Coastal Chinook Spawing Distribution: quantifying irregularities in real time
Background
The objective of this work is to quantify irregulatities in the spawining distribution of Chinook within a given basin. This information could be compiled on a weekly basis, providing managers with nearly real-time information on whehter the observed spawning distirubiton is unusual relative to our historical data.
Data
The raw data are counts of spawners at index sites. These data exist at mulitiple sites per basin. The temporal component of the data includes counts at approximately weekly intervals during a run from 1998 through 2018.
Example: Tillamook
Load raw counts on week 44 in the Tillamook:
cd('\\Kalawatseti\home\falcym\docs\Projects\CoastalMultispeciesPlan\SlidingScale\EmergencyClosure')
X1=readtable('WK44_All.xlsx','Sheet','Tillamook')
We wish to know whether the distribution of counts across six sites (S1, S2,...,S6) observed on week 44 in 2018 is unusual relative to the distibution at week 44 observed since 1998. Since this analysis is not concerned with abudnance, I will examine how the total count within a year is apportioned across sites.
First, I will impute all missing values by weighting several simple linear regressions among sites for which data exist. The function for doing this is given in the Appendix of this docuemnt.
X2=array2table(Impute(X1{:,:}));
X2.Properties.VariableNames=X1.Properties.VariableNames;
X2
Now get proportions of annual totals using the imputed data set:
X2{:,2:end}=X2{:,2:end}./sum(X2{:,2:end},2)
Quantiles
The proportions in the table above will be used to compute quantiles for the proporiton of total abundance through time. This will be done for each site. The years used for copmuting quantiles are 1998 through 2017 which is n=20 years. Notice that we are coputing quantiles up through but not including the focal year, 2018. The first step in the quantile analysis is to sort observations low to high. The first observation in the sort corresponds with a cumulative probablity of 1/20 = 0.05. The second observation has cumulative probablity 2/20 = 0.1, etc. This is how the horzontial black bars are computed in the figure below.
The quantiles are defined as 0.5/n, 1.5/n, ... (n-0.5)/n, where n is the number of years (n=20). The first quantile is 0.5/20 = 0.025, the second is 1.5/20=0.075, and the last quantile is (20-0.5)/20 = 0.975. From here, a smoothing function is needed to connect the midpoints of the quantiles. This will provide a means of calculating quantiles for new observations (2018). This smoothed function is computed with the following linear interpolation:
The 0.06 quantile is
So the 0.06 quanile is associated with the proportion 0.221, which is greater than the lowest observed proprtion in Site1 (0.0278). This occurs because we are making the function connect quantiles at the midpoint between adjacent proportions. See the figure below.
j=1; %index of site
Xs=sort(X2{1:(end-1),1+j});
for i=1:(length(Xs)-1)
line([Xs(i),Xs(i+1)],[i/length(Xs),i/length(Xs)],'col','k');
line([Xs(i),Xs(i)],[0,i/length(Xs)],'col','k','LineStyle',':')
end
line([Xs(end),1],[1,1],'col','k');
line([Xs(end),Xs(end)],[0,1],'col','k','LineStyle',':')
hold on
p=0:0.01:1;
y=quantile(Xs,p);
plot(y,p,'b')
xlabel('Observation')
ylabel('Quantile')
line([X2{end,1+j},X2{end,1+j}],[0,invprctile(Xs,X2{end,1+j})/100],'col','r')
drawArrow = @(x,y,varargin) quiver( x(1),y(1),x(2)-x(1),y(2)-y(1),0, varargin{:} );
drawArrow([X2{end,1+j},0],[invprctile(Xs,X2{end,1+j})/100,invprctile(Xs,X2{end,1+j})/100],'color','r');
Interpretation for first site within Tillamook
In the figure above, vertical dotted lines are given above each "observed" (in parantheses because includes imputation of missing values) proportion for site S1. See the table immediately above. The horiztal solid black lines connect two observations. The height of this line is the quantile for observations that fall within these two values. The blue line is a linear interpolation designed to connect the midpoints of adjacent solid black lines. Armed with this function, it is possible to compute the quantile associated witht the observation from 2018. This is shown in red in the figure above.
All Sites
The concept illustrated above can now be applied to all counts (observed and imputed) in 2018.
for j = 1:(size(X2,2)-1)
Xs=sort(X2{1:(end-1),1+j});
pt(j)=invprctile(Xs,X2{end,1+j})/100;
end
T=array2table([2018 pt]);
T.Properties.VariableNames=X2.Properties.VariableNames
Interpretation for all sites within Tillamook
For site S1, we have already seen in the graph above that the a quantile associated with the 2018 week 44 count is 0.31. Note that this is an imputed value. It would be prudent to ignore the quantiles associated with sites that have no observation. However, imputation is still needed to fill in historic values so that we can compare present proportions to (assumed) past proprotions.
Notice that the quantile for the sixth site, S6, is 0.575. The observed count for that site in 2018 is 0. The computed quantile is 0.575 because about half of the recorded counts from this site at week 44 were 0 fish. This can be appreciated by inspecting the first table in this report.
Roll-up within a basin
The previously stated goal of this work is to quantify irregularity in spawning distribution whith a basin on an on-going basis. To this end, we need to condense multiple quantiles within a basin into a single metric. Averaging raw quantiles won't work because the average of two sites that have unusual distribution (quantiles 0.95 and 0.05) is the same as the average of two sites with a commonly observed distriubtuion (quantiles 0.5 and 0.5). To measure "irregularilty" ,R, at the basin scale we can take the abolute value of the difference between 0.5 and the quantile, and then average over sites:
R will be bounded between 0 and 0.5. R increases as as the proportion of spawners across sites becomes extreme relative to historic distribution.
R=mean(abs(T{1,2:end}-0.5))
Benchmarking the basin roll-up
Is the computed value of R extreme? We could answer this by contemplating how it is computed, but it would be easier to interpret it in light of the distribution of R values we get when we apply the logic from the previous section to previous years.
for t=1:(size(X2,1)-1)%don't do 2018
for j = 1:(size(X2,2)-1)%Dont't do year column
X3=X2([1:t-1 t+1:end-1],:);
Xs=sort(X3{:,1+j});
pt2(t,j)=invprctile(Xs,X2{t,1+j})/100;
end
end
figure
hist(mean(abs(pt2-0.5),2))
hold on
line([R, R],[0, max(ylim(gca))],'col','r')
ylabel('Frequency')
xlabel('Irregulairty Index')
The figure above suggests that computed irregularity index R on 2018 (red line), is normal for this basin.
Multiple basins
Now I just repeat the foregoing for all basins.
names={'Nehalem','Tillamook','Nestucca','Salmon','Siletz','Yaquina','Alsea','Siuslaw'};
figure
for p=1:length(names)
clear X1 X2 Xs pt pt2 R
X1=readtable('WK44_All.xlsx','Sheet',string(names(p)));
X1=X1(sum(isnan(X1{:,2:end}),2)<(size(X1,2)-1),:);%remove years with no obs.
X2=array2table(Impute4(X1{:,:}));
X2.Properties.VariableNames=X1.Properties.VariableNames;
X2{:,2:end}=X2{:,2:end}./sum(X2{:,2:end},2);
for j = 1:(size(X2,2)-1)
Xs=sort(X2{1:(end-1),1+j});
pt(j)=invprctile(Xs,X2{end,1+j})/100;
end
T=array2table([2018 pt]);
T.Properties.VariableNames=X2.Properties.VariableNames;
R=mean(abs(T{1,2:end}-0.5));
for t=1:(size(X2,1)-1)%don't do 2018
for j = 1:(size(X2,2)-1)%Dont't do year column
X3=X2([1:t-1 t+1:end-1],:);
Xs=sort(X3{:,1+j});
pt2(t,j)=invprctile(Xs,X2{t,1+j})/100;
end
end
subplot(2,4,p)
hist(mean(abs(pt2-0.5),2))
hold on
line([R, R],[0, max(ylim(gca))],'col','r')
ylabel('Frequency')
xlabel('Irregulairty Index')
title(names(p))
end%population
The warnings given above pertain to the multiple imputation of performed in the Siuslaw. There isn't an irregularity index (R) for the Nestucca because there isn't enough data for the technique to work:
readtable('WK44_All.xlsx','Sheet','Nestucca')
Appendix: multiple imputation method
This section contains the function for performing multiple imputation. Since different regressions will have different numbers of observations, AIC cannot be used to perform model averaging. Instead, R^2 is used here to weight individual regressions into a single estimate.
function imputed=Impute(PC)
xnew=PC;
isnan(PC)<1;
[row,col] = ind2sub(size(PC),find(isnan(PC))); %indices of missing values
for i = 1:length(row)%missing value
R2s=[];
predy=[];
%isnan(PC(row(i),:))<1;%ones are useable columns
cs=ind2sub(size(PC,2),find(isnan(PC(row(i),:))<1));%get useable column #, same as below
predcols=find(isnan(PC(row(i),:))<1);%column # of x that contains predictor
xi=PC(:,cs);
yi=PC(:,col(i));
%BEGIN regressions
for j = 1:length(predcols) %univariate regression w/ each survey
if sum(~isnan(yi(~isnan(xi(:,j)))))>3 %Make sure there are 4 data points in regression
[b, r, stats] = glmfit(xi(~isnan(xi(:,j)),j),yi(~isnan(xi(:,j))),'Normal','link','identity');
[pred lo hi]=glmval(b,xi(~isnan(xi(:,j)),j),'identity',stats);
%{
%3 ways to compute the log likelihood
LL1=nansum(log(normpdf(yi(~isnan(xi(:,j))),pred,nanstd(stats.resid))));
N=sum(~isnan(yi(~isnan(xi(:,j)))));
sig=nanstd(stats.resid);
LL2=-N*.5*log(2*pi) - N*log(sig) - (1/(2*sig.^2))*nansum(stats.resid.^2);
K_matrix = eye(N) * sig^2;
LL3 =-N*.5*log(2*pi) - sum(log(diag(chol(K_matrix)))) ...
- .5*stats.resid(~isnan(stats.resid))' / (K_matrix)*stats.resid(~isnan(stats.resid));
%}
%make prediction
predy(j)=pred(row(i)-sum(isnan(xi(1:row(i),j))));
R2s(j)=1- nansum(stats.resid.^2)/nansum((yi(~isnan(xi(:,j)))-nanmean(yi(~isnan(xi(:,j)))) ).^2);
else
predy(j)=NaN;
R2s(j)=NaN;
end
%AIC=2*-L_generic+2*3;
end
%weights1=exp(-0.5*BICs(~isnan(BICs)))./(sum(exp(-0.5*BICs(~isnan(BICs)))));
%weights2=exp(-0.5*devs+k*log(n))./sum(exp(-0.5*devs+k*log(n)));
yhat=sum((predy(~isnan(predy)).*R2s(~isnan(R2s))))/sum(R2s(~isnan(R2s)));
xnew(row(i),col(i))=yhat;
end
imputed=xnew;
end