Abundance Estimation with Index Counts and Mark-Recapture
Matt Falcy
July 28, 2017
Problem Definition
We have mark-recpature estimates of abundance, \(\hat{N}\), and the associated standard errors \(se_{\hat{N}}\). We also have a count, \(C\), of animals in an index location that are an unknown proportion, \(p\), of the latent abudnance, \(N\). Such data exist for many independent years. We wish to be able to infer \(N\) on a year when we only have \(C\).
The Model
- \(\hat{N}\)~Normal(\(N\),\(se_{\hat{N}}\))
This means that we regard our mark-recapture estimates as unbiased becuase they are “centered” on the latent abudndance. We know that the mark-recapture estimate is imperfect, and so we simply use the standard error of the mark-recpature estimate to model the normally-distributed error around the mark-recpature point estimate.
- \(C\)~Binomial(\(N\),\(p\))
The counts are like the result of a coin flipping experiment. The total number of flips is the latent abudnance, and the number of heads is the number of fish counted in the index location. We want to know the proportion \(p=C/N\). This proportion must satisfy \(0<=p<=1\). The binomial theory gives us a way of confining uncertainty in \(p\) to this range.
Notice that \(N\) is in both of the distributions given above. This is important. Our theory is that \(C\) contains information about \(N\). So we’re actually goint to estimate \(N\) even though we already have the mark-recatpure estimate, \(\hat{N}\). If \(C\) is totally unrelated to \(N\), then our new estimate of \(N\) will match \(\hat{N}\). But if there is a relationship, then the new estimate of \(N\) will leverage information contained in \(C\). Recall that we know the mark-recapture estimate is imperfect (\(se_{\hat{N}}\)).
- \(p\)~Beta(\(a\),\(b\))
We shouldn’t assume that the proportion of the total abundance counted in the index site is perfectly contsant across years. Let’s give it some ability to move around. The paramters of the Beta distribution will describe how much \(p\) varies. For example, in the plot below \(p\) is constrained to a probability distirubiton defined by the Beta distribution where \(a\)=2 and \(b\)=40. We really want to know the values of \(a\) and \(b\) because they determine how well we can predict \(N\) when we only have \(C\).
p<-seq(0,1,0.01)
dp<-dbeta(p,2,40)
plot(p,dp,type='l',ylab='Probability Density',xlab='p')
Simulation test
First, let’s prove-up the theory I’ve outlined above with a simulation. Let’s choose values of \(N\), \(se_{\hat{N}}\), \(a\), and \(b\) and then use these to create data, \(\hat{N}\), and \(C\).
N<-floor(rlnorm(10,8.5,0.5)) #params here aren't of interest
sdNhat<-rnorm(10,100,20) #params here aren't of interest
Nhat<-floor(rnorm(10,N,sdNhat)) #This matches the first line of the model.
p<-rbeta(10,3,15) #This matches the third line of the model. a=3, and b=15.
C<-rbinom(10,N,p) #This matches the second line of the model.
Now we need to write a statistical model that will just receive data \(\hat{N}\), \(se_{\hat{N}}\), \(C\) and then estimate the paramters \(N\), \(a\), \(b\), which we already know. We’ll do this below by maximizing the likelihood of the joint data, \(\hat{N}\), \(se_{\hat{N}}\), \(C\). For computational reasons, we’ll actually minimize the negative log likelihood, which is the same thing.
library(rmutil)
##
## Attaching package: 'rmutil'
## The following object is masked from 'package:stats':
##
## nobs
LL1<-vector()
LL2<-vector()
NLL<-function(data,par){
for (i in 1:10){
LL1[i]<-dnorm(Nhat[i],par[i],sdNhat[i],log=TRUE)
LL2[i]<-dbetabinom(C[i],floor(par[i]),par[11],par[12],log=TRUE)
}
-1*(sum(LL1)+sum(LL2))
}
out<-optim(par=c(rep(10000,10),0.2,2),fn=NLL,data=list(Nhat=Nhat,sdNhat=sdNhat,C=C),method='L-BFGS-B',lower=c(rep(min(N),10),0.0001,1),upper=c(rep(max(N),10),0.9999,100),control=list(factr=1e3))
A couple of things are worth noting in the code above. First, since the Biniomial and Beta distributions are both in the exponential family, then can be combined into the “betabinomial” distribution used here. The paramterizaiton is different, but it is mathematically equivalent. Second, the “floor”" function is needed because \(N\) must be integer-valued, but the optimizer uses floating point. Third, constrained optimization in needed because \(0<=p<=1\), and this is carried out with the “L-BFGS-B” method. Constraints on the other paramters are set very liberally.
Now let’s inspect the output of the optimization.
out
## $par
## [1] 8246.3208831 6047.6470343 4219.0364708 6711.1387517 8497.0000000
## [6] 7390.4702323 6364.7556926 5012.9037908 5849.0697996 6156.8711109
## [11] 0.1727046 10.5319054
##
## $value
## [1] 252.7052
##
## $counts
## function gradient
## 140 140
##
## $convergence
## [1] 1
##
## $message
## [1] "NEW_X"
mean(p)
## [1] 0.1637982
The 11th entry of “par” is the estimate of \(p\). Compare it to mean(p). Looks good. Plot the latent \(N\) against our new synthetic estimate of \(N\). This also looks good.
plot(N,out$par[1:10])
The Empirical Data and Maximum Likelihood Analysis
Now we’re going to do the same thing with real-world data.
Nhat<-c(14335,15891,2700,1218,2213,10985,4985,8738,13878)
seNhat<-sqrt(c(26915344,6574096,175561,51529,71824,12166144,145924,2808976,3319684))
C<-c(313,410,38,30,69,175,132,193,240)
LL1<-vector()
LL2<-vector()
NLL<-function(data,par){
for (i in 1:length(Nhat)){
LL1[i]<-dnorm(Nhat[i],par[i],sdNhat[i],log=TRUE)
LL2[i]<-dbetabinom(C[i],floor(par[i]),par[length(Nhat)+1],par[length(Nhat)+2],log=TRUE)
}
-1*(sum(LL1)+sum(LL2))
}
out<-optim(par=c(floor(Nhat*runif(length(Nhat),0.9,1.1)),0.02,3),
fn=NLL,data=list(Nhat=Nhat,sdNhat=sdNhat,C=C),
method='L-BFGS-B',
lower=c(rep(100,length(Nhat)),0.0001,1),
upper=c(rep(100000,length(Nhat)),0.9999,50000),
hessian=TRUE)
out
## $par
## [1] 1.475387e+04 1.488908e+04 2.703184e+03 1.025222e+03 2.297120e+03
## [6] 1.065165e+04 4.971002e+03 8.939399e+03 1.445472e+04 2.144821e-02
## [11] 8.814388e+03
##
## $value
## [1] 170.8571
##
## $counts
## function gradient
## 113 113
##
## $convergence
## [1] 52
##
## $message
## [1] "ERROR: ABNORMAL_TERMINATION_IN_LNSRCH"
##
## $hessian
## [,1] [,2] [,3] [,4] [,5]
## [1,] 6.495782e-05 0.000000e+00 0.000000e+00 -7.105427e-09 -7.105427e-09
## [2,] 0.000000e+00 6.937739e-05 0.000000e+00 0.000000e+00 0.000000e+00
## [3,] 0.000000e+00 0.000000e+00 1.249134e-04 -7.105427e-09 0.000000e+00
## [4,] -7.105427e-09 0.000000e+00 -7.105427e-09 8.563461e-05 -7.105427e-09
## [5,] -7.105427e-09 0.000000e+00 0.000000e+00 -7.105427e-09 8.745360e-05
## [6,] -7.105427e-09 0.000000e+00 3.552714e-09 -7.105427e-09 0.000000e+00
## [7,] 0.000000e+00 0.000000e+00 7.105427e-09 0.000000e+00 7.105427e-09
## [8,] 0.000000e+00 0.000000e+00 3.552714e-09 0.000000e+00 0.000000e+00
## [9,] -3.552714e-09 0.000000e+00 0.000000e+00 0.000000e+00 -3.552714e-09
## [10,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [11,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [,6] [,7] [,8] [,9] [,10]
## [1,] -7.105427e-09 0.000000e+00 0.000000e+00 -3.552714e-09 0.000000e+00
## [2,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [3,] 3.552714e-09 7.105427e-09 3.552714e-09 0.000000e+00 0.000000e+00
## [4,] -7.105427e-09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [5,] 0.000000e+00 7.105427e-09 0.000000e+00 -3.552714e-09 0.000000e+00
## [6,] 9.057999e-05 7.105427e-09 0.000000e+00 0.000000e+00 0.000000e+00
## [7,] 7.105427e-09 2.285248e-04 0.000000e+00 7.105427e-09 0.000000e+00
## [8,] 0.000000e+00 0.000000e+00 3.303242e-04 0.000000e+00 0.000000e+00
## [9,] 0.000000e+00 7.105427e-09 0.000000e+00 8.800782e-05 0.000000e+00
## [10,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.621247e+06
## [11,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 2.029122e-02
## [,11]
## [1,] 0.000000e+00
## [2,] 0.000000e+00
## [3,] 0.000000e+00
## [4,] 0.000000e+00
## [5,] 0.000000e+00
## [6,] 0.000000e+00
## [7,] 0.000000e+00
## [8,] 0.000000e+00
## [9,] 0.000000e+00
## [10,] 2.029122e-02
## [11,] -1.250555e-06
The 10th value of “par” is \(p\). The 11th value of “par” is the overdispersion parameter that describes variation in \(p\) that would be used when predicting \(N\) with just \(C\).
Look at the relationship between the new estimates of \(N\) and \(\hat{N}\)
plot(Nhat,out$par[1:9],xlab='Mark-recapture estimate',ylab='New estimate of abundance')
abline(0,1)
So it looks like information in \(C\) is contiributing very little information to a new estimate of \(N\).
The numbers in “par” above are just point estimates. What about standard errors of the paramters? In the chunk of code above, I asked the optimizer to return the Hessian, which is a square matrix of second-order partial derivates evaluted at the maximum likelihood estimates of each paramter.
\[H = \begin{pmatrix} \frac{\partial^2f}{\partial x_1^2} & \frac{\partial^2f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2f}{\partial x_1 \partial x_n} \\ \frac{\partial^2f}{\partial x_2^2} & \frac{\partial^2f}{\partial x_2^2} & \cdots & \frac{\partial^2f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2f}{\partial x_n \partial x_1} & \frac{\partial^2f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2f}{\partial x_n^2} \end{pmatrix}\]
This matrix gives the magnitue of the curvature in all directions of the multidimenesional likelihood surface. A more “spiked” surface means that a parameter can only take on a narrow range of values, whereas a flatter surface means that the parameter estimates are more uncertain.
We can get standard errors by calculating the inverse of the square root of the diagonal elements of the observed Fisher Information matrix \(\mathbf I(\mathbf \theta)\), where \(\mathbf \theta\) is a vector of all the paramters. The Fisher Information matrix is found by inverting the Hessian matrix.
\(\mathbf I(\mathbf \theta) = \mathbf H^{-1}\)
It is important to note that the Hessian pertains to a negative log likelihood, becuase that’s what we minimized in with the optimizer. If we had maximized the log likelihood without minimizing the negative log likelihood, then we’d have to take the negtaive of the Hessian before inverting it.
And now
\(s_\mathbf \theta = \frac{1}{\sqrt{I(\theta)}}\).
Here’s how to implement this in R…
fisher.info<-solve(out$hessian)
sig<-sqrt(diag(fisher.info))
## Warning in sqrt(diag(fisher.info)): NaNs produced
upper<-out$par+1.96*sig
lower<-out$par-1.96*sig
interval<-data.frame(out$par,lower=lower,upper=upper)
interval
## out.par lower upper
## 1 1.475387e+04 1.451069e+04 1.499706e+04
## 2 1.488908e+04 1.465377e+04 1.512440e+04
## 3 2.703184e+03 2.527816e+03 2.878552e+03
## 4 1.025222e+03 8.134193e+02 1.237025e+03
## 5 2.297120e+03 2.087531e+03 2.506708e+03
## 6 1.065165e+04 1.044572e+04 1.085759e+04
## 7 4.971002e+03 4.841347e+03 5.100657e+03
## 8 8.939399e+03 8.831557e+03 9.047240e+03
## 9 1.445472e+04 1.424579e+04 1.466365e+04
## 10 2.144821e-02 1.990904e-02 2.298738e-02
## 11 8.814388e+03 NaN NaN
Bayesian Analysis
Let’s go Bayesian. It’ll make quantification of prediction error easier than the pure maximum likelihood method.
First thing to do is to write a JAGS file that we’ll call subsequently to fit the model.
Mod<-function(){
#Likelihood
for (t in 1:T){
Nhat[t]~dnorm(round(N[t]),tau[t])
C[t]~dbin(p[t],round(N[t]))
p[t]~dbeta(a,b)
}
#Prediction
Npred<-Cpred/qbeta(0.5,a,b)
#Priors
a~dunif(0,2000)
b~dunif(0,20000)
for (t in 1:T){
N[t]~dunif(100,20000)
}
}
This should look pretty similar to the data-generating model. It’s pretty much the same thing, just “going in reverse.”
The great thing with the Markov Chain Monte Carlo (MCMC) method we’ll use is that we can make it sample over all unknowns. This includes our prediction. So in the middle of the JAGS file there is the line “Npred<-Cpred/qbeta(0.5,a,b)”. “Cpred” is any \(C\) we want to use for a future prediction of \(N\). Subsequently, we will use \(C\) = 50, 100, 150, 200. We’re going to use the point estimate of \(p\) for the prediction, hence the 50th quantile of the Beta distribution. But there is uncertainty in the best value of the 50th quantile of \(p\) that is represented by \(a\) and \(b\). This uncertainty will be wrapped into our prediction, Npred. Awesome, becuase the hard work analytical work is done for us by the MCMC!
Got some diffuse priors in there. It took some trial and error before I realize how large the upper bounds of the priors on \(a\) and \(b\) need to be. More to come on that…
The stuff below just repackages the data so it can be fed into JAGS, sets up initial starting points, and specifies some conditions of how many MCMC samples to take. Notice that we will use the empirical data to generate predictions for \(N\) in a future cenario where C = 50, 100, 150, 200.
#Initialize MCMC chains
Init.fun<-function(){
list("a"=runif(1,20,40),"b"=runif(1,1000,2000),"N"=Nhat)
}
# List the parameters to save as output
params=as.character(c("a","b","N","Npred"))
#Repackage empirical data
D<-list(Nhat=Nhat,C=C,tau=1/sdNhat,T=length(Nhat),Cpred=c(50,100,150,200))
#Call JAGS into a parallel process
library(R2jags)
## Loading required package: rjags
## Loading required package: coda
## Linked to JAGS 4.2.0
## Loaded modules: basemod,bugs
##
## Attaching package: 'R2jags'
## The following object is masked from 'package:coda':
##
## traceplot
jags.out<-jags.parallel(data=D, inits=Init.fun, parameters.to.save=params, model.file = Mod,
n.chains = 4,n.iter = 100000, n.burnin = 50000, n.thin = 13,
n.cluster= 4, DIC = TRUE,
working.directory = NULL, jags.seed = 123, digits=5,
jags.module = c("dic"))
This model fits really fast compared to most. Let’s check of the model diagnostics to make sure that everyting worked OK.
#traceplot(jags.out)
The trace plots look great except for \(a\) and \(b\). I think the chains had some difficulty converging for those paramters. Let’s look at the output, including the Gelman-Rubin diagnositc “Rhat”.
jags.out$BUGSoutput$summary
## mean sd 2.5% 25% 50%
## N[1] 14334.99000 11.134297 14313.27860 14327.43992 14335.06216
## N[2] 15891.02549 10.927245 15869.94910 15883.53435 15891.04665
## N[3] 2699.72342 9.461943 2681.08576 2693.41149 2699.67465
## N[4] 1218.15379 10.328818 1197.89501 1211.19418 1218.08402
## N[5] 2213.47373 10.368390 2193.19359 2206.49838 2213.45530
## N[6] 10984.92299 10.261010 10964.59590 10978.08537 10984.90174
## N[7] 4985.10455 8.086714 4969.08066 4979.59756 4985.21236
## N[8] 8737.99349 7.375209 8723.55039 8732.94878 8738.02624
## N[9] 13877.85854 10.251964 13857.72775 13870.85709 13877.91518
## Npred[1] 2330.00497 156.552135 2033.50364 2229.66546 2325.44504
## Npred[2] 4660.00993 313.104270 4067.00728 4459.33093 4650.89007
## Npred[3] 6990.01490 469.656404 6100.51092 6688.99639 6976.33511
## Npred[4] 9320.01987 626.208539 8134.01456 8918.66185 9301.78014
## a 47.87672 33.824238 11.79644 25.56129 38.40141
## b 2166.61597 1546.995686 521.05019 1145.67770 1728.93234
## deviance 138.78767 7.139184 127.33667 133.65798 137.93076
## 75% 97.5% Rhat n.eff
## N[1] 14342.50348 14356.6095 1.001002 15000
## N[2] 15898.43059 15912.4434 1.001057 13000
## N[3] 2706.07556 2718.3394 1.001113 10000
## N[4] 1225.18214 1238.2662 1.000964 15000
## N[5] 2220.51934 2233.7448 1.000987 15000
## N[6] 10991.80006 11004.7619 1.000893 15000
## N[7] 4990.62048 5000.6985 1.000915 15000
## N[8] 8743.01409 8752.2573 1.000994 15000
## N[9] 13884.80359 13897.7649 1.000976 15000
## Npred[1] 2424.39456 2654.2538 1.000937 15000
## Npred[2] 4848.78912 5308.5077 1.000937 15000
## Npred[3] 7273.18368 7962.7615 1.000937 15000
## Npred[4] 9697.57824 10617.0154 1.000937 15000
## a 59.21211 144.7215 1.002348 1700
## b 2688.69154 6611.0513 1.002294 1800
## deviance 143.01654 154.8320 1.001061 13000
The Rhats look good. They should be less than 1.1, and even \(a\) and \(b\) easily meet this criteria.
Let’s do some reality check. The expected value of \(p\) and the variance of \(p\) can be computed for the values of \(a\) and \(b\) using some simple theory: \(E(p)=\frac{a}{a+b}\).
(jags.out$BUGSoutput$mean$a/(jags.out$BUGSoutput$mean$a+jags.out$BUGSoutput$mean$b))
## [1] 0.02161972
mean(C/Nhat)
## [1] 0.02214563
The theoretical variance of \(p\) is: \(var(p)=\frac{ab}{(a+b)^2(a+b+1)}\).
ai<-jags.out$BUGSoutput$mean$a
bi<-jags.out$BUGSoutput$mean$b
(ai*bi)/((ai+bi)^2*(ai+bi+1))
## [1] 9.547453e-06
var(C/Nhat)
## [1] 3.091778e-05
The means match up nicely. Not so sure abou the variance. But the tiny values for Var(p) indicate this method is going to work great for prediction.
Let’s use our point estimates of \(a\) and \(b\) to plot our belief in the values of \(p\)
pi<-seq(0,1,0.001)
dpi<-dbeta(pi,ai,bi)
plot(pi,dpi,type='l',ylab='Probability Density','xlab'='p')
This narrow range of \(p\) is why I needed to to use large upper bounds on the priors for \(a\) and \(b\).
The Prediction Interval
Now plot the 95% credible interval on our prediction of \(N\) for the four different levels of \(C\): 50, 100, 150, 200.
plot(Cpred,jags.out$BUGSoutput$summary[10:13,7],type='b',ylim=c(0,11000),ylab='Abundance',main='95% Prediction Interval')
lines(Cpred,jags.out$BUGSoutput$summary[10:13,3],type='b')