Indirect inference through prediction - a tutorial
You built a simulation model and now you want to tune its parameters to data. Here, and in this arXiv preprint, I recycle Blum et al. (2013) methods to do just that.
It’s a simple strategy that requires no new fancy technique.
The paper comes with code for replication but there is a lot of fluff there which you don’t really need. Here I try to produce a minimum example with no wasted effort.
The model and its summary statistics
First of all, let’s load some basic packages
tidyverse
and zoo
are useful for data manipulation; knitR
for printing tables.
The library glmnetUtils
is useful for running elastic net regressions. These are fast and linear regularized regressions. They automatically drop uninformative or quasi-collinear coefficients.
Imagine that somebody hands us a complicated black-box agent-based model depending on two parameters $\alpha,\beta$. In reality the model is just an ARIMA(1,1,1), $\alpha$ the AR coefficient and $\beta$ the MA coefficient, but let’s pretend not to know.
complicated_model<-function(n,ar,ma){
#n is number of observations, #ar,ma are the parameters
return(arima.sim(n=n,model=list("ar"=ar,"ma"=ma,order=c(1,1,1))))
}
Run this model once and it’s apparent we aren’t dealing with a stationary model (let alone ergodic)
trial_run<-complicated_model(100,.3,-.2)
autoplot(as.zoo(trial_run)) +
xlab("Steps") + ylab("Simulation Output")
If we observe one run (100 steps) out of this model, how can we obtain its parameters?
First, let’s come up with a bunch of indicators (summary statistics) that describe the output of the model:
- The trend
- The average
- The last value
- The standard deviation
- Number of peaks and valleys (see here)
- The coefficients of an AR(5) fit
- The first 10 ACF and PACF values
- Same summary statistics (1-7) for the speed (the $\Delta$ of the simulation output)
Here’s the code to extract them:
# an easy function to find peaks in the data
# grabbed from here: https://github.com/stas-g/findPeaks
find_peaks <- function (x, m = 3){
shape <- diff(sign(diff(x, na.pad = FALSE)))
pks <- sapply(which(shape < 0), FUN = function(i){
z <- i - m + 1
z <- ifelse(z > 0, z, 1)
w <- i + m + 1
w <- ifelse(w < length(x), w, length(x))
if(all(x[c(z : i, (i + 2) : w)] <= x[i + 1])) return(i + 1) else return(numeric(0))
})
pks <- unlist(pks)
pks
}
# this function will take a simulation output and returns all the
# summary statistics of it
extract_summary_statistics<-function(model_run)
{
time<-1:length(model_run) # time steps, for trend
speed<-diff(model_run)
timed<-1:length(speed) # time steps, for trend
return(
unlist(list(
#basic info about time series
"trend" = coef(lm(model_run~time))[2],
"average" = mean(model_run),
"last_value" = last(model_run),
"stdev" = sd(model_run),
"number_of_peaks" = length(find_peaks(model_run,m=3)),
"number_of_valleys" = length(find_peaks(-model_run,m=3)),
"ar_coefficients" = (ar(model_run,order.max=5,aic=FALSE))$ar,
"acf" = (acf(model_run,plot=FALSE))$acf[2:11],
"pacf" = (pacf(model_run,plot=FALSE))$acf[2:11],
#basic info about speed (ugly code duplication!)
"speed_trend" = coef(lm(speed~timed))[2],
"speed_average" = mean(speed),
"speed_last_value" = last(speed),
"speed_stdev" = sd(speed),
"speed_number_of_peaks" = length(find_peaks(speed,m=3)),
"speed_number_of_valleys" = length(find_peaks(-speed,m=3)),
"speed_ar_coefficients" = (ar(speed,order.max=5,aic=FALSE))$ar,
"speed_acf" = (acf(speed,plot=FALSE))$acf[2:11],
"speed_pacf" = (pacf(speed,plot=FALSE))$acf[2:11]
))
)
}
summary_names<-names(extract_summary_statistics(trial_run))
This sums up to 62 base summary statistics. We know immediately most of them are useless, but let’s see if we can automatically figure that out using the technique from the paper.
Build Training Data-Set
The first step is quite obvious, run the model “many” times (let’s do it 5000 times) changing the parameters and collect all the summary statistics. Let’s assume $\alpha$ (the AR coefficient) and $\beta$ (the MA coefficient) $\in [-.9,.9]$, let’s build a table where each row is a separate, independent simulation and each column is either $\alpha$,$\beta$ or a summary statistic.
SAMPLE_SIZE<-5000 # how many simulations we run
SIMULATION_LENGTH<-100 # how many time steps we let each simulation run
training<-list()
for( i in 1:SAMPLE_SIZE)
{
alpha<-runif(min = -.9,max=.9,n=1)
beta<-runif(min = -.9, max=.9,n=1)
#ugly way to build data-base
training[[i]]<-
c(alpha,beta,
#extract its summary statistics
extract_summary_statistics(
#run random model
complicated_model(
n=SIMULATION_LENGTH,
ar=alpha,
ma=beta
)
)
)
}
#turn list into database and name it properly
training <- do.call("rbind",training)
colnames(training)<- c("alpha","beta",summary_names)
#show results
training<-tbl_df(training)
head(training)
## # A tibble: 6 x 64
## alpha beta trend.time average last_value stdev number_of_peaks
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.806 0.574 0.467 -19.1 15.5 19.9 3.
## 2 -0.468 0.655 0.0469 4.57 7.61 2.10 12.
## 3 -0.598 0.847 -0.0182 -3.48 -9.79 3.77 10.
## 4 -0.380 0.00351 -0.0181 -0.0671 -2.06 2.04 9.
## 5 -0.405 0.543 -0.00551 -3.68 -6.77 2.35 8.
## 6 -0.262 -0.642 0.0128 -0.254 2.82 1.24 15.
## # ... with 57 more variables: number_of_valleys <dbl>,
## # ar_coefficients1 <dbl>, ar_coefficients2 <dbl>,
## # ar_coefficients3 <dbl>, ar_coefficients4 <dbl>,
## # ar_coefficients5 <dbl>, acf1 <dbl>, acf2 <dbl>, acf3 <dbl>,
## # acf4 <dbl>, acf5 <dbl>, acf6 <dbl>, acf7 <dbl>, acf8 <dbl>,
## # acf9 <dbl>, acf10 <dbl>, pacf1 <dbl>, pacf2 <dbl>, pacf3 <dbl>,
## # pacf4 <dbl>, pacf5 <dbl>, pacf6 <dbl>, pacf7 <dbl>, pacf8 <dbl>,
## # pacf9 <dbl>, pacf10 <dbl>, speed_trend.timed <dbl>,
## # speed_average <dbl>, speed_last_value <dbl>, speed_stdev <dbl>,
## # speed_number_of_peaks <dbl>, speed_number_of_valleys <dbl>,
## # speed_ar_coefficients1 <dbl>, speed_ar_coefficients2 <dbl>,
## # speed_ar_coefficients3 <dbl>, speed_ar_coefficients4 <dbl>,
## # speed_ar_coefficients5 <dbl>, speed_acf1 <dbl>, speed_acf2 <dbl>,
## # speed_acf3 <dbl>, speed_acf4 <dbl>, speed_acf5 <dbl>,
## # speed_acf6 <dbl>, speed_acf7 <dbl>, speed_acf8 <dbl>,
## # speed_acf9 <dbl>, speed_acf10 <dbl>, speed_pacf1 <dbl>,
## # speed_pacf2 <dbl>, speed_pacf3 <dbl>, speed_pacf4 <dbl>,
## # speed_pacf5 <dbl>, speed_pacf6 <dbl>, speed_pacf7 <dbl>,
## # speed_pacf8 <dbl>, speed_pacf9 <dbl>, speed_pacf10 <dbl>
Fitting the regressions
Looking only at the observed summary statistics, let’s try to back out the parameters that generated them. We need two regressions:
\[ \left\{\begin{matrix} \alpha &= \sum a_{\beta,i} S_i + \sum b_{\beta,i} S_{i}^2 + \sum c_{\beta,i,j} S_i S_j \\ \beta &= \sum a_{\gamma,i} S_i + \sum b_{\gamma,i} S_{i}^2 + \sum c_{\gamma,i,j} S_i S_j \end{matrix}\right. \tag{1} \] Where $S$ is the vector of summary statistics (of size 62). We basically are using not just the base summary statistics but also all their squares and cross-products
alpha_formula<-
paste("alpha~(",paste(summary_names,collapse="+"),")^2 + I(",paste(summary_names,collapse="^2)+I("),"^2)")
fit<-
cv.glmnet(formula(alpha_formula),data=training)
beta_formula<-
paste("beta~(",paste(summary_names,collapse="+"),")^2 + I(",paste(summary_names,collapse="^2)+I("),"^2)")
fit_b<-
cv.glmnet(formula(beta_formula),data=training)
We can tell already the parametrization is only okay: the $R^2$ of the regression is 0.7153599 but we can make sure of its quality by predicting out of sample.
Estimating the parameters
Now, let’s build a testing data-set by running the model 500 times more. Let’s then use the regressions we just built to try and predict the $\alpha$ and $\beta$ from the new model runs without retraining them.
Let’s compare the estimates from our regressions against those we obtain by running the arima
command in R.
#builds testing data-set
testing<-list()
for( i in 1:500)
{
tryCatch({
alpha<-runif(min = -.9,max=.9,n=1)
beta<-runif(min = -.9, max=.9,n=1)
simulation<- complicated_model(
n=SIMULATION_LENGTH,
ar=alpha,
ma=beta
)
#arima is the command to fit ARIMA models in R
fitt<-arima(simulation,order=c(1,1,1))
testing[[length(testing)+1]]<-
c(alpha,beta,
#extract its summary statistics
extract_summary_statistics(
#run random model
simulation
),
#the testing data frame will also have two more
#columns with the arima fit from R
fitt$coef[1],
fitt$coef[2]
)
},
error = function(e) {})
}
#turn list into database and name it properly
testing <- tbl_df(do.call("rbind",testing))
colnames(testing)<- c("alpha","beta",summary_names,"arima_alpha","arima_beta")
#now let's use the regressions we built before
# (without retraining them!)
alpha_predictions<-
data.frame(
#this is the prediction by the GLM fit
prediction=(predict(fit,newdata=testing,s = "lambda.1se"))[,1],
#this is the real alpha
correct=testing$alpha,
#this is the alpha from the arima fit
arima=testing$arima_alpha
) %>%
gather(method,value,-correct) %>%
mutate(variable="alpha")
#do the same for beta
beta_predictions<-
data.frame(
prediction=(predict(fit_b,newdata=testing,s = "lambda.1se"))[,1],
correct=testing$beta,
arima=testing$arima_beta
) %>%
gather(method,value,-correct)%>%
mutate(variable="beta")
#put them in a table
predict_table<-
bind_rows(alpha_predictions,
beta_predictions)
#show prediction quality graphically
ggplot(predict_table,
aes(y=value, x=correct)) +
geom_point(aes(col=method)) +
geom_abline(slope=1,lwd=2,col="red") +
scale_color_discrete(guide=FALSE) +
facet_grid(variable~method,
labeller = as_labeller(
c("arima" = "R arima function",
"prediction" = "Regression",
"alpha" = expression(alpha),
"beta" = expression(beta)
)
)) +
xlab("Real Value") +
ylab("Prediction") +
ggtitle("Out of Sample Estimation")
#show table with basic errors
kable(
predict_table %>%
group_by(method,variable) %>%
summarise(predictivity = 1- sum((value-correct)^2)/sum((mean(correct)-correct)^2),
rmse=sqrt(mean((value-correct)^2)),
mae=mean(abs((value-correct))),
# null_error=mean((mean(correct)-correct)^2),
median_rse=sqrt(median((value-correct)^2))
) %>%
arrange(variable,method)
)
method | variable | predictivity | rmse | mae | median_rse |
---|---|---|---|---|---|
arima | alpha | 0.5313216 | 0.3491251 | 0.2082134 | 0.1042458 |
prediction | alpha | 0.6931038 | 0.2825136 | 0.2085228 | 0.1469686 |
arima | beta | 0.5556443 | 0.3497373 | 0.2099202 | 0.1031024 |
prediction | beta | 0.7141395 | 0.2805134 | 0.2037229 | 0.1399748 |
We learn two things:
- The R
arima
function with its standard settings isn’t particularly good (although, its low predictivity is mostly due to few large mistakes) - The predictivity for both our regressions is only about .7, much like the in-sample $R^2$ suggested
And that’s it!
A neat little way to parametrize the model; not perfect, obviously, as in this example the error is $\approx .2$ or $.3$, which is quite big given that the total range is 2.
Still, it’s not that different from what you get from running the proper arima
which knows and maximizes the likelihood (although clearly, with terrible starting guesses)!
Bibliography
Blum, M G B, M A Nunes, D Prangle, and S A Sisson. 2013. “A comparative review of dimension reduction methods in approximate Bayesian computation.” Statistical Science 28 (2):189–208. https://doi.org/10.1214/12-STS406.