Introduction to INLA

INLA

In the Bayesian paradigm all unknown quantities in the model are treated as random variables and the aim is to compute (or estimate) the joint posterior distribution. This is, the distribution of the parameters, θ, conditional on the observed data y. The way that posterior distribution is obtained relies on Bayes’ theorem:

\[\begin{equation} \pi(\theta|\textbf{y}) = \frac{ \pi(\textbf{y}|\theta)\pi(\theta) }{\pi(\textbf{y}) } \end{equation}\]

Where \(\pi(\textbf{y}|\theta)\) is the likelihood of the data \(\textbf{y}\) given parameters \(\theta\), \(\pi(\theta)\) is the prior distribution of the parameters and \(\pi(\textbf{y})\) is the marginal likelihood, which acts as a normalizing constant (Gómez-Rubio, 2021).

Laplace Integrated Nested Approach or INLA is a recent method of fitting Bayesian models. The INLA approach aims to solve the computational difficulty of MCMC in data-intensive problems or complex models. In many applications, the posterior distribution sampling process using MCMC can take too long and is often not even feasible with existing computational resources.

The slides of the “SPATIAL PREDICTION MODELS IN R” lecture at UCSD-GPS Fall 2021 can be found here

Required packages

First, we install and load all the needed packages for this workshop. Here a reference for the installation of INLA package

# install.packages("kableExtra")
# install.packages("tidyverse")
# install.packages("yardstick")
# install.packages("gt")
# install.packages("spdep")
# install.packages("viridis")
# install.packages("INLA",repos=c(getOption("repos"),INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)


library(kableExtra)
library(tidyverse)
library(yardstick)
library(gt)
library(spdep)
library(viridis)
library(INLA)

Mortality data

For this workshop we will use monthly mortality data from Lima, Peru (2018-2019). We’re downloading the data directly from the github repository. You can check the dictionary at the bottom of the table.

db <- readRDS(url("https://github.com/healthinnovation/Inla_intro/raw/main/db_excess_proc_dis_1819_m.rds"))
reg prov distr year month n week date temperature precipitation pp.insured pp.edu.under25 pp.pover pp.no.elec pp.no.water
LIMA LIMA SANTA ROSA 2019 11 14 44 2019-11-04 29.05179 0.0628037 0.7250182 0.1420677 0.1330481 0.0308246 0.103765
LIMA LIMA SURQUILLO 2019 04 82 13 2019-04-01 29.43726 0.4175346 0.7250182 0.1420677 0.4330481 0.7308246 0.803765
LIMA LIMA PUNTA NEGRA 2018 01 0 1 2018-01-08 29.27019 0.5562810 0.7250182 0.1420677 0.1330481 0.0308246 0.103765
LIMA LIMA CHACLACAYO 2019 11 30 44 2019-11-04 29.05179 0.0628037 0.7250182 0.1420677 0.4330481 0.0308246 0.103765
LIMA LIMA CHACLACAYO 2019 01 42 1 2019-01-07 29.26792 0.2194335 0.7250182 0.1420677 0.4330481 0.0308246 0.103765
LIMA LIMA CARABAYLLO 2019 07 194 26 2019-07-01 29.06252 0.0398653 0.3250182 0.1420677 0.4330481 0.7308246 0.803765
LIMA LIMA SAN BORJA 2018 12 104 48 2018-12-03 29.15187 0.1387612 0.7250182 0.1420677 0.4330481 0.7308246 0.803765
LIMA LIMA LIMA 2018 09 316 35 2018-09-03 29.00803 0.0264610 0.7250182 0.1420677 0.4330481 0.7308246 0.803765
LIMA LIMA LIMA 2019 12 454 48 2019-12-02 29.15607 0.1205020 0.7250182 0.1420677 0.4330481 0.7308246 0.803765
LIMA LIMA SURQUILLO 2018 10 116 39 2018-10-01 29.03968 0.0989436 0.7250182 0.1420677 0.4330481 0.7308246 0.803765
LIMA LIMA BARRANCO 2018 05 22 18 2018-05-07 29.31195 0.2102735 0.7250182 0.1420677 0.4330481 0.0308246 0.103765
LIMA LIMA RIMAC 2018 12 180 48 2018-12-03 29.15187 0.1387612 0.7250182 0.1420677 0.4330481 0.7308246 0.803765
LIMA LIMA PUNTA HERMOSA 2018 07 0 26 2018-07-02 29.01577 0.0666329 0.7250182 0.1420677 0.1330481 0.0308246 0.103765
LIMA LIMA SAN BORJA 2019 08 80 31 2019-08-05 28.98726 0.0000000 0.7250182 0.1420677 0.4330481 0.7308246 0.803765
LIMA LIMA LA MOLINA 2019 07 116 26 2019-07-01 29.06252 0.0398653 0.7250182 0.1420677 0.4330481 0.7308246 0.803765
Dictionary
Variable name Description
reg region
prov province
distr district
year year of register
month month of register
week week of register
n number of deaths
temperature monthly temperature
precipitation monthly precipitation
pp.pover poverty indicator
pp.edu.under25 proportion of people under 25 with a low level of education
pp.insured proportion of insured population
pp.no.elec proportion of people without access to basic electricity service
pp.no.water proportion of people without access to basic water service

1 Descriptive analysis

2 Creating a model with INLA

To construct models using the INLA approach we can use the inla() function which has several parameters. Some of the most important are:

  • data : An object typically of the class data.frame, data to adjust any model.

  • formula : A inla formula like y ~ 1 + z. This is much like the formula for a glm except that smooth or spatial terms can be added to the right hand side of the formula. See f for full details and the web site www.r-inla.org for several worked out examples. Each smooth or spatial term specified through f should correspond to separate column of the data frame data.

  • verbose : A variable of the type boolean, which indicates if you want to show the convergence process in the console

lima.m0 <- inla(n ~ 1 + temperature,
                verbose = F,
                data = db)

The parameters detailed above are the essential ones to execute the model using INLA. However, some extra parameters to consider are the following:

  • family: Class object character. This parameter is crucial, as it determines the distribution of the target variable, by default it is infamily = Gaussian.
  • control.compute: Object of class list allows to specify the calculation of information criteria such asaic, dic,waic.
  • control.predictor: We set link=1 to establish that the fitted values are in the same units as the function as the target variable (this will be very helpful in the forecasting phase).
lima.m1 <- inla(n ~ 1 + temperature,
                verbose = F,
                data = db,
                family = "Gaussian",
                control.compute = list(dic=TRUE, waic=TRUE,cpo=TRUE),
                control.predictor = list(link = 1))

We can inspect the results of the models using the function summary. For example for the last model:

summary(lima.m1)
## 
## Call:
##    c("inla(formula = n ~ 1 + temperature, family = \"Gaussian\", data = 
##    db, ", " verbose = F, control.compute = list(dic = TRUE, waic = TRUE, 
##    ", " cpo = TRUE), control.predictor = list(link = 1))") 
## Time used:
##     Pre = 2.93, Running = 0.486, Post = 0.187, Total = 3.61 
## Fixed effects:
##                 mean      sd 0.025quant 0.5quant 0.975quant     mode kld
## (Intercept) 1073.012 429.485    232.184 1072.101   1918.001 1070.258   0
## temperature  -32.730  14.717    -61.722  -32.700     -3.931  -32.636   0
## 
## Model hyperparameters:
##                                         mean   sd 0.025quant 0.5quant
## Precision for the Gaussian observations 0.00 0.00       0.00     0.00
##                                         0.975quant mode
## Precision for the Gaussian observations       0.00 0.00
## 
## Expected number of effective parameters(stdev): 1.19(0.212)
## Number of equivalent replicates : 865.14 
## 
## Deviance Information Criterion (DIC) ...............: 12461.25
## Deviance Information Criterion (DIC, saturated) ....: 253267.12
## Effective number of parameters .....................: 2.27
## 
## Watanabe-Akaike information criterion (WAIC) ...: 12462.05
## Effective number of parameters .................: 3.05
## 
## Marginal log-Likelihood:  -6249.86 
## CPO and PIT are computed
## 
## Posterior marginals for the linear predictor and
##  the fitted values are computed

In order to obtain specific information from the summary we can to use the operator $ over the INLA objects. For example we can access to the linear fixed effects using InlaObject$summary.fixed

lima.m1$summary.fixed
##                   mean        sd 0.025quant   0.5quant  0.975quant       mode
## (Intercept) 1073.01172 429.48474  232.18447 1072.10124 1918.001113 1070.25818
## temperature  -32.73019  14.71715  -61.72204  -32.69981   -3.930967  -32.63582
##                      kld
## (Intercept) 6.134579e-07
## temperature 6.133797e-07

In addition, we can use more variables in the linear predictor or employ other assumptions of the distribution of the target variable in order to obtain a better model. Given we are modelling a count process, we’ll use a negative binomial distribution as assumption and employ socio-economic and climatic variables that can explain the variability of the number of deaths in Peru. Until now, we have only considered linear effects but with INLA we can also model non-linear effects that can control for temporal and spatial effects

2.1 Temporal Effects

when considering temporal effects, we are implicitly assuming that the time series can be decomposed as follows

\[\begin{align*} y_{t} = S_{t}+T_{t}+e_{t} \end{align*}\]

where :
\(S_{t}\) : Seasonality
\(T_{t}\) : Trend
\(e_{t}\) : white noise

To model this components in INLA, we can use :

  • AR1 (1st order auto-regressive process) : f(variable, model = "ar1")

  • RW1 (1st order random walk) : f(variable, model = "rw1")

  • RW2 (2nd order random walk) : f(variable, model = "rw2")

# Assuming a prior distribution for the years: "linear"
lima.linear <- inla(n ~ 1 + year + temperature + pp.insured + pp.pover + pp.no.elec + pp.no.water,
                verbose = F,
                data = db,
                family = "nbinomial",
                control.compute = list(dic=TRUE, waic=TRUE,cpo=TRUE),
                control.predictor = list(link = 1))

# Assuming a prior distribution for the weeks: "rw1" and "linear" for the  year 
lima.rw <- inla(n ~ 1 + year + f(week,model="rw1") + temperature + pp.insured + pp.pover + 
                  pp.no.elec + pp.no.water,
                verbose = F,
                data = db,
                family = "nbinomial",
                control.compute = list(dic=TRUE, waic=TRUE,cpo=TRUE),
                control.predictor = list(link = 1))

# Assuming a prior distribution for the weeks: "rw1" and "ar1" for the year
lima.ar_rw <- inla(n ~ 1  + f(year,model="ar1") + f(week,model="rw1") + temperature + pp.insured + 
                     pp.pover + pp.no.elec + pp.no.water,
                verbose = F,
                data = db,
                family = "nbinomial",
                control.compute = list(dic=TRUE, waic=TRUE,cpo=TRUE),
                control.predictor = list(link = 1))

You can explore further details about the priors using the inla.doc("prior name") function. For example:

inla.doc("ar1")

2.2 Spatial Effects

The assumption about the decomposition of the time series can become more complex when considering the spatial dimension

\[\begin{align*} y_{t} = S_{t}+T_{t}+u_{t}+e_{t} \end{align*}\]

where:

\(u_{t}\) : unstructured effects

The component \(u_{t}\) also know as random effects are considered to take into account the spatial dimension of the data and allow control unobserved characteristics of each area under study. These components can be modeled in INLA, however first we need to identify every spatial unit:

db.lima.sp <- db %>% 
  group_by(distr) %>% 
  mutate(id.sp=cur_group_id())

In the descriptive analysis we observe signals of spatial correlation between areas. In order to control for this mechanism we need to add the \(\nu_{t}\) , structured effects component to our equation. Our final equation is:

\[\begin{align*} y_{t} = S_{t}+T_{t}+\nu_{t}+u_{t}+e_{t} \end{align*}\]

These structured effects are random effects that explicitly take spatial structure into account. They can be modeled in INLA in various ways:
- Using besag spatial effects :f(area, model = "besag")
- Using proper besag spatial effects :f(area, model = "besagproper")

To include these type of priors as spatial effects, first we need to collect the geographic information of Peru expressed in a shapefile (for our case the district level spatial data). We need to create a neighborhood structure from the shapefile, and from that neighborhood structure, we create the spatial weight matrix.

# Creating the neighborhood structure 
lima.adj <- poly2nb(lima.shp)
# Spatial weights
W.lima <- nb2mat(lima.adj, style = "W") 

The code above represent this neighborhood structure:

Now we are ready to fit a model which effects by district are correlated in space. We use the f(id.sp, model = "bym", graph = W.lima) argument in the formula to specify the prior (“bym”) and the spatial structure object we just created (W.lima):

lima.spat <- inla(n ~ 1  + f(year,model="ar1") + f(week,model="rw1") +
                    f(id.sp, model = "bym", graph = W.lima) + temperature + pp.insured + 
                    pp.pover + pp.no.elec + pp.no.water,
                verbose = F,
                data = db.lima.sp,
                family = "nbinomial",
                control.compute = list(dic=TRUE, waic=TRUE,cpo=TRUE),
                control.predictor = list(link = 1))

We can collect the information from all the models we tested so far and compare if the model specification of the spatial and temporal trends impacts on the magnitude and precision of the coefficients of the fixed effects (socioeconomics and climate).

f.linear <- lima.linear$summary.fixed %>% 
  mutate(Model="linear") %>%
  rownames_to_column("Variable")
f.rw <- lima.rw$summary.fixed %>% 
  mutate(Model="RW") %>%
  rownames_to_column("Variable")
f.ar_rw <- lima.ar_rw$summary.fixed %>% 
  mutate(Model="AR_RW") %>%
  rownames_to_column("Variable")
f.spat <- lima.spat$summary.fixed %>% 
  mutate(Model="Spatial") %>% 
  rownames_to_column("Variable")

fix.data<-rbind(f.linear, f.rw, f.ar_rw, f.spat) %>% 
  filter(!str_detect(Variable, 'year')) %>% 
  filter(!str_detect(Variable, '(Intercept)'))

and plot them:

fix.data %>% 
  ggplot(aes(colour=Model)) + 
  geom_pointrange(aes( x = Variable, 
                       y = mean,
                       ymin = `0.025quant`,
                       ymax = `0.975quant`),
                  position = position_dodge(width = 1/2), 
                  lwd = 1/2) + 
  scale_color_brewer(palette = "Paired") +
  geom_hline(yintercept = 0, lty = 2) +
  coord_flip() +
  theme_linedraw()

However, since we’re using Bayesian models to improve the predictions, we’re not focusing on the dose-response (mechanism) effects. We’re focusing on improving the accuracy of our model to predict current and future trends.

3 Model accuracy

3.1 Predicted values

We can proceed to analyse our fitted models in the space. We access to fitted values with InlaObject$$summary.fitted.values$mean. Then we can use them to plot the predicted values for an specific date.

db.lima.sf <-  lima.shp %>% 
  inner_join(db.lima.sp %>% 
               ungroup() %>% 
               mutate(fit =lima.rw$summary.fitted.values$mean,
                      fit2=lima.spat$summary.fitted.values$mean) %>% 
               group_by(prov,distr,week,year,month) %>% 
               slice(1),
             by = c("prov" ="prov", "distr"="distr")) 

map.true <- db.lima.sf %>% 
  filter(year==2019 & month == 12) %>% 
  ggplot() +
  geom_sf(aes(fill=n/10)) +
  theme_linedraw(base_size = 23) +
  scale_fill_viridis(name="Number of\nactual deaths x10\n(12/2019)", 
                     option="rocket",direction = -1)       

map.rw <- db.lima.sf %>% 
  filter(year==2019 & month ==12) %>% 
  ggplot() +
  geom_sf(aes(fill=fit/10)) +
  theme_linedraw(base_size = 23) +
  scale_fill_viridis(name="Number of\nfitted deaths x10\n(rw 12/2019)",
                     option="rocket",
                     direction = -1)

map.fit <- db.lima.sf %>% 
  filter(year==2019 & month ==12) %>% 
  ggplot() +
  geom_sf(aes(fill=fit2/10)) +
  theme_linedraw(base_size = 23) +
  scale_fill_viridis(name  ="Number of\nfitted deaths x10\n(spatial 12/2019)", option="rocket",direction = -1)   

cowplot::plot_grid(map.true, map.rw, map.fit, ncol = 3)

3.2 Performance metrics

In order to assess the performance of the models, we will calculate in-sample metrics. First, we collect the fitted values from the INLA object

fit.m.linear <- lima.linear$summary.fitted.values$mean
fit.m.rw <- lima.rw$summary.fitted.values$mean
fit.m.ar_rw <- lima.ar_rw$summary.fitted.values$mean
fit.m.spat <- lima.spat$summary.fitted.values$mean
n.lima <- db.lima.sp$n

fitted_vals  <-  list("linear" = fit.m.linear,
                "RW" = fit.m.rw,
                "AR_RW" = fit.m.ar_rw,
                "spatial" = fit.m.spat) %>% 
  as.data.frame() %>% 
  gather(key = "modelo", value = "fit") %>% 
  group_by(modelo) %>% 
  mutate(actual = n.lima,
         date = db.lima.sp$date,
         prov = db.lima.sp$prov,
         distr = db.lima.sp$distr)

then to facilitate the calculation of the performance metrics we transform the data to long format and then we use the yardstick package to calculate the following metrics:mae,mape,mpe,rmse,msd.

We can calculate these metrics for the entire dataset:

#Metrics to use
perform.metrics <- metric_set(mae,mase,smape,rmse)

# Calculation of metrics in the forecast year
tbl.yrd.full <-  fitted_vals %>% 
  group_by(modelo) %>%
  perform.metrics(truth = actual, estimate = fit)

And per district:

#Metrics to use
perform.metrics <- metric_set(mae,mase,smape,rmse)

# Calculation of metrics in the forecast year
tbl.yrd.per <- fitted_vals %>% 
  group_by(modelo,prov,distr) %>%
  perform.metrics(truth = actual, estimate = fit)

Finally, we use gt package to show the results table

# Results table
tbl.yrd.full %>% 
  pivot_wider(id_cols = modelo,
              names_from = .metric,
              values_from = .estimate) %>%         
  gt() %>%
  tab_header(title = md("in-sample accuracy metrics"))
in-sample accuracy metrics
modelo mae1 mase2 smape3 rmse4
AR_RW 56.64523 1.8077520 48.59065 89.61038
linear 56.76891 1.8116989 48.65139 89.86674
RW 56.61960 1.8069340 48.57694 89.55081
spatial 16.33685 0.5213672 24.32419 25.09145

1 mae = mean absolute error

2 mase = Mean absolute scaled error

3 smape = Symmetric mean absolute percentage error

4 rsme = Root square mean error

We can assess this performance spatially, in this case by district. In this example, we plot the spatial distribution of the mean absolute error (MAE) for the models with temporal (RW) and spatial (bym) effects:

tbl.yrd.per.sf <- lima.shp  %>%
  inner_join(tbl.yrd.per, by=c("prov"="prov","distr"="distr"))

tbl.yrd.per.sf %>% 
  filter(.metric=="mae" & modelo %in% c("RW","spatial")) %>% 
  ggplot() +
  geom_sf(aes(fill=.estimate),lwd=0.1) +
  scale_fill_distiller(palette="Reds",direction=1,name="MAE") +
  facet_wrap(vars(modelo)) +
  theme_linedraw(base_size = 23) +
  theme(strip.text = element_text(face = "bold",size = 30)) 

We can observe that the MAE is lower when using the spatial (bym) model than the temporal (RW) model in most districts.

4 Cross-validation

In order to obtain more precises results we can calculate cpo. This is a cross-validation criterion for model assessment that is computed for each observation as

\[\begin{align*} CPO = f(y_{i}|y_{-i}) \end{align*}\]

Hence, for each observation its CPO is the posterior probability of observing that observation when the model is fit using all data but \(y_{i}\). This metric per observation is usually summarized in only one metrics as:

\[\begin{align*} CPO =-2 \sum_{i}^n log(CPO{i}) \end{align*}\]

In order to conduct this calculus we collect the cpo information using model$cpo$cpo:

cpo.linear <--2*sum(log(lima.linear$cpo$cpo))
cpo.rw <--2*sum(log(lima.rw$cpo$cpo))
cpo.ar_rw <--2*sum(log(lima.ar_rw$cpo$cpo))
cpo.spat <--2*sum(log(lima.spat$cpo$cpo))

data.cpo  <-  list("linear" = cpo.linear,
                   "RW" = cpo.rw,
                   "AR_RW" = cpo.ar_rw,
                   "spatial" = cpo.spat) %>% 
  as.data.frame()

Finally, we report these results in a table using gt

data.cpo %>% 
  pivot_longer(cols=colnames(data.cpo),
               names_to = "model",
               values_to = "CPO")  %>% 
  gt() %>%
  tab_header(title = md("LOO-CV")) 
LOO-CV
model CPO
linear 10547.748
RW 10546.322
AR_RW 10544.647
spatial 8667.694

This indicator has a similar interpretation to the information criteria indicators like AIC in that sense, given the previous table, the best model is the one that considers that the mortality data at the district level is spatially correlated.

5 Forecasting

In order to predict the number of deaths in Lima during 2020 we need information of our co-variables and the areas around our area of interest (AOI). In this section we will focus on Lima district so only the outcome of 2020 in this district is missing .

db.frcst <- readRDS(url("https://github.com/healthinnovation/Inla_intro/raw/main/db_excess_proc_dis_20_m.rds")) %>%
  inner_join(db.lima.sp %>% select(prov, distr, month, id.sp),
             by=c("prov", "distr", "month")) %>%
  distinct(reg,prov,distr,month,n,.keep_all = T)

db.lima.sp.2 <- db.lima.sp %>% 
  bind_rows(db.frcst)

db.lima.frct <- lima.shp %>%
  inner_join(db.lima.sp.2, by=c("prov","distr")) 
  
db.lima.frct.geo.off <- db.lima.frct %>% 
  st_drop_geometry()

Then we proceed to model our target variable with our best model. INLA internally will forecast the count of death for Lima district.

lima.forecast <- inla(n ~ 1  + f(year,model="ar1") + f(week,model="rw1") + f(id.sp, model = "bym", graph=W.lima) + temperature + pp.insured + pp.pover + pp.no.elec + pp.no.water,
                verbose = F,
                data = db.lima.frct.geo.off,
                family = "nbinomial",
                control.compute = list(dic=TRUE, waic=TRUE,cpo=TRUE),
                control.predictor = list(link = 1))


db.lima.frct.end <- db.lima.frct %>% 
  ungroup() %>% 
  mutate(fit = lima.forecast$summary.fitted.values$mean)      

and finally in a similar way as above, we assess the forecasted deaths spatially.

(map.frct <- db.lima.frct.end %>% 
   filter(year== 2020 & month == 12) %>% 
   ggplot() +
   geom_sf(aes(fill=fit)) +
   theme_linedraw(base_size = 23) +
   scale_fill_viridis(name="Number of\nforecasted\ndeaths\n(12/2020)",
                      direction = -1,option="magma"))