In the Bayesian paradigm all unknown quantities in the model are treated as random variables and the aim is to compute (or estimate) the joint posterior distribution. This is, the distribution of the parameters, θ, conditional on the observed data y. The way that posterior distribution is obtained relies on Bayes’ theorem:
\[\begin{equation} \pi(\theta|\textbf{y}) = \frac{ \pi(\textbf{y}|\theta)\pi(\theta) }{\pi(\textbf{y}) } \end{equation}\]Where \(\pi(\textbf{y}|\theta)\) is the likelihood of the data \(\textbf{y}\) given parameters \(\theta\), \(\pi(\theta)\) is the prior distribution of the parameters and \(\pi(\textbf{y})\) is the marginal likelihood, which acts as a normalizing constant (Gómez-Rubio, 2021).
Laplace Integrated Nested Approach or INLA is a recent method of fitting Bayesian models. The INLA approach aims to solve the computational difficulty of MCMC in data-intensive problems or complex models. In many applications, the posterior distribution sampling process using MCMC can take too long and is often not even feasible with existing computational resources.
The slides of the “SPATIAL PREDICTION MODELS IN R” lecture at UCSD-GPS Fall 2021 can be found here
First, we install and load all the needed packages for this workshop. Here a reference for the installation of INLA
package
# install.packages("kableExtra")
# install.packages("tidyverse")
# install.packages("yardstick")
# install.packages("gt")
# install.packages("spdep")
# install.packages("viridis")
# install.packages("INLA",repos=c(getOption("repos"),INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)
library(kableExtra)
library(tidyverse)
library(yardstick)
library(gt)
library(spdep)
library(viridis)
library(INLA)
For this workshop we will use monthly mortality data from Lima, Peru (2018-2019). We’re downloading the data directly from the github repository. You can check the dictionary at the bottom of the table.
<- readRDS(url("https://github.com/healthinnovation/Inla_intro/raw/main/db_excess_proc_dis_1819_m.rds")) db
reg | prov | distr | year | month | n | week | date | temperature | precipitation | pp.insured | pp.edu.under25 | pp.pover | pp.no.elec | pp.no.water |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LIMA | LIMA | SANTA ROSA | 2019 | 11 | 14 | 44 | 2019-11-04 | 29.05179 | 0.0628037 | 0.7250182 | 0.1420677 | 0.1330481 | 0.0308246 | 0.103765 |
LIMA | LIMA | SURQUILLO | 2019 | 04 | 82 | 13 | 2019-04-01 | 29.43726 | 0.4175346 | 0.7250182 | 0.1420677 | 0.4330481 | 0.7308246 | 0.803765 |
LIMA | LIMA | PUNTA NEGRA | 2018 | 01 | 0 | 1 | 2018-01-08 | 29.27019 | 0.5562810 | 0.7250182 | 0.1420677 | 0.1330481 | 0.0308246 | 0.103765 |
LIMA | LIMA | CHACLACAYO | 2019 | 11 | 30 | 44 | 2019-11-04 | 29.05179 | 0.0628037 | 0.7250182 | 0.1420677 | 0.4330481 | 0.0308246 | 0.103765 |
LIMA | LIMA | CHACLACAYO | 2019 | 01 | 42 | 1 | 2019-01-07 | 29.26792 | 0.2194335 | 0.7250182 | 0.1420677 | 0.4330481 | 0.0308246 | 0.103765 |
LIMA | LIMA | CARABAYLLO | 2019 | 07 | 194 | 26 | 2019-07-01 | 29.06252 | 0.0398653 | 0.3250182 | 0.1420677 | 0.4330481 | 0.7308246 | 0.803765 |
LIMA | LIMA | SAN BORJA | 2018 | 12 | 104 | 48 | 2018-12-03 | 29.15187 | 0.1387612 | 0.7250182 | 0.1420677 | 0.4330481 | 0.7308246 | 0.803765 |
LIMA | LIMA | LIMA | 2018 | 09 | 316 | 35 | 2018-09-03 | 29.00803 | 0.0264610 | 0.7250182 | 0.1420677 | 0.4330481 | 0.7308246 | 0.803765 |
LIMA | LIMA | LIMA | 2019 | 12 | 454 | 48 | 2019-12-02 | 29.15607 | 0.1205020 | 0.7250182 | 0.1420677 | 0.4330481 | 0.7308246 | 0.803765 |
LIMA | LIMA | SURQUILLO | 2018 | 10 | 116 | 39 | 2018-10-01 | 29.03968 | 0.0989436 | 0.7250182 | 0.1420677 | 0.4330481 | 0.7308246 | 0.803765 |
LIMA | LIMA | BARRANCO | 2018 | 05 | 22 | 18 | 2018-05-07 | 29.31195 | 0.2102735 | 0.7250182 | 0.1420677 | 0.4330481 | 0.0308246 | 0.103765 |
LIMA | LIMA | RIMAC | 2018 | 12 | 180 | 48 | 2018-12-03 | 29.15187 | 0.1387612 | 0.7250182 | 0.1420677 | 0.4330481 | 0.7308246 | 0.803765 |
LIMA | LIMA | PUNTA HERMOSA | 2018 | 07 | 0 | 26 | 2018-07-02 | 29.01577 | 0.0666329 | 0.7250182 | 0.1420677 | 0.1330481 | 0.0308246 | 0.103765 |
LIMA | LIMA | SAN BORJA | 2019 | 08 | 80 | 31 | 2019-08-05 | 28.98726 | 0.0000000 | 0.7250182 | 0.1420677 | 0.4330481 | 0.7308246 | 0.803765 |
LIMA | LIMA | LA MOLINA | 2019 | 07 | 116 | 26 | 2019-07-01 | 29.06252 | 0.0398653 | 0.7250182 | 0.1420677 | 0.4330481 | 0.7308246 | 0.803765 |
Variable name | Description |
---|---|
reg | region |
prov | province |
distr | district |
year | year of register |
month | month of register |
week | week of register |
n | number of deaths |
temperature | monthly temperature |
precipitation | monthly precipitation |
pp.pover | poverty indicator |
pp.edu.under25 | proportion of people under 25 with a low level of education |
pp.insured | proportion of insured population |
pp.no.elec | proportion of people without access to basic electricity service |
pp.no.water | proportion of people without access to basic water service |
We proceed to do a temporal descriptive analysis using the ggplot
package.
%>%
db group_by(date) %>%
summarise(n=sum(n)) %>%
ggplot(aes(x=date,y=n)) +
geom_line(color="red") +
geom_point(color="red",shape=21) +
labs(y = "Deaths count") +
theme_bw(base_size = 15)
Similarly, we can check the spatial distribution of the data. We load a shapefile
that contain all the geographic information of Peru and merge it with our tabular (mortality) data:
<- readRDS(url("https://github.com/healthinnovation/Inla_intro/raw/main/lima_shp.rds"))
lima.shp
<- lima.shp %>%
db.sp.lima inner_join(db %>%
group_by(year, prov, distr) %>%
summarise(n=sum(n)))
We can then plot the number of deaths using ggplot
package
%>%
db.sp.lima ggplot() +
geom_sf(aes(fill=n)) +
scale_fill_viridis(name="Number of deaths",direction = -1,option = "rocket") +
facet_wrap(vars(year)) +
theme_linedraw(base_size = 25)
To construct models using the INLA approach we can use the inla()
function which has several parameters. Some of the most important are:
data
: An object typically of the class data.frame
, data to adjust any model.
formula
: A inla formula like y ~ 1 + z
. This is much like the formula for a glm except that smooth or spatial terms can be added to the right hand side of the formula. See f
for full details and the web site www.r-inla.org for several worked out examples. Each smooth or spatial term specified through f should correspond to separate column of the data frame data.
verbose
: A variable of the type boolean
, which indicates if you want to show the convergence process in the console
<- inla(n ~ 1 + temperature,
lima.m0 verbose = F,
data = db)
The parameters detailed above are the essential ones to execute the model using INLA
. However, some extra parameters to consider are the following:
family
: Class object character
. This parameter is crucial, as it determines the distribution of the target variable, by default it is infamily = Gaussian
.control.compute
: Object of class list
allows to specify the calculation of information criteria such asaic
, dic
,waic
.control.predictor
: We set link=1
to establish that the fitted values are in the same units as the function as the target variable (this will be very helpful in the forecasting phase).<- inla(n ~ 1 + temperature,
lima.m1 verbose = F,
data = db,
family = "Gaussian",
control.compute = list(dic=TRUE, waic=TRUE,cpo=TRUE),
control.predictor = list(link = 1))
We can inspect the results of the models using the function summary
. For example for the last model:
summary(lima.m1)
##
## Call:
## c("inla(formula = n ~ 1 + temperature, family = \"Gaussian\", data =
## db, ", " verbose = F, control.compute = list(dic = TRUE, waic = TRUE,
## ", " cpo = TRUE), control.predictor = list(link = 1))")
## Time used:
## Pre = 2.93, Running = 0.486, Post = 0.187, Total = 3.61
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## (Intercept) 1073.012 429.485 232.184 1072.101 1918.001 1070.258 0
## temperature -32.730 14.717 -61.722 -32.700 -3.931 -32.636 0
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant
## Precision for the Gaussian observations 0.00 0.00 0.00 0.00
## 0.975quant mode
## Precision for the Gaussian observations 0.00 0.00
##
## Expected number of effective parameters(stdev): 1.19(0.212)
## Number of equivalent replicates : 865.14
##
## Deviance Information Criterion (DIC) ...............: 12461.25
## Deviance Information Criterion (DIC, saturated) ....: 253267.12
## Effective number of parameters .....................: 2.27
##
## Watanabe-Akaike information criterion (WAIC) ...: 12462.05
## Effective number of parameters .................: 3.05
##
## Marginal log-Likelihood: -6249.86
## CPO and PIT are computed
##
## Posterior marginals for the linear predictor and
## the fitted values are computed
In order to obtain specific information from the summary we can to use the operator $
over the INLA
objects. For example we can access to the linear fixed effects using InlaObject$summary.fixed
$summary.fixed lima.m1
## mean sd 0.025quant 0.5quant 0.975quant mode
## (Intercept) 1073.01172 429.48474 232.18447 1072.10124 1918.001113 1070.25818
## temperature -32.73019 14.71715 -61.72204 -32.69981 -3.930967 -32.63582
## kld
## (Intercept) 6.134579e-07
## temperature 6.133797e-07
In addition, we can use more variables in the linear predictor or employ other assumptions of the distribution of the target variable in order to obtain a better model. Given we are modelling a count process, we’ll use a
negative binomial
distribution as assumption and employ socio-economic and climatic variables that can explain the variability of the number of deaths in Peru. Until now, we have only considered linear effects but withINLA
we can also model non-linear effects that can control for temporal and spatial effects
when considering temporal effects, we are implicitly assuming that the time series can be decomposed as follows
\[\begin{align*} y_{t} = S_{t}+T_{t}+e_{t} \end{align*}\]where :
\(S_{t}\) : Seasonality
\(T_{t}\) : Trend
\(e_{t}\) : white noise
To model this components in INLA, we can use :
AR1 (1st order auto-regressive process) : f(variable, model = "ar1")
RW1 (1st order random walk) : f(variable, model = "rw1")
RW2 (2nd order random walk) : f(variable, model = "rw2")
# Assuming a prior distribution for the years: "linear"
<- inla(n ~ 1 + year + temperature + pp.insured + pp.pover + pp.no.elec + pp.no.water,
lima.linear verbose = F,
data = db,
family = "nbinomial",
control.compute = list(dic=TRUE, waic=TRUE,cpo=TRUE),
control.predictor = list(link = 1))
# Assuming a prior distribution for the weeks: "rw1" and "linear" for the year
<- inla(n ~ 1 + year + f(week,model="rw1") + temperature + pp.insured + pp.pover +
lima.rw + pp.no.water,
pp.no.elec verbose = F,
data = db,
family = "nbinomial",
control.compute = list(dic=TRUE, waic=TRUE,cpo=TRUE),
control.predictor = list(link = 1))
# Assuming a prior distribution for the weeks: "rw1" and "ar1" for the year
<- inla(n ~ 1 + f(year,model="ar1") + f(week,model="rw1") + temperature + pp.insured +
lima.ar_rw + pp.no.elec + pp.no.water,
pp.pover verbose = F,
data = db,
family = "nbinomial",
control.compute = list(dic=TRUE, waic=TRUE,cpo=TRUE),
control.predictor = list(link = 1))
You can explore further details about the priors using the inla.doc("prior name")
function. For example:
inla.doc("ar1")
The assumption about the decomposition of the time series can become more complex when considering the spatial dimension
\[\begin{align*} y_{t} = S_{t}+T_{t}+u_{t}+e_{t} \end{align*}\]where:
\(u_{t}\) : unstructured effects
The component \(u_{t}\) also know as random effects are considered to take into account the spatial dimension of the data and allow control unobserved characteristics of each area under study. These components can be modeled in INLA, however first we need to identify every spatial unit:
<- db %>%
db.lima.sp group_by(distr) %>%
mutate(id.sp=cur_group_id())
In the descriptive analysis
we observe signals of spatial correlation between areas. In order to control for this mechanism we need to add the \(\nu_{t}\) , structured effects component to our equation. Our final equation is:
These structured effects are random effects that explicitly take spatial structure into account. They can be modeled in INLA in various ways:
- Using besag spatial effects :f(area, model = "besag")
- Using proper besag spatial effects :f(area, model = "besagproper")
To include these type of priors as spatial effects, first we need to collect the geographic information of Peru expressed in a shapefile
(for our case the district level spatial data). We need to create a neighborhood structure from the shapefile
, and from that neighborhood structure, we create the spatial weight matrix.
# Creating the neighborhood structure
<- poly2nb(lima.shp)
lima.adj # Spatial weights
<- nb2mat(lima.adj, style = "W") W.lima
The code above represent this neighborhood structure:
Now we are ready to fit a model which effects by district are correlated in space. We use the f(id.sp, model = "bym", graph = W.lima)
argument in the formula to specify the prior (“bym”) and the spatial structure object we just created (W.lima
):
<- inla(n ~ 1 + f(year,model="ar1") + f(week,model="rw1") +
lima.spat f(id.sp, model = "bym", graph = W.lima) + temperature + pp.insured +
+ pp.no.elec + pp.no.water,
pp.pover verbose = F,
data = db.lima.sp,
family = "nbinomial",
control.compute = list(dic=TRUE, waic=TRUE,cpo=TRUE),
control.predictor = list(link = 1))
We can collect the information from all the models we tested so far and compare if the model specification of the spatial and temporal trends impacts on the magnitude and precision of the coefficients of the fixed effects (socioeconomics and climate).
<- lima.linear$summary.fixed %>%
f.linear mutate(Model="linear") %>%
rownames_to_column("Variable")
<- lima.rw$summary.fixed %>%
f.rw mutate(Model="RW") %>%
rownames_to_column("Variable")
<- lima.ar_rw$summary.fixed %>%
f.ar_rw mutate(Model="AR_RW") %>%
rownames_to_column("Variable")
<- lima.spat$summary.fixed %>%
f.spat mutate(Model="Spatial") %>%
rownames_to_column("Variable")
<-rbind(f.linear, f.rw, f.ar_rw, f.spat) %>%
fix.datafilter(!str_detect(Variable, 'year')) %>%
filter(!str_detect(Variable, '(Intercept)'))
and plot them:
%>%
fix.data ggplot(aes(colour=Model)) +
geom_pointrange(aes( x = Variable,
y = mean,
ymin = `0.025quant`,
ymax = `0.975quant`),
position = position_dodge(width = 1/2),
lwd = 1/2) +
scale_color_brewer(palette = "Paired") +
geom_hline(yintercept = 0, lty = 2) +
coord_flip() +
theme_linedraw()
However, since we’re using Bayesian models to improve the predictions, we’re not focusing on the dose-response (mechanism) effects. We’re focusing on improving the accuracy of our model to predict current and future trends.
We can proceed to analyse our fitted models in the space. We access to fitted values with InlaObject$$summary.fitted.values$mean
. Then we can use them to plot the predicted values for an specific date.
<- lima.shp %>%
db.lima.sf inner_join(db.lima.sp %>%
ungroup() %>%
mutate(fit =lima.rw$summary.fitted.values$mean,
fit2=lima.spat$summary.fitted.values$mean) %>%
group_by(prov,distr,week,year,month) %>%
slice(1),
by = c("prov" ="prov", "distr"="distr"))
<- db.lima.sf %>%
map.true filter(year==2019 & month == 12) %>%
ggplot() +
geom_sf(aes(fill=n/10)) +
theme_linedraw(base_size = 23) +
scale_fill_viridis(name="Number of\nactual deaths x10\n(12/2019)",
option="rocket",direction = -1)
<- db.lima.sf %>%
map.rw filter(year==2019 & month ==12) %>%
ggplot() +
geom_sf(aes(fill=fit/10)) +
theme_linedraw(base_size = 23) +
scale_fill_viridis(name="Number of\nfitted deaths x10\n(rw 12/2019)",
option="rocket",
direction = -1)
<- db.lima.sf %>%
map.fit filter(year==2019 & month ==12) %>%
ggplot() +
geom_sf(aes(fill=fit2/10)) +
theme_linedraw(base_size = 23) +
scale_fill_viridis(name ="Number of\nfitted deaths x10\n(spatial 12/2019)", option="rocket",direction = -1)
::plot_grid(map.true, map.rw, map.fit, ncol = 3) cowplot
In order to assess the performance of the models, we will calculate in-sample metrics. First, we collect the fitted values from the INLA object
<- lima.linear$summary.fitted.values$mean
fit.m.linear <- lima.rw$summary.fitted.values$mean
fit.m.rw <- lima.ar_rw$summary.fitted.values$mean
fit.m.ar_rw <- lima.spat$summary.fitted.values$mean
fit.m.spat <- db.lima.sp$n
n.lima
<- list("linear" = fit.m.linear,
fitted_vals "RW" = fit.m.rw,
"AR_RW" = fit.m.ar_rw,
"spatial" = fit.m.spat) %>%
as.data.frame() %>%
gather(key = "modelo", value = "fit") %>%
group_by(modelo) %>%
mutate(actual = n.lima,
date = db.lima.sp$date,
prov = db.lima.sp$prov,
distr = db.lima.sp$distr)
then to facilitate the calculation of the performance metrics we transform the data to long format and then we use the yardstick
package to calculate the following metrics:mae
,mape
,mpe
,rmse
,msd
.
We can calculate these metrics for the entire dataset:
#Metrics to use
<- metric_set(mae,mase,smape,rmse)
perform.metrics
# Calculation of metrics in the forecast year
<- fitted_vals %>%
tbl.yrd.full group_by(modelo) %>%
perform.metrics(truth = actual, estimate = fit)
And per district:
#Metrics to use
<- metric_set(mae,mase,smape,rmse)
perform.metrics
# Calculation of metrics in the forecast year
<- fitted_vals %>%
tbl.yrd.per group_by(modelo,prov,distr) %>%
perform.metrics(truth = actual, estimate = fit)
Finally, we use gt
package to show the results table
# Results table
%>%
tbl.yrd.full pivot_wider(id_cols = modelo,
names_from = .metric,
values_from = .estimate) %>%
gt() %>%
tab_header(title = md("in-sample accuracy metrics"))
in-sample accuracy metrics | ||||
---|---|---|---|---|
modelo | mae1 | mase2 | smape3 | rmse4 |
AR_RW | 56.64523 | 1.8077520 | 48.59065 | 89.61038 |
linear | 56.76891 | 1.8116989 | 48.65139 | 89.86674 |
RW | 56.61960 | 1.8069340 | 48.57694 | 89.55081 |
spatial | 16.33685 | 0.5213672 | 24.32419 | 25.09145 |
1
mae = mean absolute error
2
mase = Mean absolute scaled error
3
smape = Symmetric mean absolute percentage error
4
rsme = Root square mean error
|
We can assess this performance spatially, in this case by district. In this example, we plot the spatial distribution of the mean absolute error (MAE) for the models with temporal (RW) and spatial (bym) effects:
<- lima.shp %>%
tbl.yrd.per.sf inner_join(tbl.yrd.per, by=c("prov"="prov","distr"="distr"))
%>%
tbl.yrd.per.sf filter(.metric=="mae" & modelo %in% c("RW","spatial")) %>%
ggplot() +
geom_sf(aes(fill=.estimate),lwd=0.1) +
scale_fill_distiller(palette="Reds",direction=1,name="MAE") +
facet_wrap(vars(modelo)) +
theme_linedraw(base_size = 23) +
theme(strip.text = element_text(face = "bold",size = 30))
We can observe that the MAE is lower when using the spatial (bym) model than the temporal (RW) model in most districts.
In order to obtain more precises results we can calculate cpo
. This is a cross-validation criterion for model assessment that is computed for each observation as
Hence, for each observation its CPO is the posterior probability of observing that observation when the model is fit using all data but \(y_{i}\). This metric per observation is usually summarized in only one metrics as:
\[\begin{align*} CPO =-2 \sum_{i}^n log(CPO{i}) \end{align*}\]In order to conduct this calculus we collect the cpo
information using model$cpo$cpo
:
<--2*sum(log(lima.linear$cpo$cpo))
cpo.linear <--2*sum(log(lima.rw$cpo$cpo))
cpo.rw <--2*sum(log(lima.ar_rw$cpo$cpo))
cpo.ar_rw <--2*sum(log(lima.spat$cpo$cpo))
cpo.spat
<- list("linear" = cpo.linear,
data.cpo "RW" = cpo.rw,
"AR_RW" = cpo.ar_rw,
"spatial" = cpo.spat) %>%
as.data.frame()
Finally, we report these results in a table using gt
%>%
data.cpo pivot_longer(cols=colnames(data.cpo),
names_to = "model",
values_to = "CPO") %>%
gt() %>%
tab_header(title = md("LOO-CV"))
LOO-CV | |
---|---|
model | CPO |
linear | 10547.748 |
RW | 10546.322 |
AR_RW | 10544.647 |
spatial | 8667.694 |
This indicator has a similar interpretation to the information criteria indicators like AIC in that sense, given the previous table, the best model is the one that considers that the mortality data at the district level is spatially correlated.
In order to predict the number of deaths in Lima during 2020 we need information of our co-variables and the areas around our area of interest (AOI). In this section we will focus on Lima district so only the outcome of 2020 in this district is missing .
<- readRDS(url("https://github.com/healthinnovation/Inla_intro/raw/main/db_excess_proc_dis_20_m.rds")) %>%
db.frcst inner_join(db.lima.sp %>% select(prov, distr, month, id.sp),
by=c("prov", "distr", "month")) %>%
distinct(reg,prov,distr,month,n,.keep_all = T)
.2 <- db.lima.sp %>%
db.lima.spbind_rows(db.frcst)
<- lima.shp %>%
db.lima.frct inner_join(db.lima.sp.2, by=c("prov","distr"))
<- db.lima.frct %>%
db.lima.frct.geo.off st_drop_geometry()
Then we proceed to model our target variable with our best model. INLA
internally will forecast the count of death for Lima district.
<- inla(n ~ 1 + f(year,model="ar1") + f(week,model="rw1") + f(id.sp, model = "bym", graph=W.lima) + temperature + pp.insured + pp.pover + pp.no.elec + pp.no.water,
lima.forecast verbose = F,
data = db.lima.frct.geo.off,
family = "nbinomial",
control.compute = list(dic=TRUE, waic=TRUE,cpo=TRUE),
control.predictor = list(link = 1))
<- db.lima.frct %>%
db.lima.frct.end ungroup() %>%
mutate(fit = lima.forecast$summary.fitted.values$mean)
and finally in a similar way as above, we assess the forecasted deaths spatially.
<- db.lima.frct.end %>%
(map.frct filter(year== 2020 & month == 12) %>%
ggplot() +
geom_sf(aes(fill=fit)) +
theme_linedraw(base_size = 23) +
scale_fill_viridis(name="Number of\nforecasted\ndeaths\n(12/2020)",
direction = -1,option="magma"))