-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsuppl_report.Rmd
489 lines (370 loc) · 22 KB
/
suppl_report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
---
title: "Supplemental Report for JRSSA-Apr-2024-0120"
author: "Sho Kawano | January 25th, 2025"
output:
pdf_document:
toc: true # table of content true
toc_depth: 1 # upto three depths of headings (specified by #, ## and ###)
number_sections: true ## if you want number sections at each table header
---
```{r setup, include=F}
knitr::opts_chunk$set(echo = F, warning = F, message=F)
library(usmap)
library(tidyverse)
library(tidycensus)
library(patchwork)
```
```{r include=F}
load("data/data_NC/data.RDA")
load("sim_results_NC/results_by_sim.RDA")
pop <- readRDS("data/nc_data_w_geom.RDS")
tmp.df = errs_all_sims %>%
group_by(model, area) %>%
reframe(mse=mean(err^2))
ref_table = data.frame(area=1:nrow(all_data), GEOID=all_data$fips)
tmp.df = tmp.df %>% pivot_wider(names_from = model, values_from = mse) %>%
pivot_longer(cols=!c(D.Est, area), names_to = "model", values_to = "mse")
mse_pltDf = pop %>% left_join(ref_table, by = join_by(GEOID)) %>%
left_join(tmp.df, by = join_by(area)) %>%
filter(model!="CAR") %>%
mutate(model=ifelse(model=="New", "SSD", model))
```
\newpage
# Further Analysis on Simulation Study Results
## Comparing MSE
```{r fig.width=7, fig.height=5, fig.cap="Area-Specific MSE from Direct Estimates vs. Models"}
mse_pltDf %>%
ggplot(aes(x=log(D.Est), y=log(mse) )) +
facet_wrap(~model, nrow=2) +
geom_abline(slope=1) +
geom_tile(aes(x=-6, y=-5, width=2, height=2), fill=NA, color="blue")+
geom_point(alpha=0.5) +
coord_fixed() +
xlab("log(MSE) for D.Est") +
ylab("log(MSE) for Models") +
ggtitle("") +
theme(axis.text.x = element_text(angle = 45))
```
In figure 1, we compare the Area-Specific MSE which we define as:
$$\mbox{MSE for area } i = \frac{1}{G} \sum_{g=1}^G (\hat{z}^{(g)}_i-z_i)^2$$
where $G=300$ is the number of simulations in the study. Note that this *differs* from the Average MSE we compared in the main manuscript that averages MSE across all areas and simulations. If a model outperformed the direct estimate for a given area, a point would lie *below* the line. We can see that the SSD model achieves a MSE reductions across the board.
There are a few areas where using a model increases MSE (enclosed in rectangle) relative to the direct estimate. The increase in MSE is *lower* for SSD model than all other models. This is a major factor in why the SSD model outperforms others.
\newpage
## Comparing Coverage
```{r}
# Calculating coverage by area
covRate_df <- ci_all_sims %>% group_by(area, model) %>%
reframe(cov_rate=mean(covered))
cr_pltDf = pop %>%
left_join(ref_table, by = join_by(GEOID)) %>%
left_join(covRate_df, by = join_by(area)) %>%
filter(model!="CAR") %>%
mutate(model=ifelse(model=="New", "SSD", model))
```
```{r fig.width=8, fig.height=3.5, fig.cap="Undercoverage of a 90% credible interval by model"}
cr_pltDf %>%
ggplot(aes(fill = 0.9-cov_rate)) +
geom_sf() +
theme_void() + facet_wrap(~model, nrow=2) +
scale_fill_gradient(low = "light blue", high = "red")+
#scale_fill_viridis_c(option="E") +
labs(title = "",
fill = "Undercoverage= \n 0.9-Cov.Rate",
caption="Red indicates high undercoverage")
```
In figure 2, we plot the area-specific undercoverage of a 90% credible interval. We define the area-specific coverage rate as:
$$\mbox{Coverage Rate for Area } i = \frac{1}{G} \sum_{g=1}^G \boldsymbol{I} \{\hat{l}_i^{(g)} < z_i\} \cdot \boldsymbol{I} \{z_i< \hat{u}_i^{(g)}\}.$$
We define undercoverage as $0.9-\textrm{Coverage Rate}$. If a model, on average achives 90% coverage in a given area, the value should be zero. The counties in **red** in the map indicates **high undercoverage** which means that the **interval estimate is poor** for a given county and model.
The counties where the other models struggled in terms of coverage is the same counties are the same counties where they struggle in terms of accuracy. The models without random effects selection (BYM and FH models) struggle mightily in these problematic counties. The DM model struggles a bit less than the other models in these counties but has higher undercoverage in other parts of the state. This can also be seen in Figure 3, where we show the log design SEs compared to the coverage rate of a 90% credible interval from each model. The dotted line is 90% (the target coverage rate), and the red area indicates coverage below 70%. We can see that the coverage rate for many models start to drop as the SE gets larger. There is a clear difference when you compare the performance of the SSD to the other models.
```{r fig.width=6, fig.height=3, fig.cap="Log Design SEs vs. Coverage by model"}
tmp.df = data.frame(area=1:nrow(all_data),
SE = all_data$rentBurdenSE)
covRate_df %>% left_join(tmp.df) %>%
filter(model!="CAR") %>%
mutate(model=ifelse(model=="New", "SSD", model)) %>%
ggplot(aes(x=log(SE), y=cov_rate)) +
geom_point(alpha=0.7) +
facet_wrap(~model, nrow=2) +
geom_hline(yintercept = 0.9, lty=2, color="blue") +
theme_bw() +
geom_area(aes(y=0.7), fill="red", alpha=0.1) +
labs(title = "") +
ylab("Cov. Rate - 90% Cr.Interval")
```
\newpage
# Further Analysis on Random Effects from Data Analysis
```{r include=FALSE}
rm(list = ls()) # clear all objects
options(scipen=20)
# ---- Packages ----
library(bayesplot)
library(tidycensus)
library(spdep)
library(patchwork)
library(tidyverse)
# ---- Data ----
load("data/data_multi-state/data.RDA")
all_data <- all_data %>%
mutate(high_rentBurden = ifelse(rentBurden>0.4, T, F))
# the response
covs = c("degree", "assistance", "no_car", "povPerc",
"white", "black", "native", "asian", "hispanic")
X <- model.matrix(~., all_data[, covs, drop=F])
y <- all_data$rentBurden
D <- all_data$rentBurdenSE^2
y.star <- log(y)
D.star <- D / y^2
center <- mean(y.star)
scale <- sd(y.star)
y.scaled <- (y.star-center)/scale
D.scaled <- D.star/scale^2
# ---- Load Chains ----
load("data_analysis_chains/bym_res.RDA")
load("data_analysis_chains/dm_res.RDA")
load("data_analysis_chains/fh_res.RDA")
load("data_analysis_chains/new_res.RDA")
# ---- Load Map -----
pop <- readRDS("data/sa_data_w_geom.RDS")
```
## What kind of random effects are selected?
```{r fig.height=4, fig.width=8, fig.cap="Posterior Means of Selection Probabilities vs. Random Effects for DM and SSD Model"}
# create the dataframe with the necessary data
dm_reffects = (dm_res$v*dm_res$delta) %>% colMeans()
tmp.df = data.frame(area=1:nrow(all_data),
sel_prob = dm_res$p.delta %>% colMeans(),
reffects=dm_reffects,
theta=exp(new_res$theta*scale+center) %>% colMeans() )
# plot
plt1 = tmp.df %>% ggplot(aes(x=sel_prob, y=reffects)) +
geom_point(color="blue", alpha=0.5, shape=1) +
ylab("Random Effects") +
xlab("Selection Probabilities")+
ggtitle("DM Model")
# create the dataframe with the necessary data
new_reffects = scale*(new_res$delta*(new_res$v1+new_res$v2)) %>% colMeans()
tmp.df = data.frame(area=1:nrow(all_data),
sel_prob = new_res$p.delta %>% colMeans(),
reffects=new_reffects,
theta=exp(new_res$theta*scale+center) %>% colMeans() )
plt2 = tmp.df %>% ggplot(aes(x=sel_prob, y=reffects)) +
geom_point(color="blue", alpha=0.5, shape=1) +
ylab("Random Effects") +
xlab("Selection Probabilities") +
ggtitle("SSD Model")
plt1+ plt2
```
We can see from Figure 4 that the areas where the selection probability is high (close to 1) also have larger magnitude random effects. The relationship is more clear-cut for the DM model. There is some ``noise" from the SSD model resulting from the dependence of the random effects and the $p_i$ parameters, both of which impacts the selection probability. See discussion in section 3.2 of the main manuscript for details.
## Comparing Random Effects from Different Models
We can see from the map below that the SSD incorporates elements from the other models. Both models with selection (DM & SSD) has larger magnitude effects than the other two models, due to these models not having a common variance assumption. Also, we can see that all of the models have fairly similar spatial patterns. But the spatial dependence is obviously more explicit in the spatial models (BYM and SSD).
```{r}
fh_reffects = fh_res$v %>% colMeans()
dm_reffects = (dm_res$v*dm_res$delta) %>% colMeans()
bym_reffects = (bym_res$v1+bym_res$v2) %>% colMeans()
new_reffects = scale*(new_res$delta*(new_res$v1+new_res$v2)) %>% colMeans()
tmp.df = data.frame(area=1:nrow(all_data),
FH=fh_reffects,
DM=dm_reffects,
BYM=bym_reffects,
New=new_reffects)
ref_table = data.frame(area=1:nrow(all_data), GEOID=all_data$fips)
plt_df = pop %>%
left_join(ref_table, by = join_by(GEOID)) %>%
left_join(tmp.df, by = join_by(area)) %>%
arrange(area)
plt_df = plt_df %>%
pivot_longer(cols=FH:New, names_to="model", values_to = "reffects")
```
```{r big_reffect_map, fig.dim=c(7, 6.5), fig.cap="Random Effect Posterior Means from 4 Different Models", cache=TRUE}
plt_df %>%
# filter(model %in% c("DM", "New")) %>%
mutate(model=ifelse(model=="New", "SSD", model)) %>%
mutate(model = factor(model, levels=c("FH", "DM", "BYM", "SSD"))) %>%
ggplot(aes(fill = reffects )) +
geom_sf() +
# scale_fill_viridis_b(breaks=c(-0.15, -0.01, -0.005, 0.005, 0.01, 0.15)) +
# scale_fill_distiller(palette = "BuPu") +
scale_fill_distiller(palette = "PuOr", direction = 1) +
labs(#title = "Random Effects from Each Model",
#caption="Comparing Posterior Means",
fill = "Random Effects") +
theme_void() + facet_wrap(~model, nrow=2)
```
\newpage
# Prior Sensitivity Analysis - Random Effect Variances
```{r include=FALSE}
rm(list = ls())
#NOTE: This section takes a while to run (~5 minutes)
knitr::opts_chunk$set(echo = FALSE)
library(LaplacesDemon)
library(BayesLogit)
library(Matrix)
library(latex2exp)
library(tidyverse)
```
```{r warning=FALSE}
#south atlantic census division
load("data/data_NC/data.RDA")
#data
covs = c("degree", "assistance", "no_car", "povPerc",
"white", "black", "native", "asian", "hispanic")
X <- model.matrix(~., all_data[, covs, drop=F])
n <- nrow(X); j <- ncol(X) # number of covariates INCLUDES intercept
#response (with transformation)
y <- all_data$rentBurden
d.var <- all_data$rentBurdenSE^2
y.star <- log(y)
d.star <- d.var / y^2
d.scl <- d.star / sd(y.star)^2
```
For the proposed Spatially Selected and Dependent (SSD) model, the data model is $[\boldsymbol{y} | \boldsymbol{\theta} ] \sim N_n(\boldsymbol{\theta}, \boldsymbol{D})$ where $\boldsymbol{y}$ are the direct estimates from the survey and the covariance matrix $\boldsymbol{D}$ is a diagonal matrix of the design variances. The small area means are modeled with $\boldsymbol{\theta} = \boldsymbol{X} \boldsymbol{\beta} \, + \boldsymbol{u}$.
Each of the random effects are modeled as $u_i = \delta_i \cdot (v_{1i} +v_{2i})$ where $[\boldsymbol{v}_1 | \sigma_{1}^2] \sim N_n(\boldsymbol{0}, \sigma_{1}^2 \, \boldsymbol{I})$ and $[\boldsymbol{v}_2 | \sigma_{2}^2] \sim N_n(\boldsymbol{0}, \sigma_{2}^2 \, \mathbf{Q}^-)$ where $\mathbf{Q}^-$ is the scaled ICAR precision matrix. The selection indicators are modeled as $[\delta_i|p_i] \overset{ind}\sim Bernoulli(p_i)$ where $p_i$ are the selection probabilities. We model these selection probabilities via $logit(\boldsymbol{p}) = \boldsymbol{\psi}_1 + \boldsymbol{\psi}_2$ through two logit effects $[\boldsymbol{\psi}_1|s_1^2] \sim N_n(\boldsymbol{0}, s_{1}^2 \, \boldsymbol{I})$ and $[\boldsymbol{\psi}_2|s_2^2] \sim N_n(\boldsymbol{0}, s_{2}^2 \, \boldsymbol{Q}^-)$. For the full details of our model, please see the main manuscript.
The parameters that require priors are the regression coefficients $\boldsymbol{\beta}$ and the two sets of variance parameters $\sigma_1^2, \sigma_2^2$, and $s_1^2, s_2^2$. The prior for the regression coefficients are set non-informatively $\boldsymbol{\beta} \sim N_j(\boldsymbol{0}, \ 100^2 \boldsymbol{I})$ and therefore has little impact on the estimates. The prior choice of $s_1^2, s_2^2$ can affect convergence of the MCMC chains but they have minimal impact on the estimates themselves.
We have observed that the choice of prior on the random effect variances $\sigma_1^2$, $\sigma_2^2$ have the biggest impact on the estimates, as they influence the inclusion probabilities $p_i$. Thus, our prior **sensitivity analysis focuses entirely on these random effect variances**. We first start by discussing a key feature of the inverse-gamma distribution: the gap near the origin.
## The Gap Near the Origin for Different Inverse-Gamma Priors
Let $X$ be an inverse-gamma prior with the density:
$$f(x | c, d) = \frac{d^c}{\Gamma(c)} (1/x)^{c+1} \exp(- d / x).$$ We denote this as $X \sim IG(c, d).$
For every inverse-gamma distribution, there is threshold value $t$ at which $P(X < t)= \alpha$
for some small probability $\alpha$. For example for $c=3$, $d=3$, and $\alpha=0.001$ the threshold $t$ is $\approx 0.145$. In the figure below, we can see the gap increase as we alter the shape $c$ and scale $d$ parameters.
```{r fig.height=2, fig.width=6, fig.cap="Gaps Near Zero for Different Inverse-Gamma Priors"}
par(mfrow=c(1, 3), mar=c(4, 3, 2, 3))
c=1; d=1
curve(dinvgamma(x, c, d), from=0, to=0.8, n=1000,
ylab="density", main=paste0("IG(c=", c, " d=", d, ")"),
xlab=TeX("$\\sigma^2$"), col="blue", cex.main=1)
abline(h=0, lty=2)
c=5; d=5
curve(dinvgamma(x, c, d), from=0, to=0.8, n=1000,
ylab="density", main=paste0("IG(c=", c, " d=", d, ")"),
xlab=TeX("$\\sigma^2$"), col="blue", cex.main=1)
abline(h=0, lty=2)
c=5; d=10
curve(dinvgamma(x, c, d), from=0, to=0.8, n=1000,
ylab="density", main=paste0("IG(c=", c, " d=", d, ")"),
xlab=TeX("$\\sigma^2$"), col="blue", cex.main=1)
abline(h=0, lty=2)
```
When considering inverse-gamma priors for the random effects variance parameter models like the Datta-Mandal or the SSD model, this gap is important for to make sure that the "spike" and "slab" are distinguishable.
For the SSD model, we would like a prior that is non-informative but leaves enough space near zero. Setting the inverse-gamma hyperparameters $IG(c, d)$ smaller results in a non-informative prior. However, doing so results in a very small threshold and not enough gap near the origin. So what threshold is optimal to balance these two factors? In our experience, we found what makes a suitable prior is dependent on the scale of the data. This is why we recommend scaling $(y_i - \bar{y}) / s_{\boldsymbol{y}}$ where $\bar{y}, s_{\boldsymbol{y}}$ is the sample mean and standard deviation of the direct estimates to aid prior specification.
```{r }
find_t = function(k){
c=hyp_choices$shp[k]
d=hyp_choices$scl[k]
# since P(X<t)=P(1/X > 1/u) where 1/X is gamma(a, b)
threshCDF <- function(t, alpha=0.001){
abs((1-pgamma(1/t, c, rate=d))-alpha)
}
optimize(threshCDF, interval = c(0, 1))$minimum
}
# load prior choices
C = c(0.5, 1, 3, 5, 7)
hyp_choices = data.frame(shp=rep(C, 2), scl=c(C, 2*C)) %>% arrange(shp)
num_priors = nrow(hyp_choices)
hyp_choices = hyp_choices %>% mutate(t = sapply(1:num_priors, find_t)) %>% arrange(t)
```
## Assessing Impact on Random Effects Shrinkage
```{r mcmc_chains_nc, warning=F, message=FALSE, cache=TRUE}
# load sampler
source("samplers/ssd_fit.R")
set.seed(7)
# function to iterate over
fit_chain <- function(k){
# note that the function automatically handles the scaling
chain = ssd_fit(X, y.star, d.star, A, ndesired=2000, nburn=1000, nthin=1,
hyp=list(c1=hyp_choices$shp[k], d1=hyp_choices$scl[k],
c2=hyp_choices$shp[k], d2=hyp_choices$scl[k]))
return(chain)
}
# fit chains in parallel
library(parallel); library(doParallel)
options(mc.cores = 4)
cl <- parallel::makeForkCluster(4, setup_strategy = "sequential")
doParallel::registerDoParallel(cl)
list_of_chains = parLapply(cl, 1:num_priors, fit_chain)
stopCluster(cl)
# done!
# calculate average selection prob.
avg_sel_prob = sapply(1:num_priors, function(i){ list_of_chains[[i]]$p.delta %>% colMeans() %>% mean()})
```
The *posterior selection probabilities* $\tilde p_{i}$ allows us to assess shrinkage in the random effects (see eq. 2 in section 3.2). To get a sense of the overall level of shrinkage for a given dataset, we examine the average posterior selection probability across all areas (i.e. $\tfrac{1}{n} \sum_{i=1}^n \tilde p_{i}$).
Below is a plot of the different $t$ values and average *posterior* average selection probabilities resulting from different inverse-gamma priors for the rent burden dataset in North Carolina. We can see that there is larger values of $t$ lead to more random effects being reduced to zero (lower selection probability). Note that the drop in average selection probability is quite large for $t< 0.3$ but there is an inflection point near $0.3$. This is a pattern we have observed in other datasets as well. Thus *we recommend any Inverse-Gamma prior with $t>0.3$ with $\alpha=0.001$*.
```{r message=FALSE, fig.height=3, fig.width=5, fig.cap="Impact on Random Effects Shrinkage of Varying Priors" }
hyp_choices %>% mutate(avg_selection_prob = avg_sel_prob ) %>%
ggplot(aes(x=t, y=avg_selection_prob)) + geom_point() +
geom_line(linewidth=0.5, lty=2) +
xlab("t (for different priors)") +
ylab("Avg. Posterior Selection Probability")+
ggtitle("Variability of Shrinkage for IG Priors - NC Data")+
geom_vline(xintercept = 0.3, lty=3, color="blue")
```
## Assessing Impact on Estimates (North Carolina)
Here we show the variability of the point estimates of rent burden for Inverse-Gamma priors with $t>0.3$ for the North Carolina dataset. We can see that the prior choice has minimal impact outside of a few areas. There are two areas with some variability (areas 72, 94). These correspond to estimates for Perquimans and Washington counties that have the highest design variance. Thus the prior choice has a significant impact on the estimates for these counties. Note that this is not a problem for all datasets, as seen in the next example.
```{r fig.height=8, fig.width=6, fig.cap="Prior Impact on Point Estimates for IG Priors (t>0.3) - North Carolina Rent Burden Data"}
estim = sapply(5:num_priors,
function(i){
theta.orig.scl = exp(list_of_chains[[i]]$theta*list_of_chains[[i]]$scale + list_of_chains[[i]]$center)
# point estimates
theta.orig.scl %>% colMeans()})
estim %>% t() %>% as.data.frame() %>%
pivot_longer(cols = starts_with("theta"), names_to = "area") %>%
mutate(area = str_remove(area, "theta_") %>% factor(levels=as.character(1:100))) %>%
ggplot(aes(x=area, y=value)) + geom_boxplot() +
ylab("Poverty Rate Estimate") + xlab("County")+
coord_flip() +
geom_vline(xintercept = c(94, 72), lty=2, linewidth=0.1) +
theme(axis.text=element_text(size=7))
```
We can see that the prior choice has minimal impact outside of a few areas. There are two areas with some variability (areas 72, 94). These correspond to estimates for Perquimans and Washington counties that have the highest design variance. Thus the prior choice has a significant impact on the estimates for these counties. Note that this is not a problem for all datasets, as seen in the next example.
## Assessing Impact on Estimates (Illinois)
```{r mcmc_chains_il, warning=FALSE, cache=TRUE}
#south atlantic census division
load("data/data_IL/data.RDA")
#data
covs = c("degree", "assistance")
X <- model.matrix(~., all_data[, covs, drop=F])
n <- nrow(X); j <- ncol(X) # number of covariates INCLUDES intercept
#response (with transformation)
y <- all_data$povPerc
d.var <- all_data$povPercSE^2
y.star <- log(y)
d.star <- d.var / y^2
# load sampler
setwd("~/coding/spike_slab_fh/ssd_paper_code")
source("samplers/ssd_fit.R")
# function to iterate over
fit_chain <- function(k){
# note that the function automatically handles the scaling
chain = ssd_fit(X, y.star, d.star, A, ndesired=2000, nburn=1000, nthin=1,
hyp=list(c1=hyp_choices$shp[k], d1=hyp_choices$scl[k],
c2=hyp_choices$shp[k], d2=hyp_choices$scl[k]))
return(chain)
}
set.seed(7)
# fit chains in parallel
library(parallel); library(doParallel)
options(mc.cores = 4)
cl <- parallel::makeForkCluster(4, setup_strategy = "sequential")
doParallel::registerDoParallel(cl)
list_of_chains = parLapply(cl, 1:num_priors, fit_chain)
stopCluster(cl)
# done!
# calculate average selection prob.
avg_sel_prob = sapply(1:num_priors, function(i){ list_of_chains[[i]]$p.delta %>% colMeans() %>% mean()})
```
We repeat this exercise again with Illinois $n=102$ and estimating poverty rates. Here we show the variability of the point estimates for Inverse-Gamma priors with $t>0.3$.
```{r fig.height=9, fig.width=6, fig.cap="Prior Impact on Point Estimates for IG Priors (t>0.3) - Illinois Poverty Data"}
estim = sapply(5:num_priors,
function(i){
theta.orig.scl = exp(list_of_chains[[i]]$theta*list_of_chains[[i]]$scale + list_of_chains[[i]]$center)
# point estimates
theta.orig.scl %>% colMeans()})
estim %>% t() %>% as.data.frame() %>%
pivot_longer(cols = starts_with("theta"), names_to = "area") %>%
mutate(area = str_remove(area, "theta_") %>% factor(levels=as.character(1:102))) %>%
ggplot(aes(x=area, y=value)) + geom_boxplot() +
ylab("Poverty Rate Estimate") + xlab("County")+
coord_flip() + theme(axis.text=element_text(size=7))
```
## Conclusion & Recommendation
We conducted a prior sensitivity analysis for the random effect variance parameters for the SSD model. The prior choice of these parameters can influence the level of shrinkage and the value of estimates, especially in areas with high design variance.
We tested the model on two different datasets. We saw that choosing an inverse-gamma prior with $t$ value that is too small may force the random effects for most areas to be selected. On the other hand, having $t$ that is very large can force higher shrinkage and lower selection probabilities. We recommend choosing a inverse-gamma prior with $t>0.3$ for $\alpha=0.001$ that balances both of these factors. For the work done for our main manuscript, we chose the prior $IG(5, 5)$ where $t\approx 0.34$.
# Full Conditionals Derivation