11-irt.Rmd

# Item Response Theory

Item response theory (IRT) builds models for item (stimuli that the measures collected) based on two broad classes of models

1. Models for dichotomous (binary, 0/1) items and
2. Models for polytomous (multi-category) items.

## IRT Models for Dichotomous Data

First, for conventional dichotomous observed variables, an IRT model can be generally specified as follows.

Let $x_{ij}$ be the observed value from respondent $i$ on observable (item) $j$.
When $x$ is binary, the observed value can be $0$ or $1$. 
Some common IRT models for binary observed variables can be expressed as a version of 

\[p(x_{ij} = 1 \mid \theta_i, d_j, a_j, c_j) = c_j + (1-c_j)F(a_j \theta_i + d_j),\]

where,

* $\theta_i$ is the magnitude of the latent variable that individual $i$ possesses. In educational measurement, $\theta_i$ commonly represents proficiency so that a higher $\theta_i$ means that individual has more of the trait,
* $d_j$ is the item location or difficulty parameter. $d_j$ is commonly transformed to be $d_j = -a_jb_j$ so that the location parameter is easier to interpret in relation to the latent trait $\theta_i$,
* $a_j$ is the item slope or discrimination parameter,
* $c_j$ is the item lower asymptote or guessing parameter,
* $F(.)$ is the link function to be specified that determines the form of the transformation between latent trait and the item response probability. The link is common chosen to be either the logistic link or the normal-ogive link.

Common IRT models are the

* 1-PL, or one-parameter logistic model, which only uses one measurement parameter $d_j$ per item,
* 2-PL, or two-parameter logistic model, which uses two measurement model parameters $a_j$ and $d_j$ per item,
* 3-PL, or three-parameter logistic model, which uses all three parameters as shown above.
* other models are also possible for binary item response formats but are omitted here.

The above describes the functional form used to model why individual may have a greater or lesser likelihood of endorsing an item (have a $1$ as a measure).
We use the above model as the basis for defining the conditional probability of any response given the values of the parameters.
The conditional probability is then commonly used as part of a *marginal maximum likelihood (MML)* approach to finding parameters values for the measurement model which maximize the likelihood.
However, given that the values of the latent variable $\theta_i$ are also unknown, the distribution of $\theta_i$ is marginalized out of the likelihood function.

However, in the Bayesian formulation, we can side-step some of these issues by the use of prior distributions.
Starting with the general form of the likelihood function
\[p(\mathbf{x}\mid \boldsymbol{\theta}, \boldsymbol{\omega}) = \prod_{i=1}^np(\mathbf{x}_i\mid \theta_i, \boldsymbol{\omega}) = \prod_{i=1}^n\prod_{j=1}^Jp(x_{ij}\mid \theta_i, \boldsymbol{\omega}_j),\]
where
\[x_{ij}\mid \theta_i, \boldsymbol{\omega}_j \sim \mathrm{Bernoulli}[p(x_{ij}\mid \theta_i, \boldsymbol{\omega}_j)].\]

Developing a joint prior distribution for $p(\boldsymbol{\theta}, \boldsymbol{\omega})$ is not straightforward given the high dimensional aspect of the components.
But, a common assumption is that the distribution for the latent variables ($\boldsymbol{\theta}$) is independent of the distribution for the measurement model parameters ($\boldsymbol{\omega}$).
That is, we can separate the problem into independent priors
\[p(\boldsymbol{\theta}, \boldsymbol{\omega}) = p(\boldsymbol{\theta})p(\boldsymbol{\omega}).\]

For the latent variables, the prior distributuion is generally build by assuming that all individuals are also independent.
The independence of observations leads to a joint prior that is a product of priors with a common distribution,
\[p(\boldsymbol{\theta}) = \prod_{i=1}^np(\theta_i\mid \boldsymbol{\theta}_p),\]
where $\boldsymbol{\theta}_p$ are the hyperparameters governing the common prior for the latent variable distribution.
A common choice is that $\theta_i \sim \mathrm{Normal}(\mu_{\theta} = 0, \sigma^2_{\theta}=1)$ because the distribution is arbitrary.

For the measurement model parameters, a bit more complex specification is generally needed.
One *simple* approach would be to invoke an exchangeability assumption among items and among item parameters.
This would essentially make all priors independent and simplify the specification to product of univariate priors over all measurement model parameters 
\[p(\boldsymbol{\omega}) = \prod_{j=1}^Jp(\boldsymbol{\omega}_j)=\prod_{j=1}^Jp(d_j)p(a_j)p(c_j).\]
For for location parameter ($d_j$), a common prior distribution is an unbounded normal distribution.
Because, the location can take on any value within the range of the latent variable which is also technically unbounded so we let
\[d_j \sim \mathrm{Normal}(\mu_{d},\sigma^2_d).\]
The choice of hyperparameters can be guided by prior research or set to a common relative diffuse value for all items such as $\mu_{d}=0,\sigma^2_d=10$.

The discrimination parameter governs the strength of the relationship between the latent variable and the probability of endorsing the item.
This is similar in flavor to a factor loading in CFA.
An issue with specifying a prior for the discrimination parameter is the indeterminacy with respect the the orientation of the latent variable.
In CFA, we resolved the orientation indeterminacy issue by fixing one factor loading to 1.
In IRT, we can do so by constraining the possible values of the discrimination parameters to be strictly positive.
This forces each item to have the meaning of higher values on the latent variable directly (or at least proportionally) increase the probability of endorsing the item.
We achieve this by using a prior such as
\[a_j \sim \mathrm{Normal}^{+}(\mu_a,\sigma^2_a).\]
The term $\mathrm{Normal}^{+}(.)$ means that the normal distribution is truncated at 0 so that only positive values are possible.

Lastly, the guessing parameter $c_j$ takes on values between $[0,1]$.
A common choice for parameters bounded between 0 and 1 is a Beta prior, that is
\[c_j \sim \mathrm{Beta}(\alpha_c, \beta_c).\]
The hyperparameters $\alpha_c$ and $\beta_c$ determine the shape of the beta prior and affect the likelihood and magnitude of guessing parameters.

## 3-PL LSAT Example

In the Law School Admission Test (LSAT) example (p. 263-271), the data are from 1000 examinees responding to five items which is just a subset of the LSAT.
We hypothesize that only one underlying latent variable is measured by these items.
But that guessing is also plausible.
The full 3-PL model we will use can be described in an equation as
\[p(\boldsymbol{\theta}, \boldsymbol{d}, \boldsymbol{a}, \boldsymbol{c} \mid \mathbf{x}) \propto \prod_{i=1}^n\prod_{j=1}^Jp(\theta_i\mid\theta_i, d_j, a_j, c_j)p(\theta_i)p(d_j)p(a_j)p(c_j),\]
where
\begin{align*}
x_{ij}\mid\theta_i\mid\theta_i, d_j, a_j, c_j &\sim \mathrm{Bernoulli}[p(\theta_i\mid\theta_i, d_j, a_j, c_j)],\ \mathrm{for}\ i=1, \cdots, 100,\ j = 1, \cdots, 5;\\
p(\theta_i\mid\theta_i, d_j, a_j, c_j) &= c_j + (1-c_j)\Phi(a_j\theta_j + d_j),\ \mathrm{for}\ i=1, \cdots, 100,\ j = 1, \cdots, 5;\\
\theta_i &\sim \mathrm{Normal}(0,1),\ \mathrm{for}\ i = 1, \cdots, 1000;\\
d_j &\sim \mathrm{Normal}(0, 2),\ \mathrm{for}\ j=1, \cdots, 5;\\
a_j &\sim \mathrm{Normal}^{+}(1, 2),\ \mathrm{for}\ j=1, \cdots, 5;\\
c_j &\sim \mathrm{Beta}(5, 17),\ \mathrm{for}\ j=1, \cdots, 5.
\end{align*}

The above model can illustrated in a DAG as shown below.

```{r chp11-dag-1, echo=FALSE,fig.align='center',fig.cap='DAG for 3-PL IRT model for LSAT Example', out.width="75%"}
knitr::include_graphics(paste0(w.d,'/dag/chp11-irt1.png'),
                        auto_pdf = TRUE)
```

The path diagram for an IRT is essentially identical to the path diagram for a CFA model.
This fact highlights an important feature of IRT/CFA in that the major conceptual difference between these approaches to is how we define the link between the latent variable the observed items.

```{r chp11-pathdiag-1, echo=FALSE,fig.align='center',fig.cap='Path diagram for 3-PL IRT model', out.width="90%"}
knitr::include_graphics(paste0(w.d,'/path-diagram/chp11-irt1.png'),
                        auto_pdf = TRUE)
```

For completeness, I have included the model specification diagram that more concretely connects the DAG and path diagram to the assumed distributions and priors.

```{r chp11-spec-1, echo=FALSE,fig.align='center',fig.cap='Model specification diagram for the 3-PL IRT model', out.width="90%"}
knitr::include_graphics(paste0(w.d,'/model-spec/chp11-irt1.png'),
                        auto_pdf = TRUE)
```

## LSAT Example - JAGS

```{r chp11-lsat-jags, warnings=T, message=T, error=T, cache=TRUE}

jags.model.lsat <- function(){

#########################################
# Specify the item response measurement model for the observables
#########################################
for (i in 1:n){
  for(j in 1:J){
    P[i,j] <- c[j]+(1-c[j])*phi(a[j]*theta[i]+d[j])       # 3P-NO expression
    x[i,j] ~ dbern(P[i,j])                  # distribution for each observable
  }
}


##########################################
# Specify the (prior) distribution for the latent variables
##########################################
for (i in 1:n){
  theta[i] ~ dnorm(0, 1)  # distribution for the latent variables
}


##########################################
# Specify the prior distribution for the measurement model parameters
##########################################
for(j in 1:J){
  d[j] ~ dnorm(0, .5)          # Locations for observables
  a[j] ~ dnorm(1, .5); I(0,)    # Discriminations for observables
  c[j] ~ dbeta(5,17)           # Lower asymptotes for observables
  
}


} # closes the model

# initial values
start_values <- list(
  list("d"=c(1.00, 1.00, 1.00, 1.00, 1.00),
       "a"=c(1.00, 1.00, 1.00, 1.00, 1.00),
       "c"=c(0.20, 0.20, 0.20, 0.20, 0.20)),
  list("d"=c(-3.00, -3.00, -3.00, -3.00, -3.00),
       "a"=c(3.00, 3.00, 3.00, 3.00, 3.00),
       "c"=c(0.50, 0.50, 0.50, 0.50, 0.50)),
  list("d"=c(3.00, 3.00, 3.00, 3.00, 3.00),
       "a"=c(0.1, 0.1, 0.1, 0.1, 0.1),
       "c"=c(0.05, 0.05, 0.05, 0.05, 0.05))
)

# vector of all parameters to save
param_save <- c("a", "c", "d", "theta")

# dataset
dat <- read.table("data/LSAT.dat", header=T)

mydata <- list(
  n = nrow(dat), J = ncol(dat),
  x = as.matrix(dat)
)

# fit model
fit <- jags(
  model.file=jags.model.lsat,
  data=mydata,
  inits=start_values,
  parameters.to.save = param_save,
  n.iter=2000,
  n.burnin = 1000,
  n.chains = 3,
  progress.bar = "none")

#print(fit)
round(fit$BUGSoutput$summary[ !rownames(fit$BUGSoutput$summary) %like% "theta", ], 3)

# extract posteriors for all chains
jags.mcmc <- as.mcmc(fit)
# the below two plots are too big to be useful given the 1000 observations.
#R2jags::traceplot(jags.mcmc)

# gelman-rubin-brook
#gelman.plot(jags.mcmc)

# convert to single data.frame for density plot
a <- colnames(as.data.frame(jags.mcmc[[1]]))
plot.data <- data.frame(as.matrix(jags.mcmc, chains=T, iters = T))
colnames(plot.data) <- c("chain", "iter", a)


bayesplot::mcmc_acf(plot.data,pars = c(paste0("d[", 1:5, "]")))
bayesplot::mcmc_trace(plot.data,pars = c(paste0("d[", 1:5, "]")))
ggmcmc::ggs_grb(ggs(jags.mcmc), family="d")
mcmc_areas(plot.data, pars = c(paste0("d[",1:5,"]")), prob = 0.8)

bayesplot::mcmc_acf(plot.data,pars = c(paste0("a[", 1:5, "]")))
bayesplot::mcmc_trace(plot.data,pars = c(paste0("a[", 1:5, "]")))
ggmcmc::ggs_grb(ggs(jags.mcmc), family="a")
mcmc_areas( plot.data,pars = c(paste0("a[", 1:5, "]")), prob = 0.8)

bayesplot::mcmc_acf(plot.data,pars = c(paste0("c[", 1:5, "]")))
bayesplot::mcmc_trace(plot.data,pars = c(paste0("c[", 1:5, "]")))
ggmcmc::ggs_grb(ggs(jags.mcmc), family="c")
mcmc_areas(plot.data, pars = c(paste0("c[", 1:5, "]")), prob = 0.8)

```

### Posterior Predicted Distributions

Here, we want to compare the observed and expected posterior predicted distributions.

Statistical functions of interest are the (1) standardized model-based covariance (SMBC) and (2) the standardized generalized discrepancy measure (SGDDM).

For (1), the SMBC is
\[SMBC_{jj^\prime}=\frac{\frac{1}{n}\sum_{i=1}^n(x_{ij} - E(x_{ij} \mid \theta_i,\boldsymbol{\omega}_j))(x_{ij^\prime} - E(x_{ij^\prime} \mid \theta_i,\boldsymbol{\omega}_j^\prime))}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij} - E(x_{ij} \mid \theta_i,\boldsymbol{\omega}_j))^2}\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij^\prime} - E(x_{ij^\prime} \mid \theta_i,\boldsymbol{\omega}_j^\prime))}}\]

In R, the functions below can be used to compute these qualtities.

```{r chp11-ppd-functions, warnings=T, message=T, error=T, cache=TRUE}

calculate.SGDDM <- function(data.matrix, expected.value.matrix){
	
	J.local = ncol(data.matrix)

	SMBC.matrix <- calculate.SMBC.matrix(data.matrix, expected.value.matrix)
	
	SGDDM = sum(abs((lower.tri(SMBC.matrix, diag=FALSE))*SMBC.matrix))/((J.local*(J.local-1))/2)

	SGDDM

} # closes calculate.SGDDM

calculate.SMBC.matrix <- function(data.matrix, expected.value.matrix){
	
	N.local <- nrow(data.matrix)

	MBC.matrix <- (t(data.matrix-expected.value.matrix) %*% (data.matrix-expected.value.matrix))/N.local

	MBStddevs.matrix <- diag(sqrt(diag(MBC.matrix)))

	#SMBC.matrix <- solve(MBStddevs.matrix) %*% MBC.matrix %*% solve(MBStddevs.matrix)


	J.local <- ncol(data.matrix)

	SMBC.matrix <- matrix(NA, nrow=J.local, ncol=J.local)

	for(j in 1:J.local){
		for(jj in 1:J.local){
			SMBC.matrix[j,jj] <- MBC.matrix[j,jj]/(MBStddevs.matrix[j,j]*MBStddevs.matrix[jj,jj])
		}
	}

	SMBC.matrix 

} # closes calculate.MBC.matrix

```

Next, we will use the functions above among other basic data wrangling to construct a full posterior predictive distribution analysis to probe our resulting posterior.

```{r chp11-lsat-jags-ppd, warnings=T, message=T, error=T, cache=TRUE}
# Data wrangle the results/posterior draws for use
datv1 <- plot.data %>%
  pivot_longer(
    cols = `a[1]`:`a[5]`,
    values_to = "a",
    names_to = "item"
  ) %>%
  mutate(item = substr(item, 3,3)) %>%
  select(chain, iter, item, a)
datv2 <- plot.data %>%
  pivot_longer(
    cols = `c[1]`:`c[5]`,
    values_to = "c",
    names_to = "item"
  ) %>%
  mutate(item = substr(item, 3,3)) %>%
  select(chain, iter, item, c)
datv3 <- plot.data %>%
  pivot_longer(
    cols = `d[1]`:`d[5]`,
    values_to = "d",
    names_to = "item"
  ) %>%
  mutate(item = substr(item, 3,3)) %>%
  select(chain, iter, item, d)

datv4 <- plot.data %>%
  pivot_longer(
    cols = `theta[1]`:`theta[999]`,
    values_to = "theta",
    names_to = "person"
  ) %>%
  select(chain, iter, person, theta)

dat_long <- full_join(datv1, datv2)
dat_long <- full_join(dat_long, datv3)
dat_long <- full_join(dat_long, datv4)

dat1 <- dat
dat1$person <- paste0("theta[",1:nrow(dat), "]")
datvl <- dat1 %>%
  pivot_longer(
    cols=contains("item"),
    names_to = "item",
    values_to = "x"
  ) %>%
  mutate(
    item = substr(item, 6, 100)
  )

dat_long <- left_join(dat_long, datvl)

# compute expected prob
ilogit <- function(x){exp(x)/(1+exp(x))}
dat_long <- dat_long %>%
  as_tibble()%>%
  mutate(
    x.exp = c + (1-c)*ilogit(a*(theta - d)),
    x.dif = x - x.exp
  )


dat_long$x.ppd <- apply(
  dat_long, 1, 
  FUN=function(x){
    rbern(1, as.numeric(x[10]))
  }
  )
itermin <- min(dat_long$iter) # used for subseting
# figure 11.4
d <- dat_long %>%
  group_by(chain, iter, person) %>%
  summarise(raw.score = sum(x),
            raw.score.ppd = sum(x.ppd))

di <- d %>%
  filter(chain==1, iter==1001) %>%
  group_by(raw.score) %>%
  summarise(count = n())
dii <- d %>%
  group_by(chain, iter, raw.score.ppd)%>%
  summarise(raw.score = n())
# overall fit of observed scores
ggplot()+
  geom_boxplot(data=dii, aes(y=raw.score, x= raw.score.ppd, group=raw.score.ppd))+
  geom_point(data=di, aes(x=raw.score, y=count), color="red", size=2)+
  labs(x="Raw Score", y="Number of Examinees")+
  scale_x_continuous(breaks=0:5)+
  theme_classic()

# by item
d <- dat_long %>%
  group_by(chain, iter, person) %>%
  mutate(raw.score = sum(x),
         raw.score.ppd = sum(x.ppd))

di <- d %>%
  filter(chain==1, iter==1001) %>%
  group_by(raw.score, item) %>%
  summarise(p.correct = mean(x))
dii <- d %>%
  group_by(chain, iter, raw.score.ppd, item)%>%
  summarise(p.correct = mean(x.ppd))

ggplot()+
  geom_boxplot(data=dii,
               aes(y= p.correct,
                   x= raw.score.ppd,
                   group=raw.score.ppd))+
  geom_point(data=di,
             aes(x=raw.score, y=p.correct),
             color="red", size=2)+
  facet_wrap(.~item)+
  labs(x="Raw Score", y="Number of Examinees")+
  theme_classic()

# computing standardized model summary statistics
# objects for results
J <- 5
n.chain <- 3
n.iters <- length(unique(dat_long$iter))
n.iters <- length(unique(long_dat$iter))
n.iters.PPMC <- n.iters*n.chain

realized.SMBC.array <- array(NA, c(n.iters.PPMC, J, J))
postpred.SMBC.array <- array(NA, c(n.iters.PPMC, J, J))
realized.SGDDM.vector <- array(NA, c(n.iters.PPMC))
postpred.SGDDM.vector <- array(NA, c(n.iters.PPMC))


ii <- i <- c <- 1
# iteration condiitons
iter.cond <- unique(dat_long$iter)
Xobs <- as.matrix(dat[,-6])

for(i in 1:length(iter.cond)){
  for(c in 1:3){
  cc <- iter.cond[i]
  Xexp <- dat_long[dat_long$chain==c & dat_long$iter==cc , ] %>%
    pivot_wider(
      id_cols = person,
      names_from = "item",
      values_from = "x.exp",
      names_prefix = "item"
    ) %>%
    ungroup()%>%
    select(item1:item5)%>%
    as.matrix()
  Xppd <- dat_long[dat_long$chain==c & dat_long$iter==cc , ] %>%
    pivot_wider(
      id_cols = person,
      names_from = "item",
      values_from = "x.ppd",
      names_prefix = "item"
    ) %>%
    ungroup()%>%
    select(item1:item5)%>%
    as.matrix()

  # compute realized values
  realized.SMBC.array[ii, ,] <- calculate.SMBC.matrix(Xobs, Xexp)
  realized.SGDDM.vector[ii] <-  calculate.SGDDM(Xobs, Xexp)
  # compute PPD values
  postpred.SMBC.array[ii, ,] <- calculate.SMBC.matrix(Xppd, Xexp)
  postpred.SGDDM.vector[ii] <-  calculate.SGDDM(Xppd, Xexp)
    ii <- ii + 1
  }
}


plot.dat.ppd <- data.frame(
  real = realized.SGDDM.vector,
  ppd = postpred.SGDDM.vector
)


ggplot(plot.dat.ppd, aes(x=real, y=ppd))+
  geom_point()+
  geom_abline(intercept = 0, slope=1)+
  lims(x=c(0,0.5), y=c(0, 0.5))


# transform smbc into plotable format
ddim <- dim(postpred.SMBC.array)
plot.dat.ppd <- as.data.frame(matrix(0, nrow=ddim[1]*ddim[2]*ddim[3], ncol=4))
colnames(plot.dat.ppd) <- c("itemj", "itemjj", "real", "ppd")
ii <- i <- j <- jj <- 1

for(i in 1:ddim[1]){
  for(j in 1:ddim[2]){
    for(jj in 1:ddim[3]){
      plot.dat.ppd[ii, 1] <- j
      plot.dat.ppd[ii, 2] <- jj
      plot.dat.ppd[ii, 3] <- realized.SMBC.array[i, j, jj]
      plot.dat.ppd[ii, 4] <- postpred.SMBC.array[i, j, jj]
      ii <- ii + 1
    }
  }
}

plot.dat.ppd <- plot.dat.ppd %>%
  filter(itemj < itemjj) %>%
  mutate(
    cov = paste0("cov(", itemj, ", ", itemjj,")")
  )

ggplot(plot.dat.ppd, aes(x=real, y=ppd))+
  geom_point(alpha=0.25)+
  geom_density2d(adjust=2)+
  geom_abline(intercept = 0, slope=1)+
  facet_wrap(.~cov)+
  lims(x=c(-1,1), y=c(-1,1))+
  theme_classic()


```

<!-- ## LSAT Example - OpenBUGS -->

<!-- ```{r chp11-lsat-openbugs, warnings=T, message=T, error=T, cache=TRUE} -->
<!-- # model code -->
<!-- model.file <- paste0(w.d,"/code/IRT-for-Dichotomous-Observables/WinBUGS/3PNO.bug") -->
<!-- cat(model.file) -->
<!-- # get data file -->
<!-- data.file <- paste0(w.d,"/code/IRT-for-Dichotomous-Observables/WinBUGS/data.txt") -->

<!-- # initial values -->
<!-- start_values <- list( -->
<!--   list("d"=c(1.00, 1.00, 1.00, 1.00, 1.00), -->
<!--        "a"=c(1.00, 1.00, 1.00, 1.00, 1.00), -->
<!--        "c"=c(0.20, 0.20, 0.20, 0.20, 0.20)), -->
<!--   list("d"=c(-3.00, -3.00, -3.00, -3.00, -3.00), -->
<!--        "a"=c(3.00, 3.00, 3.00, 3.00, 3.00), -->
<!--        "c"=c(0.50, 0.50, 0.50, 0.50, 0.50)), -->
<!--   list("d"=c(3.00, 3.00, 3.00, 3.00, 3.00), -->
<!--        "a"=c(0.1, 0.1, 0.1, 0.1, 0.1), -->
<!--        "c"=c(0.05, 0.05, 0.05, 0.05, 0.05)) -->
<!-- ) -->

<!-- # vector of all parameters to save -->
<!-- param_save <- c("a", "c", "d", "theta") -->


<!-- # fit model -->
<!-- fit <- openbugs( -->
<!--   data= data.file,  -->
<!--   model.file = model.file, # R grabs the file and runs it in openBUGS -->
<!--   parameters.to.save = param_save, -->
<!--   #inits=start_values, -->
<!--   n.chains = 1, -->
<!--   n.iter = 2000, -->
<!--   n.burnin = 1000 -->
<!-- ) -->
<!-- print(fit) -->
<!-- ``` -->

## LSAT Example - Stan

```{r chp11-lsat-stan, warnings=T, message=T, error=T}


model_irt_lsat <- '
data {
  int  N;
  int  J;
  int x[N,J];
}

parameters {
  real<lower=0> a[J]; //discrimination
  real d[J]; //location
  real<lower=0, upper=1> c[J]; //guessing
  real theta[N]; //person parameters
}

model {
  matrix[N,J] pi;
  // item response probabilities
  for(n in 1:N){
    for(j in 1:J){
      pi[n,j] = c[j]+(1-c[j])*Phi(a[j]*theta[n]+d[j]);
      x[n,j] ~ bernoulli(pi[n,j]);
    }
  }
  //measurement model priors
  for(j in 1:J){
    a[j] ~ normal(1,2)T[0,];
    d[j] ~ normal(0,2);
    c[j] ~ beta(5, 17);
  }
  theta ~ normal(0,1);
}

'


# initial values
start_values <- list(
  list("d"=c(1.00, 1.00, 1.00, 1.00, 1.00),
       "a"=c(1.00, 1.00, 1.00, 1.00, 1.00),
       "c"=c(0.20, 0.20, 0.20, 0.20, 0.20)),
  list("d"=c(-3.00, -3.00, -3.00, -3.00, -3.00),
       "a"=c(3.00, 3.00, 3.00, 3.00, 3.00),
       "c"=c(0.50, 0.50, 0.50, 0.50, 0.50)),
  list("d"=c(3.00, 3.00, 3.00, 3.00, 3.00),
       "a"=c(0.1, 0.1, 0.1, 0.1, 0.1),
       "c"=c(0.05, 0.05, 0.05, 0.05, 0.05))
)

# dataset
dat <- read.table("data/LSAT.dat", header=T)


mydata <- list(
  N = nrow(dat), 
  J = ncol(dat),
  x = as.matrix(dat)
)


# Next, need to fit the model
#   I have explicitly outlined some common parameters
fit <- stan(
  model_code = model_irt_lsat, # model code to be compiled
  data = mydata,          # my data
  init = start_values,    # starting values
  chains = 3,             # number of Markov chains
  warmup = 2000,          # number of warm up iterations per chain
  iter = 4300,            # total number of iterations per chain
  cores = 1,              # number of cores (could use one per chain)
  refresh = 0             # no progress shown
)
```

```{r chp11-lsat-stan-plots, warnings=T, message=T, error=T}
# first get a basic breakdown of the posteriors
print(fit,pars =c("d", "a", "c", "theta[1000]"))

# plot the posterior in a
#  95% probability interval
#  and 80% to contrast the dispersion
plot(fit,pars =c("d", "a", "c", "theta[1000]"))

# traceplots
rstan::traceplot(fit,pars =c("d", "a", "c", "theta[1000]"), inc_warmup = TRUE)

# Gelman-Rubin-Brooks Convergence Criterion
ggs_grb(ggs(fit, family = c("d"))) +
   theme_bw() + theme(panel.grid = element_blank())
ggs_grb(ggs(fit, family = "a")) +
   theme_bw() + theme(panel.grid = element_blank())
ggs_grb(ggs(fit, family = "c")) +
   theme_bw() + theme(panel.grid = element_blank())
ggs_grb(ggs(fit, family = "theta[1000]")) +
   theme_bw() + theme(panel.grid = element_blank())
# autocorrelation
ggs_autocorrelation(ggs(fit, family="d")) +
   theme_bw() + theme(panel.grid = element_blank())
ggs_autocorrelation(ggs(fit, family="a")) +
   theme_bw() + theme(panel.grid = element_blank())
ggs_autocorrelation(ggs(fit, family="c")) +
   theme_bw() + theme(panel.grid = element_blank())
ggs_autocorrelation(ggs(fit, family="theta[1000]")) +
   theme_bw() + theme(panel.grid = element_blank())

# plot the posterior density
plot.data <- as.matrix(fit)

mcmc_areas(plot.data, pars = paste0("d[",1:5,"]"),prob = 0.8)
mcmc_areas(plot.data, pars = paste0("a[",1:5,"]"),prob = 0.8)
mcmc_areas(plot.data, pars = paste0("c[",1:5,"]"),prob = 0.8)
mcmc_areas(plot.data, pars = paste0("theta[1000]"),prob = 0.8)

```


## IRT Models for Polytomous Data

A commonly used IRT model for polytomous items is the graded response model (GRM).
Below is one way of describing the model.
Let $x_{ij}$ be the observed response to item $j$ from examinee $i$ that may take on values 1, 2, ..., $K_j$, where $K_j$ is the number of possible responses/outcomes for item $j$.
In many applications, the number of response options is constant across items, though this need not be the case.
The GRM using conditional probability statements about the probability of a response being at or above a specific category and obtaining the probability for each category as a difference of two such conditional probabilities.
That is 
\[P(x_{ij} = k \mid \theta_i, \boldsymbol{d}_j,a_j) = P(x_{ij} \geq k \mid \theta_i, d_{jk},a_j) - P(x_{ij} \geq k+1 \mid \theta_i, d_{j(k+1)},a_j),\]
where $\boldsymbol{d}_j$ is the collection of location/threshold parameters for item $j$.
The GRM takes on a structure similar to the 2-PL for any one category
\[P(x_{ij} \geq k \mid \theta_i, d_{jk},a_j)=F(a_j\theta_i + d_{jk}).\]

The conditional probability of observed responses may be modeled similarly as we have used for dichotomous responses but with a few important differences.
The conditional distribution of the data is
\[p(\boldsymbol{x}\mid \boldsymbol{\theta},\boldsymbol{\omega}) = \prod_{i=1}^np(\boldsymbol{x}_i\mid \theta_i, \boldsymbol{\omega}) = \prod_{i=1}^n\prod_{j=1}^Jp(x_{ij}\mid \theta_i, \boldsymbol{\omega}_j),\]
where each $x_{ij}$ is specified as a categorical random variable (or multinomial).
A categorical random variable is a generalization of the Bernoulli distribution which is defined be the collection of category response probabilities
\[x_{ij} \sim \mathrm{Categorical}(\boldsymbol{P}(x_{ij}\mid\theta_i, \boldsymbol{\omega}_j)).\]

The above helps form the likelihood of the observed data.
Next, the prior distribution is described because what the structure should be is not necessarily obvious.

First, the prior for the latent ability follows the same logic from the dichotomous model.
We employ an exchangeability assumption to specify independent priors for each respondent with a normally distribution prior.

Next, the measurement model parameters' priors are described.
We again can assume exchangeability and arrive at a common but independent prior across items, and assume that the priors for the location and discrimination parameters are independent.
These assumptions may not be tenable in theory, but they are practically useful.
The priors for discrimination stay the same as the dichotomous model.
The priors for the location parameters are a bit more involved.

For the location parameters, the first location parameter $d_{j1}$ specifies the probability of responding a 1 or greater which is a certainty if they gave a response.
Therefore, the probability would be 1.
We set $d_{j1} = -\inf$ and then set a normal prior for $d_{j2}\sim \mathrm{Normal}(\mu_{d2},\sigma^2_{d2})$.
The priors for the remaining location parameters ($d_{3}-d_{k}$) can be specified as truncated normal distributions.
That is, the location of the next threshold is constrained to be larger than the previous threshold and is formally
\[d_{jk} \sim \mathrm{Normal}^{>d_{j(k-1)}}(\mu_{d_k},\sigma^2_{d_k}),\ \mathrm{for}\ k=3, ...,K_j.\]

The posterior distribution for the GRM can be parameterized as follows.
The model as described below is very general and can accommodate varying number of thresholds per item but is constrained to only 1 latent factor.

\[p(\boldsymbol{\theta}, \boldsymbol{d}, \boldsymbol{a}\mid \mathbf{x}) \propto \prod_{i=1}^n\prod_{j=1}^Jp(\theta_i\mid\theta_i, \boldsymbol{d}_j, a_j)p(\theta_i)p(a_j)\prod_{k=2}^{K_j}p(d_{jk}),\]
where
\begin{align*}
x_{ij}\mid\theta_i\mid\theta_i, \boldsymbol{d}_j, a_j) &\sim \mathrm{Categorical}(\boldsymbol{P}(x_{ij}\mid\theta_i, \boldsymbol{\omega}_j)),\ \mathrm{for}\ i=1, \cdots, n,\ j = 1, \cdots, J;\\
\mathbf{P}(x_{ij}\mid\theta_i, \boldsymbol{d}_j, a_j) &= \left(P(x_{ij}=1\mid\theta_i, \boldsymbol{d}_j, a_j), \cdots, P(x_{ij}=K_j\mid\theta_i, \boldsymbol{d}_j, a_j)\right),\ \mathrm{for}\ i=1, \cdots, n,\ j = 1, \cdots, J;\\
P(x_{ij}=k\mid\theta_i, \boldsymbol{d}_j, a_j) &= P(x_{ij}\geq k\mid\theta_i,d_{jk}, a_j) - P(x_{ij}\geq k+1\mid\theta_i, d_{j(k+1)}, a_j),\ \mathrm{for}\ i=1, \cdots, n,\ j = 1, \cdots, J,\ k = 1,\cdots,K_j-1;\\
P(x_{ij}=K_j\mid\theta_i, \boldsymbol{d}_j, a_j) &= P(x_{ij}\geq K_j\mid\theta_i,d_{jK_j}, a_j),\ \mathrm{for}\ i=1, \cdots, n,\ j = 1, \cdots, J;\\
P(x_{ij}\geq k\mid\theta_i, d_{jk}, a_j) &= F(a_j\theta_j + d_{jk}),\ \mathrm{for}\ i=1, \cdots, n,\ j = 1, \cdots, J,\ k=2,\cdots,K_j;\\
P(x_{ij}\geq 1\mid\theta_i, d_{j1}, a_j) &= 1,\ \mathrm{for}\ i=1, \cdots, n,\ j = 1, \cdots, J;\\
\theta_i \mid \mu_{\theta}, \sigma^2_{\theta} &\sim \mathrm{Normal}(\mu_{\theta}, \sigma^2_{\theta}),\ \mathrm{for}\ i = 1, \cdots, n;\\
a_j \mid \mu_{a}, \sigma^2_{a} &\sim \mathrm{Normal}^{+}(\mu_{a}, \sigma^2_{a}),\ \mathrm{for}\ j=1, \cdots, J;\\
d_{j2}\mid\mu_{j2}, \sigma^2_{j2} &\sim \mathrm{Normal}(\mu_{j2}, \sigma^2_{j2} ),\ \mathrm{for}\ j=1, \cdots, J;\ \mathrm{and}\\
d_{jk}\mid\mu_{d_{jk}},\sigma^2_{d_{jk}} &\sim \mathrm{Normal}^{>d_{j(k-1)}}(\mu_{d_{jk}},\sigma^2_{d_{jk}}),\ \mathrm{for}\ j=1, \cdots, J,\ k=3, ...,K_j.
\end{align*}

## GRM Peer Interactions Example

The book uses an example of Peer Interactions from 500 responses to seven items. All the responses are coded from 1 to 5 on an agreement Likert-type scale.
A DAG for the GRM corresponding to these data is shown below.

```{r chp11-dag-2, echo=FALSE,fig.align='center',fig.cap='DAG for the for Peer Interactions GRM analysis', out.width="90%"}
knitr::include_graphics(paste0(w.d,'/dag/chp11-grm.png'),
                        auto_pdf = TRUE)
```

The path diagram version is substantially simpler and identical to the path diagram for the 3-PL and factor analysis diagrams.
Highlighting the similarity in *substantive* modeling of polytomous items to dichotomous items.

```{r chp11-pathdiag-2, echo=FALSE,fig.align='center',fig.cap='Path diagram for the Peer Interactions GRM analysis', out.width="90%"}
knitr::include_graphics(paste0(w.d,'/path-diagram/chp11-grm.png'),
                        auto_pdf = TRUE)
```

For completeness, I have included the model specification diagram that more concretely connects the DAG and path diagram to the assumed distributions and priors.

```{r chp11-spec-2, echo=FALSE,fig.align='center', out.width="90%",fig.cap='Model specification diagram for the Peer Interactions GRM analysis'}
knitr::include_graphics(paste0(w.d,'/model-spec/chp11-grm.png'),
                        auto_pdf = TRUE)
```

### Example Specific Model Specification

In fitting the GRM to the Peer Interactions data, we can be more precise about the prior and likelihood structure. 
Below is a breakdown of the model specific to this example.
Everything is structurally identical to the previous page but specific values are chosen for the hyperparameters.

\[p(\boldsymbol{\theta}, \boldsymbol{d}, \boldsymbol{a}\mid \mathbf{x}) \propto \prod_{i=1}^n\prod_{j=1}^Jp(\theta_i\mid\theta_i, \boldsymbol{d}_j, a_j)p(\theta_i)p(a_j)\prod_{k=2}^{K_j}p(d_{jk}),\]
where
\begin{align*}
x_{ij}\mid\theta_i\mid\theta_i, \boldsymbol{d}_j, a_j) &\sim \mathrm{Categorical}(\boldsymbol{P}(x_{ij}\mid\theta_i, \boldsymbol{\omega}_j)),\ \mathrm{for}\ i=1, \cdots, 500,\ j = 1, \cdots, 7;\\
\mathbf{P}(x_{ij}\mid\theta_i, \boldsymbol{d}_j, a_j) &= \left(P(x_{ij}=1\mid\theta_i, \boldsymbol{d}_j, a_j), \cdots, P(x_{ij}=5\mid\theta_i, \boldsymbol{d}_j, a_j)\right),\ \mathrm{for}\ i=1, \cdots, 500,\ j = 1, \cdots, 7;\\
P(x_{ij}=k\mid\theta_i, \boldsymbol{d}_j, a_j) &= P(x_{ij}\geq k\mid\theta_i,d_{jk}, a_j) - P(x_{ij}\geq k+1\mid\theta_i, d_{j(k+1)}, a_j),\ \mathrm{for}\ i=1, \cdots, 500,\ j = 1, \cdots, 7,\ k = 1,\cdots,4;\\
P(x_{ij}=5\mid\theta_i, \boldsymbol{d}_j, a_j) &= P(x_{ij}\geq 5\mid\theta_i,d_{j5}, a_j),\ \mathrm{for}\ i=1, \cdots, 500,\ j = 1, \cdots, 7;\\
P(x_{ij}\geq k\mid\theta_i, d_{jk}, a_j) &= F(a_j\theta_j + d_{jk}) = \frac{\exp\left(a_j\theta_i +d_{jk}\right)}{1+\exp\left(a_j\theta_i +d_{jk}\right)},\ \mathrm{for}\ i=1, \cdots, 500,\ j = 1, \cdots, 7,\ k=2,\cdots,5;\\
P(x_{ij}\geq 1\mid\theta_i, d_{j1}, a_j) &= 1,\ \mathrm{for}\ i=1, \cdots, 500,\ j = 1, \cdots, 7;\\
\theta_i \mid \mu_{\theta}, \sigma^2_{\theta} &\sim \mathrm{Normal}(0, 1),\ \mathrm{for}\ i = 1, \cdots, 500;\\
a_j \mid \mu_{a}, \sigma^2_{a} &\sim \mathrm{Normal}^{+}(0,2),\ \mathrm{for}\ j=1, \cdots, 7;\\
d_{j2}\mid\mu_{d_{j2}}, \sigma^2_{d_{j2}} &\sim \mathrm{Normal}(2,2),\ \mathrm{for}\ j=1, \cdots, 7;\\
d_{j3}\mid\mu_{d_{j3}},\sigma^2_{d_{j3}} &\sim \mathrm{Normal}^{>d_{j2}}(1, 2),\ \mathrm{for}\ j=1, \cdots, 7;\\
d_{j4}\mid\mu_{d_{j4}},\sigma^2_{d_{j4}} &\sim \mathrm{Normal}^{>d_{j3}}(-1, 2),\ \mathrm{for}\ j=1, \cdots, 7; \mathrm{and}\\
d_{j5}\mid\mu_{d_{j5}},\sigma^2_{d_{j5}} &\sim \mathrm{Normal}^{>d_{j4}}(-2, 2),\ \mathrm{for}\ j=1, \cdots, 7.
\end{align*}


## PI Example - JAGS

In the below implementation, I had to change ` d[j,3] ~ dnorm(1, .5)I(d[j,4],d[j,2])` to ` d[j,3] ~ dnorm(1, .5);I(,d[j,2])` because (1) R is dumb and doesn't realize that ` I` is a JAGS function; and (2) JAGS does not allow for a ` directed cycle`.
The directed cycle in the DAG is when the range of values for ` d[j,3]` is fixed to be within ` I(d[j,4],d[j,2])` and is not permissible. We need to simply constrain the thresholds to be decreasing or smaller than the previous threshold.
I'm not sure of the underlying technical reason for this error, but I found that adding the semi-colon fixes the issue when defining the model as an R function.

```{r chp11-peer-int-jags, warnings=T, message=T, error=T, cache=TRUE}

jags.model.peer.int <- function(){

  #######################################
  # Specify the item response measurement model for the observables
  #######################################
  for (i in 1:n){
    for(j in 1:J){
  
      ###################################
      # Specify the probabilities of a value being greater than or equal to each category
      ###################################
      for(k in 2:(K[j])){
        # P(greater than or equal to category k > 1)
        logit(P.gte[i,j,k]) <- a[j]*theta[i]+d[j,k]
      }
      # P(greater than or equal to category 1)
      P.gte[i,j,1] <- 1
  
  
      ###################################
      # Specify the probabilities of a value being equal to each category
      ###################################
      for(k in 1:(K[j]-1)){
        # P(greater equal to category k < K)
        P[i,j,k] <- P.gte[i,j,k]-P.gte[i,j,k+1]
      }
      # P(greater equal to category K)
      P[i,j,K[j]] <- P.gte[i,j,K[j]]
      
      ###################################
      # Specify the distribution for each observable
      ###################################
      x[i,j] ~ dcat(P[i,j,1:K[j]])
    }
  }
  
  
  #######################################
  # Specify the (prior) distribution for the latent variables
  #######################################
  for (i in 1:n){
    theta[i] ~ dnorm(0, 1)  # distribution for the latent variables
  }
  
  
  #######################################
  # Specify the prior distribution for the measurement model parameters
  #######################################
  for(j in 1:J){
    
    d[j,2] ~ dnorm(2, .5)                   # Locations for k = 2
    d[j,3] ~ dnorm(1, .5);I(,d[j,2])   # Locations for k = 3
    d[j,4] ~ dnorm(-1, .5);I(,d[j,3])  # Locations for k = 4
    d[j,5] ~ dnorm(-2, .5);I(,d[j,4])        # Locations for k = 5
    a[j] ~ dnorm(1, .5); I(0,)    # Discriminations for observables
  
  }

} # closes the model

# initial values
start_values <- list(
  list(
    d= matrix(c(NA, 3.00E+00, 1.00E+00, 0.00E+00, -1.00E+00,
               NA, 3.00E+00, 1.00E+00, 0.00E+00, -1.00E+00,
               NA, 3.00E+00, 1.00E+00, 0.00E+00, -1.00E+00,
               NA, 3.00E+00, 1.00E+00, 0.00E+00, -1.00E+00,
               NA, 3.00E+00, 1.00E+00, 0.00E+00, -1.00E+00,
               NA, 3.00E+00, 1.00E+00, 0.00E+00, -1.00E+00,
               NA, 3.00E+00, 1.00E+00, 0.00E+00, -1.00E+00),
      ncol=5, nrow=7, byrow=T),
    a=c(1.00E-01, 1.00E-01, 1.00E-01, 1.00E-01, 1.00E-01, 1.00E-01, 1.00E-01)),
  list(
    d= matrix(c(NA, 2.00E+00, 0.00E+00, -1.00E+00, -2.00E+00,
               NA, 2.00E+00, 0.00E+00, -1.00E+00, -2.00E+00,
               NA, 2.00E+00, 0.00E+00, -1.00E+00, -2.00E+00,
               NA, 2.00E+00, 0.00E+00, -1.00E+00, -2.00E+00,
               NA, 2.00E+00, 0.00E+00, -1.00E+00, -2.00E+00,
               NA, 2.00E+00, 0.00E+00, -1.00E+00, -2.00E+00,
               NA, 2.00E+00, 0.00E+00, -1.00E+00, -2.00E+00),
      ncol=5, nrow=7, byrow=T),
    a=c(3.00E+00, 3.00E+00, 3.00E+00, 3.00E+00, 3.00E+00, 3.00E+00, 3.00E+00)),
  list(
    d= matrix(c(NA, 1.00E+00, -1.00E+00, -2.00E+00, -3.00E+00,
               NA, 1.00E+00, -1.00E+00, -2.00E+00, -3.00E+00,
               NA, 1.00E+00, -1.00E+00, -2.00E+00, -3.00E+00,
               NA, 1.00E+00, -1.00E+00, -2.00E+00, -3.00E+00,
               NA, 1.00E+00, -1.00E+00, -2.00E+00, -3.00E+00,
               NA, 1.00E+00, -1.00E+00, -2.00E+00, -3.00E+00,
               NA, 1.00E+00, -1.00E+00, -2.00E+00, -3.00E+00),
              ncol=5, nrow=7, byrow=T),
    a=c(1.00E+00, 1.00E+00, 1.00E+00, 1.00E+00, 1.00E+00, 1.00E+00, 1.00E+00))
)

# vector of all parameters to save
param_save <- c("a", "d", "theta")

# dataset
dat <- read.table("data/PI.dat", header=T)

mydata <- list(
  n = nrow(dat), J = ncol(dat),
  K = rep(5, ncol(dat)),
  x = as.matrix(dat)
)

# fit model
fit <- jags(
  model.file=jags.model.peer.int,
  data=mydata,
  inits=start_values,
  parameters.to.save = param_save,
  n.iter=4000,
  n.burnin = 2000,
  n.chains = 3,
  progress.bar = "none")

print(fit)
round(fit$BUGSoutput$summary[ !rownames(fit$BUGSoutput$summary) %like% "theta", ], 3)

# extract posteriors for all chains
jags.mcmc <- as.mcmc(fit)

# convert to single data.frame for density plot
a <- colnames(as.data.frame(jags.mcmc[[1]]))
plot.data <- data.frame(as.matrix(jags.mcmc, chains=T, iters = T))
colnames(plot.data) <- c("chain", "iter", a)


plot_title <- ggtitle("Posterior distributions","with medians and 80% intervals")
bayesplot::mcmc_areas(plot.data, regex_pars = "d", prob = 0.8) +  plot_title + lims(x=c(-10, 10))

bayesplot::mcmc_areas(
  plot.data,
  pars = c(paste0("a[", 1:7, "]")),
  prob = 0.8) +
  plot_title

bayesplot::mcmc_acf(plot.data,pars = c(paste0("a[", 1:7, "]")))

bayesplot::mcmc_trace(plot.data,pars = c(paste0("a[", 1:7, "]")))

ggmcmc::ggs_grb(ggs(jags.mcmc), family="d")
ggmcmc::ggs_grb(ggs(jags.mcmc), family="a")

ggmcmc::ggs_autocorrelation(ggs(jags.mcmc), family="d")
```

<!-- ## PI Example - OpenBUGS -->

<!-- I ran into issues using the BUGs code provided in text using JAGS so I figured I would replicate the model using OpenBUGS directly to test whether the model can be run as is without any modifications. -->

<!-- ```{r chp11-peer-int-openbugs, warnings=T, message=T, error=T, cache=TRUE} -->
<!-- # model code -->
<!-- model.file <- paste0(w.d,"/code/IRT-for-Polytomous-Observables/IRT for Polytomous Observables/WinBUGS/L-GRM (Normal with Bounds, Truncated-Normal).bug") -->
<!-- cat(model.file) -->
<!-- # get data file -->
<!-- data.file <- paste0(w.d,"/code/IRT-for-Polytomous-Observables/IRT for Polytomous Observables/WinBUGS/data.txt") -->

<!-- # initial values -->
<!-- start_values <- list( -->
<!--   list( -->
<!--     d= matrix(c(NA, 3.00E+00, 1.00E+00, 0.00E+00, -1.00E+00, -->
<!--                NA, 3.00E+00, 1.00E+00, 0.00E+00, -1.00E+00, -->
<!--                NA, 3.00E+00, 1.00E+00, 0.00E+00, -1.00E+00, -->
<!--                NA, 3.00E+00, 1.00E+00, 0.00E+00, -1.00E+00, -->
<!--                NA, 3.00E+00, 1.00E+00, 0.00E+00, -1.00E+00, -->
<!--                NA, 3.00E+00, 1.00E+00, 0.00E+00, -1.00E+00, -->
<!--                NA, 3.00E+00, 1.00E+00, 0.00E+00, -1.00E+00), -->
<!--       ncol=5, nrow=7, byrow=T), -->
<!--     a=c(1.00E-01, 1.00E-01, 1.00E-01, 1.00E-01, 1.00E-01, 1.00E-01, 1.00E-01)), -->
<!--   list( -->
<!--     d= matrix(c(NA, 2.00E+00, 0.00E+00, -1.00E+00, -2.00E+00, -->
<!--                NA, 2.00E+00, 0.00E+00, -1.00E+00, -2.00E+00, -->
<!--                NA, 2.00E+00, 0.00E+00, -1.00E+00, -2.00E+00, -->
<!--                NA, 2.00E+00, 0.00E+00, -1.00E+00, -2.00E+00, -->
<!--                NA, 2.00E+00, 0.00E+00, -1.00E+00, -2.00E+00, -->
<!--                NA, 2.00E+00, 0.00E+00, -1.00E+00, -2.00E+00, -->
<!--                NA, 2.00E+00, 0.00E+00, -1.00E+00, -2.00E+00), -->
<!--       ncol=5, nrow=7, byrow=T), -->
<!--     a=c(3.00E+00, 3.00E+00, 3.00E+00, 3.00E+00, 3.00E+00, 3.00E+00, 3.00E+00)), -->
<!--   list( -->
<!--     d= matrix(c(NA, 1.00E+00, -1.00E+00, -2.00E+00, -3.00E+00, -->
<!--                NA, 1.00E+00, -1.00E+00, -2.00E+00, -3.00E+00, -->
<!--                NA, 1.00E+00, -1.00E+00, -2.00E+00, -3.00E+00, -->
<!--                NA, 1.00E+00, -1.00E+00, -2.00E+00, -3.00E+00, -->
<!--                NA, 1.00E+00, -1.00E+00, -2.00E+00, -3.00E+00, -->
<!--                NA, 1.00E+00, -1.00E+00, -2.00E+00, -3.00E+00, -->
<!--                NA, 1.00E+00, -1.00E+00, -2.00E+00, -3.00E+00), -->
<!--               ncol=5, nrow=7, byrow=T), -->
<!--     a=c(1.00E+00, 1.00E+00, 1.00E+00, 1.00E+00, 1.00E+00, 1.00E+00, 1.00E+00)) -->
<!-- ) -->


<!-- # vector of all parameters to save -->
<!-- param_save <- c("a", "d", "theta") -->

<!-- # fit model -->
<!-- fit <- openbugs( -->
<!--   data= data.file,  -->
<!--   model.file = model.file, # R grabs the file and runs it in openBUGS -->
<!--   parameters.to.save = param_save, -->
<!--   inits=start_values, -->
<!--   n.chains = 3, -->
<!--   n.iter = 2000, -->
<!--   n.burnin = 1000, -->
<!--   n.thin = 1 -->
<!-- ) -->
<!-- print(fit) -->
<!-- ``` -->

## PI Example - Stan


```{r chp11-peer-int-stan, warnings=T, message=T, error=T, cache=TRUE}


model_irt_peer_int <- '
data {
  int  N;
  int  J;
  int  K;
  int  x[N,J];
}

parameters {
  real<lower=0> a[J]; //discrimination
  ordered[K-1] d[J]; //location/thresholds
  real theta[N]; //person parameters
}

model {
  // item response probabilities
  for(n in 1:N){
    for(j in 1:J){
      x[n,j] ~ ordered_logistic(a[j]*theta[n], d[j,1:(K-1)]);
    }
  }
  //measurement model priors
  theta ~ normal(0,1);
  for(j in 1:J){
    a[j] ~ normal(1,2)T[0,];
    d[j,1] ~ normal(-2,2);
    d[j,2] ~ normal(-1,2)T[d[j,1],];
    d[j,3] ~ normal(1,2)T[d[j,2],];
    d[j,4] ~ normal(2,2)T[d[j,3],];
  }
}

'


# initial values
start_values <- list(
  list(
    d= matrix(c(-3.00E+00, -1.00E+00, 0.00E+00, 1.00E+00,
               -3.00E+00, -1.00E+00, 0.00E+00, 1.00E+00,
               -3.00E+00, -1.00E+00, 0.00E+00, 1.00E+00,
               -3.00E+00, -1.00E+00, 0.00E+00, 1.00E+00,
               -3.00E+00, -1.00E+00, 0.00E+00, 1.00E+00,
               -3.00E+00, -1.00E+00, 0.00E+00, 1.00E+00,
               -3.00E+00, -1.00E+00, 0.00E+00, 1.00E+00),
      ncol=4, nrow=7, byrow=T),
    a=c(1.00E-01, 1.00E-01, 1.00E-01, 1.00E-01, 1.00E-01, 1.00E-01, 1.00E-01)),
  list(
    d= matrix(c(-2.00E+00, 0.00E+00, 1.00E+00, 2.00E+00,
               -2.00E+00, 0.00E+00, 1.00E+00, 2.00E+00,
               -2.00E+00, 0.00E+00, 1.00E+00, 2.00E+00,
               -2.00E+00, 0.00E+00, 1.00E+00, 2.00E+00,
               -2.00E+00, 0.00E+00, 1.00E+00, 2.00E+00,
               -2.00E+00, 0.00E+00, 1.00E+00, 2.00E+00,
               -2.00E+00, 0.00E+00, 1.00E+00, 2.00E+00),
      ncol=4, nrow=7, byrow=T),
    a=c(3.00E+00, 3.00E+00, 3.00E+00, 3.00E+00, 3.00E+00, 3.00E+00, 3.00E+00)),
  list(
    d= matrix(c(-1.00E+00, 1.00E+00, 2.00E+00, 3.00E+00,
               -1.00E+00, 1.00E+00, 2.00E+00, 3.00E+00,
               -1.00E+00, 1.00E+00, 2.00E+00, 3.00E+00,
               -1.00E+00, 1.00E+00, 2.00E+00, 3.00E+00,
               -1.00E+00, 1.00E+00, 2.00E+00, 3.00E+00,
               -1.00E+00, 1.00E+00, 2.00E+00, 3.00E+00,
               -1.00E+00, 1.00E+00, 2.00E+00, 3.00E+00),
              ncol=4, nrow=7, byrow=T),
    a=c(1.00E+00, 1.00E+00, 1.00E+00, 1.00E+00, 1.00E+00, 1.00E+00, 1.00E+00))
)

# dataset
dat <- read.table("data/PI.dat", header=T)

mydata <- list(
  N = nrow(dat),
  J = ncol(dat),
  K = 5,
  x = as.matrix(dat)
)


# Next, need to fit the model
#   I have explicitly outlined some common parameters
fit <- stan(
  model_code = model_irt_peer_int, # model code to be compiled
  data = mydata,          # my data
  init = start_values,    # starting values
  chains = 3,             # number of Markov chains
  warmup = 2000,          # number of warm up iterations per chain
  iter = 4000,            # total number of iterations per chain
  cores = 3,              # number of cores (could use one per chain)
  refresh = 0             # no progress shown
)

# first get a basic breakdown of the posteriors
print(fit,pars =c("d", "a", "theta[500]"))

# plot the posterior in a
#  95% probability interval
#  and 80% to contrast the dispersion
plot(fit,pars =c("d", "a", "theta[500]"))

# traceplots
rstan::traceplot(fit,pars =c("d", "a", "theta[500]"), inc_warmup = TRUE)

# Gelman-Rubin-Brooks Convergence Criterion
ggs_grb(ggs(fit, family = c("d"))) +
   theme_bw() + theme(panel.grid = element_blank())
ggs_grb(ggs(fit, family = "a")) +
   theme_bw() + theme(panel.grid = element_blank())
ggs_grb(ggs(fit, family = "theta[500]")) +
   theme_bw() + theme(panel.grid = element_blank())
# autocorrelation
ggs_autocorrelation(ggs(fit, family="d")) +
   theme_bw() + theme(panel.grid = element_blank())
ggs_autocorrelation(ggs(fit, family="a")) +
   theme_bw() + theme(panel.grid = element_blank())
ggs_autocorrelation(ggs(fit, family="theta[500]")) +
   theme_bw() + theme(panel.grid = element_blank())

# plot the posterior density
plot.data <- as.matrix(fit)

mcmc_areas(plot.data, pars = paste0("d[1,",1:4,"]"),prob = 0.8)
mcmc_areas(plot.data, pars = paste0("d[2,",1:4,"]"),prob = 0.8)
mcmc_areas(plot.data, pars = paste0("d[3,",1:4,"]"),prob = 0.8)
mcmc_areas(plot.data, pars = paste0("d[4,",1:4,"]"),prob = 0.8)
mcmc_areas(plot.data, pars = paste0("d[5,",1:4,"]"),prob = 0.8)
mcmc_areas(plot.data, pars = paste0("d[6,",1:4,"]"),prob = 0.8)
mcmc_areas(plot.data, pars = paste0("d[7,",1:4,"]"),prob = 0.8)
mcmc_areas(plot.data, pars = paste0("a[",1:7,"]"),prob = 0.8)
mcmc_areas(plot.data, pars = paste0("theta[500]"),prob = 0.8)

```

## Latent Response Formulation

Connecting IRT models to a factor analytic perspective can be helpful from a modeling standpoint.
Especially when one's model is multidimensional leading into structural equation models.
A useful connection can be made by introducing extra variable(s) into the model to represent the *latent response variable* underlying the *observed categorical response variable*.
We can think of this latent response variables as 

* a latent continuous variable hypothesized to underlie the observed categorical variable that discretized due to data collection or difficulty in measurement; or

* when this natural interpretation is not appropriate, we can think of the latent response variable as a propensity measure for the given response. Although this is not a perfect interpretation, the use of a latent response formulation eases some of the computational machinery and allows for a nice connection between IRT and CFA models.

Next, the latent response formulation is shown for a set of dichotomous outcomes.
This model is conceptually a 2-PL/2-PNO (2 parameter normal ogive) model and is essentially a probit model.
The model can be defined as

\[x^{\ast}_{ij} = a_j\theta_i+d_j+\varepsilon_{ij},\]

where, an important feature is how this latent response is related back to the observed indicator. The range of possible values of a latent response is defined conditional on the value of the observed response. That is,
\[
x^{\ast}_{ij} \sim \cases{
N(a_j\theta_i+d_j, 1)I(x^{\ast}_{ij}\geq0)\ \ \ \ if\ x=1\\
N(a_j\theta_i+d_j, 1)I(x^{\ast}_{ij}<0)\ \ \ \ if\ x=0}
\]


### A Comment on use of JAGS 

After a lot of trial and error in JAGS, I discovered that the model for a latent response is a bit difficult to code up as defined above.
I found a way to utilize the idea of the latent response formulation.
The approach is not perfect and better ways of coding the model are likely more efficient.

In the approached below, which I demonstrate in the example code below, I utilize the model nearly identically to the above specification.
However, I alter the model so that an "observed stochastic node" can be utilized as part of the sampling.
That is, I had to use the latent response variables to obtain a probability of the observed response.
This is straightforward as the latent response variable is based on the normal standard normal distribution and we can use the cumulative normal distribution CDF (Phi) to obtain a probability that represents the probability of responding "1".
Then a Bernoulli distribution is used as the data model.
This approach is a straightforward addition to the latent response modeling and connects well to what is done in IRT models above.

### LSAT Example Revisted - JAGS

```{r chp11-lrv-lsat-jags, warnings=T, message=T, error=T, cache=TRUE}

jags.model.lsat <- function(){

for (i in 1:n){
  for(j in 1:J){
    # latent response variable
    
    xpos[i,j] ~ dnorm(0, 1);T(0,)
    xneg[i,j] ~ dnorm(0, 1);T(,0)
    
    mu[i,j] = xpos[i,j]*z[i,j] + xneg[i,j]*(1-z[i,j])
    
    # compute probabilities based on probit to obtain probabilities for observed categories
    x[i,j] ~ dbern(phi(xstar[i,j] - d[j]))

  }
  
  xstar[i,1:J] ~ dmnorm.vcov(mu[i,1:J], Omega)
  
  
}

psi ~ dgamma(1, 0.5)
invpsi = 1/psi;
for(i in 1:n){
  theta[i] ~ dnorm(0, invpsi)  # distribution for the latent variables
}

for(j in 1:J){
  d[j] ~ dnorm(0, .5)          # Locations for observables
}
a[1] = 1
for(j in 2:J){
  a[j] ~ dnorm(1, .5)    # Discriminations for observables
}

for(i in 1:J){
  for(j in 1:J){
    Omega[i,j] = ifelse(i==j, 1 + psi*a[i]*a[j], psi*a[i]*a[j])
  }
}


} # closes the model

# initial values
start_values <- list(
  list("d"=c(1.00, 1.00, 1.00, 1.00, 1.00),
       "a"=c(NA, 1.00, 1.00, 1.00, 1.00)),
  list("d"=c(-3.00, -3.00, -3.00, -3.00, -3.00),
       "a"=c(NA, 3.00, 3.00, 3.00, 3.00)),
  list("d"=c(3.00, 3.00, 3.00, 3.00, 3.00),
       "a"=c(NA, 0.1, 0.1, 0.1, 0.1))
)

# vector of all parameters to save
param_save <- c("a", "d", "theta", "psi", "Omega", "xstar")

# dataset
dat <- read.table("data/LSAT.dat", header=T)

mydata <- list(
  n = nrow(dat),
  J = ncol(dat),
  x = as.matrix(dat),
  z = as.matrix(dat)
)

# fit model
fit <- jags(
  model.file=jags.model.lsat,
  data=mydata,
  inits=start_values,
  parameters.to.save = param_save,
  n.iter=2000,
  n.burnin = 1000,
  n.chains = 3,
  progress.bar = "none")

#print(fit)
round(fit$BUGSoutput$summary[ !c(rownames(fit$BUGSoutput$summary) %like% "theta" |
                                   rownames(fit$BUGSoutput$summary) %like% "xstar" ), ], 3)
round(fit$BUGSoutput$summary[c("xstar[4,4]","xstar[4,5]","xstar[10,4]","xstar[10,5]"),] ,3)
# extract posteriors for all chains
jags.mcmc <- as.mcmc(fit)
# the below two plots are too big to be useful given the 1000 observations.
#R2jags::traceplot(jags.mcmc)

# gelman-rubin-brook
#gelman.plot(jags.mcmc)

# convert to single data.frame for density plot
a <- colnames(as.data.frame(jags.mcmc[[1]]))
plot.data <- data.frame(as.matrix(jags.mcmc, chains=T, iters = T))
colnames(plot.data) <- c("chain", "iter", a)


plot_title <- ggtitle("Posterior distributions",
                      "with medians and 80% intervals")
mcmc_areas(
  plot.data,
  pars = c(paste0("d[",1:5,"]")),
  prob = 0.8) +
  plot_title

mcmc_areas(
  plot.data,
  pars = c(paste0("a[", 2:5, "]")),
  prob = 0.8) +
  plot_title

print(mydata$x[50,])
mcmc_areas(
  plot.data,
  pars = c("xstar[50,1]", "xstar[50,2]"),
  prob = 0.8) +
  plot_title


```


### LSAT Example Revisted - Stan

```{r chp11-lrv-lsat-stan, warnings=T, message=T, error=T, cache=TRUE}

model_lrv <- '
functions {
  //x = data vector of 1,2,...,nlevs to determine which nu to use
  //mu=vector of K means fixed to 0 to align with blavaan
  //L= cholesky  factor for K variables
  //b=vector of bounds/threshold values
  //u=vector of random uniform numbers (K)
  vector[] tmvn(int[] y, vector mu, matrix L, vector b, real[] u) {
    int K = rows(mu);
    vector[K] d;
    vector[K] z;
    vector[K] out[2];
    for (k in 1:K) {
      int km1 = k - 1;
      real nu;
      //y==0 => upper bound only
      if (y[k] == 0) {
        real z_star = (b[k] - (mu[k] + ((k > 1) ? L[k,1:km1] * head(z, km1) : 0))) /  L[k,k];
        real u_star = Phi(z_star); //normal CDF for implied density of TMVN
        nu = u_star * u[k];
        d[k] = u_star;
      }
      //y==1 => lower bound only
      if(y[k] == 1) {
        real z_star = (b[k] - (mu[k] + ((k > 1) ? L[k,1:km1] * head(z, km1) : 0))) /  L[k,k];
        real u_star = Phi(z_star);
        d[k] = 1 - u_star;
        nu = u_star + d[k] * u[k];
      }
      z[k] = inv_Phi(nu); //convert back to z-score from uniform variate
    }

    out[1] = z; //simulated ystar value
    out[2] = d; //density
    return(out);
  }

}
data {
  int<lower=1>  N;
  int<lower=1>  J;
  array[N, J] int<lower=0, upper=1> x;
}

parameters {
  vector[J] tau; //threshold parameters
  vector[J-1] lambda_fr; //factor loadings
  real<lower=0> psi; // factor standard deviation
  real eta[N]; //factor scores
  array[N, J] real<lower=0, upper=1> u; // nuisance that absorbs inequality constraints
}

transformed parameters {
  vector[J] lambda;
  matrix[J,J] Omega;
  matrix[J,J] L_Omega;
  real psi_var;

  lambda[1] = 1;
  lambda[2:J] = lambda_fr;
  psi_var = psi^2;

  for(d in 1:J){
    for(f in 1:J){
      if(d==f){
        Omega[d,f] = 1 + psi_var*lambda[d]^2;
      } else {
        Omega[d,f] = psi_var*lambda[d]*lambda[f];
      }
    }
  }
  //decomponse for use in stan sampling statement
  L_Omega = cholesky_decompose(Omega);
}

model {
  for(n in 1:N)
    target += log(tmvn(x[n,1:J], rep_vector(0,J), L_Omega, tau, u[n,1:J])[2]);
  // Jacobian adjustments to kernal density for use of TMVN density on x*
  // implicit: u ~ uniform(0,1)
  // truncated multivariate normal
  
  
  //priors for measurement model
  psi ~ gamma(1, 0.5);
  eta ~ normal(0,psi);
  lambda_fr ~ normal(0,1.5);
  tau ~ normal(0,1.5);
}

generated quantities {
  corr_matrix[J] Omega_cor;

  for(d in 1:J){
    for(f in 1:J){
      if(d==f){
        Omega_cor[d,f] = 1;
      } else {
        Omega_cor[d,f] = Omega[d,f]/(sqrt(Omega[d,d])*sqrt(Omega[f,f]));
      }
    }
  }
}

'
cat(model_lrv)
# initial values
start_values <- list(
  list("tau"=c(1.00, 1.00, 1.00, 1.00, 1.00),
       "lambda_fr"=c(1.00, 1.00, 1.00, 1.00)),
  list("tau"=c(-3.00, -3.00, -3.00, -3.00, -3.00),
       "lambda_fr"=c(3.00, 3.00, 3.00, 3.00)),
  list("tau"=c(3.00, 3.00, 3.00, 3.00, 3.00),
       "lambda_fr"=c(0.1, 0.1, 0.1, 0.1))
)

# dataset
dat <- read.table("data/LSAT.dat", header=T)

mydata <- list(
  N = nrow(dat),
  J = ncol(dat),
  x = as.matrix(dat)
)


# Next, need to fit the model
#   I have explicitly outlined some common parameters
fit <- stan(
  model_code = model_lrv, # model code to be compiled
  data = mydata,          # my data
  #init = start_values,    # starting values
  chains = 4,             # number of Markov chains
  warmup = 1000,          # number of warm up iterations per chain
  iter = 2000,            # total number of iterations per chain
  cores = 4,              # number of cores (could use one per chain)
  refresh = 0             # no progress shown
)

# first get a basic breakdown of the posteriors
print(fit,pars =c("lambda", "tau","psi", "Omega_cor"))


rstan::traceplot(fit,pars =c("lambda_fr", "tau", "psi"), inc_warmup = TRUE)

# Gelman-Rubin-Brooks Convergence Criterion
ggs_grb(ggs(fit, family = c("lambda_fr"))) +
   theme_bw() + theme(panel.grid = element_blank())
ggs_grb(ggs(fit, family = "tau")) +
   theme_bw() + theme(panel.grid = element_blank())
ggs_grb(ggs(fit, family = "psi")) +
   theme_bw() + theme(panel.grid = element_blank())
# autocorrelation
ggs_autocorrelation(ggs(fit, family="lambda_fr")) +
   theme_bw() + theme(panel.grid = element_blank())
ggs_autocorrelation(ggs(fit, family="tau")) +
   theme_bw() + theme(panel.grid = element_blank())
ggs_autocorrelation(ggs(fit, family="psi")) +
   theme_bw() + theme(panel.grid = element_blank())

# plot the posterior density
plot.data <- as.matrix(fit)

plot_title <- ggtitle("Posterior distributions",
                      "with medians and 80% intervals")
mcmc_areas(
  plot.data,
  pars = paste0("lambda_fr[",1:4,"]"),
  prob = 0.8) +
  plot_title

mcmc_areas(
  plot.data,
  pars = paste0("tau[",1:5,"]"),
  prob = 0.8) +
  plot_title

mcmc_areas(
  plot.data,
  pars = c("psi"),
  prob = 0.8) +
  plot_title
```


## Final Notes

* A fully Bayesian approach to psychometric modeling helps highlight the major similarities between factor analytic frameworks and the item response theory perspective.