chap3.tex

\chapter[RNN-Based Counterfactual Prediction, with an Application to Homestead Policy and Public Schooling]{RNN-Based Counterfactual Prediction, with an Application to Homestead Policy and Public Schooling}\label{rnns-causal}

\begin{quote}  
	\textbf{Summary:} This paper proposes an alternative to the synthetic control method (SCM) for estimating the effect of a policy intervention on an outcome over time. Recurrent neural networks (RNNs) are used to predict the counterfactual outcomes of treated units using only the outcomes of control units as predictors. This approach is less susceptible to $p$-hacking because it does not require the researcher to choose predictors or pre-intervention covariates to construct the synthetic control. RNNs do not assume a functional form, can learn nonconvex combinations of control units, and are specifically structured to exploit temporal dependencies in sequential data. I apply the approach to the problem of estimating the long-run impacts of U.S. homestead policy on public school spending.
\end{quote}

\clearpage

\section{Introduction} 
\noindent

An important problem in the social sciences is estimating the effect of a discrete intervention on a continuous outcome over time. When interventions take place at an aggregate level (e.g., a state), researchers make causal inferences by comparing the post-intervention (``post-period'') outcomes of affected (``treated'') units against the outcomes of unaffected  (``control'') units. A common approach to the problem is the synthetic control method (SCM) \citep{abadie2010synthetic}, which predicts the counterfactual outcomes of treated units by finding a convex combination of control units that match the treated units in term of lagged outcomes or pre-intervention (``pre-period'') covariates. 

The SCM has several limitations. First, the convexity restriction of the synthetic control estimator precludes dynamic, nonlinear interactions between multiple control units. Intuitively, one can expect that the treated unit may exhibit nonlinear or negative correlations with the control units. \citet{ferman2016revisiting} demonstrate that the convexity restriction implies that the SCM estimator may be biased even if selection into treatment is only correlated with time-invariant unobserved covariates. Second, \citet{ferman2018synthetic} demonstrate that the SCM is generally biased if treatment assignment is correlated with unobserved confounders, even when the number of pre-period periods grows. Moreover, the authors show that while the SCM minimizes imbalance in pre-period outcomes, the likelihood of finding exact balancing weights vanishes as the number of time periods increase, which results in bias. 

Third, several problems arise from the lack of guidance on how to specify the SCM estimator. The specification of the estimator can produce very different results: \citet{ferman2018cherry} show, for example, how cherry-picking between common SCM specifications can facilitate $p$-hacking. \citet{kaul2015synthetic} show that the common practice of including lagged outcomes as model inputs can render all other covariates irrelevant. \citet{klossner2017comparative} demonstrates that the common practice of using cross-validation to select importance weights can yield multiple values and consequently different results. 

This paper proposes an alternative to the SCM that is capable of automatically selecting appropriate control units at each time period, allows for nonconvex combinations of control units, and does not rely on pre-period covariates. The method uses recurrent neural networks (RNNs) to predict the counterfactual outcomes of treated units using only control unit outcomes as model inputs. RNNs are a class of neural networks that take advantage of the sequential nature of temporal data by sharing model parameters across multiple time periods \citep{el1995}. RNNs are nonparametric in that they do not assume a functional form when fitting the data. In addition, RNNs can learn the most useful nonconvex combination of control unit outcomes at each time period for generating counterfactual predictions. Relaxing the convexity restriction is useful when the data-generating process underlying the outcome of interest depends nonlinearly on the history of its inputs. RNNs have been shown to outperform various linear models on time-series prediction tasks \citep{cinar2017position}. 

RNNs are end-to-end trainable and very flexible to a given sequential prediction problem. For example, they are capable of sharing learned parameters across time periods and multiple treated units. While the SCM can be generalized to handle multiple treated units \citep[e.g.,][]{dube2015pooling,xu2017generalized}, the generalized SCM is not capable of sharing model weights when predicting the outcomes of multiple treated units. Regularization methods such as dropout can easily be incorporated into RNN architectures to prevent overfitting during the training process, which is problematic when the networks learn an overreliance on a few inputs.

The proposed method builds on a new literature that uses machine learning methods for data-driven counterfactual prediction, such as matrix completion \citep{athey2017matrix}, or two-stage estimators that reduce data dimensionality via L1-regularized regression \citep{doudchenko2016balancing,carvalho2018arco} or matrix factorization \citep{amjad2018robust} prior to regressing the outcomes on the reduced data. These methods are data-driven in the sense that they are capable of finding an appropriate subset of control units to form the synthetic control, without domain knowledge or pre-period covariates. 

In the section immediately below, I describe the problem of counterfactual prediction and its relationship to matrix completion and the problem of covariate shift; Section \ref{RNNs-section} introduces the approach of using RNNs for counterfactual prediction; Section \ref{placebo} presents the results of the placebo tests; Section \ref{schooling-app} details the procedure for hypothesis testing and applies the RNN-based method and inferential procedure ot the problem of estimating the impact of homestead policy on long-run state government investment in public schooling; Section \ref{ch3-conclusion} concludes and offers potential avenues for future research. 

\section{Counterfactual prediction} \label{prediction}

The proposed method estimates the causal effect of a discrete intervention in observational panel data; i.e., settings in which treatment is not randomly assigned and there exists both pre- and post-period observations of the outcome of interest. Let $\boldsymbol{Y}$ denote a $\text{N} \times \text{T}$ matrix of outcomes for each unit $i =1, \ldots, \text{N}$, at time $t = 1, \ldots, \text{T}$. $\boldsymbol{Y}$ is incomplete because we observe each element $Y_{it}$ for only the control units and the treated units prior to time of initial treatment exposure, $\text{T}_0 < \text{T}$. Let $\mathcal{O}$ denote the set of $(it)$ values that are observed and $\mathcal{M}$ the set of $(it)$ missing values. Let the values of the $\text{N} \times \text{T}$ complete matrix $\boldsymbol{W}$ be $W_{it} =1$ if $(it) \in \mathcal{M}$ and $W_{it} = 0$ if $(it) \in \mathcal{O}$. The pattern of missing data is assumed throughout this paper to follow a simultaneous treatment adoption setting, where treated units are exposed to treatment at time $\text{T}_0$ and every subsequent period. 

This setup is motivated by the \citet{neyman1923} potential outcomes framework, where for each $it$ value there exists a pair of potential outcomes, $Y_{it}(1)$ and $Y_{it}(0)$, representing the response to treated and control regimes, respectively. The observed outcomes are 

\begin{align*} 
	Y_{it} = \begin{cases}
		Y_{it}(0) 	& \mbox{if } W_{it} = 0  \text{ or } t < \text{T}_0 \\
		Y_{it}(1) 	& \mbox{if } W_{it} = 1  \text{ and } t \geq \text{T}_0.
	\end{cases} 
\end{align*} 
\noindent
The problem of counterfactual prediction is that we cannot directly observe the missing potential outcomes and instead wish to impute the missing values in $\boldsymbol{Y}(0)$ for treated units with $W_{it} =1$.  The potential outcomes framework explicitly assumes unconfoundedness. In an observational setting, this assumption requires

$$
\left(\boldsymbol{Y}(0), \boldsymbol{Y}(1) \right) \independent \boldsymbol{W}| \boldsymbol{Y}(\mathcal{O}),
$$ where $\boldsymbol{Y}(\mathcal{O})$ is the observed data. The potential outcomes framework implicitly assumes treatment is well-defined to ensure that each unit has the same number of potential outcomes \citep{imbens2015causal}. It also excludes interference between units, which would undermine the framework by creating more than two potential outcomes per unit, depending on the treatment status of other units \citep{rubin1990}.

\subsection{Relationship to matrix completion and covariate shift}

The intuition behind the proposed approach to counterfactual prediction is similar to that of the method of matrix completion via nuclear norm minimization (MC-NNM) proposed by \citet{athey2017matrix}. Matrix completion methods attempt to impute missing entries in a low-rank matrix by solving a convex optimization problem, even when relatively few values are observed in $\boldsymbol{Y}$ \citep{candes2009exact,candes2010matrix}. The estimator recovers a $\text{N} \times \text{T}$ low-rank matrix by minimizing the sum of squared errors via nuclear norm regularized least squares. The estimator reconstructs the matrix by iteratively replacing missing values with those recovered from a singular value decomposition \citep{mazumder2010spectral}. 

\citet{athey2017matrix} note two drawbacks of MC-NNM. First, the errors may be autocorrelated because the estimator does not account for temporal dependencies in the observed data. The estimator detects patterns row- and column-wise, but treat the data as perfectly synchronized \citep{yoon2018estimating}. In contrast, the SCM assumes that correlations across units are stable over time, while the RNN-based approach exploits the temporal component of the data and therefore does not have the problem of autocorrelated errors. 

Second, the MC-NNM estimator penalizes the errors for each observed value equally without regard to the fact that the probability of missingness (i.e, the propensity score), increases with $t$. \citet{athey2017matrix} suggest weighting the loss function by the propensity score, which is similar to the importance weighting scheme proposed by \citet{cortes2008sample} to address the problem of covariate shift, which is a special case of domain adaptation \citep{huang2007correcting,ben2007analysis,bickel2009discriminative,cortes2010learning,JMLR:v17:15-239}.\footnote{\citet{schnabel2016recommendations} first connected the matrix completion problem with causal inference in observational settings in the context of recommender systems under confounding. \citet{johansson2016learning} formulates the general problem of counterfactual inference as a covariate shift problem.}

The covariate shift problem occurs when training and test data are drawn from different distributions. Define the training set input-output pair as 

$$\left(\boldsymbol{X}^{\text{train}}, \boldsymbol{Y}^{\text{train}}\right) = \left(\boldsymbol{Y}(\boldsymbol{W})^{\left(t < \text{T}_0\right)}, \boldsymbol{Y}(\boldsymbol{W})^{\left(t \geq \text{T}_0\right)}\right)$$
\noindent
for units with $\boldsymbol{W}=0$ and the test set pair $\left(\boldsymbol{X}^{\text{test}}, \boldsymbol{Y}^{\text{test}}\right)$ for units with $\boldsymbol{W}=1$. In the proposed approach, the model weights learned on the training set is fit on $\boldsymbol{X}^{\text{test}}$ to predict $\boldsymbol{Y}^{\text{test}}$. The approach therefore assumes similarity between the distributions of $\boldsymbol{X}^{\text{train}}$ and $\boldsymbol{X}^{\text{test}}$. In order to minimize the discrepancy between the training and test set input distributions, I estimate the propensity score $\hat{e}_{it} = \Pr(W_{it}=1 | Z_{it})$, conditional on covariate matrix $\boldsymbol{Z}$ and then weight the training loss by the estimated propensity scores. 

\subsection{Nonparametric regression}

In its most basic form, counterfactual prediction can be represented as a nonparametric regression of the training set outputs on the inputs,

\begin{equation}\label{eq:np}
	\boldsymbol{\hat{\boldsymbol{Y}}^{\text{train}}} =  \hat{f_0} \left(\boldsymbol{X}^{\text{train}}\right) + \upepsilon^{(t)},
\end{equation}
\noindent
where the noise variables $\upepsilon^{(t)}$ are assumed to be i.i.d. standard normal and independent of the observed data. The nonlinear function $\hat{f_0}$ is estimated by minimizing the weighted mean squared error on the training set outputs, 

\begin{equation} \label{eq:mse}
	\text{WMSE} = \sum \left(\boldsymbol{Y}^{\text{train}} - \boldsymbol{\hat{Y}}^{\text{train}}  \right)^2  \cdot \frac{\boldsymbol{\hat{E}}^\text{train}}{|\boldsymbol{X}^\text{train}|},
\end{equation}
\noindent
where $\boldsymbol{\hat{E}}^\text{train}$ is a matrix of estimated propensity scores. 

At test time, the estimated function is used to predict $\boldsymbol{\hat{Y}}^{\text{test}} = \hat{f_0} \left(\boldsymbol{X}^{\text{test}}\right)$. The estimated causal effect of the intervention is then

\begin{equation}\label{eq:pointwise}
	\boldsymbol{\hat{\upphi}} = \boldsymbol{Y}^{\text{test}} - \boldsymbol{\hat{Y}}^{\text{test}}. 
\end{equation}

The estimated average causal effect of the intervention on treated units is calculated by averaging over the time dimension, resulting in the vector $\boldsymbol{\bar{\upphi}}^{(t)}$ of length $\text{T}_\star = \text{T}-\text{T}_0$. 

\section{RNNs for counteractual prediction} \label{RNNs-section}

RNNs \citep{graves2012,goodfellow2016deep} consist of an input $\boldsymbol{X} = \left(\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(n_x)}\right)$, an output $\boldsymbol{Y} = \left(\boldsymbol{y}^{(1)}, \ldots, \boldsymbol{y}^{(n_y)}\right)$, and a hidden state $\boldsymbol{h}^{(t)}$. In the plain vanilla RNN it is assumed $n_x = n_y = T$; in the encoder-decoder network architecture described below, $n_x$ and $n_y$ can vary in length. 

At each $t$, RNNs input $\boldsymbol{x}^{(t)}$ and pass it to the $\boldsymbol{h}^{(t)}$, which is updated with a function $g^{(t)}$ using the entire history of the input, which is unfolded backwards in time:
%
\begin{align}
	\boldsymbol{h}^{(t)} &= g^{(t)} \left(\boldsymbol{x}^{(t)}, \boldsymbol{x}^{(t-1)}, \ldots, \boldsymbol{x}^{(1)} \right) \\ 
	&= f_1 \left( \boldsymbol{h}^{(t-1)}, \boldsymbol{x}^{(t)}; \, \theta \right). \label{eq:hidden}
\end{align} 
The activation function $f_1 (\cdot)$, parameterized by $\theta$, is shared for all $t$. Parameter sharing is particularly useful in the current application because it allows for better generalization when the dimension of the training data is relatively small. The updated hidden state (\ref{eq:hidden}) is used to generate a sequence of values $\boldsymbol{o}^{(t)}$ in the form of log probabilities corresponding to the output. The loss function internally computes $\boldsymbol{\hat{y}}^{(t)} = f_2 \left(\boldsymbol{o}^{(t)}\right)$, where $f_2 (\cdot)$ can be a linear function for regression problems. The total loss for the input-output pair is the sum of the losses over all $t$.

The RNNs are trained to estimate the conditional distribution of $\boldsymbol{y}^{(t)}$ given the past inputs and also the previous output. This is accomplished by offsetting the input-output pairs by one time period so that the networks receive $\boldsymbol{y}^{(1)}$  as input at $t + 1$ to be conditioned on for predicting subsequent outputs. This popular training procedure is known as teacher forcing because it forces the networks to stay close to the ground-truth output $\boldsymbol{y}^{(t)}$ \citep{lamb2016professor}. Specifically, the RNNs are trained to maximize the log-likelihood

\begin{equation} \label{rnn-obj}
	\text{log} \Pr \left(\boldsymbol{y}^{(t)} | \boldsymbol{x}^{(1)} \ldots \boldsymbol{x}^{(t)},\boldsymbol{y}^{(1)}, \ldots, \boldsymbol{y}^{(t-1)} \right).
\end{equation}

\subsection{Encoder-decoder networks}

Encoder-decoder networks are the standard for neural machine translation (NMT) \citep{cho2014learning,bahdanau2014neural,vinyals2014grammar} and are also widely used for predictive tasks, including speech recognition \citep{chorowski2015attention} and time-series forecasting \citep{zhu2017deep}. 

The encoder RNN reads in $\boldsymbol{x}^{(t)}$ sequentially and the hidden state of the network updates according to (\ref{eq:hidden}). The hidden state of the encoder is a context vector $\boldsymbol{c}$ that summarizes the input sequence, which is copied over to the decoder RNN. The decoder generates a variable-length output sequence by predicting $\boldsymbol{y}^{(t)}$ given the encoder hidden state and the previous element of the output sequence. Thus, the hidden state of the decoder is updated recursively by

\begin{equation}
	\boldsymbol{h}^{(t)} = f_1 \left( \boldsymbol{h}^{(t-1)}, \boldsymbol{y}^{(t-1)}, \boldsymbol{c}; \theta \right), \label{eq:decoder}
\end{equation} and the conditional probability of the next element of the sequence is 

\begin{equation}
	\Pr (\boldsymbol{y}^{(t)} | \boldsymbol{y}^{(t)}, \ldots, \boldsymbol{y}^{(t-1)}, \boldsymbol{c}) =  f_1 \left( \boldsymbol{h}^{(t)}, \boldsymbol{y}^{(t-1)}, \boldsymbol{c}; \, \theta \right).
\end{equation}  Effectively, the decoder learns to generate outputs $\boldsymbol{y}^{(t)}$ given the previous outputs, conditioned on the input sequence. 

\subsection{Recurrent variational autoencoder}

While the encoder-decoder architecture is effective for many sequential prediction tasks, the model does not learn a vector representation of the entire input. The variational autoencoder (VAE) \citep{kingma2013auto} is a generative model that learns a latent variable model for $\boldsymbol{x}^{(t)}$ such that new sequences $\boldsymbol{x'}^{(t)}$ can be generated by sampling from the latent space $q$. Similar to encoder-decoder networks, the VAE has an encoder that learns a latent representation of the input sequence and a decoder that maps the representation back to the inputs. The VAE architecture differs from encoder-decoder networks in that the VAE doesn't have a final dense layer that compares the decoder outputs to  $\boldsymbol{x'}^{(t)}$; i.e., it is a ``self-supervised'' technique. Another difference is that the VAE learns parameter weights by mapping the inputs to a distribution over parameters of $q$.

The recurrent VAE (RVAE) \citep{fabius2014variational, chung2015recurrent,bowman2015generating} consists of an encoder RNN that maps $\boldsymbol{x}^{(t)}$ to a distribution over parameters of $q$. The model then randomly samples $\boldsymbol{z}$ from the latent distribution, 

\begin{equation}
	q(\boldsymbol{z} | \boldsymbol{x}^{(t)}) = q (\boldsymbol{z}; f_3 (\boldsymbol{x}^{(t)};\, \theta)).
\end{equation}

Finally, a decoder RNN takes the form of a conditional probability model $\Pr (\boldsymbol{x}^{(t)} | \boldsymbol{z})$. The parameters of the model are learned by maximizing the loss function, which takes the difference between the log-likelihood between the decoder outputs $\boldsymbol{x'}^{(t)}$ and $\boldsymbol{x}^{(t)}$ and the relative entropy between  $q(\boldsymbol{z} | \boldsymbol{x}^{(t)})$ and the model prior $\Pr (\boldsymbol{z})$. The latter component of the loss function acts as regularizer by forcing the learned latent distribution to be similar to the model prior. 

\section{Placebo tests} \label{placebo}

I conduct placebo tests on actual datasets in order to benchmark the accuracy of RNN-based estimators. There are no actual treated units in the placebo tests, so the estimators are evaluated on their ability to recover a null effect. 

For each trial run, I randomly select half of the units in the dataset to be treated and predict their counterfactual outcomes for periods following a selected $\text{T}_0$. I compare the predicted values to the observed values by calculating the root-mean squared error $(\text{RMSE})$. I benchmark the encoder-decoder networks and RVAE against the following estimators: 
%
\begin{description}
	{\setlength\itemindent{1mm}
		\item[(a) DID] Regression of $\textbf{Y}$ on $\textbf{W}$ and unit and time fixed effects
		\item[(b) MC-NNM] Matrix completion via nuclear norm minimization, with the regularization term on the nuclear norm selected by cross-validation \citep{athey2017matrix}
		\item[(c) SCM] Approached via exponentiated gradient descent \citep{abadie2010synthetic}
		\item[(d) VT-EN] Vertical regression with elastic-net regularization, with the regularization and mixing parameters selected by cross-validation \citep{zou2005regularization,athey2017matrix}.
	}
\end{description}

Implementation details for the encoder-decoder networks and RVAE are provided in Section \ref{imp}. In the placebo tests, the networks are trained using an unweighted MSE loss function for 500 epochs on a 12GB NVIDIA Titan Xp GPU.

\subsection{Synthetic control datasets} \label{synth-placebo}

I first conduct placebo tests on three datasets common to the synthetic control literature, with the actual treated unit removed from each dataset: \possessivecite{abadie2003economic} study of the economic impact of terrorism in the Basque Country during the late 1960s ($\text{N}=16$, $\text{T}=43$); \possessivecite{abadie2010synthetic} study of the effects of a large-scale tobacco control program implemented in California in 1988 ($\text{N}=38$, $\text{T}=31$); and \possessivecite{abadie2015comparative} study of the economic impact of the 1990 German reunification on West Germany ($\text{N}=16$, $\text{T}=44$). Each dataset is log-transformed to alleviate exponential effects.

Figure \ref{california-sim} reports the estimated average prediction error on the California smoking dataset, with the estimates jittered horizontally to reduce overlap. Figures \ref{basque-sim} and \ref{germany-sim} report the estimates for the Basque Country and West Germany datasets, respectively. Error bars are calculated using the standard deviation of the error distribution generated by multiple runs. The RNN-based estimators yield comparable error rates vis-à-vis the alternatives only for high ratios of $\text{T}_0/\text{T}$, which reflect the need for sizeable training sets for the RNN-based approach. The RVAE performs the worse on comparatively small training data since it is learning from less information than the encoder-decoder networks; i.e., without the post-period observations of the control units. The MC-NNM estimator does comparatively well in the simulations due to the fact that it is capable of using additional information in the form of pre-period observations of the treated units, whereas the other estimators train only on the control observations. 

\begin{figure}[htbp]
	\centering
	\includegraphics[width=0.9\textwidth]{/media/jason/Dropbox/github/rnns-causal/paper/plots/california-sim.png}
	\caption{Placebo tests on California smoking data: 
		{\protect\tikz \protect\draw[color={rgb:red,4;green,0;yellow,1}] (0,0) -- plot[mark=o, mark options={scale=2}] (0.25,0) -- (0.5,0);}, DID;
		{\protect\tikz \protect\draw[color={rgb:red,244;green,226;blue,66}] (0,0) -- plot[mark=triangle*, mark options={scale=2,fill=white}] (0.25,0) -- (0.5,0);}, ED; 
		{\protect\tikz \protect\draw[color={rgb:red,0;green,5;blue,1}] (0,0) -- plot[mark=+, mark options={scale=2}] (0.25,0) -- (0.5,0);}, MC-NNM;
		{\protect\tikz \protect\draw[color={rgb:red,66;green,200;blue,244}] (0,0) -- plot[mark=x, mark options={scale=2}] (0.25,0) -- (0.5,0);}, RVAE;
		{\protect\tikz \protect\draw[color={rgb:red,66;green,107;blue,244}] (0,0) -- plot[mark=diamond, mark options={scale=2}] (0.25,0) -- (0.5,0);}, SCM;
		{\protect\tikz \protect\draw[color={rgb:red,244;pink,66;blue,223}] (0,0) -- plot[mark=triangle, mark options={scale=2, rotate=180}] (0.25,0) -- (0.5,0);}, VT-EN.\label{california-sim}}
\end{figure}

\subsection{Stock market data}

The second battery of placebo tests draws on a dataset of stock market returns compiled by \citet{athey2017matrix}. The dataset consists of daily returns for 2,453 stocks over 3,082 days. In order to track how the error rates vary according to the dimensionality of the data, I create six sub-samples of the first $T$ daily returns of $N$ randomly selected stocks for the $(\text{N}, \text{T})$ pairs $\left\{(10, 490), (20, 245), (50, 98), (70, 70), (100, 49), (140, 35)\right\}$. In each sub-sample, half of the units are randomly selected as treated, and $\text{T}_0 = \text{T}/2$.

Figure \ref{stock-sim} reports the average RMSE for each pair with standard errors informed by the error distribution generated by five trial runs. The average RMSE is the lowest for all estimators in the sub-sample $(\text{N}, \text{T}) = (10, 490)$, which reflects the benefit of training on a large number of time periods. Within this sub-sample, encoder-decoder networks and RVAE achieve the lowest average RMSE, followed by MC-NNM, SCM, DID, and lastly, vertical regression. The RNN-based estimators do comparatively less well when $N \gg T$ since there is not an adequate number of training set pre-periods to learn a concise representation of the inputs. 

\begin{figure}[htbp]
	\centering
	\includegraphics[width=0.9\textwidth]{/media/jason/Dropbox/github/rnns-causal/paper/plots/stock-sim.png}
	\caption{Placebo tests on stock market data: 
		{\protect\tikz \protect\draw[color={rgb:red,4;green,0;yellow,1}] (0,0) -- plot[mark=o, mark options={scale=2}] (0.25,0) -- (0.5,0);}, DID;
		{\protect\tikz \protect\draw[color={rgb:red,244;green,226;blue,66}] (0,0) -- plot[mark=triangle*, mark options={scale=2,fill=white}] (0.25,0) -- (0.5,0);}, ED; 
		{\protect\tikz \protect\draw[color={rgb:red,0;green,5;blue,1}] (0,0) -- plot[mark=+, mark options={scale=2}] (0.25,0) -- (0.5,0);}, MC-NNM;
		{\protect\tikz \protect\draw[color={rgb:red,66;green,200;blue,244}] (0,0) -- plot[mark=x, mark options={scale=2}] (0.25,0) -- (0.5,0);}, RVAE;
		{\protect\tikz \protect\draw[color={rgb:red,66;green,107;blue,244}] (0,0) -- plot[mark=diamond, mark options={scale=2}] (0.25,0) -- (0.5,0);}, SCM;
		{\protect\tikz \protect\draw[color={rgb:red,244;pink,66;blue,223}] (0,0) -- plot[mark=triangle, mark options={scale=2, rotate=180}] (0.25,0) -- (0.5,0);}, VT-EN.\label{stock-sim}}
\end{figure}

\section{Application: Homestead policy and public schooling} \label{schooling-app}

In the empirical application, I apply the RNN-based approach to the problem of estimating the long-run impacts of the HSA on state government public education spending. Sociologists and political economists \citep[e.g,][]{meyer1979public,alesina2013nation,bandiera2018nation} have viewed the rapid development of public schooling in the U.S. during the 19th century as a nation-building policy. It is argued that states across the U.S. adopted compulsory primary education means to homogenize the population during the `Age of Mass Migration', when of tens of millions of foreign migrants arrived to the country between 1850 and 1914. 

An alternative explanation for the rise of public schooling is the view of \citet{engerman2005evolution}, that frontier state governments sought to increase public investments in order to attract eastern migrants following the passage of the Homestead Act (HSA) of 1862. The HSA opened for settlement hundreds of millions of acres of frontier land. Any adult citizen could apply for a homestead grant of 160 acres of land, provided that they live and make improvements on the land for five years.  According to this view, the sparse population on the frontier meant that state and local governments competed with each other to attract migrants in order to lower local labor costs and to increase land values and tax revenues. Frontier governments offered migrants broad access to cheap land and property rights, unrestricted voting rights, and a more generous provision of schooling and other public goods.

The HSA may have also expanded investments in public schooling by reducing the degree of land inequality on the frontier. Homestead policies are expected to lower land inequality by fixing land grants to 160 acres, thereby encouraging farm sizes to approach their ideal scale. Political economy frameworks \citep[e.g.,][]{acemoglu2008persistence, besley2009origins} emphasize that greater economic power of the ruling class reduces public investments. In the model of \citet{galor2009inequality}, wealthy landowners block education reforms because public schooling favors industrial labor productivity and decreases the value in farm rents. Inequality in this context can be thought of as a proxy for the amount of \emph{de facto} political influence elites have to block reforms.

\subsection{Data and assumptions} \label{educ-data}

I create a state-level measure of state government education spending from the records of 48 state governments during the period of 1783 to 1932 \citep{sylla1993sources} and the records of 16 state governments during the period of 1933 to 1937 \citep{sylla1995sourcesa,sylla1995sourcesb}. Comparable measures for 48 states are drawn from U.S. Census special reports for the years 1902, 1913, 1932, and 1942 \citep{haines2010}.

The data pre-processing steps are as follows. The measure is inflation-adjusted according to the U.S. Consumer Price Index \citep{williamson2017seven} and scaled by the total free population in the decennial census \citep{haines2010}. Missing values are imputed separately in the pre- and -post-periods by carrying the last observation forward and remaining missing values are imputed by carrying the next observation backward. The data are log-transformed to alleviate exponential effects. Lastly, I remove states with no variance in the pre-period outcomes, resulting in a complete matrix of size $(\text{N} \times \text{T})= (32 \times 156)$. 

In this application, public land states --- i.e., states crafted from the public domain --- serve as treated units (i.e., the test set). State land states, which include states of the original 13 colonies, Maine, Tennessee, Texas, Vermont, and West Virginia, were not directly affected by homestead policies and therefore serve as control units (i.e., the training set). The RNN-based approach assumes the distribution of $\boldsymbol{X}^{\text{train}}$ and $\boldsymbol{X}^{\text{test}}$ are similar.

I weight the training loss by  propensity scores in order to minimize the discrepancy between the distributions of training and test set inputs. The propensity scores are estimated via logistic regression with unit-specific, pre-period covariates including state-level average farm sizes measured in the 1860 and average farm values measured in the 1850 and 1860 censuses \citep{haines2010} to control for homesteaders migrating to more productive land. To control for selection bias arising from differences in access to frontier lands, I create a measure of total miles of operational track per square mile aggregated to the state-level using digitized railroad maps provided by \citet{atack2013use}. Fig. \ref{educ-dense} shows that the training and test set input distributions weighted by the propensity scores are visually similar.\footnote{However, a weighted two-sided t-test rejects the null of equivalence for the difference-in-means between the two distributions ($t= \boldsymbol{\bar{X}}^{\text{train}} - \boldsymbol{\bar{X}}^{\text{test}} = -0.86$; $\sigma_t = 0.07$; $p < 0.01$).}

Aggregating to the state level approximately 1.46 million individual land patent records authorized under the HSA, I determine that the earliest homestead entries occurred in 1869 in about half of the frontier states, about seven years following the enactment of the HSA.\footnote{Land patent records provide information on the initial transfer of land titles from the federal government and are published online by the U.S. General Land Office (\url{https://glorecords.blm.gov}).} Using this information, I set $\text{T}_0 = 87$, which leaves $\text{T}_\star = 69$ time periods when half of the states are exposed to treatment. While the approach assumes that treatment adoption is simultaneous across states, the date of initial treatment exposure varied as new frontier land opened between the period of 1869 to 1902.\footnote{The assumption of simultaneous adoption is standard for DID estimation \citep{doudchenko2016balancing}.} Also note that while the no interference assumption cannot directly be tested, it is likely that state land states were indirectly affected by the out-migration of homesteaders from frontier states. Interference in this case would underestimate the effect of the intervention because it would make the counterfactual and observed treated unit observations in the post-period more similar. 

\subsection{Estimates} 

Prior to analyzing the data, I conduct placebo tests on the education spending data similar to those described in Section \ref{synth-placebo}. Figure \ref{educ-sim} presents the average RMSE calculated on the control unit outcomes with standard errors originating from 10 runs. In line with the previous placebo tests, the RNN-based estimators yield error rates comparable to the alternative estimators only when there are sufficient pre-period observations to train on; in this case, when $\text{T}_0/\text{T} \geq 0.5$. We can be reasonably confident that the RNN-based estimators will be at least as accurate as the other estimators since $\text{T}_0/\text{T} = 0.55$ in this application. 

Next, I train a encoder-decoder network on the training set of state land states and use the learned weights to predict the counterfactual outcomes of public land states. The top panel of Figure \ref{educ-ed} compares the average outcomes of treated units and control units along with the average predicted outcomes of treated units. The dashed vertical line represents the first year of treatment exposure in 1869. We are primarily interested in the difference in the observed and predicted treated unit outcomes, which is the quantity $\boldsymbol{\bar{\upphi}^{(t)}}$. These per-period average causal impacts are plotted in the bottom panel and are bounded by 95\% randomization confidence intervals, which are estimated following the procedure described in Section \ref{eval}. 

Counterfactual predictions of state government education spending in the absence of the HSA generally tracks the observed control time-series until the turn of the 19$^\text{th}$ century, at which the counterfactual flattens and diverges from the increasing observed control time-series. This delay can potentially be explained by the fact that homestead entries did not substantially accumulate until after Congress prohibited the sale of public land in 1889 in all states except Missouri \citep{gates1941land,gates1979federal}. 

Taking the mean of post-period impacts, I estimate that the impact of the HSA on the state government spending of states exposed to homesteads is 0.69 [-0.19, 2.01]. The confidence intervals surrounding this estimate contain zero, which implies that the estimated impact is not significantly more extreme than the exact distribution of average placebo effects under the null hypothesis. Examining the time-specific causal estimates reveals that fifty years after the first homestead entry, the estimated impact of the HSA on state government education spending in 1919 is 0.68 log points [0.13, 1.24]. The confidence intervals surrounding this time-specific estimate do not contain zero, which implies that the estimated impact is significantly more extreme than the average placebo effects. To put the magnitude of the point estimate in perspective, it represents about 3\% of the total school expenditures per-capita in 1929 \citep{snyder2010digest}.

\begin{figure}[htbp]
	\centering
	\includegraphics[width=0.9\textwidth]{/media/jason/Dropbox/github/rnns-causal/paper/plots/educ-ed.png}
	\caption{Encoder-decoder estimates of the impact of the HSA on state government education spending, 1809 to 1942:		{\color{Darjeeling15}{\sampleline{}}}, observed treated;
		{\color{Darjeeling11}{\sampleline{dashed}}}, observed control;
		{\color{Darjeeling15}{\sampleline{dotted}}}, counterfactual treated;
		{\color{Darjeeling15}{\sampleline{dash pattern=on .7em off .2em on .05em off .2em}}}, $\boldsymbol{\bar{\upphi}^{(t)}}$.\label{educ-ed}} 
\end{figure}

\subsection{Sensitivity to imputation method}

The previously described estimates imply that homestead policy had no overall long-term impact on state education spending. How much of this conclusion depends on the imputation procedure? I compare the following four imputation methods used for time-series analysis in the presence of missing values:

%
\begin{description}
	{\setlength\itemindent{1mm}
		\item[(a) Linear interpolation] Use linear interpolation to replace missing values
		\item[(b) LOCF] Replace each missing value with the most recent non-missing value prior to it (Last Observation Carried Forward); remaining missing values are imputed by LOCF in reverse
		\item[(c) Median replacement] Replace missing values with the median of the training set
		\item[(d) Random replacement] Replaces each missing value by drawing a random sample between the minimum and the maximum non-missing values in the data.
	}
\end{description}

Note that LOCF (b) is the imputation method used in the previous section. Also note that each imputation procedure is performed separately on the training and test sets to ensure that the networks do not learn from the unseen test data. I train encoder-decoder networks (Figure \ref{educ-ed-imp}) and RVAE (Figure \ref{educ-rvae-imp}) on each differently imputed and present the results in Table \ref{educ-sens}. The encoder-decoder causal estimates are generally impervious to the choice of imputation method. The exception is that the confidence bound does not include zero when missing values are randomly replaced, and in this case, the estimates imply a positive and statistically significant impact of homestead policy on education spending. 

\begin{table}[htbp]
	\captionsetup{font=normalsize}
	\caption{Causal impacts on education spending by RNNs architecture and imputation method.\label{educ-sens}}
	\begin{center}
		\scalebox{.9}{\input{/media/jason/Dropbox/github/rnns-causal/paper/educ-sens.tex}}
	\end{center}
\end{table}

The RVAE estimates tend to be larger in magnitude and with wider confidence bands, which suggests more uncertainty compared to the encoder-decoder estimates. Interpretation of the RVAE estimates, which tend to be positive and statistically significant, should be approached with caution since this is a self-supervised model that learns without outputs (i.e., the post-period observations of the control units). 

\section{Conclusion} \label{ch3-conclusion}

This paper makes a methodological contribution in proposing a novel alternative to the SCM for estimating the effect of a policy intervention on an outcome over time. The SCM is growing in popularity in the social sciences despite its limitations --- the most obvious being that the choice of specification can lead to different results, and thus facilitate $p$-hacking. By inputting only control unit outcomes and not relying on pre-period covariates, the proposed method offers a more principled approach than the SCM. 

The RNN-based approach joins a new generation of data-driven machine learning techniques for generating counterfactual predictions. Machine learning techniques in general have an advantage over the SCM in that they automatically choose appropriate predictors without relying on pretreatment covariates; this capability limits `researcher degrees of freedom' that arise from choices on how to specify the model. RNNs do not assume a specific functional distribution, can learn nonconvex combinations of control units, and are specifically structured to exploit temporal dependencies in the data. RNNs are also capable of handling multiple treated units, which is useful because the model can share parameters across treated units, and thus generate more precise predictions in settings in which treated units share similar data-generating processes. 

In placebo tests, RNN-based estimators perform comparatively worse than the alternatives on small-dimensional datasets such as those featured in the original synthetic control papers. Both RNN-based estimators require sufficient pre-period observations in order to learn an informative representation of the control units. The RVAE in particular requires a large amount of training data since it is a self-supervised method that learns without outputs. In higher dimensional datasets such as the stock market data, the RNN-based methods generally outperform the alternatives when  $N \ll T$. The estimators underperform when $N \gg T$, which again reflects the need for sufficient pre-period observations. 

The matrix completion method performs well in either case, despite of its disadvantage of treating the data as static and thus ignoring the temporal component of the data. A built-in advantage of the matrix completion approach is that it does not assume a specific structure to the treatment assignment mechanism and thus can accommodate settings in which the time of initial treatment exposure varies across treated units. One potential avenue for future research is to integrate RNNs into the matrix completion approach by training multidirectional RNNs \citep[e.g.,][]{yoon2018estimating} to both impute missing values across the unit dimension and interpolate missing values within the time dimension. 

A second area of future research would explore ways to relax the assumption of equivalence between the distributions of training and test set inputs, beyond propensity score reweighting of the training loss. An alternative approach is to treat the problem of counterfactual prediction like a NMT problem by training the networks on the pre-period outcomes of control units to predict those of treated units. The learned model weights would then be fit on the post-period outcomes of control units at test time. This setup would instead assume equivalence between the distributions of pre-and post-period outcomes of control units, which is more likely to be satisfied in the absence of interference between treated and control units. 

In the empirical application, I estimate the causal impacts of the HSA on state government education spending. I find that homestead policy had positive long-run impacts on public education spending, although the impacts are not statistically significant when averaging across the entire post-intervention period. Time-specific causal estimates suggest that the HSA had positive and significant impacts on state government education spending fifty years after the first homestead entry in 1869. The estimated increase in education spending attributable to homestead policy translates to about 3\% of the total school expenditures per-capita in 1929.