-
-
Notifications
You must be signed in to change notification settings - Fork 39
/
Copy path20-prediction_estimation.Rmd
590 lines (367 loc) · 29 KB
/
20-prediction_estimation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
# Prediction and Estimation
In modern statistics, econometrics, and machine learning, two primary goals often motivate data analysis:
1. **Prediction**: To build a function $\hat{f}$ that **accurately predicts** an outcome $Y$ from observed features (predictors) $X$.
2. **Estimation or Causal Inference**: To **uncover and quantify** the relationship (often causal) between $X$ and $Y$, typically by estimating parameters like $\beta$ in a model $Y = g(X; \beta)$.
These goals, while superficially similar, rest on distinct philosophical and mathematical foundations. Below, we explore the difference in detail, illustrating key ideas with formal definitions, theorems, proofs (where relevant), and references to seminal works.
## Conceptual Framing
### Predictive Modeling
Predictive modeling focuses on building a function $\hat{f}: \mathcal{X} \rightarrow \mathcal{Y}$ that maps inputs $X$ to outputs $Y$. For simplicity, assume:
- $X \in \mathbb{R}^p$ (though in practice $X$ can be images, text, time series, etc.).
- $Y \in \mathbb{R}$ for regression or $Y \in \{0, 1\}$ (or other finite set) for classification.
The yardstick for success is the function's accuracy in out-of-sample predictions, often measured by a loss function $L(\hat{y}, y)$. We typically choose $\hat{f}$ to minimize expected loss:
$$
\text{(Predictive Problem)} \quad \hat{f} = \arg \min_{f \in \mathcal{F}} \mathbb{E}[L(f(X), Y)],
$$
where $\mathcal{F}$ is a class of functions (models) and $\mathbb{E}[\cdot]$ is taken over the joint distribution of $(X, Y)$.
### Estimation or Causal Inference
By contrast, estimation or causal inference generally aims to uncover the underlying mechanism: *how does* $X$ *(or a particular component* $T \subseteq X$*) cause changes in* $Y$? The canonical problem is to estimate parameters $\beta$ in a model $m_\beta(x)$ such that:
$$
Y = m_\beta(X) + \varepsilon,
$$
or, in linear form,
$$
Y = X\beta + \varepsilon.
$$
A variety of statistical properties---**consistency, unbiasedness, efficiency, confidence intervals, hypothesis tests**---are relevant here. *Causal* interpretations usually require assumptions beyond typical i.i.d. sampling: unconfoundedness, exogeneity, or random assignment, so that $\beta$ indeed captures how changes in $X$ cause changes in $Y$.
Key Distinction:
- **Prediction** does not require that the parameters used in $\hat{f}$ reflect any real-world mechanism. As long as out-of-sample predictive performance is good, the model is deemed successful---even if it's a "black box."
- **Causal inference** demands interpretability in terms of structural or exogenous relationships. The main objective is consistent estimation of the true (or theoretically defined) parameter $\beta$, which has an economic, biomedical, or policy interpretation.
------------------------------------------------------------------------
## Mathematical Setup
### Probability Space and Data
We posit a probability space $(\Omega, \mathcal{F}, P)$ and random variables $(X, Y)$ on it. We typically have an i.i.d. sample $\{(X_i, Y_i)\}_{i=1}^n$ from the true distribution $\mathcal{D}$. Let:
$$
(X, Y) \sim \mathcal{D}, \quad (X_i, Y_i) \overset{\text{i.i.d.}}{\sim} \mathcal{D}.
$$
In prediction, we train on $\{(X_i, Y_i)\}_{i=1}^n$ to obtain $\hat{f}$, and we evaluate on a test point $(\tilde{X}, \tilde{Y})$ drawn from $\mathcal{D}$. In causal inference, we scrutinize the data generating process carefully, ensuring that we can identify a causal effect. For example, we may require:
- Potential outcomes $\{Y_i(0), Y_i(1)\}$ for treatment effect settings.
- Unconfoundedness or randomization assumptions.
### Loss Functions and Risk
A general framework for both tasks is the risk minimization approach. For a function $f$, define:
- The population (or expected) risk: $$
\mathcal{R}(f) = \mathbb{E}[L(f(X), Y)].
$$
- The empirical risk (on a sample of size $n$): $$
\hat{\mathcal{R}}_n(f) = \frac{1}{n} \sum_{i=1}^n L(f(X_i), Y_i).
$$
**Prediction:** We often solve the empirical risk minimization (ERM) problem:
$$
\hat{f} = \arg \min_{f \in \mathcal{F}} \hat{\mathcal{R}}_n(f),
$$
possibly with regularization. The measure of success is $\mathcal{R}(\hat{f})$, i.e., how well $\hat{f}$ generalizes beyond the training sample.
**Causal/Parameter Estimation:** We might define an $M$-estimator for $\beta$ [@newey1994large]. Consider a function $\psi(\beta; X, Y)$ such that the true parameter $\beta_0$ satisfies:
$$
\mathbb{E}[\psi(\beta_0; X, Y)] = 0.
$$
The empirical $M$-estimator solves
$$
\hat{\beta} = \arg \min_\beta \left\| \frac{1}{n} \sum_{i=1}^n \psi(\beta; X_i, Y_i) \right\|,
$$
or equivalently sets it to zero in a method-of-moments sense:
$$
\frac{1}{n} \sum_{i=1}^n \psi(\hat{\beta}; X_i, Y_i) = 0.
$$
Properties like consistency ($\hat{\beta} \overset{p}{\to} \beta_0$) or asymptotic normality ($\sqrt{n}(\hat{\beta} - \beta_0) \overset{d}{\to} N(0, \Sigma)$) are central. The emphasis is on *uncovering the true* $\beta_0$ rather than purely predictive accuracy.
------------------------------------------------------------------------
## Prediction in Detail
### Empirical Risk Minimization and Generalization
In supervised learning, the goal is to find a function $f$ from a class of candidate models $\mathcal{F}$ (e.g., linear models, neural networks, tree-based models) that accurately predicts an outcome $Y$ given an input $X$. This is typically formulated as an **Empirical Risk Minimization problem**, where we seek to minimize the average loss over the training data:
$$
\hat{f} = \arg \min_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n L(f(X_i), Y_i).
$$
where $L(\cdot, \cdot)$ is a **loss function** that quantifies the error between predictions and actual values. Common choices include:
- **Squared Error (Regression):** $L(\hat{y}, y) = (\hat{y} - y)^2$.
- **Absolute Error (Regression):** $L(\hat{y}, y) = |\hat{y} - y|$.
- **Logistic Loss (Classification):** $L(\hat{p}, y) = -[y \log \hat{p} + (1 - y) \log(1 - \hat{p})]$.
By minimizing empirical risk, we find a function $\hat{f}$ that best fits the observed data. However, **minimizing training error does not guarantee good generalization**---the ability of $\hat{f}$ to perform well on unseen data.
#### Overfitting and Regularization
If $\mathcal{F}$ is very large or expressive (e.g., deep neural networks with millions of parameters), $\hat{f}$ can become too **complex**, learning patterns that exist in the training set but do not generalize to new data. This is called **overfitting**.
To mitigate overfitting, we introduce **regularization**, modifying the optimization objective to penalize complex models:
$$
\hat{f}_\lambda = \arg \min_{f \in \mathcal{F}} \left\{ \hat{\mathcal{R}}_n(f) + \lambda \Omega(f) \right\}.
$$
where:
- $\hat{\mathcal{R}}_n(f) = \frac{1}{n} \sum_{i=1}^{n} L(f(X_i), Y_i)$ is the empirical risk.
- $\Omega(f)$ is a **complexity penalty** that discourages overly flexible models.
- $\lambda$ controls the strength of regularization.
Common choices of $\Omega(f)$ include:
- **LASSO penalty:** $\|\beta\|_1$ (sparsity constraint in linear models).
- **Ridge penalty:** $\|\beta\|_2^2$ (shrinking coefficients to reduce variance).
- **Neural network weight decay:** $\sum w^2$ (prevents exploding weights).
Regularization encourages **simpler models**, which are more likely to generalize well.
#### Generalization and Statistical Learning Theory
A fundamental question in machine learning is: *How well does* $\hat{f}$ perform on unseen data? This is captured by the **expected risk**:
$$
R(f) = \mathbb{E}[L(f(X), Y)].
$$
Ideally, we want to minimize the gap between the **true risk** $R(\hat{f})$ and the best possible risk $R(f^*)$ within $\mathcal{F}$:
$$
R(\hat{f}) - \min_{f \in \mathcal{F}} R(f).
$$
This difference, called the **excess risk**, measures how well $\hat{f}$ generalizes beyond the training sample. **Statistical Learning Theory** provides theoretical tools to analyze this gap [@vapnik2013nature; @hastie2009elements]. In particular, it establishes generalization bounds that depend on the **capacity** of the function class $\mathcal{F}$.
#### Complexity Measures
Two important ways to quantify the complexity of $\mathcal{F}$ are
- [VC Dimension]
- [Rademacher Complexity]
##### VC Dimension
The VC dimension measures the ability of a hypothesis class $\mathcal{F}$ to fit arbitrary labels. Formally, the VC dimension of $\mathcal{F}$, denoted as $\operatorname{VC}(\mathcal{F})$, is the largest number of points that can be shattered by some function in $\mathcal{F}$.
- A set of points is shattered by $\mathcal{F}$ if, for every possible labeling of these points, there exists a function $f \in \mathcal{F}$ that perfectly classifies them.
**Example 1: Linear Classifiers in 2D**
- Consider a set of points in $\mathbb{R}^2$ (the plane).
- If $\mathcal{F}$ consists of linear decision boundaries, we can shatter at most three points in general position (because a single line can separate them in any way).
- However, four points cannot always be shattered (e.g., if arranged in an XOR pattern). - Thus, the VC dimension of linear classifiers in $\mathbb{R}^2$ is 3.
**Key Property:**
- A higher VC dimension means a more expressive model class (higher capacity).
- If $\operatorname{VC}(\mathcal{F})$ is too large, the model can memorize the training set, leading to poor generalization.
##### Rademacher Complexity
VC dimension is a combinatorial measure, but Rademacher complexity is a more refined, data-dependent measure of function class flexibility.
**Intuition:** Rademacher complexity quantifies how well functions in $\mathcal{F}$ can **correlate with random noise**. If a function class can fit random labels well, it is too flexible and likely to overfit.
**Definition:**\
Given $n$ training samples, let $\sigma_1, \dots, \sigma_n$ be **independent Rademacher variables** (i.e., random variables taking values $\pm1$ with equal probability). The **empirical Rademacher complexity** of $\mathcal{F}$ is:
$$
\hat{\mathcal{R}}_n(\mathcal{F}) = \mathbb{E}_{\sigma} \left[ \sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^{n} \sigma_i f(X_i) \right].
$$
**Interpretation:**
- If $\hat{\mathcal{R}}_n(\mathcal{F})$ is large, then $\mathcal{F}$ can fit random noise well $\Rightarrow$ high risk of overfitting.
- If $\hat{\mathcal{R}}_n(\mathcal{F})$ is small, then $\mathcal{F}$ is more stable $\Rightarrow$ better generalization.
**Example 2: Linear Models with Bounded Norm**
- Suppose $\mathcal{F}$ consists of linear models $f(X) = w^\top X$, where $\|w\| \leq C$.
- The **Rademacher complexity** of this class scales as $\mathcal{O}(C/\sqrt{n})$.
- This suggests that controlling the norm of $w$ (e.g., via [Ridge Regression]) improves generalization.
### Bias-Variance Decomposition
For a regression problem with squared-error loss, a classic decomposition is:
$$
\mathbb{E}_{\text{train}}[(\hat{f}(X) - Y)^2] = \underbrace{(\mathbb{E}[\hat{f}(X)] - f^*(X))^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}[(\hat{f}(X) - \mathbb{E}[\hat{f}(X)])^2]}_{\text{Variance}} + \underbrace{\sigma_\varepsilon^2}_{\text{Irreducible Error}}
$$
where $f^*(X) = \mathbb{E}[Y \mid X]$. Minimizing the sum of bias$^2$ and variance is key.
In prediction, a small increase in bias is often acceptable if it yields a large reduction in variance---this can improve out-of-sample performance. However, for causal inference, any added bias is problematic if it distorts the interpretation of parameters.
### Example: Linear Regression for Prediction
Consider a linear predictor:
$$
\hat{y} = x^\top \hat{\beta}.
$$
We choose $\hat{\beta}$ to minimize:
$$
\sum_{i=1}^n (y_i - x_i^\top \beta)^2 \quad \text{or with a penalty:} \quad \sum_{i=1}^n (y_i - x_i^\top \beta)^2 + \lambda \|\beta\|_2^2.
$$
**Goal:** Achieve minimal prediction error on unseen data $(\tilde{x}, \tilde{y})$.
The estimated $\hat{\beta}$ might be biased if we use regularization (e.g., ridge). But from a purely predictive lens, that bias can be advantageous if it lowers variance substantially and thus lowers expected prediction error.
### Applications in Economics
In economics (and related social sciences), **prediction** plays an increasingly prominent role [@mullainathan2017machine; @athey2019machine]:
- **Measure Variables**: Predicting missing or proxy variables (e.g., predicting income from observable covariates, or predicting individual preferences from online behaviors).
- **Embed Prediction Tasks Within Parameter Estimation or Treatment Effects**: Sometimes, a first-stage prediction (e.g., imputing missing data or generating prognostic scores) is used as an input for subsequent causal analyses.
- **Control for Observed Confounders**: Machine learning methods---such as LASSO, random forests, or neural nets---can be used to control for high-dimensional $X$ when doing partial-out adjustments or residualizing outcomes [@belloni2014high; @chernozhukov2018double].
------------------------------------------------------------------------
## Parameter Estimation and Causal Inference
### Estimation in Parametric Models
In a simple parametric form:
$$
Y = X\beta + \varepsilon, \quad \mathbb{E}[\varepsilon \mid X] = 0, \quad \text{Var}(\varepsilon \mid X) = \sigma^2 I.
$$
The [Ordinary Least Squares] estimator is:
$$
\hat{\beta}_{\text{OLS}} = \arg \min_\beta \|Y - X\beta\|_2^2 = (X^\top X)^{-1} X^\top Y.
$$
Under classical assumptions (e.g., no perfect collinearity, homoskedastic errors), $\hat{\beta}_{\text{OLS}}$ is BLUE---the Best Linear Unbiased Estimator.
In a more general form, **parameter estimation**, denoted $\hat{\beta}$, focuses on **estimating the relationship** between $y$ and $x$, often with a view toward **causality**. In many econometric or statistical settings, we write:
$$ y = x^\top \beta + \varepsilon, $$
or more generally $y = g\bigl(x;\beta\bigr) + \varepsilon,$ where $\beta$ encodes the structural or causal parameters we wish to recover.
The core aim is **consistency**---that is, for large $n$, we want $\hat{\beta}$ to converge to the true $\beta$ that defines the underlying relationship. In other words:
$$ \hat{\beta} \xrightarrow{p} \beta, \quad \text{as } n \to \infty. $$
Some texts phrase it informally as requiring that
$$ \mathbb{E}\bigl[\hat{f}\bigr] = f, $$
meaning the estimator is (asymptotically) unbiased for the true function or parameters.
However, **consistency** alone may not suffice for scientific inference. One often also examines:
- **Asymptotic Normality**: $\sqrt{n}(\hat{\beta} - \beta) \;\;\xrightarrow{d}\;\; \mathcal{N}(0,\Sigma).$
- **Confidence Intervals**: $\hat{\beta}_j \;\pm\; z_{\alpha/2}\,\mathrm{SE}\bigl(\hat{\beta}_j\bigr).$
- **Hypothesis Tests**: $H_0\colon \beta_j = 0 \quad\text{vs.}\quad H_1\colon \beta_j \neq 0.$
### Causal Inference Fundamentals
To interpret $\beta$ in $Y = X\beta + \varepsilon$ as "causal," we typically require that changes in $X$ (or at least in one component of $X$) lead to changes in $Y$ that are not confounded by omitted variables or simultaneity. In a prototypical potential-outcomes framework (for a binary treatment $D$):
- $Y_i(1)$: outcome if unit $i$ receives treatment $D = 1$.
- $Y_i(0)$: outcome if unit $i$ receives no treatment $D = 0$.
The observed outcome $Y_i$ is
$$
Y_i = D_i Y_i(1) + (1 - D_i) Y_i(0).
$$
The **Average Treatment Effect** (ATE) is:
$$
\tau = \mathbb{E}[Y(1) - Y(0)].
$$
Identification of $\tau$ requires an assumption like unconfoundedness:
$$
\{Y(0), Y(1)\} \perp D \mid X,
$$
i.e., after conditioning on $X$, the treatment assignment is as-if random. Estimation strategies then revolve around properly adjusting for $X$.
Such assumptions are not necessary for raw prediction of $Y$: a black-box function can yield $\hat{Y} \approx Y$ without ensuring that $\hat{Y}(1) - \hat{Y}(0)$ is an unbiased estimate of $\tau$.
### Role of Identification
Identification means that the parameter of interest ($\beta$ or $\tau$) is uniquely pinned down by the distribution of observables (under assumptions). If $\beta$ is not identified (e.g., because of endogeneity or insufficient variation in $X$), no matter how large the sample, we cannot estimate $\beta$ consistently.
In prediction, "identification" is not usually the main concern. The function $\hat{f}(x)$ could be a complicated ensemble method that just fits well, without guaranteeing any structural or causal interpretation of its parameters.
### Challenges
1. **High-Dimensional Spaces**: With large $p$ (number of predictors), covariance among variables (multicollinearity) can hamper classical estimation. This is the setting of the well-known **bias-variance tradeoff** [@hastie2009elements; @bishop2006pattern].
2. **Endogeneity**: If $x$ is correlated with the error term $\varepsilon$, ordinary least squares (OLS) is biased. Causal inference demands identifying exogenous variation in $x$, which requires additional assumptions or designs (e.g., randomization).
3. **Model Misspecification**: If the functional form $g\bigl(x;\beta\bigr)$ is incorrect, parameter estimates can systematically deviate from capturing the true underlying mechanism.
## Causation versus Prediction
Understanding the relationship between **causation** and **prediction** is crucial in statistical modeling. Building on @kleinberg2015prediction and @mullainathan2017machine, consider a scenario where $Y$ is an outcome variable dependent on $X$, and we want to **manipulate** $X$ to maximize some payoff function $\pi(X,Y)$. Formally:
$$
\pi(X,Y)
=
\mathbb{E}\bigl[\,U(X,Y)\bigr]
\quad
\text{or some other objective measure}.
$$
The decision on $X$ depends on **how changes in** $X$ influence $\pi$. Taking a derivative:
$$
\frac{d\,\pi(X,Y)}{dX}
=
\frac{\partial \pi}{\partial X}(Y)
+
\frac{\partial \pi}{\partial Y}\,\frac{\partial Y}{\partial X}.
$$
We can interpret the terms:
- $\displaystyle \frac{\partial \pi}{\partial X}$: The *direct* dependence of the payoff on $X$, which can be predicted if we can forecast how $\pi$ changes with $X$.
- $\displaystyle \frac{\partial Y}{\partial X}$: The *causal* effect of $X$ on $Y$, essential for understanding how interventions on $X$ shift $Y$.
- $\displaystyle \frac{\partial \pi}{\partial Y}$: The marginal effect of $Y$ on the payoff.
Hence, @kleinberg2015prediction frames this distinction as one between **predicting** $Y$ effectively (for instance, "If I observe $X$, can I guess $Y$?") versus **managing** or **causing** $Y$ to change via **interventions** on $X$. Empirically:
- To **predict** $Y$, we model $\mathbb{E}\bigl[Y\mid X\bigr]$.
- To **infer causality**, we require identification strategies that isolate exogenous variation in $X$.
Empirical work in economics, or social science often aims to estimate partial derivatives of structural or reduced-form equations:
- $\displaystyle \frac{\partial Y}{\partial X}$: The **causal derivative**; tells us how $Y$ changes if we **intervene** on $X$.
- $\displaystyle \frac{\partial \pi}{\partial X}$: The **effect of** $X$ on payoff, partially mediated by changes in $Y$.
Without proper identification (e.g., **randomization, instrumental variables, difference-in-differences, or other quasi-experimental designs**), we risk **conflating association** ($\hat{f}$ that predicts $Y$) with **causation** ($\hat{\beta}$ that truly captures how $X$ shifts $Y$).
To illustrate these concepts, consider the following directed acyclic graph (DAG):
```{r}
library(ggdag)
library(dagitty)
library(ggplot2)
# Define the DAG structure with custom coordinates
dag <- dagitty('
dag {
X0 [pos="0,1"]
X [pos="1,2"]
Y [pos="1,1"]
II [pos="1,0"]
X0 -> Y
X0 -> II
X -> Y
Y -> II
}
')
# Convert to ggdag format with manual layout
dag_plot <- ggdag(dag) +
theme_void() +
geom_text(aes(x = 0.5, y = 1.2, label = "Causation"), size = 4) +
geom_text(aes(x = 0.3, y = 0.5, label = "Prediction"), size = 4)
# Display the DAG
dag_plot
```
## Illustrative Equations and Mathematical Contrasts
Below, we showcase a few derivations that highlight how predictive modeling vs. causal inference differ in their mathematical structure and interpretation.
### Risk Minimization vs. Consistency
Consider a real-valued outcome $Y$ and predictors $X$. Let $\ell(y, \hat{y})$ be a loss function, and define the Bayes regressor $f^*$ as:
$$
f^* = \arg \min_f \mathbb{E}[\ell(Y, f(X))].
$$
For squared error loss, the Bayes regressor is $f^*(x) = \mathbb{E}[Y \mid X = x]$.
A learning algorithm tries to approximate $f^*$. If we parametrize $f_\beta(x) = x^\top \beta$ and do empirical risk minimization with a large enough sample, $\beta$ converges to the minimizer of:
$$
\beta^* = \arg \min_\beta \mathbb{E}[(Y - X^\top \beta)^2].
$$
Note that $\beta^*$ is the solution to $\mathbb{E}[XX^\top] \beta = \mathbb{E}[XY]$. If $\text{Cov}(X, X)$ is invertible, then
$$
\beta^* = \text{Cov}(X, X)^{-1} \text{Cov}(X, Y).
$$
This $\beta^*$ is not necessarily the same as the "true" $\beta_0$ from a structural equation $Y = X\beta_0 + \varepsilon$ unless $\mathbb{E}[\varepsilon \mid X] = 0$.
From a predictive standpoint, $\beta^*$ is the best linear predictor in the sense of mean squared error. From a causal standpoint, we want $\beta_0$ such that $\varepsilon$ is mean-independent of $X$. If that fails, $\beta^* \neq \beta_0$.
### Partial Derivatives vs. Predictions
A powerful way to see the difference is to compare:
- $\frac{\partial}{\partial x} f^*(x)$ -- The partial derivative of the best predictor w.r.t. $x$. This is about how the model's *prediction* changes with $x$.
- $\frac{\partial}{\partial x} m_\beta(x)$ -- The partial derivative of the structural function $m_\beta(\cdot)$. This is about how the *true outcome* $Y$ changes with $x$, i.e., a causal effect if $m_\beta$ is indeed structural.
Unless the model was identified and the assumptions hold (exogeneity, no omitted variables, etc.), the partial derivative from a purely predictive model does not represent the causal effect.
In short: "slopes" from a black-box predictive model are not guaranteed to reflect how interventions on $X$ would shift $Y$.
### Example: High-Dimensional Regularization
Suppose we have a large number of predictors $p$, possibly $p \gg n$. A common approach in both prediction and inference is LASSO:
$$
\hat{\beta}_{\text{LASSO}} = \arg \min_\beta \left\{ \frac{1}{n} \sum_{i=1}^n (y_i - x_i^\top \beta)^2 + \lambda \|\beta\|_1 \right\}.
$$
- **Prediction:** Choose $\lambda$ to optimize out-of-sample MSE. Some bias is introduced in $\hat{\beta}$, but the final model might predict extremely well, especially if many true coefficients are near zero.
- **Causal Estimation:** We must worry about whether the LASSO is shrinking or zeroing out confounders. If a crucial confounder's coefficient is set to zero, the resulting estimate for a treatment variable's coefficient will be biased. Therefore, special procedures (like the double/debiased machine learning approach [@chernozhukov2018double]) are introduced to correct for the selection bias or to do post-selection inference [@belloni2014high].
The mathematics of "best subset" for prediction vs. valid coverage intervals for parameters diverges significantly.
### Potential Outcomes Notation
Let $D \in \{0, 1\}$ be a treatment indicator, and define potential outcomes:
$$
Y_i(0), Y_i(1).
$$
The observed outcome is:
$$
Y_i = D_i Y_i(1) + (1 - D_i) Y_i(0).
$$
- **Prediction:** One might train a model $\hat{Y} = \hat{f}(X, D)$ to guess $Y$ from $(X, D)$. That model could be a black box with no guarantee that $\hat{Y}(1) - \hat{Y}(0)$ is an unbiased estimate of $Y_i(1) - Y_i(0)$.
- **Causal Inference:** We want to estimate $\mathbb{E}[Y(1) - Y(0)]$ or $\mathbb{E}[Y(1) - Y(0) \mid X = x]$. Identification typically requires $\{Y(0), Y(1)\} \perp D \mid X$, i.e., after conditioning on $X$, the treatment assignment is as-if random. Under such an assumption, the difference $\hat{f}(x, 1) - \hat{f}(x, 0)$ can be interpreted as a causal effect.
## Extended Mathematical Points
We now delve deeper into some mathematical nuances that are especially relevant when distinguishing between predictive vs. causal modeling.
### M-Estimation and Asymptotic Theory
$M$-Estimators unify many approaches: maximum likelihood, method of moments, generalized method of moments, and quasi-likelihood estimators. Let $\beta_0$ be the true parameter and define the population criterion function:
$$
Q(\beta) = \mathbb{E}[m(\beta; X, Y)],
$$
for some function $m$. The M-estimator $\hat{\beta}$ solves:
$$
\hat{\beta} = \arg \max_{\beta \in \Theta} \frac{1}{n} \sum_{i=1}^n m(\beta; X_i, Y_i).
$$
(Or $\arg \min$, depending on convention.)
Under regularity conditions [@newey1994large; @white1980heteroskedasticity], we have:
- **Consistency:** $\hat{\beta} \overset{p}{\to} \beta_0$.
- **Asymptotic Normality:** $\sqrt{n}(\hat{\beta} - \beta_0) \overset{d}{\to} N(0, \Sigma)$,
where $\Sigma$ is derived from derivatives of $m(\cdot; \cdot, \cdot)$ and the distribution of $(X, Y)$.
For prediction, such classical asymptotic properties may be of less interest unless we want to build confidence intervals around predictions. For causal inference, the entire enterprise revolves around these properties to ensure valid inference about $\beta_0$.
### The Danger of Omitted Variables
Consider a structural equation:
$$
Y = \beta_1 X_1 + \beta_2 X_2 + \varepsilon,
\quad
\mathbb{E}[\varepsilon \mid X_1, X_2] = 0.
$$
If we ignore $X_2$ and regress $Y$ on $X_1$ only, the resulting $\hat{\beta}_1$ can be severely biased:
$$
\hat{\beta}_1
=
\arg\min_{b} \sum_{i=1}^n \bigl(y_i - b\,x_{i1}\bigr)^2.
$$
The expected value of $\hat{\beta}_1$ in large samples is:
$$
\beta_1
\;+\;
\beta_2 \,\frac{\mathrm{Cov}(X_1, X_2)}{\mathrm{Var}(X_1)}.
$$
This extra term $\displaystyle \beta_2 \,\frac{\mathrm{Cov}(X_1, X_2)}{\mathrm{Var}(X_1)}$ is the **omitted variables bias**. For **prediction**, omitting $X_2$ might sometimes be acceptable if $X_2$ has little incremental predictive value or if we only care about accuracy in some domain. However, for **inference** on $\beta_1$, ignoring $X_2$ invalidates the causal interpretation.
### Cross-Validation vs. Statistical Testing
- **Cross-Validation**: Predominantly used in **prediction** tasks. We split the data into training and validation sets, measure out-of-sample error, and select hyperparameters that minimize CV error.
- **Statistical Testing**: Predominantly used in **inference** tasks. We compute test statistics (e.g., $t$-test, Wald test), form confidence intervals, or test hypotheses about parameters ($H_0: \beta_j = 0$).
They serve different objectives:
1. **CV** is about **predictive model selection**.
2. **Testing** is about **scientific or policy conclusions** on whether $\beta_j$ differs from zero (i.e., "Does a particular variable have a causal effect?").
------------------------------------------------------------------------
## Putting It All Together: Comparing Objectives
As an overarching illustration, let $\hat{f}$ be any trained predictor (ML model, regression, etc.) and let $\hat{\beta}$ be a parameter estimator from a structural or causal model. Their respective tasks differ:
- **Form of Output**
- $\hat{f}$ is a *function* from $\mathcal{X} \to \mathcal{Y}$.
- $\hat{\beta}$ is a *vector of parameters* with theoretical meaning.
- **Criterion**
- **Prediction**: Minimizes predictive loss $\mathbb{E}[L(Y,\hat{f}(X))]$.
- **Causal Inference**: Seeks $\beta$ such that $Y = m_\beta(X)$ is a correct *structural* representation. Minimizes bias in $\beta$, or satisfies orthogonality conditions in method-of-moments style, etc.
- **Validity**
- **Prediction**: Usually validated by out-of-sample experiments or cross-validation.
- **Estimation**: Validated by theoretical identification arguments, assumptions about exogeneity, randomization, or no omitted confounders.
- **Interpretation**
- **Prediction**: "$\hat{f}(x)$ is our best guess of $Y$ for new $x$."
- **Causal Inference**: "$\beta$ measures how $Y$ changes if we intervene on $X$."
------------------------------------------------------------------------
## Conclusion
**Prediction** and **Estimation/Causal Inference** serve distinctly different roles in data analysis:
- **Prediction**: The emphasis is on **predictive accuracy**. The final model $\hat{f}$ may have uninterpretable parameters (e.g., deep neural networks) yet excel at forecasting $Y$. Bias in parameter estimates is not necessarily problematic if it reduces variance and improves out-of-sample performance.
- **Estimation/Causal Inference**: The emphasis is on obtaining **consistent** and **unbiased** estimates of parameters ($\beta$, or a treatment effect $\tau$). We impose stronger assumptions about data collection and the relationship between $X$ and $\varepsilon$. The success criterion is whether $\hat{\beta}\approx\beta_0$ in a formal sense, with valid confidence intervals and robust identification strategies.
> **Key Takeaway**:
>
> - If your question is "How do I **predict** $Y$ for new $X$ as accurately as possible?", you prioritize **prediction**.
>
> - If your question is "How does **changing** $X$ (or assigning treatment $D$) affect $Y$ in a **causal** sense?", you focus on **estimation** with a fully developed identification strategy.