Skip to content

Commit

Permalink
Created additional vignette for calib. weighting
Browse files Browse the repository at this point in the history
  • Loading branch information
maciejostapiuk committed Sep 4, 2024
1 parent dab9f86 commit 220142a
Show file tree
Hide file tree
Showing 5 changed files with 189 additions and 4 deletions.
3 changes: 2 additions & 1 deletion docs/LICENSE-text.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion docs/LICENSE.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion docs/authors.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion docs/pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,9 @@ pandoc: 3.1.11
pkgdown: 2.1.0
pkgdown_sha: ~
articles:
calibration_weighting: calibration_weighting.html
theory: theory.html
last_built: 2024-08-23T10:28Z
last_built: 2024-09-04T21:24Z
urls:
reference: https://maciejostapiuk.github.io/MNAR/reference
article: https://maciejostapiuk.github.io/MNAR/articles
181 changes: 181 additions & 0 deletions vignettes/calibration_weighting.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
---
title: "Calibration weighting"
author: "Maciej Ostapiuk and Maciej Beręsewicz"
output:
html_vignette:
df_print: kable
toc: true
number_sections: true
vignette: >
%\VignetteIndexEntry{Calibration weighting}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
bibliography: references.bib
link-citations: true
header-includes:
- \usepackage{amsmath}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
\newcommand{\bX}{\boldsymbol{X}}
\newcommand{\bx}{\boldsymbol{x}}
\newcommand{\bY}{\boldsymbol{Y}}
\newcommand{\by}{\boldsymbol{y}}
\newcommand{\bh}{\boldsymbol{h}}
\newcommand{\bg}{\boldsymbol{g}}
\newcommand{\bH}{\boldsymbol{H}}
\newcommand{\ba}{\boldsymbol{a}}
\newcommand{\bp}{\boldsymbol{p}}
\newcommand{\bA}{\boldsymbol{A}}
\newcommand{\bw}{\boldsymbol{w}}
\newcommand{\bd}{\boldsymbol{d}}
\newcommand{\bZ}{\boldsymbol{Z}}
\newcommand{\bz}{\boldsymbol{z}}
\newcommand{\bv}{\boldsymbol{v}}
\newcommand{\bb}{\boldsymbol{b}}
\newcommand{\bu}{\boldsymbol{u}}
\newcommand{\bU}{\boldsymbol{U}}
\newcommand{\bQ}{\boldsymbol{Q}}
\newcommand{\bG}{\boldsymbol{G}}
\newcommand{\HT}{\text{\rm HT}}

\newcommand{\bbeta}{\boldsymbol{\beta}}
\newcommand{\balpha}{\boldsymbol{\alpha}}
\newcommand{\btau}{\boldsymbol{\tau}}
\newcommand{\bgamma}{\boldsymbol{\gamma}}
\newcommand{\bGamma}{\boldsymbol{\Gamma}}
\newcommand{\btheta}{\boldsymbol{\theta}}
\newcommand{\blambda}{\boldsymbol{\lambda}}
\newcommand{\bPhi}{\boldsymbol{\Phi}}
\newcommand{\bEta}{\boldsymbol{\eta}}

\newcommand{\bZero}{\boldsymbol{0}}
\newcommand{\indicator}{\mathmybb{1}}

\newcommand{\colvec}{\operatorname{colvec}}
\newcommand{\logit}{\operatorname{logit}}
\newcommand{\Exp}{\operatorname{Exp}}
\newcommand{\Ber}{\operatorname{Bernoulli}}
\newcommand{\Uni}{\operatorname{Uniform}}
\newcommand{\argmin}{\operatorname{argmin}}



# Weighting methods
Usually, if non-responses occur, summation in (\ref{eq:HT estimator}) provides underestimated values compared to the population total from (\ref{eq:Pop. total}). Thus, it is needed to perform correction of initial weights $d_k$ under given sampling design - in other words, we have to perform \textbf{calibration} of described $d_k$'s.

In general we have the following settings:
- **Information on the target variable $Y$ is only available for respondents.**

- **Information on auxiliary variables $X$ is available under the following settings:**
- Unit-level data is available for respondents and non-respondents.
- Unit-level data is available only for respondents, but we have population totals for the reference population.

## Case when dimensions of calibration and response-model variables coincide

Let $\bx= (x_1, x_2, ..., x_p)^{\text{T}}$ denote benchmark vector of chosen auxiliary variables and $\bx_{k} = (x_{1_k}, x_{2_k}, ..., x_{p_k})^{\text{T}}$ is the vector of auxiliary variables for $k$-th element of the sample $s$. Settings state that $\bX$, which is the vector of global auxiliary variables' values is known, i.e.
\begin{equation}\label{eq:Aux. Total}
\bX = \left(\sum_{k=1}^{N}{x_{1_k}}, \sum_{k=1}^{N}{x_{2_k}}, ..., \sum_{k=1}^{N}{x_{p_k}}\right)^{\text{T}} = \sum_{k \in U}{\bx_k}.
\end{equation}
If any of auxiliary values total is not known, one might use $x_{i_{k}}$ instead of $y_k$'s into (\ref{eq:HT estimator}), i.e. $$\hat{X}^{i}_{\text{HT}} = \sum_{k=1}^{N}{x_{i_k}}, \; i= 1, ..., p.$$
However, using $\bx_k$ instead of $y_k$ does not always work in process of estimation $\bX$. One needs to perform slightly different weights than $d_k$'s. Those weights, denoted as $w_k$'s as solutions to optimization problem of form
\begin{equation}\label{eq: optimization w_k}
\argmin_{w_k}{\sum_{k\in r}G_k\left(w_k,d_k\right)},
\end{equation}
where $G_k$ is a strictly convex, differentiable function, for which $G_k(d_k,d_k) = 0$ and $G_k(1) = G'_k(1) = 0$. Also, there exists a additional condition which has to be satisfied, namely:
\begin{equation}\label{eq: calib eq}
\sum_{k\in r}{w_k\bx_k} = \sum_{k\in U}{\bx_k}.
\end{equation}
Equation (\ref{eq: calib eq}) is also being called as \textbf{calibration equation}. Using Lagrange multipliers method, it is shown in @deville1992calibration, that vector of calibration weights might be written as:
\begin{equation}\label{eq: w_k minimizers form}
w_k = d_k F_k (\blambda^{\text{T}}\bz_k)
\end{equation}
where $\bz_k$ is a vector of instrumental variables, coinciding, in sense of dimensions with $\bx_k$. Later in this paper, we will consider situation where $\bz_k$ has got higher dimension than $\bx_k$. $F_k$ is the inverse of $G_k'(w_k, d_k)$, defined as:
\begin{equation}\label{eq: partial derivative of G_k}
G_k'(w_k, d_k) = \frac{\partial{G_k(w_k, d_k)}}{\partial w_k}.
\end{equation}
There are various ideas to choose function $G_k$ but it is a common case to consider $G_k$ of form:
\begin{equation}\label{eq: G_k example}
G_k(w_k,d_k) = \frac{\left(w_k - d_k\right)^2}{2d_k}
\end{equation}
For such choice, the solution $w_k$ of problem stated in (\ref{eq: optimization w_k}) is expressed by @estevaofunctional2000 as:
\begin{equation}\label{eq:G_k optimizers}
w_k = d_k(1 + \bz_k^{\text{T}}\blambda),
\end{equation}
where $\bg$ is defined as follows:
\begin{equation}\label{eq: g vector}
\bg = \left(\sum_{k \in r}{d_k\bx_k\bz_k^{\text{T}}}\right)^{-1} \times \left(\bX - \sum_{k\in r}{d_k \bx_k}\right).
\end{equation}
Using obtained $w_k$, known as the linear weights, a new, so called "calibration-weighted" estimator of target variable total from (\ref{eq:Pop. total}) is of the form:
\begin{equation}\label{eq: calibration-weighted estimator}
\hat{Y}_{\text{cal}} = \sum_{k \in r}{w_k y_k},
\end{equation}
which can be rendered as:
\begin{equation}\label{eq: rendered calibration-weighted estimator}
\hat{Y}_{\text{cal}} = \sum_{k \in r }{d_k y_k} + \left(\bX - \sum_{k\in r}{d_k \bx_k}\right)\bb,
\end{equation}
where
\begin{equation*}
\bb = \left(\sum_{k \in r}{d_k\bz_k\bx_k^{\text{T}}}\right)^{-1} \times \sum_{k \in r}{d_k\bz_k y_k}.
\end{equation*}
Notice, that $\hat{Y}_{cal}$ is no longer unbiased by design. However it might be consistent, which is described in @isakifuller1982.

\noident How does one formulate the prediction model in this case? Let's denote two indicator random variables:
\begin{equation*}
\displaystyle I_j = 1{j \in U} \;\;\; R_j = 1{k\in r}.
\end{equation*}
@kottweightingnonignorable2010 proposed the double-protection justification set of equations:
\begin{equation}\label{eq: double-security-pred.framework}
\left\{\begin{array}{lll}
y_k &= \bx_k^{\text{T}} \bbeta_{\bx} + \epsilon_k\\
\bz_k &= \bx_k^{\text{T}}\bGamma + \boldsymbol{\eta}_k^{\text{T}},\\
\end{array} \right.
\end{equation}
where $\bGamma$ is usually on full-rank (not necessarily), $\bbeta_{\bx}$ is a coefficients vector and
\begin{equation}
E{\left(\epsilon_k|\bx_j,I_j,R_j\right)} = 0, \;\; E{\left(\boldsymbol{\eta_k}|\bx_j,I_j,R_j\right)} = 0.
\end{equation}
Under proposal from (\ref{eq: double-security-pred.framework}) there is a property in form of:
\begin{equation}\label{eq: property of 2-sec prediction}
\left(y_k - \bz_k^{\text{T}}\bbeta_{\bz}\right)|\bx_k = \left(\epsilon - \boldsymbol{\eta}_k^{\text{T}}\bGamma^{-1}\bbeta_{\bx}\right)|\bx_k,
\end{equation}
where
\begin{equation*}
\beta_{\bz} = \Gamma^{-1}\beta_{\bx}.
\end{equation*}

## When there are more calibration than response-model variables

First, lets consider $\bb_{\bz}^{*}$, a asymptotic limit of:
\begin{equation}\label{eq: b_z}
\bb_{\bz} = \left(\sum_{k \in S}{d_kR_kF'_k(\bx_k^{\text{T}}\blambda)\bx_k}\bz_k^{\text{T}}\right)^{-1} \times \sum_{k \in S}{d_kR_kF'_k(\bx_k^{\text{T}}\blambda)\bx_ky_k},
\end{equation}
which, alongside with $\bb_{\bz}$, is said to exist apart from result of the prediction. When prediction model fails, we got $\bb_{\bz} - \bbeta_{\bz}$ as long as $\blambda$ converged to a finite $\blambda^{*}$. @chang_using_2008 considered this case and extended weighting approach by replacing the reformulated calibration equation from (\ref{eq: calib eq}):
\begin{equation}\label{eq: reformulated calib. eq.}
\boldsymbol{s} = \frac{1}{N} \left[\sum_{k \in S}{d_kR_kF_k(\bx_k^{\text{T}}\blambda)\bz_k}\bx_k- \sum_{k \in S}{d_k\bx_k}\right] = \bZero
\end{equation}
by finding $\blambda$ that minimizes $\boldsymbol{s}^{\text{T}}\bQ\boldsymbol{s}$ for some symmetric and positive $\bQ$. There are various ways to pick $\bQ$ as well as dealing with $\bGamma$ not being on full rank. Couple of examples might be found in @kottliaoalowingmorecalib2017. For example, one of the options is to use $\displaystyle \mathbf{\bQ}^{-1} = \text{DIAG}\left[\left({N}^{-1} \sum_{S} d_k \mathbf{\bx}_k\right) \left({N}^{-1}\sum_{S} d_k \mathbf{\bx}_k^\top \right)\right]$. After finding $\blambda$, dimension of $\bx_k^{\text{T}}$ is reduced in such way:
\begin{align*}
\tilde{\bx}_k^T &= N^{-1} \bQ \sum_{j \in S} d_j R_j F_k' \left( \bz_j^T \blambda \right) \bx_j \bz_j^T\\
&= \bx_k^T \left( \sum_{j \in S} d_j R_j F_k' \left( \bz_j^T \blambda \right) \bx_j \bx_j^T \right)^{-1} \sum_{j \in S} d_j R_j F_k' \left( \bz_j^T \mathbf{g} \right) \bx_j \bz_j^T\\
&= \bx_k^T \mathbf{B}_{\bz}.
\end{align*}
Another approach to component reduction, proposed by @andridge2011proxy, works without searching $\blambda$ or does not even rely on picking $\bQ$ matrix- idea relies on satisfying
\begin{equation}\label{eq: reformulated calib.eq}
\sum_{k \in S} w_k \tilde{\bx}_k = \sum_{k \in S} d_k R_k F_k \tilde{\bx}_k = \sum_{k \in S} d_k \tilde{\bx}_k
\end{equation}
and setting
\begin{equation}\label{eq: setting A&L}
\tilde{\bx}_k^{\text{T}} = \bx_k^{\text{T}}\bA^{\text{T}},
\end{equation}
where $\bA^{\text{T}} =\left(\sum_{S}{R_j\bx_j\bx_j^{\text{T}}}\right)^{-1} \sum_{S}{R_j\bx_j\bz_j^{\text{T}}}$. Again, the reduction of dimensions is needed and with such obtained $\tilde{\bx}_k$ one is able to perform generalized calibration weighting technique. By far, this method is implemented as a part of `MNAR:gencal()` function.

# References

0 comments on commit 220142a

Please sign in to comment.