Skip to content

Commit

Permalink
update rank regression example jynb
Browse files Browse the repository at this point in the history
  • Loading branch information
statmlben committed Nov 11, 2024
1 parent ec30df0 commit 43259e7
Show file tree
Hide file tree
Showing 3 changed files with 427 additions and 124 deletions.
330 changes: 330 additions & 0 deletions doc/source/examples/.ipynb_checkpoints/RankRegression-checkpoint.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,330 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "J0qXcTp-FofG"
},
"source": [
"\n",
"# Rank Regression\n",
"[![Slides](https://img.shields.io/badge/🦌-ReHLine-blueviolet)](https://rehline-python.readthedocs.io/en/latest/)\n",
"\n",
"Suppose we have a dataset $(y_i,\\mathbf{x}_j)_{i=1}^n$, where $y_i \\in \\mathbb{R}$ is the response and $\\mathbf{x}_i \\in \\mathbb{R}^{d}$ is the independent variable. Assume the data is generated from the following linear regression model,\n",
"\n",
"$$\n",
" y_i = \\boldsymbol{\\beta}^{\\intercal}\\mathbf{x}_i + \\epsilon_i, \\quad i=1,2,\\cdots n,\n",
"$$\n",
"where $\\boldsymbol{\\beta} \\in \\mathbb{R}^d$ is the unknown parameter to be estimated and $\\{\\epsilon_i\\}_{i=1}^n$ are iid random errors. \n",
"\n",
"When the random errors $\\{\\epsilon_i\\}_{i=1}^n$ follow non-Gaussian distributions, particularly those with heavy tails, the performance of OLS maybe poor. Rank regression is one of the robust regression methods which can deal with the case. It suffices to solve the following optimization problem (Jaeckel, 1972):\n",
"\n",
"$$\n",
" \\min_{\\boldsymbol{\\beta} \\in \\mathbb{R}^d} \\sum\\limits_{i=1}^{n}R_i^c(\\boldsymbol{\\beta})(y_i - \\boldsymbol{\\beta}^{\\intercal}\\mathbf{x}_i),\n",
"$$\n",
"where $R_i^c(\\boldsymbol{\\beta})=R_i(\\boldsymbol{\\beta}) - \\frac{n+1}{2}$, $R_i(\\boldsymbol{\\beta})$ is the rank of ${y_i-\\boldsymbol{\\beta}^{\\intercal}\\mathbf{x}_i}$ among all $\\{ y_i-\\boldsymbol{\\beta}^{\\intercal}\\mathbf{x}_i\\}_{i=1}^n$. Canonically, we estimate $\\boldsymbol{\\beta}$ by solving the following optimization problem,\n",
"\n",
"$$\n",
" \\min_{\\boldsymbol{\\beta} \\in \\mathbb{R}^d} \\frac{2}{n(n-1)}\\sum\\limits_{1 \\leq i < j \\leq n} \\big| y_{ij} - \\boldsymbol{\\beta}^{\\intercal} \\mathbf{x}_{ij} \\big| \\tag{1}\n",
"$$\n",
"where $\\mathbf{x}_{ij} = \\mathbf{x}_i - \\mathbf{x}_j$, $y_{ij}= y_i - y_j$. In the remaining part, I will show two methods to solve $(1)$ with implementations via `rehline`.\n",
"\n",
"In `ReHLine`, we have two options to solve rank regression: (i) use [Manual ReHLine Formulation](https://rehline-python.readthedocs.io/en/latest/tutorials/ReHLine_manual.html) based on `rehline.ReHLine`; (ii) use the form of [Empirical Risk Minimization](https://rehline-python.readthedocs.io/en/latest/tutorials/ReHLine_ERM.html) with the check loss based on `rehline.plqERM_Ridge`.\n",
"<!-- \n",
"**Least absolute deviation (LAD)** \n",
"By taking $\\mathbf{x}_{ij}$ and $y_{ij}$ as observasions, problem $(1)$ can be regarded as a LAD problem. We can solve the LAD problem by rehline. The composite ReLU-ReHU parameters for LAD can be found in Table 2 of (Dai and Qiu, 2024).\n",
"\n",
"**Quantile regression (QR)** \n",
"Let $e_{ij}=y_{ij} - \\boldsymbol{\\beta}^{\\intercal} \\mathbf{x}_{ij}$. $(1)$ is equivalent to\n",
"\n",
"$$\n",
" \\min_{\\boldsymbol{\\beta} \\in \\mathbb{R}^d} \\frac{2}{n(n-1)} \\sum\\limits_{1 \\leq i < j \\leq n } ( e_{ij} \\mathbb{1}_{ \\lbrace e_{ij} \\geq 0 \\rbrace} - e_{ij} \\mathbb{1}_{ \\lbrace e_{ij} < 0 \\rbrace} ) \\tag{2}\n",
"$$\n",
"\n",
"We compare $(2)$ with the check loss for quantile regression $\\rho_{\\tau}(ϵ)=ϵ(τ-\\mathbb{1}_{\\lbrace ϵ < 0 \\rbrace}) = \\tau ϵ \\mathbb{1}_{ \\lbrace \\epsilon \\geq 0 \\rbrace } +(\\tau-1)ϵ\\mathbb{1}_{ \\lbrace ϵ < 0 \\rbrace}$. When $\\tau=0.5$, solving $(2)$ is eqivalent to solving Quantile Regression by taking $\\mathbf{x}_{ij}$ and $y_{ij}$ as observasions.\n",
"\n",
"$$\n",
" \\min_{\\boldsymbol{\\beta} \\in \\mathbb{R}^d} \\frac{4}{n(n-1)} \\sum\\limits_{1 \\leq i < j \\leq n } \\rho_{0.5}(y_{ij}-\\boldsymbol{\\beta}^{\\intercal}\\mathbf{x}_{ij}) \\tag{3}\n",
"$$ -->"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "u3gEkxOBdUft"
},
"source": [
"## Numerical example"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CDuMki43tnuL"
},
"source": [
"### Preparation \n",
"**Generate the data**"
]
},
{
"cell_type": "code",
"execution_count": 200,
"metadata": {
"id": "r3VoUcR4FceK"
},
"outputs": [],
"source": [
"from rehline import ReHLine, plqERM_Ridge\n",
"from sklearn.datasets import make_regression\n",
"from sklearn.linear_model import LinearRegression, QuantileRegressor\n",
"import numpy as np\n",
"\n",
"np.random.seed(0)\n",
"\n",
"# Generate the data\n",
"n, d = 3000, 3\n",
"X = np.random.randn(n, d)\n",
"beta_true = np.random.randn(d)\n",
"y = X.dot(beta_true) + np.random.standard_t(df=10, size=n)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8B62NsX7t2SS"
},
"source": [
"**Pairwise differenced dataset**"
]
},
{
"cell_type": "code",
"execution_count": 201,
"metadata": {
"id": "OcdNUtvrn_Hy"
},
"outputs": [],
"source": [
"# Differencing\n",
"i, j = np.triu_indices(n, k=1)\n",
"X_diff = (X[:, np.newaxis, :] - X[np.newaxis, :, :])[i,j]\n",
"y_diff = (y[:, np.newaxis] - y[np.newaxis, :])[i,j]\n",
"n_diff = y_diff.shape[0]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VtuYCtgruBdK"
},
"source": [
"### Method 1: Solve Rank Regression via `rehline.plqERM_Ridge`\n",
"\n",
"Since `rehline.plqERM_Ridge` support for check loss, we can directly use it to solve the problem.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 202,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "pNLvCXgO2KcT",
"outputId": "7f205dd8-3daf-4cd6-ce8b-5e43c7a4ab57"
},
"outputs": [],
"source": [
"# solve the problem use quantitle regression model with $\\tau = 0.5$ by reline\n",
"C = 20/n_diff\n",
"RankReg = plqERM_Ridge(loss={'name': 'QR', 'qt': np.array([0.5])}, C=C, max_iter=10000)"
]
},
{
"cell_type": "code",
"execution_count": 203,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 811 ms, sys: 105 ms, total: 916 ms\n",
"Wall time: 915 ms\n"
]
}
],
"source": [
"%%time\n",
"RankReg.fit(X_diff, y_diff)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_aE8U8k8t4Uq"
},
"source": [
"### Method 2: Solve Rank Regression by `rehline.ReHLine`\n",
"Since the rank regression is a plq formulation, we can also use `rehline.RehLine` to solve it by manually specifying the ReLU-ReHU parameters.\n",
"\n",
"As shown in table 2 of (Dai and Qiu, 2024), when $L_i = C_i \\big| y_i - \\boldsymbol{\\beta}^{\\intercal} \\mathbf{x}_i \\big|$, ReLU parameters are as followed: $u_{1i}=C_i$, $v_{1i}=-C_i y_i$, $u_{2i}=-C_i$, $v_{2i}=C_i y_i$."
]
},
{
"cell_type": "code",
"execution_count": 204,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "9BIbKUma39CH",
"outputId": "1b663080-64df-4dc4-cee4-4acc5e4ef819"
},
"outputs": [],
"source": [
"# solve the problem use the least absolute deviation model (LAD)\n",
"C = 20/n_diff/2\n",
"U = np.zeros(shape=(2,n_diff))\n",
"V = np.zeros(shape=(2,n_diff))\n",
"U[0,:], U[1,:] = C, -C\n",
"V[0,:], V[1,:] = -C * y_diff, C * y_diff\n",
"RankReg = ReHLine(C=C, max_iter=10000)\n",
"RankReg.U, RankReg.V = U,V"
]
},
{
"cell_type": "code",
"execution_count": 205,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 744 ms, sys: 77.2 ms, total: 821 ms\n",
"Wall time: 819 ms\n"
]
}
],
"source": [
"%%time\n",
"RankReg.fit(X_diff)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pwPOUiEyeQfF"
},
"source": [
"### Results\n",
"\n",
"We compare the results of rank regression with OLS methods."
]
},
{
"cell_type": "code",
"execution_count": 206,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Nvjfr8YbCsUM",
"outputId": "46dd7432-4c08-4083-8672-fe3fae7e397d"
},
"outputs": [
{
"data": {
"text/html": [
"<style>#sk-container-id-23 {color: black;}#sk-container-id-23 pre{padding: 0;}#sk-container-id-23 div.sk-toggleable {background-color: white;}#sk-container-id-23 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-23 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-23 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-23 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-23 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-23 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-23 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-23 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-23 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-23 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-23 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-23 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-23 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-23 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-23 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-23 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-23 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-23 div.sk-item {position: relative;z-index: 1;}#sk-container-id-23 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-23 div.sk-item::before, #sk-container-id-23 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-23 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-23 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-23 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-23 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-23 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-23 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-23 div.sk-label-container {text-align: center;}#sk-container-id-23 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-23 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-23\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>LinearRegression(fit_intercept=False)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-23\" type=\"checkbox\" checked><label for=\"sk-estimator-id-23\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">LinearRegression</label><div class=\"sk-toggleable__content\"><pre>LinearRegression(fit_intercept=False)</pre></div></div></div></div></div>"
],
"text/plain": [
"LinearRegression(fit_intercept=False)"
]
},
"execution_count": 206,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ols = LinearRegression(fit_intercept=False)\n",
"ols.fit(X, y)"
]
},
{
"cell_type": "code",
"execution_count": 207,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Name Value \n",
"-------------------- --------------------\n",
"#samples: 3000 \n",
"#diff_samples: 4498500 \n",
"-------------------- --------------------\n",
"ture-β [ 1.4171 0.4437 -0.077 ]\n",
"RankReg-β [ 1.4207 0.4351 -0.0467]\n",
"ols-β [ 1.4227 0.4386 -0.0364]\n"
]
}
],
"source": [
"import pprint\n",
"np.set_printoptions(precision=4)\n",
"\n",
"res = {'ture-β': beta_true, 'RankReg-β': RankReg.coef_, 'ols-β': ols.coef_}\n",
"\n",
"## print the estimation results\n",
"print(\"Name\".ljust(20), \"Value\".ljust(20))\n",
"print(\"-\" * 20, \"-\" * 20)\n",
"print(\"#samples:\".ljust(20), str(n).ljust(20))\n",
"print(\"#diff_samples:\".ljust(20), str(n_diff).ljust(20))\n",
"print(\"-\" * 20, \"-\" * 20)\n",
"for key, value in res.items():\n",
" print(f\"{str(key).ljust(20)} {str(value).ljust(20)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fivGfEGe8yhz"
},
"source": [
"## References\n",
"[1] Jaeckel, L. A. (1972). Estimating regression coefficients by minimizing the dispersion of the residuals. _The Annals of Mathematical Statistics_, 1449-1458. \n",
"[2] Dai, B., & Qiu, Y. (2024). ReHLine: regularized composite ReLU-ReHU loss minimization with linear computation and linear convergence. _Advances in Neural Information Processing Systems_, 36."
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Loading

0 comments on commit 43259e7

Please sign in to comment.