Skip to content

Commit

Permalink
regularization updates
Browse files Browse the repository at this point in the history
  • Loading branch information
ludwigbothmann committed Jan 17, 2024
1 parent ca32b67 commit a3827e1
Show file tree
Hide file tree
Showing 3 changed files with 23 additions and 13 deletions.
2 changes: 1 addition & 1 deletion slides/regularization/slides-regu-intro.tex
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@

\begin{vbframe}{Example II: Overfitting}

We train a shallow neural network with one hidden layer and 100 hidden units as well as a SVM with RBF kernel on a small regression task. No form of explicit regularization is imposed on the models. %The target variable is house price.
We train a shallow neural network with one hidden layer and 100 hidden units as well as a SVM with RBF kernel (and $C=1e6, \gamma=10$) on a small regression task. No form of explicit regularization is imposed on the models. %The target variable is house price.
\vspace{0.2cm}
\begin{table}[ht]
\centering
Expand Down
26 changes: 18 additions & 8 deletions slides/regularization/slides-regu-l1l2.tex
Original file line number Diff line number Diff line change
Expand Up @@ -33,16 +33,16 @@

\end{vbframe}

\begin{vbframe}{Example: Ridge Regression}
Assume the data generating process $y=3x_{1} -2x_{2} +\epsilon $, where $\displaystyle \epsilon \sim N( 0,1)$. The true minimizer is given by $\theta ^{*} =( 3,-2)^{T}$.
%\begin{vbframe}{Example: Ridge Regression}
%Assume the data generating process $y=3x_{1} -2x_{2} +\epsilon $, where $\displaystyle \epsilon \sim N( 0,1)$. The true minimizer is given by $\theta ^{*} =( 3,-2)^{T}$.

\begin{figure}
\includegraphics[width=0.8\textwidth]{figure/lin_reg_l2.png}
\end{figure}
%\begin{figure}
%\includegraphics[width=0.8\textwidth]{figure/lin_reg_l2.png}
%\end{figure}

With increasing regularization, $\theta_{\textit{reg}}$ is pulled back to the origin.
%With increasing regularization, $\theta_{\textit{reg}}$ is pulled back to the origin.

\end{vbframe}
%\end{vbframe}


% \section{Ridge Regression}
Expand All @@ -64,6 +64,16 @@

\framebreak

Assume the data generating process $y=3x_{1} -2x_{2} +\epsilon $, where $\displaystyle \epsilon \sim N( 0,1)$. The true minimizer is given by $\theta ^{*} =( 3,-2)^{T}$.

\begin{figure}
\includegraphics[width=0.8\textwidth]{figure/lin_reg_l2.png}
\end{figure}

With increasing regularization, $\theta_{\textit{reg}}$ is pulled back to the origin.

\framebreak

We understand the geometry of these 2 mixed components in our regularized risk objective much better, if we formulate the optimization as a constrained problem (see this as Lagrange multipliers in reverse).

\vspace{-0.5cm}
Expand All @@ -80,7 +90,7 @@
\end{figure}

\begin{footnotesize}
NB: Relationship between $\lambda$ and $t$ will be explained later.
NB: There is a bijective relationship between $\lambda$ and $t$: $\, \lambda \uparrow \,\, \Rightarrow \,\, t \downarrow$ and vice versa.
\end{footnotesize}

\framebreak
Expand Down
8 changes: 4 additions & 4 deletions slides/regularization/slides-regu-l1vsl2.tex
Original file line number Diff line number Diff line change
Expand Up @@ -75,8 +75,8 @@

\begin{itemize}
\item Typically we omit $\theta_0$ in the penalty term $J(\thetab)$ so that the ``infinitely'' regularized model is the constant model (but this can be implementation-dependent).
\item Penalty methods are typically not equivariant under scaling of the inputs, so one usually standardizes the features beforehand.
\item Note a normal LM has the inductive bias of rescaling equivariance, i.e., if you scale some features, we can simply "anti-scale" the coefficients the same way. The risk does not change.
\item Note that unregularized LM has inductive bias of \textbf{rescaling equivariance}, i.e., if you scale some features, we can simply "anti-scale" the coefs and the risk does not change.
\item Penalty methods typically not equivariant under rescaling of the inputs, so one usually standardizes the features beforehand.
\item While regularized LMs exhibit low-complexity inductive bias, they lose equivariance property: if you down-scale features, coefficients have to become larger to counteract. Then they are penalized stronger in $J(\thetab)$, making some features less attractive without relevant changes in data.

% \item While ridge regression usually leads to smaller estimated coefficients, but still dense $\thetab$ vectors,
Expand Down Expand Up @@ -140,8 +140,8 @@
\begin{vbframe}{Summarizing Comments}

\begin{itemize}
\item Neither one can be classified as overall better.
\item Lasso is likely better if the true underlying structure is sparse, so if only few features influence $y$. Ridge works well if there are many (weakly) influential features.
\item Neither one can be classified as overall better (no free lunch!)
\item Lasso is likely better if true underlying structure is sparse, so if only few features influence $y$. Ridge works well if there are many (weakly) influential features.
\item Lasso can set some coefficients to zero, thus performing variable selection, while Ridge regression usually leads to smaller estimated coefficients, but still dense parameter vectors $\thetab$.
\item Lasso has difficulties handling correlated predictors. For high correlation Ridge dominates Lasso in performance.
\item For Lasso one of the correlated predictors will have a larger coefficient, while the rest are (nearly) zeroed. The respective feature is, however, selected randomly.
Expand Down

0 comments on commit a3827e1

Please sign in to comment.