diff --git a/slides/regularization/slides-regu-intro.tex b/slides/regularization/slides-regu-intro.tex index 41d015a5..bc35be40 100644 --- a/slides/regularization/slides-regu-intro.tex +++ b/slides/regularization/slides-regu-intro.tex @@ -113,7 +113,7 @@ \begin{vbframe}{Example II: Overfitting} -We train a shallow neural network with one hidden layer and 100 hidden units as well as a SVM with RBF kernel on a small regression task. No form of explicit regularization is imposed on the models. %The target variable is house price. +We train a shallow neural network with one hidden layer and 100 hidden units as well as a SVM with RBF kernel (and $C=1e6, \gamma=10$) on a small regression task. No form of explicit regularization is imposed on the models. %The target variable is house price. \vspace{0.2cm} \begin{table}[ht] \centering diff --git a/slides/regularization/slides-regu-l1l2.tex b/slides/regularization/slides-regu-l1l2.tex index 76ee3881..79303b36 100644 --- a/slides/regularization/slides-regu-l1l2.tex +++ b/slides/regularization/slides-regu-l1l2.tex @@ -33,16 +33,16 @@ \end{vbframe} -\begin{vbframe}{Example: Ridge Regression} -Assume the data generating process $y=3x_{1} -2x_{2} +\epsilon $, where $\displaystyle \epsilon \sim N( 0,1)$. The true minimizer is given by $\theta ^{*} =( 3,-2)^{T}$. +%\begin{vbframe}{Example: Ridge Regression} +%Assume the data generating process $y=3x_{1} -2x_{2} +\epsilon $, where $\displaystyle \epsilon \sim N( 0,1)$. The true minimizer is given by $\theta ^{*} =( 3,-2)^{T}$. -\begin{figure} -\includegraphics[width=0.8\textwidth]{figure/lin_reg_l2.png} -\end{figure} +%\begin{figure} +%\includegraphics[width=0.8\textwidth]{figure/lin_reg_l2.png} +%\end{figure} -With increasing regularization, $\theta_{\textit{reg}}$ is pulled back to the origin. +%With increasing regularization, $\theta_{\textit{reg}}$ is pulled back to the origin. -\end{vbframe} +%\end{vbframe} % \section{Ridge Regression} @@ -64,6 +64,16 @@ \framebreak +Assume the data generating process $y=3x_{1} -2x_{2} +\epsilon $, where $\displaystyle \epsilon \sim N( 0,1)$. The true minimizer is given by $\theta ^{*} =( 3,-2)^{T}$. + +\begin{figure} +\includegraphics[width=0.8\textwidth]{figure/lin_reg_l2.png} +\end{figure} + +With increasing regularization, $\theta_{\textit{reg}}$ is pulled back to the origin. + +\framebreak + We understand the geometry of these 2 mixed components in our regularized risk objective much better, if we formulate the optimization as a constrained problem (see this as Lagrange multipliers in reverse). \vspace{-0.5cm} @@ -80,7 +90,7 @@ \end{figure} \begin{footnotesize} -NB: Relationship between $\lambda$ and $t$ will be explained later. +NB: There is a bijective relationship between $\lambda$ and $t$: $\, \lambda \uparrow \,\, \Rightarrow \,\, t \downarrow$ and vice versa. \end{footnotesize} \framebreak diff --git a/slides/regularization/slides-regu-l1vsl2.tex b/slides/regularization/slides-regu-l1vsl2.tex index 5f8c30d9..4a898676 100644 --- a/slides/regularization/slides-regu-l1vsl2.tex +++ b/slides/regularization/slides-regu-l1vsl2.tex @@ -75,8 +75,8 @@ \begin{itemize} \item Typically we omit $\theta_0$ in the penalty term $J(\thetab)$ so that the ``infinitely'' regularized model is the constant model (but this can be implementation-dependent). - \item Penalty methods are typically not equivariant under scaling of the inputs, so one usually standardizes the features beforehand. - \item Note a normal LM has the inductive bias of rescaling equivariance, i.e., if you scale some features, we can simply "anti-scale" the coefficients the same way. The risk does not change. + \item Note that unregularized LM has inductive bias of \textbf{rescaling equivariance}, i.e., if you scale some features, we can simply "anti-scale" the coefs and the risk does not change. + \item Penalty methods typically not equivariant under rescaling of the inputs, so one usually standardizes the features beforehand. \item While regularized LMs exhibit low-complexity inductive bias, they lose equivariance property: if you down-scale features, coefficients have to become larger to counteract. Then they are penalized stronger in $J(\thetab)$, making some features less attractive without relevant changes in data. % \item While ridge regression usually leads to smaller estimated coefficients, but still dense $\thetab$ vectors, @@ -140,8 +140,8 @@ \begin{vbframe}{Summarizing Comments} \begin{itemize} -\item Neither one can be classified as overall better. -\item Lasso is likely better if the true underlying structure is sparse, so if only few features influence $y$. Ridge works well if there are many (weakly) influential features. +\item Neither one can be classified as overall better (no free lunch!) +\item Lasso is likely better if true underlying structure is sparse, so if only few features influence $y$. Ridge works well if there are many (weakly) influential features. \item Lasso can set some coefficients to zero, thus performing variable selection, while Ridge regression usually leads to smaller estimated coefficients, but still dense parameter vectors $\thetab$. \item Lasso has difficulties handling correlated predictors. For high correlation Ridge dominates Lasso in performance. \item For Lasso one of the correlated predictors will have a larger coefficient, while the rest are (nearly) zeroed. The respective feature is, however, selected randomly.