diff --git a/slides/information-theory/figure_man/kl_ml_dist_sim.png b/slides/information-theory/figure_man/kl_ml_dist_sim.png new file mode 100644 index 00000000..c6ff29eb Binary files /dev/null and b/slides/information-theory/figure_man/kl_ml_dist_sim.png differ diff --git a/slides/information-theory/figure_man/kl_ml_fkl_rkl.png b/slides/information-theory/figure_man/kl_ml_fkl_rkl.png new file mode 100644 index 00000000..93954696 Binary files /dev/null and b/slides/information-theory/figure_man/kl_ml_fkl_rkl.png differ diff --git a/slides/information-theory/figure_man/kl_ml_mi.png b/slides/information-theory/figure_man/kl_ml_mi.png new file mode 100644 index 00000000..690689de Binary files /dev/null and b/slides/information-theory/figure_man/kl_ml_mi.png differ diff --git a/slides/information-theory/figure_man/kl_ml_prob_fit.png b/slides/information-theory/figure_man/kl_ml_prob_fit.png new file mode 100644 index 00000000..d2483602 Binary files /dev/null and b/slides/information-theory/figure_man/kl_ml_prob_fit.png differ diff --git a/slides/information-theory/figure_man/kl_ml_vi.png b/slides/information-theory/figure_man/kl_ml_vi.png new file mode 100644 index 00000000..5ab7d436 Binary files /dev/null and b/slides/information-theory/figure_man/kl_ml_vi.png differ diff --git a/slides/information-theory/slides-info-diffent.tex b/slides/information-theory/slides-info-diffent.tex index ab2a4ef8..9af77476 100644 --- a/slides/information-theory/slides-info-diffent.tex +++ b/slides/information-theory/slides-info-diffent.tex @@ -21,7 +21,7 @@ \begin{itemize} \item For a continuous random variable $X$ with density function $f(x)$ and support $\Xspace$, the analogue of entropy is \textbf{differential entropy}: \small{ - $$ h(X) := h(f) := - \int_{\Xspace} f(x) \log(f(x)) dx $$} + $$ h(X) := h(f) := - \mathbb{E}[\log(f(x))]= - \int_{\Xspace} f(x) \log(f(x)) dx $$} \item The base of the log is again somewhat arbitrary, and we could either use 2 (and measure in bits) or e (to measure in nats). \item The integral above does not necessarily exist for all densities. \item Differential entropy lacks the non-negativeness of discrete entropy: $h(X) < 0$ is possible as $f(x) > 1$ is possible: @@ -134,6 +134,9 @@ \item $h(aX) = h(X) + \log |a|$. \item $h(AX) = h(X) + \log |A|$ for random vectors and matrix A. \end{enumerate} +\lz +3) and 4) are slightly involved to prove, while the other properties are relatively straightforward to show + \end{vbframe} \endlecture diff --git a/slides/information-theory/slides-info-entropy.tex b/slides/information-theory/slides-info-entropy.tex index 4d728b8c..5fd1aded 100644 --- a/slides/information-theory/slides-info-entropy.tex +++ b/slides/information-theory/slides-info-entropy.tex @@ -41,16 +41,16 @@ \item We will show some proofs, but not for everything. We recommend \textit{Elements of Information Theory} by Cover and Thomas as a reference for more. \item The application of information theory to the concepts of statistics and ML can sometimes be confusing, we will try to make the connection as clear as possible. + \item In this unit we develop entropy as a measure of uncertainty in terms of expected information. \end{itemize} \end{vbframe} \begin{vbframe}{Entropy} -\begin{itemize} - \item We develop in this unit entropy as a measure of uncertainty in terms of expected information. +%\begin{itemize} %\item Entropy is often introduced in IT as a measure of % expected information or in terms of bits needed for efficient coding, %but for us in stats and ML the first type of intuition seems most useful. -\end{itemize} +%\end{itemize} For a discrete random variable $X$ with domain $\Xspace \ni x$ and pmf $p(x)$: @@ -60,20 +60,20 @@ &= \E\left[\log_2\left(\frac{1}{p(X)}\right)\right] &= \sum_{x \in \Xspace} p(x) \log_2 \frac{1}{p(x)} \end{aligned} \end{equation*} - \begin{itemize} - \item \textbf{Definition:} -Base $2$ means the information is measured in bits, but you can use any number $>1$ as base of the logarithm. - \item \textbf{Note:} If $p(x) = 0$, then $p(x) \log_2 p(x)$ is taken to be zero, because $\lim _{p \rightarrow 0} p \log_2 p=0$. %for $x=0$. - \item NB: $H$ is actually Greek capital letter \textbf{E}ta ($\eta$) for \textbf{e}ntropy - \end{itemize} + +\begin{center} +\includegraphics[width = 11cm ]{figure/entropy_calc.png} +\end{center} +\vspace{-0.5cm} +\begin{itemize} +\item The final entropy is $H(X)=1.5$. +\end{itemize} + \end{vbframe} \begin{vbframe}{Entropy Calculation} - \begin{itemize} - \item The negative log probabilities $\log_2 p(x)$ are called "Surprisal". - \end{itemize} \begin{equation*} \begin{aligned} @@ -81,13 +81,16 @@ \end{aligned} \end{equation*} -\begin{center} -\includegraphics[width = 12cm ]{figure/entropy_calc.png} \\ -\end{center} +\begin{itemize} +\setlength\itemsep{1.2em} +\item \textbf{Definition:} +Base $2$ means the information is measured in bits, but you can use any number $>1$ as base of the logarithm. +\item \textbf{Note:} If $p(x) = 0$, then $p(x) \log_2 p(x)$ is taken to be zero, because $\lim _{p \rightarrow 0} p \log_2 p=0$. %for $x=0$. +\item NB: $H$ is actually Greek capital letter \textbf{E}ta ($\eta$) for \textbf{e}ntropy +\item The negative log probabilities $\log_2 p(x)$ are called "Surprisal". +\end{itemize} + - \begin{itemize} - \item The final entropy is $H(X)=1.5$. - \end{itemize} \end{vbframe} diff --git a/slides/information-theory/slides-info-kl-ment.tex b/slides/information-theory/slides-info-kl-ment.tex index a361dd01..cb43367c 100644 --- a/slides/information-theory/slides-info-kl-ment.tex +++ b/slides/information-theory/slides-info-kl-ment.tex @@ -35,7 +35,7 @@ \lz Let $\mathcal{X}$ be a measurable space with $\sigma$-algebra $\mathcal{F}$ and measure $\mu$ that can be continuous or discrete. \\ We start with a prior distribution $q$ over $\mathcal{X}$ dominated by $\mu$ and a constraint of the form $$\int_D a(\xv) dq(\xv) = c \in \R$$ - with $D \in \mathcal{F}.$ + with $D \in \mathcal{F}.$ Note that the constraint function $a(\xv)$ is analogous to moment condition functions $g(\cdot)$ in the discrete case. We want to update the prior distribution $q$ to a posterior distribution $p$ that fulfills the constraint and is maximal w.r.t. $S(p).$ \\ \lz For this maximization to make sense, $S$ must be transitive, i.e., @@ -44,9 +44,9 @@ \begin{vbframe}{Constructing the KL} \textbf{1) Locality} \\ The constraint must only update the prior distribution in $D, i.e.,$ the region where it is active. \\ - +\includegraphics[width=0.3\linewidth]{slides/information-theory/figure_man/kl_me_constraint.png} \\ \lz - For this, it can be shown that the non-overlapping domains of $\mathal{X}$ must contribute additively to the entropy, i.e., + For this, it can be shown that the non-overlapping domains of $\mathcal{X}$ must contribute additively to the entropy, i.e., $$S(p) = \int F(p(\xv), \xv) d\mu(\xv)$$ where $F$ is an unknown function. @@ -54,8 +54,7 @@ \textbf{2) Invariance to coordinate system} \\ \lz - TODO: image \\ - \lz + \includegraphics[width=0.5\linewidth]{slides/information-theory/figure_man/kl_me_cosy.png} \\ Enforcing 2) results in $$S(p) = \int \bm{\Phi}\left(\frac{dp}{dm}(\xv)\right)dm(\xv)$$ where $\bm{\Phi}$ is an unknown function, $m$ is another measure on $\mathcal{X}$ dominated by $\mu$ and $\frac{dp}{dm}$ the Radon–Nikodym derivative which becomes @@ -70,8 +69,7 @@ \\ $\Rightarrow m$ must be the prior distribution $q$, and our entropy measure must be understood relatively to this prior, so $S(p)$ becomes, in fact, $S(p\|q).$\\ \lz \textbf{3) Independent subsystems} \\ - TODO: image \\ - \lz + \includegraphics[width=0.6\linewidth]{slides/information-theory/figure_man/kl_me_indep_sub.png} \\ If the prior distribution defines a subsystem of $\mathcal{X}$ to be independent, then the priors can be independently updated, and the resulting posterior is just their product density. \framebreak @@ -85,7 +83,7 @@ \item With our desired properties, we ended up with KL minimization \item This is called the principle of minimum discrimination information, i.e., the posterior should differ from the prior as least as possible \item This principle is meaningful for continuous and discrete RVs - \item Maximum entropy is just a special case when $\mathcal{X}$ is discrete and $q$ is the uniform distribution. + \item The maximum entropy principle is just a special case when $\mathcal{X}$ is discrete and $q$ is the uniform distribution. \item Analogously, Shannon entropy can always be treated as negative KL with uniform reference distribution. \end{itemize} diff --git a/slides/information-theory/slides-info-kl-ml.tex b/slides/information-theory/slides-info-kl-ml.tex index cb240697..9fa5d05b 100644 --- a/slides/information-theory/slides-info-kl-ml.tex +++ b/slides/information-theory/slides-info-kl-ml.tex @@ -22,26 +22,7 @@ \begin{vbframe} {Measuring Distribution Similarity in ML} \begin{itemize} \item Information theory provides tools (e.g., divergence measures) to quantify the similarity between probability distributions - \begin{tikzpicture} - % Define parameters for the first Gaussian curve - \def\muA{0} - \def\sigmaA{1} - \def\scaleA{1.3} - - % Define parameters for the second Gaussian curve - \def\muB{4} - \def\sigmaB{1} - \def\scaleB{1.3} - - % Plot the first Gaussian curve - \draw[domain=-3:3, smooth, samples=100, variable=\x, blue] plot ({\x}, {\scaleA*exp(-(\x-\muA)^2/(2*\sigmaA^2))}); - - % Plot the second Gaussian curve - \draw[domain=1:7, smooth, samples=100, variable=\x, red] plot ({\x}, {\scaleB*exp(-(\x-\muB)^2/(2*\sigmaB^2))}); - - % Add a question mark symbol above the curves - \node at (2, 1.5) {?}; -\end{tikzpicture} +\includegraphics[width=0.4\linewidth]{slides/information-theory/figure_man/kl_ml_dist_sim.png} \item The most prominent divergence measure is the KL divergence \item In ML, measuring (and maximizing) the similarity between probability distributions is a ubiquitous concept, which will be shown in the following. \end{itemize} @@ -50,11 +31,7 @@ \item \textbf{Probabilistic model fitting}\\ Assume our learner is probabilistic, i.e., we model $p(y| \mathbf{x})$ for example (for example, ridge regression, logistic regression, ...). -\lz - -TODO: picture - -\lz +\includegraphics[width=0.4\linewidth]{slides/information-theory/figure_man/kl_ml_prob_fit.png} We want to minimize the difference between $p(y \vert \mathbf{x})$ and the conditional data generating process $\mathbb{P}_{y\vert\mathbf{x}}$ based on the data stemming from $\mathbb{P}_{y, \mathbf{x}}.$ @@ -68,13 +45,9 @@ \begin{itemize} \item \textbf{Feature selection} -In feature selection, we want to select features that the target strongly depends on. - -\lz +In feature selection, we want to choose features the target strongly depends on. -TODO: picture - -\lz +\includegraphics[width=0.6\linewidth]{slides/information-theory/figure_man/kl_ml_mi.png} We can measure dependency by measuring the similarity between $p(\mathbf{x}, y)$ and $p(\mathbf{x})\cdot p(y).$ \\ We will later see that measuring this similarity with KL leads to the concept of mutual information. @@ -87,11 +60,7 @@ \item \textbf{Variational inference (VI)} Our data can also induce probability distributions: By Bayes' theorem it holds that the posterior density $$p(\bm{\theta}\vert \mathbf{X}, \mathbf{y}) = \frac{p(\mathbf{y}|\mathbf{X}, \bm{\theta})p(\bm{\theta})}{\int p(\mathbf{y}|\mathbf{X}, \bm{\theta})p(\bm{\theta})d\bm{\theta}}.$$ However, computing this density analytically is usually intractable. -\lz - -TODO: picture - -\lz +\includegraphics[width=0.99\linewidth]{slides/information-theory/figure_man/kl_ml_vi.png} In VI, we want to fit a density $q_{\bm{\phi}}$ with parameters $\bm{\phi}$ to $p(\bm{\theta}\vert \mathbf{X}, \mathbf{y}).$ @@ -153,8 +122,7 @@ \end{itemize} \framebreak -TODO: image -\lz \\ \lz \\ +\includegraphics[width=0.6\linewidth]{slides/information-theory/figure_man/kl_ml_fkl_rkl.png} \\ The asymmetry of the KL has the following implications \begin{itemize} \item The forward KL $D_{KL}(p\|q_{\bm{\phi}}) = \E_{\xv \sim p} \log\left(\frac{p(\xv)}{q_{\bm{\phi}}(\xv)}\right)$ is mass-covering since $p(\xv)\log\left(\frac{p(\xv)}{q_{\bm{\phi}}(\xv)}\right) \approx 0$ if $p(\xv) \approx 0$ (as long as both distribution do not extremely differ)