Skip to content

Commit

Permalink
Merge overleaf-2023-11-22-0832 into main
Browse files Browse the repository at this point in the history
  • Loading branch information
ludwigbothmann authored Nov 22, 2023
2 parents 152a3fb + 63eec09 commit 4b57d88
Show file tree
Hide file tree
Showing 9 changed files with 37 additions and 65 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 4 additions & 1 deletion slides/information-theory/slides-info-diffent.tex
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
\begin{itemize}
\item For a continuous random variable $X$ with density function $f(x)$ and support $\Xspace$, the analogue of entropy is \textbf{differential entropy}:
\small{
$$ h(X) := h(f) := - \int_{\Xspace} f(x) \log(f(x)) dx $$}
$$ h(X) := h(f) := - \mathbb{E}[\log(f(x))]= - \int_{\Xspace} f(x) \log(f(x)) dx $$}
\item The base of the log is again somewhat arbitrary, and we could either use 2 (and measure in bits) or e (to measure in nats).
\item The integral above does not necessarily exist for all densities.
\item Differential entropy lacks the non-negativeness of discrete entropy: $h(X) < 0$ is possible as $f(x) > 1$ is possible:
Expand Down Expand Up @@ -134,6 +134,9 @@
\item $h(aX) = h(X) + \log |a|$.
\item $h(AX) = h(X) + \log |A|$ for random vectors and matrix A.
\end{enumerate}
\lz
3) and 4) are slightly involved to prove, while the other properties are relatively straightforward to show

\end{vbframe}

\endlecture
Expand Down
39 changes: 21 additions & 18 deletions slides/information-theory/slides-info-entropy.tex
Original file line number Diff line number Diff line change
Expand Up @@ -41,16 +41,16 @@
\item We will show some proofs, but not for everything. We recommend
\textit{Elements of Information Theory} by Cover and Thomas as a reference for more.
\item The application of information theory to the concepts of statistics and ML can sometimes be confusing, we will try to make the connection as clear as possible.
\item In this unit we develop entropy as a measure of uncertainty in terms of expected information.
\end{itemize}
\end{vbframe}

\begin{vbframe}{Entropy}
\begin{itemize}
\item We develop in this unit entropy as a measure of uncertainty in terms of expected information.
%\begin{itemize}
%\item Entropy is often introduced in IT as a measure of
% expected information or in terms of bits needed for efficient coding,
%but for us in stats and ML the first type of intuition seems most useful.
\end{itemize}
%\end{itemize}


For a discrete random variable $X$ with domain $\Xspace \ni x$ and pmf $p(x)$:
Expand All @@ -60,34 +60,37 @@
&= \E\left[\log_2\left(\frac{1}{p(X)}\right)\right] &= \sum_{x \in \Xspace} p(x) \log_2 \frac{1}{p(x)}
\end{aligned}
\end{equation*}
\begin{itemize}
\item \textbf{Definition:}
Base $2$ means the information is measured in bits, but you can use any number $>1$ as base of the logarithm.
\item \textbf{Note:} If $p(x) = 0$, then $p(x) \log_2 p(x)$ is taken to be zero, because $\lim _{p \rightarrow 0} p \log_2 p=0$. %for $x=0$.
\item NB: $H$ is actually Greek capital letter \textbf{E}ta ($\eta$) for \textbf{e}ntropy
\end{itemize}

\begin{center}
\includegraphics[width = 11cm ]{figure/entropy_calc.png}
\end{center}
\vspace{-0.5cm}
\begin{itemize}
\item The final entropy is $H(X)=1.5$.
\end{itemize}


\end{vbframe}

\begin{vbframe}{Entropy Calculation}

\begin{itemize}
\item The negative log probabilities $\log_2 p(x)$ are called "Surprisal".
\end{itemize}

\begin{equation*}
\begin{aligned}
H(X) = - \E[\log_2(p(X))] &= -\sum_{x \in \Xspace} p(x) \log_2 p(x)
\end{aligned}
\end{equation*}

\begin{center}
\includegraphics[width = 12cm ]{figure/entropy_calc.png} \\
\end{center}
\begin{itemize}
\setlength\itemsep{1.2em}
\item \textbf{Definition:}
Base $2$ means the information is measured in bits, but you can use any number $>1$ as base of the logarithm.
\item \textbf{Note:} If $p(x) = 0$, then $p(x) \log_2 p(x)$ is taken to be zero, because $\lim _{p \rightarrow 0} p \log_2 p=0$. %for $x=0$.
\item NB: $H$ is actually Greek capital letter \textbf{E}ta ($\eta$) for \textbf{e}ntropy
\item The negative log probabilities $\log_2 p(x)$ are called "Surprisal".
\end{itemize}


\begin{itemize}
\item The final entropy is $H(X)=1.5$.
\end{itemize}

\end{vbframe}

Expand Down
14 changes: 6 additions & 8 deletions slides/information-theory/slides-info-kl-ment.tex
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
\lz
Let $\mathcal{X}$ be a measurable space with $\sigma$-algebra $\mathcal{F}$ and measure $\mu$ that can be continuous or discrete. \\
We start with a prior distribution $q$ over $\mathcal{X}$ dominated by $\mu$ and a constraint of the form $$\int_D a(\xv) dq(\xv) = c \in \R$$
with $D \in \mathcal{F}.$
with $D \in \mathcal{F}.$ Note that the constraint function $a(\xv)$ is analogous to moment condition functions $g(\cdot)$ in the discrete case.
We want to update the prior distribution $q$ to a posterior distribution $p$ that fulfills the constraint and is maximal w.r.t. $S(p).$ \\
\lz
For this maximization to make sense, $S$ must be transitive, i.e.,
Expand All @@ -44,18 +44,17 @@
\begin{vbframe}{Constructing the KL}
\textbf{1) Locality} \\
The constraint must only update the prior distribution in $D, i.e.,$ the region where it is active. \\

\includegraphics[width=0.3\linewidth]{slides/information-theory/figure_man/kl_me_constraint.png} \\
\lz
For this, it can be shown that the non-overlapping domains of $\mathal{X}$ must contribute additively to the entropy, i.e.,
For this, it can be shown that the non-overlapping domains of $\mathcal{X}$ must contribute additively to the entropy, i.e.,
$$S(p) = \int F(p(\xv), \xv) d\mu(\xv)$$
where $F$ is an unknown function.

\framebreak

\textbf{2) Invariance to coordinate system} \\
\lz
TODO: image \\
\lz
\includegraphics[width=0.5\linewidth]{slides/information-theory/figure_man/kl_me_cosy.png} \\
Enforcing 2) results in
$$S(p) = \int \bm{\Phi}\left(\frac{dp}{dm}(\xv)\right)dm(\xv)$$
where $\bm{\Phi}$ is an unknown function, $m$ is another measure on $\mathcal{X}$ dominated by $\mu$ and $\frac{dp}{dm}$ the Radon–Nikodym derivative which becomes
Expand All @@ -70,8 +69,7 @@
\\ $\Rightarrow m$ must be the prior distribution $q$, and our entropy measure must be understood relatively to this prior, so $S(p)$ becomes, in fact, $S(p\|q).$\\
\lz
\textbf{3) Independent subsystems} \\
TODO: image \\
\lz
\includegraphics[width=0.6\linewidth]{slides/information-theory/figure_man/kl_me_indep_sub.png} \\
If the prior distribution defines a subsystem of $\mathcal{X}$ to be independent, then the priors can be independently updated, and the resulting posterior is just their product density.

\framebreak
Expand All @@ -85,7 +83,7 @@
\item With our desired properties, we ended up with KL minimization
\item This is called the principle of minimum discrimination information, i.e., the posterior should differ from the prior as least as possible
\item This principle is meaningful for continuous and discrete RVs
\item Maximum entropy is just a special case when $\mathcal{X}$ is discrete and $q$ is the uniform distribution.
\item The maximum entropy principle is just a special case when $\mathcal{X}$ is discrete and $q$ is the uniform distribution.
\item Analogously, Shannon entropy can always be treated as negative KL with uniform reference distribution.
\end{itemize}

Expand Down
44 changes: 6 additions & 38 deletions slides/information-theory/slides-info-kl-ml.tex
Original file line number Diff line number Diff line change
Expand Up @@ -22,26 +22,7 @@
\begin{vbframe} {Measuring Distribution Similarity in ML}
\begin{itemize}
\item Information theory provides tools (e.g., divergence measures) to quantify the similarity between probability distributions
\begin{tikzpicture}
% Define parameters for the first Gaussian curve
\def\muA{0}
\def\sigmaA{1}
\def\scaleA{1.3}

% Define parameters for the second Gaussian curve
\def\muB{4}
\def\sigmaB{1}
\def\scaleB{1.3}

% Plot the first Gaussian curve
\draw[domain=-3:3, smooth, samples=100, variable=\x, blue] plot ({\x}, {\scaleA*exp(-(\x-\muA)^2/(2*\sigmaA^2))});

% Plot the second Gaussian curve
\draw[domain=1:7, smooth, samples=100, variable=\x, red] plot ({\x}, {\scaleB*exp(-(\x-\muB)^2/(2*\sigmaB^2))});

% Add a question mark symbol above the curves
\node at (2, 1.5) {?};
\end{tikzpicture}
\includegraphics[width=0.4\linewidth]{slides/information-theory/figure_man/kl_ml_dist_sim.png}
\item The most prominent divergence measure is the KL divergence
\item In ML, measuring (and maximizing) the similarity between probability distributions is a ubiquitous concept, which will be shown in the following.
\end{itemize}
Expand All @@ -50,11 +31,7 @@
\item \textbf{Probabilistic model fitting}\\
Assume our learner is probabilistic, i.e., we model $p(y| \mathbf{x})$ for example (for example, ridge regression, logistic regression, ...).

\lz

TODO: picture

\lz
\includegraphics[width=0.4\linewidth]{slides/information-theory/figure_man/kl_ml_prob_fit.png}

We want to minimize the difference between $p(y \vert \mathbf{x})$ and the conditional data generating process $\mathbb{P}_{y\vert\mathbf{x}}$ based on the data stemming from $\mathbb{P}_{y, \mathbf{x}}.$

Expand All @@ -68,13 +45,9 @@

\begin{itemize}
\item \textbf{Feature selection}
In feature selection, we want to select features that the target strongly depends on.

\lz
In feature selection, we want to choose features the target strongly depends on.

TODO: picture

\lz
\includegraphics[width=0.6\linewidth]{slides/information-theory/figure_man/kl_ml_mi.png}

We can measure dependency by measuring the similarity between $p(\mathbf{x}, y)$ and $p(\mathbf{x})\cdot p(y).$ \\
We will later see that measuring this similarity with KL leads to the concept of mutual information.
Expand All @@ -87,11 +60,7 @@
\item \textbf{Variational inference (VI)}
Our data can also induce probability distributions: By Bayes' theorem it holds that the posterior density $$p(\bm{\theta}\vert \mathbf{X}, \mathbf{y}) = \frac{p(\mathbf{y}|\mathbf{X}, \bm{\theta})p(\bm{\theta})}{\int p(\mathbf{y}|\mathbf{X}, \bm{\theta})p(\bm{\theta})d\bm{\theta}}.$$ However, computing this density analytically is usually intractable.

\lz

TODO: picture

\lz
\includegraphics[width=0.99\linewidth]{slides/information-theory/figure_man/kl_ml_vi.png}

In VI, we want to fit a density $q_{\bm{\phi}}$ with parameters $\bm{\phi}$ to
$p(\bm{\theta}\vert \mathbf{X}, \mathbf{y}).$
Expand Down Expand Up @@ -153,8 +122,7 @@
\end{itemize}
\framebreak

TODO: image
\lz \\ \lz \\
\includegraphics[width=0.6\linewidth]{slides/information-theory/figure_man/kl_ml_fkl_rkl.png} \\
The asymmetry of the KL has the following implications
\begin{itemize}
\item The forward KL $D_{KL}(p\|q_{\bm{\phi}}) = \E_{\xv \sim p} \log\left(\frac{p(\xv)}{q_{\bm{\phi}}(\xv)}\right)$ is mass-covering since $p(\xv)\log\left(\frac{p(\xv)}{q_{\bm{\phi}}(\xv)}\right) \approx 0$ if $p(\xv) \approx 0$ (as long as both distribution do not extremely differ)
Expand Down

0 comments on commit 4b57d88

Please sign in to comment.