-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
2d5bd24
commit 939eb1f
Showing
1 changed file
with
99 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,99 @@ | ||
# Expectation-Maximization (EM) | ||
# Theory - EM | ||
|
||
The source for most of the theory here is *Machine Learning: A Probablistic Perspective* by Kevin P. Murphy {cite}`murphy2012`. | ||
|
||
Let us represent all the parameters generally as $\boldsymbol \theta$. | ||
|
||
$$ | ||
\boldsymbol \theta = \{\boldsymbol \pi, \boldsymbol \mu, \boldsymbol \Sigma\} | ||
$$ | ||
|
||
## The Need for EM | ||
|
||
For computing Maximum Likelihood Estimate (MLE) or Maximum A Posteriori (MAP) Estimate, one approach is to use a generic gradient-based optimizer to find a local minimum of the Negative Log Likelihood: | ||
|
||
$$ | ||
\text{NLL}(\boldsymbol \theta) = - \frac{1}{N} \log P(\boldsymbol X | \boldsymbol \theta) | ||
$$ | ||
|
||
However, there are constraints that need to be enforced, like | ||
- Covariance matrices must be positive semi-definite | ||
- Proportions ($\pi_k$) must sum to one | ||
|
||
This can be tricky. It can be simpler to use an iterative algorithm like Expectation Maximization (EM). | ||
|
||
EM alternates between inferring the missing values given the parameters (the E step), and then optimizing the parameters given the "filled in" data (the M step). | ||
|
||
|
||
## EM in GMM | ||
|
||
### Initialization | ||
|
||
We first initialize our parameters. We can use the results of K-Means Clustering as a starting point. | ||
|
||
### The E Step | ||
|
||
Consider the Auxiliary Function of expected complete data log likelihood: | ||
|
||
$$ | ||
Q(\boldsymbol \theta^*, \boldsymbol \theta) = \mathbb{E}[\log P(\boldsymbol X, \boldsymbol Z | \boldsymbol \theta^*)] = \sum_{\boldsymbol Z} P(\boldsymbol Z | \boldsymbol X, \boldsymbol \theta) \log P(\boldsymbol X, \boldsymbol Z | \boldsymbol \theta^*) | ||
$$ | ||
|
||
Out of this, $P(\boldsymbol Z | \boldsymbol X, \boldsymbol \theta)$ is just $\gamma(z_{nk})$, as seen in Equation {eq}`responsibilities`. | ||
|
||
And for the other part, use Equations {eq}`P(z)` and {eq}`P(x|z)`. | ||
|
||
$$ | ||
\begin{aligned} | ||
P(\boldsymbol X, \boldsymbol Z | \boldsymbol \theta^*) &= P(\boldsymbol X | \boldsymbol Z, \boldsymbol \theta) P(\boldsymbol Z | \boldsymbol \theta) = \prod_{n=1}^{N} \prod_{k=1}^{K} \mathcal{N} (\boldsymbol x_n | \boldsymbol \mu_k, \boldsymbol \Sigma_k)^{z_{nk}} \pi_k^{z_{nk}} \\ \\ | ||
\implies \log P(\boldsymbol X, \boldsymbol Z | \boldsymbol \theta^*) &= \sum_{n=1}^N \sum_{k=1}^K z_{nk} \left[ \log \pi_k + \log \mathcal{N}(\boldsymbol x_n | \boldsymbol \mu_k, \boldsymbol \Sigma_k) \right] | ||
\end{aligned} | ||
$$ | ||
|
||
Since the latent variable is only 1 once anytime we evaluate the summation, our final auxiliary function becomes: | ||
|
||
```{math} | ||
:label: the-auxiliary-function | ||
Q(\boldsymbol \theta^*, \boldsymbol \theta) = \sum_{n=1}^N \sum_{k=1}^K \gamma(z_{nk}) \left[ \log \pi_k + \log \mathcal{N}(\boldsymbol x_n | \boldsymbol \mu_k, \boldsymbol \Sigma_k) \right] | ||
``` | ||
|
||
The E Step is about computing the missing values. | ||
|
||
In practice, however, we only need to compute the responsibilities $\gamma(z_{nk})$ in the E step. The auxiliary function in its full form is required to get some results in the M step. | ||
|
||
### The M Step | ||
|
||
The M step is about using the "filled in data" in the E step and the existing parameters $\boldsymbol \theta$ to compute revised parameters $\boldsymbol \theta^*$ such that: | ||
|
||
$$ | ||
\boldsymbol \theta^* = \underset{\boldsymbol \theta}{\text{argmax}}\ Q(\boldsymbol \theta^*, \boldsymbol \theta) | ||
$$ | ||
|
||
For $\boldsymbol \pi$, we have: | ||
|
||
```{math} | ||
:label: optimal-pi | ||
\pi_k^* = \frac{1}{N} \sum_{n=1}^{N} \gamma(z_{nk}) | ||
``` | ||
|
||
For the revised values of mean and covariances, we compute the partial derivatives of $Q$ with respect to these parameters and equate to zero. One can show that the new parameter estimates are given by: | ||
|
||
```{math} | ||
:label: optimal-mu | ||
\boldsymbol \mu_k^* = \frac{\sum_{n=1}^N \gamma(z_{nk}) \boldsymbol x_n}{\sum_{n=1}^N \gamma(z_{nk})} | ||
``` | ||
|
||
```{math} | ||
:label: optimal-sigma | ||
\boldsymbol \Sigma_k^* = \frac{\sum_{n=1}^N \gamma(z_{nk}) (\boldsymbol x_n - \boldsymbol \mu_k) (\boldsymbol x_n - \boldsymbol \mu_k)^\intercal}{\sum_{n=1}^N \gamma(z_{nk})} | ||
``` | ||
|
||
Computing these revised parameters constitutes the M step. | ||
|
||
--- | ||
|
||
## References | ||
|
||
```{bibliography} | ||
:filter: docname in docnames | ||
``` |