Skip to content

Latest commit

 

History

History
61 lines (45 loc) · 2.21 KB

notes.md

File metadata and controls

61 lines (45 loc) · 2.21 KB

Mean-field Normal

For a parameter vector theta_1, theta_2, ..., theta_n in the true model p(theta), MFN models N(mu_1, sigma_1), ... N(mu_n, sigma_n) within approximation model q(). All parameters within q are deemed independent of each other.

the density of the Mean field would be the product of all the normal distributions within itself.

Calculating ELBO

in stan and viabel:

elbo = E[p(x)] + q.entropy()

entropy for random variables:

integrate from -inf to inf -(q(x) ln q(x)) dx

= E[- ln q(x)] # differential entropy

ELBO equation: E[ln q(x, theta)] - E[ln q(theta)]

therefore we can just add entropy of q since it's equivalent to -E[ln q(theta)]

The Mean-Field Normal is equal to a Multivariate Normal Distribution with diag(sigma) as its covariance matrix. So we can use its entropy equation after simplifying it. You can check out its proof here: https://markusthill.github.io/gaussian-distribution-with-a-diagonal-covariance-matrix/

in viabel code, elbo is implemented as:

if approx.supports_entropy:
    lower_bound = np.mean(self.model(samples)) + approx.entropy(var_param)
else:
    lower_bound = np.mean(self.model(samples) - approx.log_density(samples))
def log_density(x):
    return mvn.logpdf(x, param_dict['mu'], np.diag(np.exp(2*param_dict['log_sigma'])))

In the case entropy can't be calculated, E[log q(theta)] would be approximated through E[log(q(x))] using a finite number of samples(x)

SGA

Recall that in a multivariate function, its function value increases the fastest along the direction of its gradient. Its inverse also holds; its values decreases the fastest in the direction of -delta

So the most basic form of stochastic gradient ascent increments the current function value in the direction of its gradient multiplied by some "learning rate" eta, which can be fixed or adaptive.

Since the function is at a critical point when its gradient is zero, unless we "overshoot", we are guaranteed to go towards the local maximum every step(well there some additional requirements for that to always hold true).

In the case of ELBO maximization, we apply this to each mu and sigma, just the gradient is calculated and incremented separately with respect to mu and sigma.