*You can find the notebook for this article here. It is part of the bayesian-machine-learning repo on Github.*

This is part 1 of a two-part series of articles about latent variable models. Part 1 covers the expectation maximization (EM) algorithm and its application to Gaussian mixture models. Part 2 covers approximate inference and variational autoencoders.

## Introduction

Given a probabilistic model $p(\mathbf{x} \lvert \boldsymbol{\theta})$ and $N$ observations we often want to find a value for parameter $\boldsymbol{\theta}$ that maximizes the likelihood function $p(\mathbf{X} \lvert \boldsymbol{\theta})$, a function of parameter $\boldsymbol{\theta}$. This is known as maximimum likelihood estimation (MLE).

If the model is a simple probability distribution, like a single Gaussian, for example, then has an analytical solution. A common approach for more complex models is *gradient descent* using the *negative log likelihood*, $-\log p(\mathbf{x} \lvert \boldsymbol{\theta})$, as loss function. This can easily be implemented with frameworks like PyTorch or Tensorflow provided that $p(\mathbf{x} \lvert \boldsymbol{\theta})$ is differentiable w.r.t. $\boldsymbol{\theta}$. But this is not necessarily the most efficient approach.

## Gaussian mixture model

MLE can often be simplified by introducing *latent variables*. A latent variable model makes the assumption that an observation $\mathbf{x}_i$ is caused by some underlying latent variable, a variable that cannot be observed directly but can be inferred from observed variables and parameters. For example, the following plot shows observations in 2-dimensional space and one can see that their overall distribution doesn’t seem follow a simple distribution like a single Gaussian.

```
from latent_variable_models_util import n_true, mu_true, sigma_true
from latent_variable_models_util import generate_data, plot_data, plot_densities
%matplotlib inline
X, T = generate_data(n=n_true, mu=mu_true, sigma=sigma_true)
plot_data(X, color='grey')
```

But we can see that there are clusters of higher density. Also, the distribution within a cluster looks more like a Gaussian compared to the overall distribution. Indeed, these data were generated using a *mixture* of three Gaussians as shown in the following plot.

```
plot_data(X, color=T)
plot_densities(X, mu=mu_true, sigma=sigma_true)
```

The underlying probabilistic model is called a *Gaussian mixture model* (GMM), a weighted sum of $C$ Gaussian components where $C=3$ in this example.

$\pi_c$, and are the weight, mean vector and covariance matrix of mixture component $c$, respectively. The weights are non-negative and sum up to $1$ i.e. $\sum_{c=1}^{C} \pi_c = 1$. Parameter vector denotes the set of all model parameters. If we introduce a discrete latent variable $\mathbf{t}$ that determines the assignment of observations to mixture components we can define a joint distribution over observed and latent variables $p(\mathbf{x}, \mathbf{t} \lvert \boldsymbol{\theta})$ in terms of a conditional distribution $p(\mathbf{x} \lvert \mathbf{t}, \boldsymbol{\theta})$ and a prior distribution $p(\mathbf{t} \lvert \boldsymbol{\theta})$

where $p(\mathbf{x} \lvert t_c = 1, \boldsymbol{\theta}) = \mathcal{N}(\mathbf{x} \lvert \boldsymbol{\mu}_c, \boldsymbol{\Sigma}_c)$ and $p(t_c = 1 \lvert \boldsymbol{\theta}) = \pi_c$. The values of $\mathbf{t}$ are one-hot encoded. For example, $t_2 = 1$ refers to the second mixture component which means $\mathbf{t} = (0,1,0)^T$ if there are $C=3$ components in total. The marginal distribution $p(\mathbf{x} \lvert \boldsymbol{\theta})$ is obtained by summing over all possible states of $\mathbf{t}$.

For each observation $\mathbf{x}_i$ we have one latent variable $\mathbf{t}_i$, as shown in the following plate notation of the model.

We denote the set of all observations by $\mathbf{X}$ and the set of all latent variables by $\mathbf{T}$. If we could observe $\mathbf{T}$ directly, we could easily maximize the *complete-data likelihood* $p(\mathbf{X}, \mathbf{T} \lvert \boldsymbol{\theta})$ because we would then know the assignment of data points to components, and fitting a single Gaussian per component can be done analytically. But since we can only observe $\mathbf{X}$, we have to maximize the *marginal likelihood* or *incomplete-data likelihood* $p(\mathbf{X} \lvert \boldsymbol{\theta})$. By using the logarithm of the likelihood we have

which involves a summation over latent variables inside the logarithm. This prevents a simple analytical solution to the optimization problem.

## Expectation maximization algorithm

Although we cannot observe latent variables directly, we can obtain their posterior distribution. We start with a preliminary parameter value $\boldsymbol{\theta}^{old}$.

This allows us to define an *expectation* of the complete-data likelihood w.r.t. to the posterior distribution.

This expectation is then *maximized* w.r.t. to $\boldsymbol{\theta}$ resulting in an updated parameter vector $\boldsymbol{\theta}^{new}$.

In Eq. $(7)$ the summation is outside the logarithm which enables an analytical solution for $\boldsymbol{\theta}^{new}$ in the case of GMMs. We then let $\boldsymbol{\theta}^{old} \leftarrow \boldsymbol{\theta}^{new}$ and repeat these steps until convergence. This is the essence of the *expectation maximization (EM) algorithm*. It has an expectation- or *E-step* where the posterior over latent variables is obtained and a maximization- or *M-step* where the expectation of the complete-data likelihood w.r.t. the posterior distribution is maximized. It can be shown that the EM algorithm always converges to a local maximum of $p(\mathbf{X} \lvert \boldsymbol{\theta})$. The following subsection describes the EM algorithm in a more general form.

### General form

By introducing a latent variable $\mathbf{t}_i$ for each observation $\mathbf{x}_i$ we can define the log marginal likelihood as

Next we introduce a distribution $q(\mathbf{t}_i)$ over latent variable $\mathbf{t}_i$, which can be any probability distribution, and multiply and divide the RHS of the Eq. $(9)$ with this distribution.

We now have a concave function ($\log$) of an expectation which allows us to apply Jensen’s inequality to define a *lower bound* $\mathcal{L}$ on $\log p(\mathbf{X} \lvert \boldsymbol{\theta})$.

This lower bound is a function of $\boldsymbol{\theta}$ and $q$. When we subtract the lower bound from the marginal log likelihood we should end up with something that is non-negative.

We end up with the Kullback-Leibler (KL) divergence between $q(\mathbf{T})$ and the true posterior over latent variables. It can be shown that the KL divergence is always non-negative. We finally can write the following expression for the lower bound.

In the E-step of the EM algorithm $\mathcal{L}(\boldsymbol{\theta}, q)$ is maximized w.r.t. $q$ and $\boldsymbol{\theta}$ is held fixed.

$\mathcal{L}(\boldsymbol{\theta}, q)$ is maximized when $\mathrm{KL}(q(\mathbf{T}) \mid\mid p(\mathbf{T} \lvert \mathbf{X}, \boldsymbol{\theta}))$ is minimized as $\log p(\mathbf{X} \lvert \boldsymbol{\theta})$ doesn’t depend on $q$. If we can obtain the true posterior, like in the GMM case, we can set $q(\mathbf{T})$ to $p(\mathbf{T} \lvert \mathbf{X}, \boldsymbol{\theta})$ and the KL divergence becomes $0$. If the true posterior is not tractable we have to use approximations. One family of approximation techniques is called *variational inference* which uses specific forms of $q$ to approximate the true posterior. For example, $q$ may come from a parameteric family of distributions and the parameters of $q$ are chosen such that the KL divergence is minimized. Variational inference is covered in more detail in part 2.

In the M-step $\mathcal{L}(\boldsymbol{\theta}, q)$ is maximized w.r.t. $\boldsymbol{\theta}$ and $q$ is held fixed. Using Eq. $(11)$ we get

If the true posterior is known Eq. $(15)$ becomes Eq. $(7)$ except for the constant term which can be ignored during optimization. Again, we let $\boldsymbol{\theta}^{old} \leftarrow \boldsymbol{\theta}^{new}$ and repeat these steps until convergence. The next section applies the EM algorithm to GMMs.

## Expectation maximization algorithm for GMMs

The parameters for a GMM with 3 components are . The prior probability for component $c$ is $p(t_c = 1 \lvert \boldsymbol{\theta}) = \pi_c$ and the conditional distribution of $\mathbf{x}$ given the latent variable value for this component is $p(\mathbf{x} \lvert t_c = 1, \boldsymbol{\theta}) = \mathcal{N}(\mathbf{x} \lvert \boldsymbol{\mu}_c, \boldsymbol{\Sigma}_c)$.

### Implementation with numpy and scipy

```
import numpy as np
from scipy.stats import multivariate_normal as mvn
```

In the E-step we set $q(\mathbf{T})$ to the known posterior $ p(\mathbf{T} \lvert \mathbf{X}, \boldsymbol{\theta})$ using Eq. $(6)$ and compute the posterior probabilities.

```
def e_step(X, pi, mu, sigma):
"""
Computes posterior probabilities from data and parameters.
Args:
X: observed data (N, D).
pi: prior probabilities (C,).
mu: mixture component means (C, D).
sigma: mixture component covariances (C, D, D).
Returns:
Posterior probabilities (N, C).
"""
N = X.shape[0]
C = mu.shape[0]
q = np.zeros((N, C))
# Equation (6)
for c in range(C):
q[:, c] = mvn(mu[c], sigma[c]).pdf(X) * pi[c]
return q / np.sum(q, axis=-1, keepdims=True)
```

In the M-step, we take the derivatives of $\mathcal{L}(\boldsymbol{\theta}, q)$ w.r.t. to $\pi_c$, and and set the resulting expressions to $0$ i.e. do MLE of parameters and also apply constraints to ensure that $\sum_{c=1}^C \pi_c = 1$ and $\boldsymbol{\Sigma}_{c})^T$ is positive semi-definite. The details are omitted here and only the results are presented.

```
def m_step(X, q):
"""
Computes parameters from data and posterior probabilities.
Args:
X: data (N, D).
q: posterior probabilities (N, C).
Returns:
tuple of
- prior probabilities (C,).
- mixture component means (C, D).
- mixture component covariances (C, D, D).
"""
N, D = X.shape
C = q.shape[1]
sigma = np.zeros((C, D, D))
# Equation (16)
pi = np.sum(q, axis=0) / N
# Equation (17)
mu = q.T.dot(X) / np.sum(q.T, axis=1, keepdims=True)
# Equation (18)
for c in range(C):
delta = (X - mu[c])
sigma[c] = (q[:, [c]] * delta).T.dot(delta) / np.sum(q[:, c])
return pi, mu, sigma
```

For computing the lower bound, using the results from the E-step and M-step, we first re-arrange Eq. $(11)$

and then implement it in this form.

```
def lower_bound(X, pi, mu, sigma, q):
"""
Computes lower bound from data, parameters and posterior probabilities.
Args:
X: observed data (N, D).
pi: prior probabilities (C,).
mu: mixture component means (C, D).
pi: mixture component covariances (C, D, D).
q: posterior probabilities (N, C).
Returns:
Lower bound.
"""
N, C = q.shape
ll = np.zeros((N, C))
# Equation (19)
for c in range(C):
ll[:,c] = mvn(mu[c], sigma[c]).logpdf(X)
return np.sum(q * (ll + np.log(pi) - np.log(np.maximum(q, 1e-8))))
```

Model training iterates over E- and M-steps alternately until convergence of the lower bound. To increase the chance of escaping local maxima and finding the global maximum, training is restarted several times from random initial parameters.

```
def random_init_params(X, C):
D = X.shape[1]
pi = np.ones(C) / C
mu = mvn(mean=np.mean(X, axis=0), cov=[np.var(X[:, 0]),
np.var(X[:, 1])]).rvs(C).reshape(C, D)
sigma = np.tile(np.eye(2), (C, 1, 1))
return pi, mu, sigma
def train(X, C, n_restarts=10, max_iter=50, rtol=1e-3):
q_best = None
pi_best = None
mu_best = None
sigma_best = None
lb_best = -np.inf
for _ in range(n_restarts):
pi, mu, sigma = random_init_params(X, C)
prev_lb = None
try:
for _ in range(max_iter):
q = e_step(X, pi, mu, sigma)
pi, mu, sigma = m_step(X, q)
lb = lower_bound(X, pi, mu, sigma, q)
if lb > lb_best:
q_best = q
pi_best = pi
mu_best = mu
sigma_best = sigma
lb_best = lb
if prev_lb and np.abs((lb - prev_lb) / prev_lb) < rtol:
break
prev_lb = lb
except np.linalg.LinAlgError:
# Singularity. One of the components collapsed
# onto a specific data point. Start again ...
pass
return pi_best, mu_best, sigma_best, q_best, lb_best
pi_best, mu_best, sigma_best, q_best, lb_best = train(X, C=3)
print(f'Lower bound = {lb_best:.2f}')
```

```
Lower bound = -3923.77
```

After convergence we can use the posterior probabilties of latent variables to plot the soft-assignment of data points to mixture components.

```
plot_data(X, color=q_best)
plot_densities(X, mu=mu_best, sigma=sigma_best)
```

### Optimal number of components

Usually, we do not know the optimal number of mixture components a priori. But we can get a hint when plotting the lower bound vs. the number of mixture components.

```
import matplotlib.pyplot as plt
Cs = range(1, 8)
lbs = []
for C in Cs:
lb = train(X, C)[-1]
lbs.append(lb)
plt.plot(Cs, lbs)
plt.xlabel('Number of mixture components')
plt.ylabel('Lower bound');
```

There is a strong increase in the lower bound value until $C = 3$ and then the lower bound more or less doesn’t increase any more. With more components there are of course more options to overfit but the simplest model that reaches a relatively high lower bound value is a GMM with 3 components. This is exactly the number of components used to generate the data.

A more principled approach to determine the optimal number of components requires a Bayesian treatment of model parameters. In this case the lower bound would also take into account model complexity and we would see decreasing lower bound values for $C \gt 3$ and a maximum at $C = 3$. For details see section 10.2.4 in [1].

### Implementation with scikit-learn

The low-level implementation above was just for illustration purposes. Scikit-learn already comes with a `GaussianMixture`

class that can be readily used.

```
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=3, n_init=10)
gmm.fit(X)
plot_data(X, color=gmm.predict_proba(X))
plot_densities(X, mu=gmm.means_, sigma=gmm.covariances_)
```

The results are very similar. The differences come from different random initializations. Also, plotting the lower bound vs. the number of mixture components reproduces our previous findings.

```
Cs = range(1, 8)
lbs = []
for C in Cs:
gmm = GaussianMixture(n_components=C, n_init=10)
gmm.fit(X)
lbs.append(gmm.lower_bound_)
plt.plot(Cs, lbs)
plt.xlabel('Number of mixture components')
plt.ylabel('Lower bound (normalized)');
```

The lower bound values obtained via `gmm.lower_bound_`

are normalized i.e. divided by $N = 1000$ in this example.

## Conclusion

Inference of latent variables and estimation of parameters in GMMs is an example of exact inference. The exact posterior can be obtained in the E-step and analytical solutions exist for parameter MLE in the M-step. But for many models of practical interest, exact inference is not possible and approximate inference methods must be used. The EM algorithm has numerous extensions for approximate inference. *Variational inference* is one of them and will be covered in part 2, together with a *variational autoencoder* as application example.

## References

[1] Christopher M. Bishop. Pattern Recognition and Machine Learning (PDF), Chapter 9.

[2] Kevin P. Murphy. Machine Learning, A Probabilistic Perspective, Chapter 11.