Latent variable models - part 1: Gaussian mixture models and the EM algorithm

November 21, 2019

You can find the notebook for this article here. It is part of the bayesian-machine-learning repo on Github.

This is part 1 of a two-part series of articles about latent variable models. Part 1 covers the expectation maximization (EM) algorithm and its application to Gaussian mixture models. Part 2 covers approximate inference and variational autoencoders.

Introduction

Given a probabilistic model $p(\mathbf{x} \lvert \boldsymbol{\theta})$ and $N$ observations we often want to find a value for parameter $\boldsymbol{\theta}$ that maximizes the likelihood function $p(\mathbf{X} \lvert \boldsymbol{\theta})$, a function of parameter $\boldsymbol{\theta}$. This is known as maximimum likelihood estimation (MLE).

If the model is a simple probability distribution, like a single Gaussian, for example, then has an analytical solution. A common approach for more complex models is gradient descent using the negative log likelihood, $-\log p(\mathbf{x} \lvert \boldsymbol{\theta})$, as loss function. This can easily be implemented with frameworks like PyTorch or Tensorflow provided that $p(\mathbf{x} \lvert \boldsymbol{\theta})$ is differentiable w.r.t. $\boldsymbol{\theta}$. But this is not necessarily the most efficient approach.

Gaussian mixture model

MLE can often be simplified by introducing latent variables. A latent variable model makes the assumption that an observation $\mathbf{x}_i$ is caused by some underlying latent variable, a variable that cannot be observed directly but can be inferred from observed variables and parameters. For example, the following plot shows observations in 2-dimensional space and one can see that their overall distribution doesn’t seem follow a simple distribution like a single Gaussian.

from latent_variable_models_util import n_true, mu_true, sigma_true
from latent_variable_models_util import generate_data, plot_data, plot_densities

%matplotlib inline

X, T = generate_data(n=n_true, mu=mu_true, sigma=sigma_true)

plot_data(X, color='grey')

png

But we can see that there are clusters of higher density. Also, the distribution within a cluster looks more like a Gaussian compared to the overall distribution. Indeed, these data were generated using a mixture of three Gaussians as shown in the following plot.

plot_data(X, color=T)
plot_densities(X, mu=mu_true, sigma=sigma_true)

png

The underlying probabilistic model is called a Gaussian mixture model (GMM), a weighted sum of $C$ Gaussian components where $C=3$ in this example.

$\pi_c$, and are the weight, mean vector and covariance matrix of mixture component $c$, respectively. The weights are non-negative and sum up to $1$ i.e. $\sum_{c=1}^{C} \pi_c = 1$. Parameter vector denotes the set of all model parameters. If we introduce a discrete latent variable $\mathbf{t}$ that determines the assignment of observations to mixture components we can define a joint distribution over observed and latent variables $p(\mathbf{x}, \mathbf{t} \lvert \boldsymbol{\theta})$ in terms of a conditional distribution $p(\mathbf{x} \lvert \mathbf{t}, \boldsymbol{\theta})$ and a prior distribution $p(\mathbf{t} \lvert \boldsymbol{\theta})$

where $p(\mathbf{x} \lvert t_c = 1, \boldsymbol{\theta}) = \mathcal{N}(\mathbf{x} \lvert \boldsymbol{\mu}_c, \boldsymbol{\Sigma}_c)$ and $p(t_c = 1 \lvert \boldsymbol{\theta}) = \pi_c$. The values of $\mathbf{t}$ are one-hot encoded. For example, $t_2 = 1$ refers to the second mixture component which means $\mathbf{t} = (0,1,0)^T$ if there are $C=3$ components in total. The marginal distribution $p(\mathbf{x} \lvert \boldsymbol{\theta})$ is obtained by summing over all possible states of $\mathbf{t}$.

For each observation $\mathbf{x}_i$ we have one latent variable $\mathbf{t}_i$, as shown in the following plate notation of the model.

gmm

We denote the set of all observations by $\mathbf{X}$ and the set of all latent variables by $\mathbf{T}$. If we could observe $\mathbf{T}$ directly, we could easily maximize the complete-data likelihood $p(\mathbf{X}, \mathbf{T} \lvert \boldsymbol{\theta})$ because we would then know the assignment of data points to components, and fitting a single Gaussian per component can be done analytically. But since we can only observe $\mathbf{X}$, we have to maximize the marginal likelihood or incomplete-data likelihood $p(\mathbf{X} \lvert \boldsymbol{\theta})$. By using the logarithm of the likelihood we have

which involves a summation over latent variables inside the logarithm. This prevents a simple analytical solution to the optimization problem.

Expectation maximization algorithm

Although we cannot observe latent variables directly, we can obtain their posterior distribution. We start with a preliminary parameter value $\boldsymbol{\theta}^{old}$.

This allows us to define an expectation of the complete-data likelihood w.r.t. to the posterior distribution.

This expectation is then maximized w.r.t. to $\boldsymbol{\theta}$ resulting in an updated parameter vector $\boldsymbol{\theta}^{new}$.

In Eq. $(7)$ the summation is outside the logarithm which enables an analytical solution for $\boldsymbol{\theta}^{new}$ in the case of GMMs. We then let $\boldsymbol{\theta}^{old} \leftarrow \boldsymbol{\theta}^{new}$ and repeat these steps until convergence. This is the essence of the expectation maximization (EM) algorithm. It has an expectation- or E-step where the posterior over latent variables is obtained and a maximization- or M-step where the expectation of the complete-data likelihood w.r.t. the posterior distribution is maximized. It can be shown that the EM algorithm always converges to a local maximum of $p(\mathbf{X} \lvert \boldsymbol{\theta})$. The following subsection describes the EM algorithm in a more general form.

General form

By introducing a latent variable $\mathbf{t}_i$ for each observation $\mathbf{x}_i$ we can define the log marginal likelihood as

Next we introduce a distribution $q(\mathbf{t}_i)$ over latent variable $\mathbf{t}_i$, which can be any probability distribution, and multiply and divide the RHS of the Eq. $(9)$ with this distribution.

We now have a concave function ($\log$) of an expectation which allows us to apply Jensen’s inequality to define a lower bound $\mathcal{L}$ on $\log p(\mathbf{X} \lvert \boldsymbol{\theta})$.

This lower bound is a function of $\boldsymbol{\theta}$ and $q$. When we subtract the lower bound from the marginal log likelihood we should end up with something that is non-negative.

We end up with the Kullback-Leibler (KL) divergence between $q(\mathbf{T})$ and the true posterior over latent variables. It can be shown that the KL divergence is always non-negative. We finally can write the following expression for the lower bound.

In the E-step of the EM algorithm $\mathcal{L}(\boldsymbol{\theta}, q)$ is maximized w.r.t. $q$ and $\boldsymbol{\theta}$ is held fixed.

$\mathcal{L}(\boldsymbol{\theta}, q)$ is maximized when $\mathrm{KL}(q(\mathbf{T}) \mid\mid p(\mathbf{T} \lvert \mathbf{X}, \boldsymbol{\theta}))$ is minimized as $\log p(\mathbf{X} \lvert \boldsymbol{\theta})$ doesn’t depend on $q$. If we can obtain the true posterior, like in the GMM case, we can set $q(\mathbf{T})$ to $p(\mathbf{T} \lvert \mathbf{X}, \boldsymbol{\theta})$ and the KL divergence becomes $0$. If the true posterior is not tractable we have to use approximations. One family of approximation techniques is called variational inference which uses specific forms of $q$ to approximate the true posterior. For example, $q$ may come from a parameteric family of distributions and the parameters of $q$ are chosen such that the KL divergence is minimized. Variational inference is covered in more detail in part 2.

In the M-step $\mathcal{L}(\boldsymbol{\theta}, q)$ is maximized w.r.t. $\boldsymbol{\theta}$ and $q$ is held fixed. Using Eq. $(11)$ we get

If the true posterior is known Eq. $(15)$ becomes Eq. $(7)$ except for the constant term which can be ignored during optimization. Again, we let $\boldsymbol{\theta}^{old} \leftarrow \boldsymbol{\theta}^{new}$ and repeat these steps until convergence. The next section applies the EM algorithm to GMMs.

Expectation maximization algorithm for GMMs

The parameters for a GMM with 3 components are . The prior probability for component $c$ is $p(t_c = 1 \lvert \boldsymbol{\theta}) = \pi_c$ and the conditional distribution of $\mathbf{x}$ given the latent variable value for this component is $p(\mathbf{x} \lvert t_c = 1, \boldsymbol{\theta}) = \mathcal{N}(\mathbf{x} \lvert \boldsymbol{\mu}_c, \boldsymbol{\Sigma}_c)$.

Implementation with numpy and scipy

import numpy as np

from scipy.stats import multivariate_normal as mvn

In the E-step we set $q(\mathbf{T})$ to the known posterior $ p(\mathbf{T} \lvert \mathbf{X}, \boldsymbol{\theta})$ using Eq. $(6)$ and compute the posterior probabilities.

def e_step(X, pi, mu, sigma):
    """
    Computes posterior probabilities from data and parameters.
    
    Args:
        X: observed data (N, D).
        pi: prior probabilities (C,).
        mu: mixture component means (C, D).
        sigma: mixture component covariances (C, D, D).

    Returns:
        Posterior probabilities (N, C).
    """

    N = X.shape[0]
    C = mu.shape[0]
    q = np.zeros((N, C))

    # Equation (6)
    for c in range(C):
        q[:, c] = mvn(mu[c], sigma[c]).pdf(X) * pi[c]        
    return q / np.sum(q, axis=-1, keepdims=True) 

In the M-step, we take the derivatives of $\mathcal{L}(\boldsymbol{\theta}, q)$ w.r.t. to $\pi_c$, and and set the resulting expressions to $0$ i.e. do MLE of parameters and also apply constraints to ensure that $\sum_{c=1}^C \pi_c = 1$ and $\boldsymbol{\Sigma}_{c})^T$ is positive semi-definite. The details are omitted here and only the results are presented.

def m_step(X, q):
    """
    Computes parameters from data and posterior probabilities.

    Args:
        X: data (N, D).
        q: posterior probabilities (N, C).

    Returns:
        tuple of
        - prior probabilities (C,).
        - mixture component means (C, D).
        - mixture component covariances (C, D, D).
    """    
    
    N, D = X.shape
    C = q.shape[1]    
    sigma = np.zeros((C, D, D))
    
    # Equation (16)
    pi = np.sum(q, axis=0) / N

    # Equation (17)
    mu = q.T.dot(X) / np.sum(q.T, axis=1, keepdims=True)
    
    # Equation (18)
    for c in range(C):
        delta = (X - mu[c])
        sigma[c] = (q[:, [c]] * delta).T.dot(delta) / np.sum(q[:, c])
        
    return pi, mu, sigma    

For computing the lower bound, using the results from the E-step and M-step, we first re-arrange Eq. $(11)$

and then implement it in this form.

def lower_bound(X, pi, mu, sigma, q):
    """
    Computes lower bound from data, parameters and posterior probabilities.

    Args:
        X: observed data (N, D).
        pi: prior probabilities (C,).
        mu: mixture component means (C, D).
        pi: mixture component covariances (C, D, D).
        q: posterior probabilities (N, C).

    Returns:
        Lower bound.
    """    

    N, C = q.shape
    ll = np.zeros((N, C))
    
    # Equation (19)
    for c in range(C):
        ll[:,c] = mvn(mu[c], sigma[c]).logpdf(X)
    return np.sum(q * (ll + np.log(pi) - np.log(np.maximum(q, 1e-8))))

Model training iterates over E- and M-steps alternately until convergence of the lower bound. To increase the chance of escaping local maxima and finding the global maximum, training is restarted several times from random initial parameters.

def random_init_params(X, C):
    D = X.shape[1]
    pi = np.ones(C) / C
    mu = mvn(mean=np.mean(X, axis=0), cov=[np.var(X[:, 0]), 
                                           np.var(X[:, 1])]).rvs(C).reshape(C, D)
    sigma = np.tile(np.eye(2), (C, 1, 1))
    return pi, mu, sigma


def train(X, C, n_restarts=10, max_iter=50, rtol=1e-3):
    q_best = None
    pi_best = None
    mu_best = None
    sigma_best = None
    lb_best = -np.inf

    for _ in range(n_restarts):
        pi, mu, sigma = random_init_params(X, C)
        prev_lb = None

        try:
            for _ in range(max_iter):
                q = e_step(X, pi, mu, sigma)
                pi, mu, sigma = m_step(X, q)
                lb = lower_bound(X, pi, mu, sigma, q)

                if lb > lb_best:
                    q_best = q
                    pi_best = pi
                    mu_best = mu
                    sigma_best = sigma
                    lb_best = lb

                if prev_lb and np.abs((lb - prev_lb) / prev_lb) < rtol:
                    break

                prev_lb = lb
        except np.linalg.LinAlgError:
            # Singularity. One of the components collapsed
            # onto a specific data point. Start again ...
            pass

    return pi_best, mu_best, sigma_best, q_best, lb_best

pi_best, mu_best, sigma_best, q_best, lb_best = train(X, C=3)
print(f'Lower bound = {lb_best:.2f}')
Lower bound = -3923.77

After convergence we can use the posterior probabilties of latent variables to plot the soft-assignment of data points to mixture components.

plot_data(X, color=q_best)
plot_densities(X, mu=mu_best, sigma=sigma_best)

png

Optimal number of components

Usually, we do not know the optimal number of mixture components a priori. But we can get a hint when plotting the lower bound vs. the number of mixture components.

import matplotlib.pyplot as plt

Cs = range(1, 8)
lbs = []

for C in Cs:
    lb = train(X, C)[-1]
    lbs.append(lb)
    
plt.plot(Cs, lbs)
plt.xlabel('Number of mixture components')
plt.ylabel('Lower bound');

png

There is a strong increase in the lower bound value until $C = 3$ and then the lower bound more or less doesn’t increase any more. With more components there are of course more options to overfit but the simplest model that reaches a relatively high lower bound value is a GMM with 3 components. This is exactly the number of components used to generate the data.

A more principled approach to determine the optimal number of components requires a Bayesian treatment of model parameters. In this case the lower bound would also take into account model complexity and we would see decreasing lower bound values for $C \gt 3$ and a maximum at $C = 3$. For details see section 10.2.4 in [1].

Implementation with scikit-learn

The low-level implementation above was just for illustration purposes. Scikit-learn already comes with a GaussianMixture class that can be readily used.

from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=3, n_init=10)
gmm.fit(X)

plot_data(X, color=gmm.predict_proba(X))
plot_densities(X, mu=gmm.means_, sigma=gmm.covariances_)

png

The results are very similar. The differences come from different random initializations. Also, plotting the lower bound vs. the number of mixture components reproduces our previous findings.

Cs = range(1, 8)
lbs = []

for C in Cs:
    gmm = GaussianMixture(n_components=C, n_init=10)
    gmm.fit(X)
    lbs.append(gmm.lower_bound_)
    
plt.plot(Cs, lbs)
plt.xlabel('Number of mixture components')
plt.ylabel('Lower bound (normalized)');

png

The lower bound values obtained via gmm.lower_bound_ are normalized i.e. divided by $N = 1000$ in this example.

Conclusion

Inference of latent variables and estimation of parameters in GMMs is an example of exact inference. The exact posterior can be obtained in the E-step and analytical solutions exist for parameter MLE in the M-step. But for many models of practical interest, exact inference is not possible and approximate inference methods must be used. The EM algorithm has numerous extensions for approximate inference. Variational inference is one of them and will be covered in part 2, together with a variational autoencoder as application example.

References

[1] Christopher M. Bishop. Pattern Recognition and Machine Learning (PDF), Chapter 9.
[2] Kevin P. Murphy. Machine Learning, A Probabilistic Perspective, Chapter 11.



comments powered by Disqus