Gaussian Mixture Model

by Weidong Liang

Beijing, 2014.09


Introduction

Mixture of Gaussians provides a way for soft clustering of data points into groups. It is a generative model of K Gaussian distribution of the form: $$ p(x) = \sum_{k=1}^{K}{\pi_k N(x \mid \mu_k, \Sigma_k)} $$ The probablity of data data point x belonging to group k is given by: $$ p(z_k = 1 \mid x) = N(x \mid \mu_k, \Sigma_k) $$ where N is the multi-variable Gaussian: $$ N(x \mid \mu_k, \Sigma_k) = \frac{1}{(2 \pi)^{\frac{D}{2}} \sqrt{\vert \Sigma \vert} } \exp{\left({-\frac{1}{2}} (x - \mu_k)^T \Sigma^{-1} (x - \mu_k)\right)} $$ The marginal probability over z is specified in terms of the mixing coefficients \( \pi_k \) $$ p(z_k = 1) = \pi_k, 0 \le \pi_k \le 1, \sum_{k=1}^{K}{\pi_k} = 1 $$ Using 1-of-K encoding for z, we have: $$ p(z) = \prod_{k=1}^K{\pi_k^{z_k}} $$ $$ p(x \mid z) = \prod_{k=1}^K{ N(x \mid \mu_k, \Sigma_k) ^ {z_k} } $$

Training GMM using the Expectation-Maximization Algorithm

For the E-step, we have: $$ p(z_k = 1 \mid x_n) = \gamma(z_{nk}) = \frac{p(z_k = 1, x_n)}{\sum_{j=1}^K{p(z_j = 1, x_n)}} = \frac{\pi_k N(x_n \mid \mu_k, \Sigma_k)}{ \sum_{j=1}^K{\pi_j N(x_n \mid \mu_j, \Sigma_j)} } $$

For the M-step, we have to maximize: $$ \Theta = argmax_{\Theta_i} Q(\Theta_i, \Theta^{old}) = argmax_{\Theta_i} \sum_Z{p(Z \mid X, \Theta^{old}) \ln p(X, Z \mid \Theta_i)} $$ In this case: $$ Q(\Theta, \Theta^{old}) = \sum_{n=1}^N{ \sum_{k=1}^K{ p(z_k = 1 | x_n, \Theta^{old}) \ln p(x_n, z_k = 1 \mid \Theta)} } = \sum_{n=1}^N{ \sum_{k=1}^K{ \gamma(z_{nk}) \ln \pi_k N(x_n \mid \mu_k, \Sigma_k) } } = \sum_{n=1}^N{ \sum_{k=1}^K{ \gamma(z_{nk}) \left( \ln \pi_k + \ln N(x_n \mid \mu_k, \Sigma_k) \right) }} $$ For \( \pi_k \), we need to maximize \( Q(\Theta, \Theta^{old}) \) subject to constraint \( \sum_{k=1}^K(\pi_k) = 1 \): $$ \frac{\partial}{\partial \pi_k} \left[ Q(\Theta, \Theta^{old}) + \lambda_0 \left( 1 - \sum_{k=1}^K(\pi_k) \right) \right] = \sum_{n=1}^N{\gamma(z_{nk})\frac{1}{\pi_k}} - \lambda_0 = 0 \\ => \pi_k = \frac{\sum_{n=1}^N{\gamma(z_{nk})}}{ \sum_{n=1}^N \sum_{k=1}^K{\gamma(z_{nk})} } = \frac{N_k}{N} $$ Similarly for \( \mu_k \) and \( \Sigma_k \), we have: $$ \frac{\partial}{\partial \mu_k} \left[ Q(\Theta, \Theta^{old}) + \lambda_0 \left( 1 - \sum_{k=1}^K(\pi_k) \right) \right] = 0 \\ => \mu_k = \frac{1}{N_k}{\sum_{n=1}^N{ \gamma(z_{nk}) x_n }} $$ $$ \frac{\partial}{\partial \Sigma_k} \left[ Q(\Theta, \Theta^{old}) + \lambda_0 \left( 1 - \sum_{k=1}^K(\pi_k) \right) \right] = 0 \\ => \Sigma_k = \frac{1}{N_k}{\gamma(z_{nk})(x_n - \mu_k^{new})(x_n - \mu_k^{new})^T } $$


Simulation


Algorithm

  1. Initialize paramters \( \mu_k \), \( \Sigma_k \), and \( \pi_k \).
  2. E-step, evaluate \( p(z \mid X, \Theta) \) $$ \gamma(z_{nk}) = \frac{\pi_k N (x_n \mid \mu_k, \Sigma_k)} {\sum_{j=1}^K{\pi_j N(x_n \mid \mu_j, \Sigma_j)}} $$
  3. M-step, maximize \( \sum_{Z}{ p(Z \mid X, \Theta^{old}) \ln{ p(X, Z \mid \Theta) } } \) $$ N_k = \sum_{n=1}^{N}{\gamma(z_{nk})} $$ $$ \mu_k^{new} = \frac{1}{N_k}{\sum_{n=1}^N{ \gamma(z_{nk}) x_n }} $$ $$ \Sigma_k^{new} = \frac{1}{N_k}{\gamma(z_{nk})(x_n - \mu_k^{new})(x_n - \mu_k^{new})^T } $$ $$ \pi_k^{new} = \frac{N_k}{N} $$
  4. Evaluate log likelihood and check for convergence: $$ \ln p(X \mid \mu, \Sigma, \pi) = \sum_{n=1}^N{\ln\{\sum_{k = 1}^K{\pi_k N(x_n \mid \mu_k, \Sigma_k)}\}} $$


Implementation


Note

The EM Algorithm

For models involving hidden variables, training involves marginalizing over Z, i.e. \( p(X) = \sum_Z{p(X, Z)} \), and evaluation of the summation is usually not possible due to the exponential number of terms needed to evaluated. The EM algorithm avoids this short-comming by evaluating \( p(X, Z) \) and \( p(Z) \) instead of \( p(X) \).

Assuming that we have to maximize \( p(X \mid \Theta) = \sum_Z{ p(X, Z \mid \Theta)} \), we have the following derivation: $$ p(X, Z \mid \Theta) = p(Z \mid X, \Theta) * p(X \mid \Theta) $$ $$ \ln p(X \mid \Theta) = \ln p(X, Z \mid \Theta) - \ln p(Z \mid X, \Theta) $$ $$ \ln p(X \mid \Theta) = \ln p(X, Z \mid \Theta) - \ln p(Z \mid X, \Theta) + \ln q(Z) - \ln q(Z) $$ $$ \ln p(X \mid \Theta) = \ln \frac{p(X, Z \mid \Theta)}{q(Z)} - \ln \frac{p(Z \mid X, \Theta)}{q(Z)} $$ $$ \sum_Z{q(Z) \ln p(X \mid \Theta)} = \sum_Z{q(Z) \ln \frac{p(X, Z \mid \Theta)}{q(Z)} } - \sum_Z{q(Z) \ln \frac{p(Z \mid X, \Theta)}{q(Z)} } $$ $$ \left( \sum_Z{q(Z)} \right) \left( \ln p(X \mid \Theta) \right) = \sum_Z{q(Z) \ln \frac{p(X, Z \mid \Theta)}{q(Z)} } - \sum_Z{q(Z) \ln \frac{p(Z \mid X, \Theta)}{q(Z)} } $$ $$ \ln p(X \mid \Theta) = \sum_Z{q(Z) \ln \frac{p(X, Z \mid \Theta)}{q(Z)} } - \sum_Z{q(Z) \ln \frac{p(Z \mid X, \Theta)}{q(Z)} } $$ $$ \ln p(X \mid \Theta) = L(q, \Theta) + KL(p || q) $$ where \( KL(p || q) \) is the Kullback-Leibler divergence.

Since \( KL(p || q) \ge 0 \), therefore \( L(q, \Theta) \le \ln p(X \mid \Theta) \), i.e. \( L(q, \Theta) \) is the lower bound of \( \ln p(X \mid \Theta) \).

In the Expectation Step, the lower bound \( L(q, \Theta) \) is maximized with respect to \( q(Z) \) while holding \( \Theta \) fixed, which it is at its maximum when \( KL(p || q) = 0 \), i.e. \( q(Z) = p(Z \mid X, \Theta) \).

In the Maximization Step, the lower bound \( L(q, \Theta) \) is maximized with respect to \( \Theta \) while holding \( q(Z) \) fixed. Since \( q(Z) = p(Z \mid, X, \Theta^{old}) \) from the E-step, we have: $$ L(q, \Theta) = \sum_Z{p(Z \mid X, \Theta^{old}) \ln p(X, Z \mid \Theta)} - \sum_Z{p(Z \mid X, \Theta^{old}) \ln p(Z \mid X, \Theta^{old})} = \sum_Z{p(Z \mid X, \Theta^{old}) \ln p(X, Z \mid \Theta)} + const $$ Therefore, the M-Step is equivalent to maximize: $$ Q(\Theta, \Theta^{old}) = \sum_Z{p(Z \mid X, \Theta^{old}) \ln p(X, Z \mid \Theta)} $$ Note that maximization of \( Q(\Theta, \Theta^{old}) \) can be easier than direct maximization of \( \ln p(X) \) due to the presence of the joint distribution in the logarithm if the it contains a member of the exponential family or a product of such.

Summing up, the EM algorithms alternate between the evaluation of \( q(Z) = p(Z \mid X, \Theta^{old}) \) and maximization of \( Q(\Theta, \Theta^{old}) = \sum_Z{p(Z \mid X, \Theta^{old}) \ln p(X, Z \mid \Theta)} \) till convergence to achive model training.


Reference


comments powered by Disqus