Maximum Likelihood Estimators - Multivariate Gaussian

The Multivariate Gaussian appears frequently in Machine Learning and the following results are used in many ML books and courses without the derivations.

$\hat \mu = \frac\sum_^m \mathbf < x^> = \mathbf>$
$\hat \Sigma = \frac\sum_^m \mathbf <(x^- \hat \mu) (x^ -\hat \mu)>^T $

I understand that knowledge of the multivariate Gaussian is a pre-requisite for many ML courses, but it would be helpful to have the full derivation in a self contained answer once and for all as I feel many self-learners are bouncing around the stats.stackexchange and math.stackexchange websites looking for answers.

Question

What is the full derivation of the Maximum Likelihood Estimators for the multivariate Gaussian

Examples:

These lecture notes (page 11) on Linear Discriminant Analysis, or these ones make use of the results and assume previous knowledge.

There are also a few posts which are partly answered or closed:

Maximum likelihood estimator for multivariate normal distribution
Need help to understand Maximum Likelihood Estimation for multivariate normal distribution?

normal-distribution
maximum-likelihood
estimators
multivariate-normal-distribution

asked Jun 15, 2018 at 13:10 Xavier Bourret Sicotte Xavier Bourret Sicotte 9,856 4 4 gold badges 48 48 silver badges 73 73 bronze badges

3 Answers 3

$\begingroup$

Deriving the Maximum Likelihood Estimators

Assume that we have $m$ random vectors, each of size $p$ : $\mathbf, X^, \dotsc, X^>$ where each random vectors can be interpreted as an observation (data point) across $p$ variables. If each $\mathbf^$ are i.i.d. as multivariate Gaussian vectors:

$$ \mathbf> \sim \mathcal_p(\mu, \Sigma) $$

Where the parameters $\mu, \Sigma$ are unknown. To obtain their estimate we can use the method of maximum likelihood and maximize the log likelihood function.

Note that by the independence of the random vectors, the joint density of the data $\mathbf< \>, i = 1,2, \dotsc ,m\>$ is the product of the individual densities, that is $\prod_^m f_<\mathbf>>(\mathbf ; \mu , \Sigma >)$ . Taking the logarithm gives the log-likelihood function

\begin l(\mathbf < \mu, \Sigma | x^>) & = \log \prod_^m f_<\mathbf>(\mathbf ) \\ & = \log \ \prod_^m \frac <(2 \pi)^

|\Sigma|^> \exp \left( - \frac \mathbf <(x^- \mu)^T \Sigma^ (x^ - \mu) > \right) \\ & = \sum_^m \left( - \frac

\log (2 \pi) - \frac \log |\Sigma| - \frac \mathbf <(x^- \mu)^T \Sigma^ (x^ - \mu) > \right) \end

\begin l(\mu, \Sigma ; ) & = - \frac \log (2 \pi) - \frac \log |\Sigma| - \frac \sum_^m \mathbf <(x^- \mu)^T \Sigma^ (x^ - \mu) > \end

Deriving $\hat \mu$

To take the derivative with respect to $\mu$ and equate to zero we will make use of the following matrix calculus identity:

Which is often called the sample mean vector.

Deriving $\hat \Sigma$

Deriving the MLE for the covariance matrix requires more work and the use of the following linear algebra and calculus properties:

Combining these properties allows us to calculate

Which is the outer product of the vector $x$ with itself.

We can now re-write the log-likelihood function and compute the derivative w.r.t. $\Sigma^$ (note $C$ is constant)

Equating to zero and solving for $\Sigma$

\begin 0 &= m \Sigma - \sum_^m \mathbf <(x^- \mu) (x^ - \mu)>^T \\ \hat \Sigma & = \frac \sum_^m \mathbf <(x^- \hat \mu) (x^ -\hat \mu)>^T \end

Sources

https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter13.pdf
http://ttic.uchicago.edu/~shubhendu/Slides/Estimation.pdf