# The Statement of Fundamental Theorem of MLEs¶

Loosely, the Fundamental Theorem of Maximum Likelihood Estimators states:

Theorem

Maximum likelihood estimators are asymptotically normal.

Making precise sense of this requires considerable work.

## Recollections¶

Definition

The relative entropy:

\begin{split}\begin{align*} \mathcal{D}(\rho_A || \rho_B) &= \langle I_{\rho_B} - I_{\rho_A} \rangle_{\rho_A} \\ &= \langle I_{\rho_B} \rangle_{\rho_A} - \mathcal{S}(\rho_A) \end{align*}\end{split}

where $$I_\rho$$ is the information associated to the distibution $$\rho_A$$

We begin with a brief review.

Note

Recall that, given a parametric family:

\begin{split}\begin{align*} \Theta &\overset{\theta}\longrightarrow \mathrm{Prob}(\Omega) \\ \theta &\longmapsto \rho_\theta \end{align*}\end{split}

maximum likelihood estimation provides a map:

\begin{split}\begin{align*} \mathfrak{D}(\Omega) &\overset{\mathrm{MLE}_\Theta}\longrightarrow \mathrm{Prob}(\Omega) \\ \rho_X &\longmapsto \hat{\rho}_\theta(X) = \mathrm{MLE_\Theta}(\rho_x) := \ \mathrm{argmin}_\theta \mathfrak{D}(\rho_X || \rho_\theta \end{align*}\end{split}

When the data is drawn from a probability distribution, $$\rho \in \mathrm{Prob}(\Omega)$$, the MLE map gives a probability distribution on the space of probability distributions:

$\mathrm{MLE}_*(\rho_\theta) := \hat{\rho_\theta} \in \mathrm{Prob}(\Omega)$

Warning

Although $$\hat{\rho}$$ is “random” (in the sense that it is a a probability distribution) it is not a “random variable”. This subtle, technical point is meant to emphasize the intrinsic nature of MLEs.

However, a choice of coordinates allows us to consider $$\hat{\rho}$$ a random variable.

Note

Stein’s Lemma interprets the function MLEs are trying to minimize interpretation in terms of hypothesis testing.

## Geometric Preliminaries¶

Given a smooth function on a manifold e.g. ($$\mathbb{R}^n$$):

$M \overset{f}\longrightarrow \mathbb{R}$

along with a “dummy” metric, $$g$$, we can construct a symmetric quadratic form, the Hessian of $$f$$:

$\mathrm{Hess}_g(f) \in \mathrm{Sym}^2(\mathrm{T}^*M)$

which can be computed as second derivatives in coordinates.

In general, this form depends on the dummy metric. However, if:

$\mathrm{d}f|_p = 0$

then

$\mathrm{Hess}_g(f)|_p \in \mathrm{Sym}^2(\mathrm{T}^*_p M)$

is independent of the dummy metric independent of the dummy metric.

Moreover, if $$f$$ is convex, $$\mathrm{Hess}_g(f)|_p$$ is a positive definite symmetric quadratic form on $$\mathrm{T}_p M$$.

Given coordinates $$\varphi$$, this can be computed as:

$(\partial_i \partial_j \varphi^*f)(p) \in \mathbb{R}$

Moreover, one can explicitly compute the covariance matrix using the following construction:

## Back to Statistics¶

In the setting of MLE, the dictionary is:

\begin{split}\begin{align*} \Theta &:= M \\ f(\rho) &:= \mathcal{D}(\rho_\theta || \rho) \in \mathrm{C}^\infty(\Theta) \end{align*}\end{split}

As we vary $$\theta$$, we obtain an element of:

$f_\theta \in \Theta \times \mathrm{C}^\infty(\Theta)$

In other words, the function which MLEs are trying to minimize gives a positive definite quadratic form on the tangent space of $$\hat{\rho}_\theta$$.

Definition

The Fisher information metric is defined as:

$\mathbb{I}_\Theta = \left. \mathrm{Hess} \bigl( \mathcal{D}(\rho_\theta || \rho) \bigl) \right\vert_{\rho = \rho_\theta} \in \mathrm{Sym}^2(\mathrm{T}^*\Theta)$

Note

As this metric is positive definite and symmetric, it defines a Riemannian metric on $$\Theta$$, conventionally referred to as the Fisher-Rao metric

Theorem

When $$g \in \mathrm{Im}(\theta)$$, in the limit of $$n \rightarrow \infty$$:

$\hat{\rho}_\theta \sim \mathcal{N} \bigl( g, (n \cdot \mathbb{I}|_{\hat{\rho}_\theta)})^{-1} \bigl)$

where:

Lemma

When $$g \in \mathrm{Im}(\theta)$$ (i.e. the model is correctly “specified”, MLE’s are consistent.

Note

As will be seen in the next sections, the relative entropy is a generalization of the effect size between two normal distibutions with identical standard deviations.

Theorem

Maximum likelihood estimation is positive definite and convex.

Corrollary

The central limit theorem?