26 Maximum Likelihood Estimation

26.1 The Normal Example

We formulated both confidence intervals and hypothesis tests under the following idealized scenario:

Suppose a simple random sample of \(n\) data points is collected so that the following model of the data is reasonable: \(X_1, X_2, \ldots, X_n\) are iid Normal(\(\mu\), \(\sigma^2\)). The goal is to do inference on \(\mu\), the population mean. For simplicity, assume that \(\sigma^2\) is known (e.g., \(\sigma^2 = 1\)).

There is a good reason why we did this.

26.2 MLE \(\rightarrow\) Normal Pivotal Statistics

The random variable distributions we introduced last week have maximum likelihood estimators (MLEs) that can be standardized to yield a pivotal statistic with a Normal(0,1) distribution based on MLE theory.

For example, if \(X \sim \mbox{Binomial}(n,p)\) then \(\hat{p}=X/n\) is the MLE. For large \(n\) it approximately holds that \[ \frac{\hat{p} - p}{\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}} \sim \mbox{Normal}(0,1). \]

26.3 Likelihood Function

Suppose that we observe \(x_1, x_2, \ldots, x_n\) according to the model \(X_1, X_2, \ldots, X_n \sim F_{\theta}\). The joint pdf is \(f(\boldsymbol{x} ; \theta)\). We view the pdf as being a function of \(\boldsymbol{x}\) for a fixed \(\theta\).

The likelihood function is obtained by reversing the arguments and viewing this as a function of \(\theta\) for a fixed, observed \(\boldsymbol{x}\):

\[L(\theta ; \boldsymbol{x}) = f(\boldsymbol{x} ; \theta).\]

26.4 Log-Likelihood Function

The log-likelihood function is

\[ \ell(\theta ; \boldsymbol{x}) = \log L(\theta ; \boldsymbol{x}).\]

When the data are iid, we have

\[ \ell(\theta ; \boldsymbol{x}) = \log \prod_{i=1}^n f(x_i ; \theta) = \sum_{i=1}^n \log f(x_i ; \theta).\]

26.5 Calculating MLEs

The maximum likelihood estimate is the value of \(\theta\) that maximizes \(L(\theta ; \boldsymbol{x})\) for an observe data set \(\boldsymbol{x}\).

\[\begin{align*} \hat{\theta}_{{\rm MLE}} & = \operatorname{argmax}_{\theta} L(\theta ; \boldsymbol{x}) \\ & = \operatorname{argmax}_{\theta} \ell (\theta ; \boldsymbol{x}) \\ & = \operatorname{argmax}_{\theta} L (\theta ; T(\boldsymbol{x})) \end{align*}\]

where the last equality holds for sufficient statistics \(T(\boldsymbol{x})\).

The MLE can usually be calculated analytically or numerically.

26.6 Properties

When “certain regularity assumptions” are true, the following properties hold for MLEs.

Consistent
Equivariant
Asymptotically Normal
Asymptotically Efficient (or Optimal)
Approximate Bayes Estimator

26.7 Assumptions and Notation

We will assume that \(X_1, X_2, \ldots, X_n \stackrel{{\rm iid}}{\sim} F_{\theta}\) and let \(\hat{\theta}_n\) be the MLE of \(\theta\) based on the \(n\) observations.

The only exception is for the Binomial distribution where we will assume that \(X \sim \mbox{Binomial}(n, p)\), which is the sum of \(n\) iid \(\mbox{Bernoulli}(p)\) rv’s.

We will assume that the “certain regularity assumptions” are true in the following results.

26.8 Consistency

An estimator is consistent if it converges in probability to the true parameter value. MLEs are consistent so that as \(n \rightarrow \infty\),

\[\hat{\theta}_n \stackrel{P}{\rightarrow} \theta,\]

where \(\theta\) is the true value.

26.9 Equivariance

If \(\hat{\theta}_n\) is the MLE of \(\theta\), then \(g\left(\hat{\theta}_n\right)\) is the MLE of \(g(\theta)\).

Example: For the Normal\((\mu, \sigma^2)\) the MLE of \(\mu\) is \(\overline{X}\). Therefore, the MLE of \(e^\mu\) is \(e^{\overline{X}}\).

26.10 Fisher Information

The Fisher Information of \(X_1, X_2, \ldots, X_n \stackrel{{\rm iid}}{\sim} F_{\theta}\) is:

\[\begin{align*} I_n(\theta) & = \operatorname{Var}\left( \frac{d}{d\theta} \log f(\boldsymbol{X}; \theta) \right) = \sum_{i=1}^n \operatorname{Var}\left( \frac{d}{d\theta} \log f(X_i; \theta) \right) \\ & = - \operatorname{E}\left( \frac{d^2}{d\theta^2} \log f(\boldsymbol{X}; \theta) \right) = - \sum_{i=1}^n \operatorname{E}\left( \frac{d^2}{d\theta^2} \log f(X_i; \theta) \right) \end{align*}\]

26.11 Standard Error

In general, the standard error of the standard deviation of sampling distribution of an estimate or statistic.

For MLEs, the standard error is \(\sqrt{{\operatorname{Var}}\left(\hat{\theta}_n\right)}\). It has the approximation

\[\operatorname{se}\left(\hat{\theta}_n\right) \approx \frac{1}{\sqrt{I_n(\theta)}}\] and the standard error estimate is

\[\hat{\operatorname{se}}\left(\hat{\theta}_n\right) = \frac{1}{\sqrt{I_n\left(\hat{\theta}_n\right)}}.\]

26.12 Asymptotic Normal

MLEs converge in distribution to the Normal distribution. Specifically, as \(n \rightarrow \infty\),

\[\frac{\hat{\theta}_n - \theta}{{\operatorname{se}}\left(\hat{\theta}_n\right)} \stackrel{D}{\longrightarrow} \mbox{Normal}(0,1)\] and

\[\frac{\hat{\theta}_n - \theta}{\hat{{\operatorname{se}}}\left(\hat{\theta}_n\right)} \stackrel{D}{\longrightarrow} \mbox{Normal}(0,1).\]

26.13 Asymptotic Pivotal Statistic

By the previous result, we now have an approximate (asymptotic) pivotal statistic:

\[Z = \frac{\hat{\theta}_n - \theta}{\hat{{\operatorname{se}}}\left(\hat{\theta}_n\right)} \stackrel{D}{\longrightarrow} \mbox{Normal}(0,1).\]

This allows us to construct approximate confidence intervals and hypothesis test as in the idealized \(\mbox{Normal}(\mu, \sigma^2)\) (with \(\sigma^2\) known) scenario from the previous sections.

26.14 Wald Test

Consider the hypothesis test, \(H_0: \theta=\theta_0\) vs \(H_1: \theta \not= \theta_0\). We form test statistic

\[z = \frac{\hat{\theta}_n - \theta_0}{\hat{{\operatorname{se}}}\left(\hat{\theta}_n\right)},\]

which has approximate p-value

\[\mbox{p-value} = {\rm Pr}(|Z^*| \geq |z|),\]

where \(Z^*\) is a Normal\((0,1)\) random variable.

26.15 Confidence Intervals

Using this MLE theory, we can form approximate \((1-\alpha)\) level confidence intervals as follows.

Two-sided: \[\left(\hat{\theta}_n - |z_{\alpha/2}| \hat{{\operatorname{se}}}\left(\hat{\theta}_n\right), \hat{\theta}_n + |z_{\alpha/2}| \hat{{\operatorname{se}}}\left(\hat{\theta}_n\right)\right)\]

Upper: \[\left(-\infty, \hat{\theta}_n + |z_{\alpha}| \hat{{\operatorname{se}}}\left(\hat{\theta}_n\right)\right)\]

Lower: \[\left(\hat{\theta}_n - |z_{\alpha}| \hat{{\operatorname{se}}}\left(\hat{\theta}_n\right), \infty\right)\]

26.16 Optimality

The MLE is such that

\[ \sqrt{n} \left( \hat{\theta}_n - \theta \right) \stackrel{D}{\longrightarrow} \mbox{Normal}(0, \tau^2)\]

for some \(\tau^2\). Suppose that \(\tilde{\theta}_n\) is any other estimate so that

\[ \sqrt{n} \left( \tilde{\theta}_n - \theta \right) \stackrel{D}{\longrightarrow} \mbox{Normal}(0, \gamma^2).\]

It follows that

\[\frac{\tau^2}{\gamma^2} \leq 1.\]

26.17 Delta Method

Suppose that \(g()\) is a differentiable function and \(g'(\theta) \not= 0\). Note that for some \(t\) in a neighborhood of \(\theta\), a first-order Taylor expansion tells us that \(g(t) \approx g'(\theta) (t - \theta)\). From this we know that

\[{\operatorname{Var}}\left(g(\hat{\theta}_n) \right) \approx g'(\theta)^2 {\operatorname{Var}}(\hat{\theta}_n)\]

The delta method shows that \(\hat{{\operatorname{se}}}\left(g(\hat{\theta}_n)\right) = |g'(\hat{\theta}_n)| \hat{{\operatorname{se}}}\left(\hat{\theta}_n\right)\) and

\[\frac{g(\hat{\theta}_n) - g(\theta)}{|g'(\hat{\theta}_n)| \hat{{\operatorname{se}}}\left(\hat{\theta}_n\right)} \stackrel{D}{\longrightarrow} \mbox{Normal}(0,1).\]

26.18 Delta Method Example

Suppose \(X \sim \mbox{Binomial}(n,p)\) which has MLE, \(\hat{p} = X/n\). By the equivariance property, the MLE of the per-trial variance \(p(1-p)\) is \(\hat{p}(1-\hat{p})\). It can be calculated that \(\hat{{\operatorname{se}}}(\hat{p}) = \sqrt{\hat{p}(1-\hat{p})/n}\).

Let \(g(p) = p(1-p)\). Then \(g'(p) = 1-2p\). By the delta method,

\[\hat{{\operatorname{se}}}\left( \hat{p}(1-\hat{p}) \right) = \left| (1-2\hat{p}) \right| \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}.\]

26.19 Multiparameter Fisher Info Matrix

Suppose that \(X_1, X_2, \ldots, X_n \stackrel{{\rm iid}}{\sim} F_{\boldsymbol{\theta}}\) where \(\boldsymbol{\theta} = (\theta_1, \theta_2, \ldots, \theta_d)^T\) has MLE \(\hat{\boldsymbol{\theta}}_n\).

The Fisher Information Matrix \(I_n(\boldsymbol{\theta})\) is the \(d \times d\) matrix with \((i, j)\) entry

\[ -\sum_{k=1}^n \operatorname{E}\left( \frac{\partial^2}{\partial\theta_i \partial\theta_j} \log f(X_k; \boldsymbol{\theta}) \right).\]

26.20 Multiparameter Asymptotic MVN

Under appropriate regularity conditions, as \(n \rightarrow \infty\),

\[ \left( \hat{\boldsymbol{\theta}}_n - \boldsymbol{\theta} \right) \stackrel{D}{\longrightarrow} \mbox{MVN}_d \left( \boldsymbol{0}, I_n(\boldsymbol{\theta})^{-1} \right) \mbox{ and } \]

\[ \left( \hat{\boldsymbol{\theta}}_n - \boldsymbol{\theta} \right)^T I_n(\hat{\boldsymbol{\theta}}_n) \left( \hat{\boldsymbol{\theta}}_n - \boldsymbol{\theta} \right) \stackrel{D}{\longrightarrow} \mbox{MVN}_d \left( \boldsymbol{0}, \boldsymbol{I} \right). \]