# 26 Maximum Likelihood Estimation

## 26.1 The Normal Example

We formulated both confidence intervals and hypothesis tests under the following idealized scenario:

Suppose a simple random sample of $$n$$ data points is collected so that the following model of the data is reasonable: $$X_1, X_2, \ldots, X_n$$ are iid Normal($$\mu$$, $$\sigma^2$$). The goal is to do inference on $$\mu$$, the population mean. For simplicity, assume that $$\sigma^2$$ is known (e.g., $$\sigma^2 = 1$$).

There is a good reason why we did this.

## 26.2 MLE $$\rightarrow$$ Normal Pivotal Statistics

The random variable distributions we introduced last week have maximum likelihood estimators (MLEs) that can be standardized to yield a pivotal statistic with a Normal(0,1) distribution based on MLE theory.

For example, if $$X \sim \mbox{Binomial}(n,p)$$ then $$\hat{p}=X/n$$ is the MLE. For large $$n$$ it approximately holds that $\frac{\hat{p} - p}{\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}} \sim \mbox{Normal}(0,1).$

## 26.3 Likelihood Function

Suppose that we observe $$x_1, x_2, \ldots, x_n$$ according to the model $$X_1, X_2, \ldots, X_n \sim F_{\theta}$$. The joint pdf is $$f(\boldsymbol{x} ; \theta)$$. We view the pdf as being a function of $$\boldsymbol{x}$$ for a fixed $$\theta$$.

The likelihood function is obtained by reversing the arguments and viewing this as a function of $$\theta$$ for a fixed, observed $$\boldsymbol{x}$$:

$L(\theta ; \boldsymbol{x}) = f(\boldsymbol{x} ; \theta).$

## 26.4 Log-Likelihood Function

The log-likelihood function is

$\ell(\theta ; \boldsymbol{x}) = \log L(\theta ; \boldsymbol{x}).$

When the data are iid, we have

$\ell(\theta ; \boldsymbol{x}) = \log \prod_{i=1}^n f(x_i ; \theta) = \sum_{i=1}^n \log f(x_i ; \theta).$

## 26.5 Calculating MLEs

The maximum likelihood estimate is the value of $$\theta$$ that maximizes $$L(\theta ; \boldsymbol{x})$$ for an observe data set $$\boldsymbol{x}$$.

\begin{align*} \hat{\theta}_{{\rm MLE}} & = \operatorname{argmax}_{\theta} L(\theta ; \boldsymbol{x}) \\ & = \operatorname{argmax}_{\theta} \ell (\theta ; \boldsymbol{x}) \\ & = \operatorname{argmax}_{\theta} L (\theta ; T(\boldsymbol{x})) \end{align*}

where the last equality holds for sufficient statistics $$T(\boldsymbol{x})$$.

The MLE can usually be calculated analytically or numerically.

## 26.6 Properties

When “certain regularity assumptions” are true, the following properties hold for MLEs.

• Consistent
• Equivariant
• Asymptotically Normal
• Asymptotically Efficient (or Optimal)
• Approximate Bayes Estimator

## 26.7 Assumptions and Notation

We will assume that $$X_1, X_2, \ldots, X_n \stackrel{{\rm iid}}{\sim} F_{\theta}$$ and let $$\hat{\theta}_n$$ be the MLE of $$\theta$$ based on the $$n$$ observations.

The only exception is for the Binomial distribution where we will assume that $$X \sim \mbox{Binomial}(n, p)$$, which is the sum of $$n$$ iid $$\mbox{Bernoulli}(p)$$ rv’s.

We will assume that the “certain regularity assumptions” are true in the following results.

## 26.8 Consistency

An estimator is consistent if it converges in probability to the true parameter value. MLEs are consistent so that as $$n \rightarrow \infty$$,

$\hat{\theta}_n \stackrel{P}{\rightarrow} \theta,$

where $$\theta$$ is the true value.

## 26.9 Equivariance

If $$\hat{\theta}_n$$ is the MLE of $$\theta$$, then $$g\left(\hat{\theta}_n\right)$$ is the MLE of $$g(\theta)$$.

Example: For the Normal$$(\mu, \sigma^2)$$ the MLE of $$\mu$$ is $$\overline{X}$$. Therefore, the MLE of $$e^\mu$$ is $$e^{\overline{X}}$$.

## 26.10 Fisher Information

The Fisher Information of $$X_1, X_2, \ldots, X_n \stackrel{{\rm iid}}{\sim} F_{\theta}$$ is:

\begin{align*} I_n(\theta) & = \operatorname{Var}\left( \frac{d}{d\theta} \log f(\boldsymbol{X}; \theta) \right) = \sum_{i=1}^n \operatorname{Var}\left( \frac{d}{d\theta} \log f(X_i; \theta) \right) \\ & = - \operatorname{E}\left( \frac{d^2}{d\theta^2} \log f(\boldsymbol{X}; \theta) \right) = - \sum_{i=1}^n \operatorname{E}\left( \frac{d^2}{d\theta^2} \log f(X_i; \theta) \right) \end{align*}

## 26.11 Standard Error

In general, the standard error of the standard deviation of sampling distribution of an estimate or statistic.

For MLEs, the standard error is $$\sqrt{{\operatorname{Var}}\left(\hat{\theta}_n\right)}$$. It has the approximation

$\operatorname{se}\left(\hat{\theta}_n\right) \approx \frac{1}{\sqrt{I_n(\theta)}}$ and the standard error estimate is

$\hat{\operatorname{se}}\left(\hat{\theta}_n\right) = \frac{1}{\sqrt{I_n\left(\hat{\theta}_n\right)}}.$

## 26.12 Asymptotic Normal

MLEs converge in distribution to the Normal distribution. Specifically, as $$n \rightarrow \infty$$,

$\frac{\hat{\theta}_n - \theta}{{\operatorname{se}}\left(\hat{\theta}_n\right)} \stackrel{D}{\longrightarrow} \mbox{Normal}(0,1)$ and

$\frac{\hat{\theta}_n - \theta}{\hat{{\operatorname{se}}}\left(\hat{\theta}_n\right)} \stackrel{D}{\longrightarrow} \mbox{Normal}(0,1).$

## 26.13 Asymptotic Pivotal Statistic

By the previous result, we now have an approximate (asymptotic) pivotal statistic:

$Z = \frac{\hat{\theta}_n - \theta}{\hat{{\operatorname{se}}}\left(\hat{\theta}_n\right)} \stackrel{D}{\longrightarrow} \mbox{Normal}(0,1).$

This allows us to construct approximate confidence intervals and hypothesis test as in the idealized $$\mbox{Normal}(\mu, \sigma^2)$$ (with $$\sigma^2$$ known) scenario from the previous sections.

## 26.14 Wald Test

Consider the hypothesis test, $$H_0: \theta=\theta_0$$ vs $$H_1: \theta \not= \theta_0$$. We form test statistic

$z = \frac{\hat{\theta}_n - \theta_0}{\hat{{\operatorname{se}}}\left(\hat{\theta}_n\right)},$

which has approximate p-value

$\mbox{p-value} = {\rm Pr}(|Z^*| \geq |z|),$

where $$Z^*$$ is a Normal$$(0,1)$$ random variable.

## 26.15 Confidence Intervals

Using this MLE theory, we can form approximate $$(1-\alpha)$$ level confidence intervals as follows.

Two-sided: $\left(\hat{\theta}_n - |z_{\alpha/2}| \hat{{\operatorname{se}}}\left(\hat{\theta}_n\right), \hat{\theta}_n + |z_{\alpha/2}| \hat{{\operatorname{se}}}\left(\hat{\theta}_n\right)\right)$

Upper: $\left(-\infty, \hat{\theta}_n + |z_{\alpha}| \hat{{\operatorname{se}}}\left(\hat{\theta}_n\right)\right)$

Lower: $\left(\hat{\theta}_n - |z_{\alpha}| \hat{{\operatorname{se}}}\left(\hat{\theta}_n\right), \infty\right)$

## 26.16 Optimality

The MLE is such that

$\sqrt{n} \left( \hat{\theta}_n - \theta \right) \stackrel{D}{\longrightarrow} \mbox{Normal}(0, \tau^2)$

for some $$\tau^2$$. Suppose that $$\tilde{\theta}_n$$ is any other estimate so that

$\sqrt{n} \left( \tilde{\theta}_n - \theta \right) \stackrel{D}{\longrightarrow} \mbox{Normal}(0, \gamma^2).$

It follows that

$\frac{\tau^2}{\gamma^2} \leq 1.$

## 26.17 Delta Method

Suppose that $$g()$$ is a differentiable function and $$g'(\theta) \not= 0$$. Note that for some $$t$$ in a neighborhood of $$\theta$$, a first-order Taylor expansion tells us that $$g(t) \approx g'(\theta) (t - \theta)$$. From this we know that

${\operatorname{Var}}\left(g(\hat{\theta}_n) \right) \approx g'(\theta)^2 {\operatorname{Var}}(\hat{\theta}_n)$

The delta method shows that $$\hat{{\operatorname{se}}}\left(g(\hat{\theta}_n)\right) = |g'(\hat{\theta}_n)| \hat{{\operatorname{se}}}\left(\hat{\theta}_n\right)$$ and

$\frac{g(\hat{\theta}_n) - g(\theta)}{|g'(\hat{\theta}_n)| \hat{{\operatorname{se}}}\left(\hat{\theta}_n\right)} \stackrel{D}{\longrightarrow} \mbox{Normal}(0,1).$

## 26.18 Delta Method Example

Suppose $$X \sim \mbox{Binomial}(n,p)$$ which has MLE, $$\hat{p} = X/n$$. By the equivariance property, the MLE of the per-trial variance $$p(1-p)$$ is $$\hat{p}(1-\hat{p})$$. It can be calculated that $$\hat{{\operatorname{se}}}(\hat{p}) = \sqrt{\hat{p}(1-\hat{p})/n}$$.

Let $$g(p) = p(1-p)$$. Then $$g'(p) = 1-2p$$. By the delta method,

$\hat{{\operatorname{se}}}\left( \hat{p}(1-\hat{p}) \right) = \left| (1-2\hat{p}) \right| \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}.$

## 26.19 Multiparameter Fisher Info Matrix

Suppose that $$X_1, X_2, \ldots, X_n \stackrel{{\rm iid}}{\sim} F_{\boldsymbol{\theta}}$$ where $$\boldsymbol{\theta} = (\theta_1, \theta_2, \ldots, \theta_d)^T$$ has MLE $$\hat{\boldsymbol{\theta}}_n$$.

The Fisher Information Matrix $$I_n(\boldsymbol{\theta})$$ is the $$d \times d$$ matrix with $$(i, j)$$ entry

$-\sum_{k=1}^n \operatorname{E}\left( \frac{\partial^2}{\partial\theta_i \partial\theta_j} \log f(X_k; \boldsymbol{\theta}) \right).$

## 26.20 Multiparameter Asymptotic MVN

Under appropriate regularity conditions, as $$n \rightarrow \infty$$,

$\left( \hat{\boldsymbol{\theta}}_n - \boldsymbol{\theta} \right) \stackrel{D}{\longrightarrow} \mbox{MVN}_d \left( \boldsymbol{0}, I_n(\boldsymbol{\theta})^{-1} \right) \mbox{ and }$

$\left( \hat{\boldsymbol{\theta}}_n - \boldsymbol{\theta} \right)^T I_n(\hat{\boldsymbol{\theta}}_n) \left( \hat{\boldsymbol{\theta}}_n - \boldsymbol{\theta} \right) \stackrel{D}{\longrightarrow} \mbox{MVN}_d \left( \boldsymbol{0}, \boldsymbol{I} \right).$