26 Maximum Likelihood Estimation
26.1 The Normal Example
We formulated both confidence intervals and hypothesis tests under the following idealized scenario:
Suppose a simple random sample of \(n\) data points is collected so that the following model of the data is reasonable: \(X_1, X_2, \ldots, X_n\) are iid Normal(\(\mu\), \(\sigma^2\)). The goal is to do inference on \(\mu\), the population mean. For simplicity, assume that \(\sigma^2\) is known (e.g., \(\sigma^2 = 1\)).
There is a good reason why we did this.
26.2 MLE \(\rightarrow\) Normal Pivotal Statistics
The random variable distributions we introduced last week have maximum likelihood estimators (MLEs) that can be standardized to yield a pivotal statistic with a Normal(0,1) distribution based on MLE theory.
For example, if \(X \sim \mbox{Binomial}(n,p)\) then \(\hat{p}=X/n\) is the MLE. For large \(n\) it approximately holds that \[ \frac{\hat{p} - p}{\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}} \sim \mbox{Normal}(0,1). \]
26.3 Likelihood Function
Suppose that we observe \(x_1, x_2, \ldots, x_n\) according to the model \(X_1, X_2, \ldots, X_n \sim F_{\theta}\). The joint pdf is \(f(\boldsymbol{x} ; \theta)\). We view the pdf as being a function of \(\boldsymbol{x}\) for a fixed \(\theta\).
The likelihood function is obtained by reversing the arguments and viewing this as a function of \(\theta\) for a fixed, observed \(\boldsymbol{x}\):
\[L(\theta ; \boldsymbol{x}) = f(\boldsymbol{x} ; \theta).\]
26.4 Log-Likelihood Function
The log-likelihood function is
\[ \ell(\theta ; \boldsymbol{x}) = \log L(\theta ; \boldsymbol{x}).\]
When the data are iid, we have
\[ \ell(\theta ; \boldsymbol{x}) = \log \prod_{i=1}^n f(x_i ; \theta) = \sum_{i=1}^n \log f(x_i ; \theta).\]
26.5 Calculating MLEs
The maximum likelihood estimate is the value of \(\theta\) that maximizes \(L(\theta ; \boldsymbol{x})\) for an observe data set \(\boldsymbol{x}\).
\[\begin{align*} \hat{\theta}_{{\rm MLE}} & = \operatorname{argmax}_{\theta} L(\theta ; \boldsymbol{x}) \\ & = \operatorname{argmax}_{\theta} \ell (\theta ; \boldsymbol{x}) \\ & = \operatorname{argmax}_{\theta} L (\theta ; T(\boldsymbol{x})) \end{align*}\]where the last equality holds for sufficient statistics \(T(\boldsymbol{x})\).
The MLE can usually be calculated analytically or numerically.
26.6 Properties
When “certain regularity assumptions” are true, the following properties hold for MLEs.
- Consistent
- Equivariant
- Asymptotically Normal
- Asymptotically Efficient (or Optimal)
- Approximate Bayes Estimator
26.7 Assumptions and Notation
We will assume that \(X_1, X_2, \ldots, X_n \stackrel{{\rm iid}}{\sim} F_{\theta}\) and let \(\hat{\theta}_n\) be the MLE of \(\theta\) based on the \(n\) observations.
The only exception is for the Binomial distribution where we will assume that \(X \sim \mbox{Binomial}(n, p)\), which is the sum of \(n\) iid \(\mbox{Bernoulli}(p)\) rv’s.
We will assume that the “certain regularity assumptions” are true in the following results.
26.8 Consistency
An estimator is consistent if it converges in probability to the true parameter value. MLEs are consistent so that as \(n \rightarrow \infty\),
\[\hat{\theta}_n \stackrel{P}{\rightarrow} \theta,\]
where \(\theta\) is the true value.
26.9 Equivariance
If \(\hat{\theta}_n\) is the MLE of \(\theta\), then \(g\left(\hat{\theta}_n\right)\) is the MLE of \(g(\theta)\).
Example: For the Normal\((\mu, \sigma^2)\) the MLE of \(\mu\) is \(\overline{X}\). Therefore, the MLE of \(e^\mu\) is \(e^{\overline{X}}\).
26.10 Fisher Information
The Fisher Information of \(X_1, X_2, \ldots, X_n \stackrel{{\rm iid}}{\sim} F_{\theta}\) is:
\[\begin{align*} I_n(\theta) & = \operatorname{Var}\left( \frac{d}{d\theta} \log f(\boldsymbol{X}; \theta) \right) = \sum_{i=1}^n \operatorname{Var}\left( \frac{d}{d\theta} \log f(X_i; \theta) \right) \\ & = - \operatorname{E}\left( \frac{d^2}{d\theta^2} \log f(\boldsymbol{X}; \theta) \right) = - \sum_{i=1}^n \operatorname{E}\left( \frac{d^2}{d\theta^2} \log f(X_i; \theta) \right) \end{align*}\]
26.11 Standard Error
In general, the standard error of the standard deviation of sampling distribution of an estimate or statistic.
For MLEs, the standard error is \(\sqrt{{\operatorname{Var}}\left(\hat{\theta}_n\right)}\). It has the approximation
\[\operatorname{se}\left(\hat{\theta}_n\right) \approx \frac{1}{\sqrt{I_n(\theta)}}\] and the standard error estimate is
\[\hat{\operatorname{se}}\left(\hat{\theta}_n\right) = \frac{1}{\sqrt{I_n\left(\hat{\theta}_n\right)}}.\]
26.12 Asymptotic Normal
MLEs converge in distribution to the Normal distribution. Specifically, as \(n \rightarrow \infty\),
\[\frac{\hat{\theta}_n - \theta}{{\operatorname{se}}\left(\hat{\theta}_n\right)} \stackrel{D}{\longrightarrow} \mbox{Normal}(0,1)\] and
\[\frac{\hat{\theta}_n - \theta}{\hat{{\operatorname{se}}}\left(\hat{\theta}_n\right)} \stackrel{D}{\longrightarrow} \mbox{Normal}(0,1).\]
26.13 Asymptotic Pivotal Statistic
By the previous result, we now have an approximate (asymptotic) pivotal statistic:
\[Z = \frac{\hat{\theta}_n - \theta}{\hat{{\operatorname{se}}}\left(\hat{\theta}_n\right)} \stackrel{D}{\longrightarrow} \mbox{Normal}(0,1).\]
This allows us to construct approximate confidence intervals and hypothesis test as in the idealized \(\mbox{Normal}(\mu, \sigma^2)\) (with \(\sigma^2\) known) scenario from the previous sections.
26.14 Wald Test
Consider the hypothesis test, \(H_0: \theta=\theta_0\) vs \(H_1: \theta \not= \theta_0\). We form test statistic
\[z = \frac{\hat{\theta}_n - \theta_0}{\hat{{\operatorname{se}}}\left(\hat{\theta}_n\right)},\]
which has approximate p-value
\[\mbox{p-value} = {\rm Pr}(|Z^*| \geq |z|),\]
where \(Z^*\) is a Normal\((0,1)\) random variable.
26.15 Confidence Intervals
Using this MLE theory, we can form approximate \((1-\alpha)\) level confidence intervals as follows.
Two-sided: \[\left(\hat{\theta}_n - |z_{\alpha/2}| \hat{{\operatorname{se}}}\left(\hat{\theta}_n\right), \hat{\theta}_n + |z_{\alpha/2}| \hat{{\operatorname{se}}}\left(\hat{\theta}_n\right)\right)\]
Upper: \[\left(-\infty, \hat{\theta}_n + |z_{\alpha}| \hat{{\operatorname{se}}}\left(\hat{\theta}_n\right)\right)\]
Lower: \[\left(\hat{\theta}_n - |z_{\alpha}| \hat{{\operatorname{se}}}\left(\hat{\theta}_n\right), \infty\right)\]
26.16 Optimality
The MLE is such that
\[ \sqrt{n} \left( \hat{\theta}_n - \theta \right) \stackrel{D}{\longrightarrow} \mbox{Normal}(0, \tau^2)\]
for some \(\tau^2\). Suppose that \(\tilde{\theta}_n\) is any other estimate so that
\[ \sqrt{n} \left( \tilde{\theta}_n - \theta \right) \stackrel{D}{\longrightarrow} \mbox{Normal}(0, \gamma^2).\]
It follows that
\[\frac{\tau^2}{\gamma^2} \leq 1.\]
26.17 Delta Method
Suppose that \(g()\) is a differentiable function and \(g'(\theta) \not= 0\). Note that for some \(t\) in a neighborhood of \(\theta\), a first-order Taylor expansion tells us that \(g(t) \approx g'(\theta) (t - \theta)\). From this we know that
\[{\operatorname{Var}}\left(g(\hat{\theta}_n) \right) \approx g'(\theta)^2 {\operatorname{Var}}(\hat{\theta}_n)\]
The delta method shows that \(\hat{{\operatorname{se}}}\left(g(\hat{\theta}_n)\right) = |g'(\hat{\theta}_n)| \hat{{\operatorname{se}}}\left(\hat{\theta}_n\right)\) and
\[\frac{g(\hat{\theta}_n) - g(\theta)}{|g'(\hat{\theta}_n)| \hat{{\operatorname{se}}}\left(\hat{\theta}_n\right)} \stackrel{D}{\longrightarrow} \mbox{Normal}(0,1).\]
26.18 Delta Method Example
Suppose \(X \sim \mbox{Binomial}(n,p)\) which has MLE, \(\hat{p} = X/n\). By the equivariance property, the MLE of the per-trial variance \(p(1-p)\) is \(\hat{p}(1-\hat{p})\). It can be calculated that \(\hat{{\operatorname{se}}}(\hat{p}) = \sqrt{\hat{p}(1-\hat{p})/n}\).
Let \(g(p) = p(1-p)\). Then \(g'(p) = 1-2p\). By the delta method,
\[\hat{{\operatorname{se}}}\left( \hat{p}(1-\hat{p}) \right) = \left| (1-2\hat{p}) \right| \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}.\]
26.19 Multiparameter Fisher Info Matrix
Suppose that \(X_1, X_2, \ldots, X_n \stackrel{{\rm iid}}{\sim} F_{\boldsymbol{\theta}}\) where \(\boldsymbol{\theta} = (\theta_1, \theta_2, \ldots, \theta_d)^T\) has MLE \(\hat{\boldsymbol{\theta}}_n\).
The Fisher Information Matrix \(I_n(\boldsymbol{\theta})\) is the \(d \times d\) matrix with \((i, j)\) entry
\[ -\sum_{k=1}^n \operatorname{E}\left( \frac{\partial^2}{\partial\theta_i \partial\theta_j} \log f(X_k; \boldsymbol{\theta}) \right).\]
26.20 Multiparameter Asymptotic MVN
Under appropriate regularity conditions, as \(n \rightarrow \infty\),
\[ \left( \hat{\boldsymbol{\theta}}_n - \boldsymbol{\theta} \right) \stackrel{D}{\longrightarrow} \mbox{MVN}_d \left( \boldsymbol{0}, I_n(\boldsymbol{\theta})^{-1} \right) \mbox{ and } \]
\[ \left( \hat{\boldsymbol{\theta}}_n - \boldsymbol{\theta} \right)^T I_n(\hat{\boldsymbol{\theta}}_n) \left( \hat{\boldsymbol{\theta}}_n - \boldsymbol{\theta} \right) \stackrel{D}{\longrightarrow} \mbox{MVN}_d \left( \boldsymbol{0}, \boldsymbol{I} \right). \]