20 From Probability to Likelihood
20.1 Likelihood Function
Suppose that we observe \(x_1, x_2, \ldots, x_n\) according to the model \(X_1, X_2, \ldots, X_n \sim F_{\theta}\). The joint pdf is \(f(\boldsymbol{x} ; \theta)\). We view the pdf as being a function of \(\boldsymbol{x}\) for a fixed \(\theta\).
The likelihood function is obtained by reversing the arguments and viewing this as a function of \(\theta\) for a fixed, observed \(\boldsymbol{x}\):
\[L(\theta ; \boldsymbol{x}) = f(\boldsymbol{x} ; \theta).\]
20.2 Log-Likelihood Function
The log-likelihood function is
\[ \ell(\theta ; \boldsymbol{x}) = \log L(\theta ; \boldsymbol{x}).\]
When the data are iid, we have
\[ \ell(\theta ; \boldsymbol{x}) = \log \prod_{i=1}^n f(x_i ; \theta) = \sum_{i=1}^n \log f(x_i ; \theta).\]
20.3 Sufficient Statistics
A statistic \(T(\boldsymbol{x})\) is defined to be a function of the data.
A sufficient statistic is a statistic where the distribution of data, conditional on this statistic, does not depend on \(\theta\). That is, \(\boldsymbol{X} | T(\boldsymbol{X})\) does not depend on \(\theta\).
The interpretation is that the information in \(\boldsymbol{X}\) about \(\theta\) (the target of inference) is contained in \(T(\boldsymbol{X})\).
20.4 Factorization Theorem
The factorization theorem says that \(T(\boldsymbol{x})\) is a sufficient statistic if and only if we can factor
\[f(\boldsymbol{x} ; \theta) = g(T(\boldsymbol{x}), \theta) h(\boldsymbol{x}).\]
Therefore, if \(T(\boldsymbol{x})\) is a sufficient statistic then
\[L(\theta ; \boldsymbol{x}) = g(T(\boldsymbol{x}), \theta) h(\boldsymbol{x}) \propto L(\theta ; T(\boldsymbol{x})).\]
This formalizes the idea that the information in \(\boldsymbol{X}\) about \(\theta\) (the target of inference) is contained in \(T(\boldsymbol{X})\).
20.5 Example: Normal
If \(X_1, X_2, \ldots, X_n \stackrel{{\rm iid}}{\sim} \mbox{Normal}(\mu \sigma^2)\), then \(\overline{X}\) is sufficient for \(\mu\).
As an exercise, show this via the factorization theorem.
Hint: \(\sum_{i=1}^n (x_i - \mu)^2 = \sum_{i=1}^n (x_i - \overline{x})^2 + n(\overline{x} - \mu)^2\).
20.6 Likelihood Principle
If \(\boldsymbol{x}\) and \(\boldsymbol{y}\) are two data sets so that
\[L(\theta ; \boldsymbol{x}) \propto L(\theta ; \boldsymbol{y}),\]
\[\mbox{i.e., } L(\theta ; \boldsymbol{x}) = c(\boldsymbol{x}, \boldsymbol{y}) L(\theta ; \boldsymbol{y}),\]
then inferenece \(\theta\) should be the same for \(\boldsymbol{x}\) and \(\boldsymbol{y}\).
20.7 Maximum Likelihood
A common starting point for inference is to calculate the maximum likelihood estimate. This is the value of \(\theta\) that maximizes \(L(\theta ; \boldsymbol{x})\) for an observe data set \(\boldsymbol{x}\).
\[\begin{align*} \hat{\theta}_{{\rm MLE}} & = \operatorname{argmax}_{\theta} L(\theta ; \boldsymbol{x}) \\ & = \operatorname{argmax}_{\theta} \ell (\theta ; \boldsymbol{x}) \\ & = \operatorname{argmax}_{\theta} L (\theta ; T(\boldsymbol{x})) \end{align*}\]where the last equality holds for sufficient statistics \(T(\boldsymbol{x})\).
20.8 Going Further
If this interests you, be sure to read about:
- Minimal sufficient statistics
- Complete sufficient statistics
- Ancillary statistics
- Basu’s theorem