20 From Probability to Likelihood

20.1 Likelihood Function

Suppose that we observe \(x_1, x_2, \ldots, x_n\) according to the model \(X_1, X_2, \ldots, X_n \sim F_{\theta}\). The joint pdf is \(f(\boldsymbol{x} ; \theta)\). We view the pdf as being a function of \(\boldsymbol{x}\) for a fixed \(\theta\).

The likelihood function is obtained by reversing the arguments and viewing this as a function of \(\theta\) for a fixed, observed \(\boldsymbol{x}\):

\[L(\theta ; \boldsymbol{x}) = f(\boldsymbol{x} ; \theta).\]

20.2 Log-Likelihood Function

The log-likelihood function is

\[ \ell(\theta ; \boldsymbol{x}) = \log L(\theta ; \boldsymbol{x}).\]

When the data are iid, we have

\[ \ell(\theta ; \boldsymbol{x}) = \log \prod_{i=1}^n f(x_i ; \theta) = \sum_{i=1}^n \log f(x_i ; \theta).\]

20.3 Sufficient Statistics

A statistic \(T(\boldsymbol{x})\) is defined to be a function of the data.

A sufficient statistic is a statistic where the distribution of data, conditional on this statistic, does not depend on \(\theta\). That is, \(\boldsymbol{X} | T(\boldsymbol{X})\) does not depend on \(\theta\).

The interpretation is that the information in \(\boldsymbol{X}\) about \(\theta\) (the target of inference) is contained in \(T(\boldsymbol{X})\).

20.4 Factorization Theorem

The factorization theorem says that \(T(\boldsymbol{x})\) is a sufficient statistic if and only if we can factor

\[f(\boldsymbol{x} ; \theta) = g(T(\boldsymbol{x}), \theta) h(\boldsymbol{x}).\]

Therefore, if \(T(\boldsymbol{x})\) is a sufficient statistic then

\[L(\theta ; \boldsymbol{x}) = g(T(\boldsymbol{x}), \theta) h(\boldsymbol{x}) \propto L(\theta ; T(\boldsymbol{x})).\]

This formalizes the idea that the information in \(\boldsymbol{X}\) about \(\theta\) (the target of inference) is contained in \(T(\boldsymbol{X})\).

20.5 Example: Normal


If \(X_1, X_2, \ldots, X_n \stackrel{{\rm iid}}{\sim} \mbox{Normal}(\mu \sigma^2)\), then \(\overline{X}\) is sufficient for \(\mu\).


As an exercise, show this via the factorization theorem.


Hint: \(\sum_{i=1}^n (x_i - \mu)^2 = \sum_{i=1}^n (x_i - \overline{x})^2 + n(\overline{x} - \mu)^2\).

20.6 Likelihood Principle


If \(\boldsymbol{x}\) and \(\boldsymbol{y}\) are two data sets so that

\[L(\theta ; \boldsymbol{x}) \propto L(\theta ; \boldsymbol{y}),\]

\[\mbox{i.e., } L(\theta ; \boldsymbol{x}) = c(\boldsymbol{x}, \boldsymbol{y}) L(\theta ; \boldsymbol{y}),\]

then inferenece \(\theta\) should be the same for \(\boldsymbol{x}\) and \(\boldsymbol{y}\).

20.7 Maximum Likelihood

A common starting point for inference is to calculate the maximum likelihood estimate. This is the value of \(\theta\) that maximizes \(L(\theta ; \boldsymbol{x})\) for an observe data set \(\boldsymbol{x}\).

\[\begin{align*} \hat{\theta}_{{\rm MLE}} & = \operatorname{argmax}_{\theta} L(\theta ; \boldsymbol{x}) \\ & = \operatorname{argmax}_{\theta} \ell (\theta ; \boldsymbol{x}) \\ & = \operatorname{argmax}_{\theta} L (\theta ; T(\boldsymbol{x})) \end{align*}\]

where the last equality holds for sufficient statistics \(T(\boldsymbol{x})\).

20.8 Going Further

If this interests you, be sure to read about:

  • Minimal sufficient statistics
  • Complete sufficient statistics
  • Ancillary statistics
  • Basu’s theorem