# 20 From Probability to Likelihood

## 20.1 Likelihood Function

Suppose that we observe $$x_1, x_2, \ldots, x_n$$ according to the model $$X_1, X_2, \ldots, X_n \sim F_{\theta}$$. The joint pdf is $$f(\boldsymbol{x} ; \theta)$$. We view the pdf as being a function of $$\boldsymbol{x}$$ for a fixed $$\theta$$.

The likelihood function is obtained by reversing the arguments and viewing this as a function of $$\theta$$ for a fixed, observed $$\boldsymbol{x}$$:

$L(\theta ; \boldsymbol{x}) = f(\boldsymbol{x} ; \theta).$

## 20.2 Log-Likelihood Function

The log-likelihood function is

$\ell(\theta ; \boldsymbol{x}) = \log L(\theta ; \boldsymbol{x}).$

When the data are iid, we have

$\ell(\theta ; \boldsymbol{x}) = \log \prod_{i=1}^n f(x_i ; \theta) = \sum_{i=1}^n \log f(x_i ; \theta).$

## 20.3 Sufficient Statistics

A statistic $$T(\boldsymbol{x})$$ is defined to be a function of the data.

A sufficient statistic is a statistic where the distribution of data, conditional on this statistic, does not depend on $$\theta$$. That is, $$\boldsymbol{X} | T(\boldsymbol{X})$$ does not depend on $$\theta$$.

The interpretation is that the information in $$\boldsymbol{X}$$ about $$\theta$$ (the target of inference) is contained in $$T(\boldsymbol{X})$$.

## 20.4 Factorization Theorem

The factorization theorem says that $$T(\boldsymbol{x})$$ is a sufficient statistic if and only if we can factor

$f(\boldsymbol{x} ; \theta) = g(T(\boldsymbol{x}), \theta) h(\boldsymbol{x}).$

Therefore, if $$T(\boldsymbol{x})$$ is a sufficient statistic then

$L(\theta ; \boldsymbol{x}) = g(T(\boldsymbol{x}), \theta) h(\boldsymbol{x}) \propto L(\theta ; T(\boldsymbol{x})).$

This formalizes the idea that the information in $$\boldsymbol{X}$$ about $$\theta$$ (the target of inference) is contained in $$T(\boldsymbol{X})$$.

## 20.5 Example: Normal

If $$X_1, X_2, \ldots, X_n \stackrel{{\rm iid}}{\sim} \mbox{Normal}(\mu \sigma^2)$$, then $$\overline{X}$$ is sufficient for $$\mu$$.

As an exercise, show this via the factorization theorem.

Hint: $$\sum_{i=1}^n (x_i - \mu)^2 = \sum_{i=1}^n (x_i - \overline{x})^2 + n(\overline{x} - \mu)^2$$.

## 20.6 Likelihood Principle

If $$\boldsymbol{x}$$ and $$\boldsymbol{y}$$ are two data sets so that

$L(\theta ; \boldsymbol{x}) \propto L(\theta ; \boldsymbol{y}),$

$\mbox{i.e., } L(\theta ; \boldsymbol{x}) = c(\boldsymbol{x}, \boldsymbol{y}) L(\theta ; \boldsymbol{y}),$

then inferenece $$\theta$$ should be the same for $$\boldsymbol{x}$$ and $$\boldsymbol{y}$$.

## 20.7 Maximum Likelihood

A common starting point for inference is to calculate the maximum likelihood estimate. This is the value of $$\theta$$ that maximizes $$L(\theta ; \boldsymbol{x})$$ for an observe data set $$\boldsymbol{x}$$.

\begin{align*} \hat{\theta}_{{\rm MLE}} & = \operatorname{argmax}_{\theta} L(\theta ; \boldsymbol{x}) \\ & = \operatorname{argmax}_{\theta} \ell (\theta ; \boldsymbol{x}) \\ & = \operatorname{argmax}_{\theta} L (\theta ; T(\boldsymbol{x})) \end{align*}

where the last equality holds for sufficient statistics $$T(\boldsymbol{x})$$.