52 Goodness of Fit
52.1 Rationale
Sometimes we want to figure out which probability distribution is a reasonable model for the data.
This is related to nonparametric inference in that we wish to go from being in a nonparametric framework to a parametric framework.
Goodness of fit (GoF) tests allow one to perform a hypothesis test of how well a particular parametric probability model explains variation observed in a data set.
52.2 Chi-Square GoF Test
Suppose we have data generating process \(X_1, X_2, \ldots, X_n {\; \stackrel{\text{iid}}{\sim}\;}F\) for some probability distribution \(F\). We wish to test \(H_0: F \in \{F_{{\boldsymbol{\theta}}}: {\boldsymbol{\theta}}\in \boldsymbol{\Theta} \}\) vs \(H_1: \mbox{not } H_0\). Suppose that \(\boldsymbol{\Theta}\) is \(d\)-dimensional.
Divide the support of \(\{F_{{\boldsymbol{\theta}}}: {\boldsymbol{\theta}}\in \boldsymbol{\Theta} \}\) into \(k\) bins \(I_1, I_2, \ldots, I_k\).
For \(j=1, 2, \ldots, k\), calculate
\[ q_j({\boldsymbol{\theta}}) = \int_{I_j} dF_{{\boldsymbol{\theta}}}(x). \]
Suppose we observe data \(x_1, x_2, \ldots, x_n\). For \(j = 1, 2, \ldots, k\), let \(n_j\) be the number of values \(x_i \in I_j\).
Let \(\tilde{\theta}_1, \tilde{\theta}_2, \ldots, \tilde{\theta}_d\) be the values that maximize the multinomial likelihood
\[ \prod_{j=1}^k q_j({\boldsymbol{\theta}})^{n_j}. \]
Form GoF statistic
\[ s({\boldsymbol{x}}) = \sum_{j=1}^k \frac{\left(n_j - n q_j\left(\tilde{{\boldsymbol{\theta}}} \right) \right)^2}{n q_j\left(\tilde{{\boldsymbol{\theta}}} \right)} \]
When \(H_0\) is true, \(S \sim \chi^2_v\) where \(v = k - d - 1\). The p-value is calculated by \(\Pr(S^* \geq s({\boldsymbol{x}}))\) where \(S^* \sim \chi^2_{k-d-1}\).
52.3 Example: Hardy-Weinberg
Suppose at your favorite SNP, we observe genotypes from 100 randomly sampled individuals as follows:
AA | AT | TT |
---|---|---|
28 | 60 | 12 |
If we code these genotypes as 0, 1, 2, testing for Hardy-Weinberg equilibrium is equivalent to testing whether \(X_1, X_2, \ldots, X_{100} {\; \stackrel{\text{iid}}{\sim}\;}\mbox{Binomial}(2, \theta)\) for some unknown allele frequency of T, \(\theta\).
The parameter dimension is such that \(d=1\). We will also set \(k=3\), where each bin is a genotype. Therefore, we have \(n_1 = 28\), \(n_2 = 60\), and \(n_3 = 12\). Also,
\[ q_1(\theta) = (1-\theta)^2, \ \ q_2(\theta) = 2 \theta (1-\theta), \ \ q_3(\theta) = \theta^2. \]
Forming the multinomial likelihood under these bin probabilities, we find \(\tilde{\theta} = (n_2 + 2n_3)/(2n)\). The degrees of freedom of the \(\chi^2_v\) null distribution is \(v = k - d - 1 = 3 - 1 - 1 = 1\).
Let’s carry out the test in R.
> n <- 100
> nj <- c(28, 60, 12)
>
> # parameter estimates
> theta <- (nj[2] + 2*nj[3])/(2*n)
> qj <- c((1-theta)^2, 2*theta*(1-theta), theta^2)
>
> # gof statistic
> s <- sum((nj - n*qj)^2 / (n*qj))
>
> # p-value
> 1-pchisq(s, df=1)
[1] 0.02059811
52.4 Kolmogorov–Smirnov Test
The KS test can be used to compare a sample of data to a particular distribution, or to compare two samples of data. The former is a parametric GoF test, and the latter is a nonparametric test of equal distributions.
52.5 One Sample KS Test
Suppose we have data generating process \(X_1, X_2, \ldots, X_n \sim F\) for some probability distribution \(F\). We wish to test \(H_0: F = F_{{\boldsymbol{\theta}}}\) vs \(H_1: F \not= F_{{\boldsymbol{\theta}}}\) for some parametric distribution \(F_{{\boldsymbol{\theta}}}\).
For observed data \(x_1, x_2, \ldots, x_n\) we form the edf \(\hat{F}_{{\boldsymbol{x}}}\) and test-statistic
\[ D({\boldsymbol{x}}) = \sup_{z} \left| \hat{F}_{{\boldsymbol{x}}}(z) - F_{{\boldsymbol{\theta}}}(z) \right|. \]
The null distribution of this test can be approximated based on a stochastic process called the Brownian bridge.
52.6 Two Sample KS Test
Suppose \(X_1, X_2, \ldots, X_m {\; \stackrel{\text{iid}}{\sim}\;}F_X\) and \(Y_1, Y_2, \ldots, Y_n {\; \stackrel{\text{iid}}{\sim}\;}F_Y\). We wish to test \(H_0: F_X = F_Y\) vs \(H_1: F_X \not= F_Y\).
For observed data \(x_1, x_2, \ldots, x_m\) and \(y_1, y_2, \ldots, y_n\) we form the edf’s \(\hat{F}_{{\boldsymbol{x}}}\) and \(\hat{F}_{\boldsymbol{y}}\). We then form test-statistic
\[ D({\boldsymbol{x}},\boldsymbol{y}) = \sup_{z} \left| \hat{F}_{{\boldsymbol{x}}}(z) - \hat{F}_{\boldsymbol{y}}(z) \right|. \]
The null distribution of this statistic can be approximated using results on edf’s.
Both of these tests can be carried out using the ks.test()
function in R.
ks.test(x, y, ...,
alternative = c("two.sided", "less", "greater"),
exact = NULL)
52.7 Example: Exponential vs Normal
Two sample KS test.
> x <- rnorm(100, mean=1)
> y <- rexp(100, rate=1)
> wilcox.test(x, y)
Wilcoxon rank sum test with continuity correction
data: x and y
W = 5021, p-value = 0.9601
alternative hypothesis: true location shift is not equal to 0
> ks.test(x, y)
Two-sample Kolmogorov-Smirnov test
data: x and y
D = 0.19, p-value = 0.0541
alternative hypothesis: two-sided
> qqplot(x, y); abline(0,1)
One sample KS tests.
> ks.test(x=x, y="pnorm")
One-sample Kolmogorov-Smirnov test
data: x
D = 0.41398, p-value = 2.554e-15
alternative hypothesis: two-sided
>
> ks.test(x=x, y="pnorm", mean=1)
One-sample Kolmogorov-Smirnov test
data: x
D = 0.068035, p-value = 0.7436
alternative hypothesis: two-sided
Standardize (mean center, sd scale) the observations before comparing to a Normal(0,1) distribution.
> ks.test(x=((x-mean(x))/sd(x)), y="pnorm")
One-sample Kolmogorov-Smirnov test
data: ((x - mean(x))/sd(x))
D = 0.05896, p-value = 0.8778
alternative hypothesis: two-sided
>
> ks.test(x=((y-mean(y))/sd(y)), y="pnorm")
One-sample Kolmogorov-Smirnov test
data: ((y - mean(y))/sd(y))
D = 0.14439, p-value = 0.03092
alternative hypothesis: two-sided