20 t-test
So far, everything that we have done we have been lucky enough to know both the mean and the variance of the sampling distribution in our hypothesis tests. The mean is specified in \(H_0\) and \(H_1\) and the variance has either fallen out luckily (e.g., as with the Binomial test), or I have just given you a number and told you pretend that we just know it to be true (e.g., previous cheese maze example). Of course, in most real world scenarios, we will not know the variance of the sampling distribution, and this means that the approaches we have developed so far aren’t quite appropriate. Here is what we do instead:
Let \(X_1, X_2, \ldots, X_n\) be independent and identically distributed as
\[ X_i \sim N(\mu_X, \sigma_X) \]
and define two random variables \(\bar{X}\) and \(S^2\) as
\[ \bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i \]
\[ S^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2 \]
then the random variable
\[ \frac{\bar{X} - \mu_X}{\frac{\sigma_X}{\sqrt{n}}} \sim N(0, 1) = Z \]
and
\[ \frac{\bar{X} - \mu_X}{\frac{S}{\sqrt{n}}} \sim t(n-1) \]
where \(t\) is a t-distribution, which is completely defined by one parameter called the degrees of freedom given by \(n-1\). This all means that the mathematical formulation for how our sampling distribution is defined is different depending on whether or not we know \(\sigma_X\). Lets examine how this pans out using our previous cheese example, but without assuming known variance.
1. Specify the null and alternative hypotheses (\(H_0\) and \(H_1\)) in terms of a distribution and population parameter.
\[ H_0: \mu = 90 \\ H_1: \mu < 90 \]
2. Specify the type I error rate – denoted by the symbol \(\alpha\) – you are willing to tolerate.
\[ \alpha = 0.05 \]
3. Specify the sample statistic that you will use to estimate the population parameter in step 1 and state how it is distributed under the assumption that \(H_0\) is true.
In this example we do not know \(\sigma_{x}\), and so we must estimate it. This means that we do not want to reason using the observed \(\bar{x}\) value and corresponding sampling distribution, but instead want to reason using an observed \(t\) value and corresponding t-distribution.
\[ t_{obs} = \frac{\bar{x} - \mu_x}{\frac{s_x}{\sqrt{n}}} \sim t(n-1) \]
4. Obtain a random sample and use it to compute the sample statistic from step 3. Call this value \(\widehat{\theta}_{\text{obs}}\)
For our data, the following is true:
- \(n=15\),
- \(\bar{x} =\) 90.1349684,
- \(s_x =\) 9.5456853
- \(t_{obs} =\) 0.141392.
The above plot shows the \(t(n-1)\) sampling distribution in colour, and the \(Z\sim\mathcal{N}(0,1)\) in black. The \(t\) has higher tails than the \(Z\). This is because the t-value is the result of two random variables (sample mean and sample variance), while the z-value is only a product of only one random variable (the sample mean). However, it is easy to see that the difference between \(t\) and \(Z\) is reduces as \(n\) increases.
5. If \(\widehat{\theta}_{\text{obs}}\) is very unlikely to occur under the assumption that \(H_0\) is true, then reject \(H_0\). Otherwise, do not reject \(H_0\).
When computing the p-value, we will turn to pt()
. From the
plot above, and from reasoning about the alternative hypothesis,
we see that we need lower.tail=TRUE
.
## [1] 0.5560762
## [1] -1.660391
Finally, there is a built in function called t.test()
that
will do all of this for you.
##
## One Sample t-test
##
## data: x_obs
## t = 0.14139, df = 99, p-value = 0.5561
## alternative hypothesis: true mean is less than 90
## 95 percent confidence interval:
## -Inf 91.71993
## sample estimates:
## mean of x
## 90.13497
20.1 Two-tailed t-test
- t-test arises from the Normal test scenarios in which the sample variance \(\sigma_X\) is unknown.
1. Specify the null and alternative hypotheses (\(H_0\) and \(H_1\)) in terms of a distribution and population parameter.
\[ H_0: \mu = 90 \\ H_1: \mu \neq 90 \]
2. Specify the type I error rate – denoted by the symbol \(\alpha\) – you are willing to tolerate.
\[ \alpha = 0.05 \]
3. Specify the sample statistic that you will use to estimate the population parameter in step 1 and state how it is distributed under the assumption that \(H_0\) is true.
\[ \widehat{\mu} = \bar{x} \\ \bar{x} \sim \mathcal{N}(\mu_{\bar{x}}, \sigma_{\bar{x}}) \\ \mu_{\bar{x}} = \mu_{x} \\ \sigma_{\bar{x}} = \frac{\sigma_{x}}{\sqrt{n}} \rightarrow \widehat{\sigma}_{\bar{x}} = \frac{s_{x}}{\sqrt{n}} \\ t = \frac{\bar{x} - \mu_x}{\frac{s_x}{\sqrt{n}}} \sim t(n-1) \]
4. Obtain a random sample and use it to compute the sample statistic from step 3. Call this value \(\widehat{\theta}_{\text{obs}}\)
The researchers perform 15 trials and measure the time to cheese on each trial. The data are as follows:
xobs <- c(105.25909, 73.47533, 106.59599, 105.44859, 88.29283,
49.20100, 61.42866, 74.10559, 79.88466, 128.09307,
95.27187 ,64.01982 ,57.04686 ,74.21077, 74.01570)
n <- length(xobs)
xbarobs <- mean(xobs)
sigxbarobs <- sd(xobs) / sqrt(n)
mux <-90
muxbar <- 90
tobs <- (xbarobs - muxbar) / sigxbarobs
5. If \(\widehat{\theta}_{\text{obs}}\) is very unlikely to occur under the assumption that \(H_0\) is true, then reject \(H_0\). Otherwise, do not reject \(H_0\).
tobs_upper <- -tobs
tobs_lower <- tobs
t_crit_upper <- qt(0.05/2, n-1, lower.tail=FALSE)
t_crit_lower <- qt(0.05/2, n-1, lower.tail=TRUE)
# compute p-value by hand
pval_upper <- pt(tobs_upper, n-1, lower.tail=FALSE)
pval_lower <- pt(tobs_lower, n-1, lower.tail=TRUE)
pval <- pval_upper + pval_lower
pval
## [1] 0.2024876
##
## One Sample t-test
##
## data: xobs
## t = -1.3372, df = 14, p-value = 0.2025
## alternative hypothesis: true mean is not equal to 90
## 95 percent confidence interval:
## 70.27055 94.57609
## sample estimates:
## mean of x
## 82.42332