13 Distribution of sample means

13.1 Learning objectives

Understand the concept of a sample mean as a random variable.
Understand the concept of the distribution of sample means.

13.2 Introduction

Every time we draw a sample from any random variable, we can compute the sample mean of that sample.
In general, if we repeat that procedure many times, we will see that every time we draw a sample, we get different outcomes and therefore different sample means.
This means that the sample mean is itself a random variable (i.e., because every time you measure it you get a different number).
Lets examine this by considering samples from the Binomial random variable with distribution shown in the left column of the following, and samples from the Normal random variable with distribution shown on the right.
More precisely, consider the following:

\(X \sim \mathcal{Binomial}(n=10, p=0.5)\)

##     sample_number sample_outcome sample_mean
##  1:      sample_1              4           5
##  2:      sample_1              3           5
##  3:      sample_1              6           5
##  4:      sample_1              3           5
##  5:      sample_1              6           5
##  6:      sample_1              5           5
##  7:      sample_1              7           5
##  8:      sample_1              2           5
##  9:      sample_1              8           5
## 10:      sample_1              6           5
## 11:      sample_2              4           5
## 12:      sample_2              4           5
## 13:      sample_2              6           5
## 14:      sample_2              3           5
## 15:      sample_2              7           5
## 16:      sample_2              4           5
## 17:      sample_2              5           5
## 18:      sample_2              6           5
## 19:      sample_2              4           5
## 20:      sample_2              7           5
## 21:      sample_3              3           6
## 22:      sample_3              7           6
## 23:      sample_3              7           6
## 24:      sample_3              6           6
## 25:      sample_3              9           6
## 26:      sample_3              6           6
## 27:      sample_3              3           6
## 28:      sample_3              6           6
## 29:      sample_3              5           6
## 30:      sample_3              8           6
##     sample_number sample_outcome sample_mean

\(Y \sim \mathcal{N}({\mu=5, \sigma=2.5})\)

##     sample_number sample_outcome sample_mean
##  1:      sample_1     1.07982714    5.128237
##  2:      sample_1     5.17417802    5.128237
##  3:      sample_1     4.23205034    5.128237
##  4:      sample_1     4.96986678    5.128237
##  5:      sample_1    12.23713598    5.128237
##  6:      sample_1     1.50329880    5.128237
##  7:      sample_1     6.08028350    5.128237
##  8:      sample_1     9.58164429    5.128237
##  9:      sample_1     3.47244364    5.128237
## 10:      sample_1     2.95164322    5.128237
## 11:      sample_2     5.12077365    3.982125
## 12:      sample_2     8.25137842    3.982125
## 13:      sample_2     4.14218790    3.982125
## 14:      sample_2     2.43552182    3.982125
## 15:      sample_2     5.17637135    3.982125
## 16:      sample_2    -0.04454817    3.982125
## 17:      sample_2     1.31136220    3.982125
## 18:      sample_2     7.71615700    3.982125
## 19:      sample_2     6.14703894    3.982125
## 20:      sample_2    -0.43499107    3.982125
## 21:      sample_3     6.54404064    4.965397
## 22:      sample_3    -0.76198838    4.965397
## 23:      sample_3     3.88257822    4.965397
## 24:      sample_3     5.74872669    4.965397
## 25:      sample_3     1.42881353    4.965397
## 26:      sample_3     8.16874371    4.965397
## 27:      sample_3     8.03626447    4.965397
## 28:      sample_3     3.31286017    4.965397
## 29:      sample_3     7.80255477    4.965397
## 30:      sample_3     5.49137444    4.965397
##     sample_number sample_outcome sample_mean

We see that every time we sample from either random variable:

We get a sample mean \(\bar{\boldsymbol{x}}\) that is close to the population mean \(\mu\), but doesn’t match it exactly unless by dumb luck.
Every sample from \(X\) leads to a different value for \(\bar{\boldsymbol{x}}\).
Thus, we have verified that \(\bar{\boldsymbol{x}}\) must itself be a random variable.

Moving forward, we will denote the random variable corresponding to the distribution of sample means with the symbol \(\bar{X}\), and continue to use \(\bar{\boldsymbol{x}}\) to refer to a particular sample mean.

\[ \begin{align} X & \rightarrow \{x_{1}, \ldots, x_{n}\} \\ \\ \bar{X} & \rightarrow \frac{1}{n} \{ x_{1} + \ldots + x_{n} \} \end{align} \]

Notice that \(n>1\) samples from \(X\) are needed to generate a single \(n=1\) sample from \(\bar{X}\).
This means that in order to estimate the distribution of sample means, we need to draw \(n>1\) samples from \(X\) many times.

If we perform an experiment in which we draw \(n=10\) samples from \(X\) or from \(Y\), compute the sample means for each (\(\bar{x}\) and \(\bar{y}\)), and repeat 500 times, then we get the following estimate for the distribution of sample means:

Note that even though \(X\) is discrete, \(\bar{X}\) is continuous. We also note that both \(\bar{X}\) and \(\bar{Y}\) look bell-shaped. This is because of something called the central limit theorem. Before we say more about this very important theorem, lets investigate how the original distributions (\(X\) and \(Y\)) compare to their corresponding distributions of sample means (\(\bar{X}\) and \(\bar{Y}\)) in terms of central tendancy and spread.

We can see a few important things from these plots:

The central tendancy (i.e., mean) of the distribution of sample means \(\bar{X}\) looks to be about the same as that for the original distribution \(X\).
The spread of the distribution of sample means \(\bar{X}\) looks to be smaller than that for the original distribution \(X\).

The central limit theorem helps us formalise both of these observations.