5 Basic descriptive statistics
5.1 Central tendency of a sample
Given a sample, a measure of central tendency is supposed to tell us where most values tend to be clustered. One very common measure of sample central tendency is called the sample mean. The sample mean is denoted by \(\overline{\boldsymbol{x}}\), and is defined by the following equation:
\[ \overline{\boldsymbol{x}} = \frac{x_1 + x_2 + x_3 + ... + x_n}{n} \]
We can write this concisely as:
\[ \overline{\boldsymbol{x}} = \frac{1}{n} \sum_{i=1}^{n} x_{i} \]
Another common measure of sample central tendency is called the sample median. We will denote it by \(\widetilde{\boldsymbol{x}}\), and it is defined simply as the value that splits the observations in half. Finally, sample mode is the element that occurs most often in the sample.
5.1.1 Central tendency by hand
Suppose you have the following observations: \[ \boldsymbol{x} = (55, 35, 23, 44, 31) \]
To compute the mean, we simply plug these numbers into the equation. \[ \overline{\boldsymbol{x}} = \frac{55 + 35 + 23 + 44 + 31}{5} = \frac{188}{5} = 37.6 \]
To compute the median, first sort the data from smallest to largest: \[ \boldsymbol{x}_{sorted} = (23, 31, 35, 44, 55) \]
Then, pick the value that ends up in the middle: \[ \widetilde{\boldsymbol{x}} = 35 \]
Since we have an odd number of observations, finding the median is pretty intuitive, but what if we had an even number of observations? In this case, we will take the mean of the middle two numbers.
\[ \boldsymbol{x} = (55, 35, 23, 44) \]
\[ \boldsymbol{x}_{sorted} = (23, 35, 44, 55) \]
\[ \widetilde{\boldsymbol{x}} = \frac{35 + 44}{2} = 39.5 \]
5.1.2 Central tendency using R
In general, things are easier and we are happier and more
productive human beings if we use R. We just store our
sample observations in a variable x
, and use built-in R
functions mean()
and median()
to compute the sample mean
and sample median.
## [1] 37.6
## [1] 35
## [1] 39.25
## [1] 39.5
5.1.3 Central tendency and outliers
Sometimes a sample contains a few observations that are very different from the majority of the others. Theses observations are called outliers. How will outliers influence our measures of central tendency? To answer this question, consider the rat maze running example from above.
# define vectors that contain the maze times from the
# example given at the top of the lecture.
x1 <- c(52.38, 55.41, 70.88, 43.30, 50.15, 41.99, 36.82, 34.05, 52.70, 72.25)
x2 <- c(62.36, 53.89, 53.95, 33.81, 61.12, 61.48, 36.89, 49.45, 52.50, 50.95)
x3 <- c(52.04, 48.28, 48.12, 58.89, 51.76, 42.88, 49.04, 60.41, 53.99, 70.06)
# combine all observations into a single vector
x <- c(x1, x2, x3)
# compute the mean and median of x
mean(x)
## [1] 52.06
## [1] 52.21
# add an outlier to x
x <- c(x, 300)
# compute the mean and median of x with the outlier in the data
mean(x)
## [1] 60.05806
## [1] 52.38
Here, you can see that the mean, but much less so the median, is sensitive to outliers. So, which is a better measure of central tendency? The answer to this question depends entirely on what you think is an outlier and how much you care about them. Saying much more than that is beyond the scope of this lecture, but we should leave with at least a simple lesson: it is always a good idea to identify and investigate outliers in our data.
5.2 Spread of a sample
Measures of spread of a sample are supposed to tell us how widely the sample observations are distributed. One very common measure of spread is called sample variance. It is denoted by \(\boldsymbol{s}^2\) and it is defined as:
\[ \boldsymbol{s}^2 = \frac{1}{n-1} \sum_{i=1}^{n} ( x_{i} - \overline{\boldsymbol{x}} )^2 \]
An additional measure of spread is called the sample standard deviation. It is denoted by, \(\boldsymbol{s}\), and it is defined simply as the square root of the sample variance.
\[ \boldsymbol{s} = \sqrt{\boldsymbol{s}^2} \]
A third measure of spread that we will consider is called the sample range, and it is defined as the difference between the most extreme observed values.
5.2.1 Spread by hand
Consider the following sample:
\[ \boldsymbol{x} = (55, 35, 23, 44) \]
If for some reason you needed to compute sample variance, and every computer near you was broken, then you could compute the sample variance of this sample by hand as follows:
\[ \boldsymbol{s}^2 = \frac{ (55-39.25)^2 + (35 -39.25)^2 + (23-39.25)^2 + (44-39.25)^2 }{4-1} \]
\[ \boldsymbol{s}^2 = \frac{ (15.75)^2 + (-4.25)^2 + (-16.25)^2 + (4.75)^2 }{4-1} \]
\[ \boldsymbol{s}^2 = \frac{ 248.0625 + 18.0625 + 264.0625 + 22.5625 }{4-1} \]
\[ \boldsymbol{s}^2 = \frac{ 552.75 }{4-1} = 184.25 \]
Well, that sucked, and in the next section we will see that
R
will do this for us with grace and ease.