Tutorial Code

Normal test vs Z test vs T test

# Normal test p value for temperature example
xbar_sd <- 5/sqrt(90)
pnorm(26, 23, xbar_sd, lower.tail=FALSE)
## [1] 6.274324e-09
# p = 6.274324e-09

# Z test p value for temperature example
z = (26-23)/xbar_sd
pnorm(z, 0, 1, lower.tail=FALSE)
## [1] 6.274324e-09
# p = 6.274324e-09

#p value is the same!
# While normal and z distributions are not the exact same distribution per se, they are defined by the exact same parameters (population mu and sigma) - this means that when you calculate "area under the curve" for your p value, that value should be the same.


# T test p value for temperature example
t = (26 - 23)/(5/sqrt(90))
pt(t, 89, lower.tail=FALSE)
## [1] 7.937711e-08
# p = 7.937711e-08

# p value for t test is different
# Why?
# The distribution we used to model the null (t distribution) uses different parameters (eg. uses df, not mu and sigma). This change in distribution will affect how much "area under the curve" is considered the p value.

Practicing data wrangling and analysis

This is a simple example of data wrangling and analysis. We will be using an example dataset from R (iris) that we used in the Week 3 ggplot tutorial. This is an example of what’s to come and what to expect when you start wrangling, cleaning and analysing your own data for your final project.

Setting up pt 2

Research Question: Does sepal length differ significantly between the virginica and versicolor species?

# What test will we need to run for this RQ?
# Can't run normal or z test because we don't know population statistics
# We have 2 unrelated samples (eg. not a time difference), and it doesn't make sense to run a t test on the differences between the sample, so it cannot be a paired t test
# Independent sample t test! 

# What are our null and alternative hypotheses? And what are our key random variables?
#H0: muX - mu Y = 0
#H0: muX - mu Y >0
#X = virginica sepal length
#Y = versicolor sepal length

# What is the function we need for that test?
# What values do we need to input into that function?

# Separating out our data
iris[Species == "virginica", .(Sepal.Length, Species)]
##     Sepal.Length   Species
##            <num>    <fctr>
##  1:          6.3 virginica
##  2:          5.8 virginica
##  3:          7.1 virginica
##  4:          6.3 virginica
##  5:          6.5 virginica
##  6:          7.6 virginica
##  7:          4.9 virginica
##  8:          7.3 virginica
##  9:          6.7 virginica
## 10:          7.2 virginica
## 11:          6.5 virginica
## 12:          6.4 virginica
## 13:          6.8 virginica
## 14:          5.7 virginica
## 15:          5.8 virginica
## 16:          6.4 virginica
## 17:          6.5 virginica
## 18:          7.7 virginica
## 19:          7.7 virginica
## 20:          6.0 virginica
## 21:          6.9 virginica
## 22:          5.6 virginica
## 23:          7.7 virginica
## 24:          6.3 virginica
## 25:          6.7 virginica
## 26:          7.2 virginica
## 27:          6.2 virginica
## 28:          6.1 virginica
## 29:          6.4 virginica
## 30:          7.2 virginica
## 31:          7.4 virginica
## 32:          7.9 virginica
## 33:          6.4 virginica
## 34:          6.3 virginica
## 35:          6.1 virginica
## 36:          7.7 virginica
## 37:          6.3 virginica
## 38:          6.4 virginica
## 39:          6.0 virginica
## 40:          6.9 virginica
## 41:          6.7 virginica
## 42:          6.9 virginica
## 43:          5.8 virginica
## 44:          6.8 virginica
## 45:          6.7 virginica
## 46:          6.7 virginica
## 47:          6.3 virginica
## 48:          6.5 virginica
## 49:          6.2 virginica
## 50:          5.9 virginica
##     Sepal.Length   Species
iris[Species == "versicolor", .(Sepal.Length, Species)]
##     Sepal.Length    Species
##            <num>     <fctr>
##  1:          7.0 versicolor
##  2:          6.4 versicolor
##  3:          6.9 versicolor
##  4:          5.5 versicolor
##  5:          6.5 versicolor
##  6:          5.7 versicolor
##  7:          6.3 versicolor
##  8:          4.9 versicolor
##  9:          6.6 versicolor
## 10:          5.2 versicolor
## 11:          5.0 versicolor
## 12:          5.9 versicolor
## 13:          6.0 versicolor
## 14:          6.1 versicolor
## 15:          5.6 versicolor
## 16:          6.7 versicolor
## 17:          5.6 versicolor
## 18:          5.8 versicolor
## 19:          6.2 versicolor
## 20:          5.6 versicolor
## 21:          5.9 versicolor
## 22:          6.1 versicolor
## 23:          6.3 versicolor
## 24:          6.1 versicolor
## 25:          6.4 versicolor
## 26:          6.6 versicolor
## 27:          6.8 versicolor
## 28:          6.7 versicolor
## 29:          6.0 versicolor
## 30:          5.7 versicolor
## 31:          5.5 versicolor
## 32:          5.5 versicolor
## 33:          5.8 versicolor
## 34:          6.0 versicolor
## 35:          5.4 versicolor
## 36:          6.0 versicolor
## 37:          6.7 versicolor
## 38:          6.3 versicolor
## 39:          5.6 versicolor
## 40:          5.5 versicolor
## 41:          5.5 versicolor
## 42:          6.1 versicolor
## 43:          5.8 versicolor
## 44:          5.0 versicolor
## 45:          5.6 versicolor
## 46:          5.7 versicolor
## 47:          5.7 versicolor
## 48:          6.2 versicolor
## 49:          5.1 versicolor
## 50:          5.7 versicolor
##     Sepal.Length    Species
# Above won't work - we will need to draw out our Sepal Length data into a vector form
x_obs <- iris[Species == "virginica", Sepal.Length]
y_obs <- iris[Species == "versicolor", Sepal.Length]

# Eyeball check if variance is equal
var(x_obs)
## [1] 0.4043429
var(y_obs)
## [1] 0.2664327
# Running the t test
# What if we set var.equal = FALSE?
res_unequal <- t.test(x= x_obs, 
        y= y_obs, 
        alternative= "two.sided",
        mu=0,
        paired=FALSE,
        var.equal=FALSE,
        conf.level=0.95)
# What if we set var.equal = TRUE?
res_equal <- t.test(x= x_obs, 
        y= y_obs, 
        alternative= "two.sided",
        mu=0,
        paired=FALSE,
        var.equal=TRUE,
        conf.level=0.95)


res_unequal
## 
##  Welch Two Sample t-test
## 
## data:  x_obs and y_obs
## t = 5.6292, df = 94.025, p-value = 1.866e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.4220269 0.8819731
## sample estimates:
## mean of x mean of y 
##     6.588     5.936
res_equal
## 
##  Two Sample t-test
## 
## data:  x_obs and y_obs
## t = 5.6292, df = 98, p-value = 1.725e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.4221484 0.8818516
## sample estimates:
## mean of x mean of y 
##     6.588     5.936

What are some other research questions that you could make from this dataset? What appropriate tests could you run to answer those questions? Practice practice practice :)