2024

Introduction to Correlation

  • An independent-samples t-test assumes that the two samples \(x\) and \(y\) are independent.

  • Independence means that the value of one variable does not translate to knowledge about the other in any way.

  • This also means that you can write the joint probability of two independent random variables as the product of their marginal probabilities:

\[ P(X=x \cap Y=y) = P(X=x) P(Y=y) \]

Example of independence

  • Flipping two totally different coins.

  • The result of flipping coin 1 doesn’t affect the result of flipping coin 2 an vice versa.

  • This allows us to write:

\[ P(X=H \cap Y=T) = P(X=H) P(Y=T) \]

How can we know if two random variables are independent?

  • We can estimate whether RVs \(X\) and \(Y\) are independent by looking at their correlation.

  • Statistical independence implies zero correlation but zero correlation does not imply independence

Correlation is a measure of how two variables covary

  • We can write: \[ \begin{align} Var(X) &= E\left[ (X - \mu_X)^2 \right] \\ Var(Y) &= E\left[ (Y - \mu_Y)^2 \right] \\ Cov(X, Y) &= E \left[ (X - \mu_X) (Y - \mu_Y) \right] \\ \end{align} \]

  • Or equivalently: \[ \begin{align} \sigma^2_X &= Var(X) \\ \sigma^2_Y &= Var(Y) \\ Cov_{X,Y} &= Cov(X,Y) \\ \end{align} \]

Estimating covariance from a sample

  • Variance from a single sample: \[ \begin{align} s^2_{x} &= \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 \\ &= \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x}) (x_i - \bar{x}) \\ \\ \end{align} \]

  • Covariance from two independent samples: \[ \begin{align} cov_{xy} &= \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x}) (y_i - \bar{y}) \end{align} \]

Attributes of variance and covariance

  • Variance is large when samples \(x\) regularly deviate from the mean \(\bar{x}\)

  • Covariance is large when samples \(x\) regularly deviate from the mean \(\bar{x}\) at the same time that samples \(y\) deviate from \(\bar{y}\).

  • Variance is strictly \(\geq 0\)

  • Covariance is positive if deviations from the mean in \(X\) match the direction of deviations from the mean in \(Y\).

  • Covariance is negative if deviations from the mean in \(X\) and \(Y\) go in opposite directions.

Correlation

  • Correlation is simply covariance standardised to vary from \(-1\) to \(+1\).

  • It’s represented by Pearson’s correlation coefficient, typically denoted as \(r\).

  • Named after the person that developed it (Karl Pearson)

\[ r = \frac{cov_{x,y}}{s_x s_y} \]

  • You can compute it in R using cor(x, y)

Zero correlation

Positive correlation

Negative correlation

Significance of the Correlation Coefficient

  • How do we interpret the value of \(r\)?

  • A rule of thumb:

    • \(|r| > 0.8\) : Very strong relationship
    • \(0.5 < |r| \leq 0.8\) : Strong relationship
    • \(0.3 < |r| \leq 0.5\) : Moderate relationship
    • \(0.1 < |r| \leq 0.3\) : Weak relationship
    • \(|r| \leq 0.1\) : Very weak or no relationship

NHST: correlation coefficient

  • How do we know whether or not a sample correlation is indicative of a true population correlation?

  • We use NHST same as always.

1. State the null and alternative hypotheses

  • Null Hypothesis (\(H_0\)): There is no correlation between the two variables in the population, i.e., \(\rho = 0\), where \(\rho\) is the population correlation coefficient.

  • Alternative Hypothesis (\(H_a\)): There is a correlation between the two variables in the population, i.e., \(\rho \neq 0\).

2. Pick a type I error rate

  • \(alpha = 0.05\)

3. Fully specify a statistic that estimates the parameter in step 1

  • \(\widehat{\rho} = r\)

  • Problem: \(r\) does not have a friendly familiar sampling distribution.

  • Solution: Transform it into something that we like and is easy to use.

\[ t_{obs} = r \sqrt{\frac{n-2}{1-r^2}} \sim t(n-2) \]

  • \(r\) is the sample correlation coefficient and \(n\) is the number of pairs of data points.

  • Note that \(r\) itself does not follow a t-distribution. The test statistic written above is used to transform \(r\) into a \(t\) distribution.

Correlation NHST Using R

# generate some data
x <- rnorm(100)
y <- rnorm(100)

# test for significant correlation
cor.test(x, y)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = -0.31696, df = 98, p-value = 0.752
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2269924  0.1654571
## sample estimates:
##         cor 
## -0.03200104

Coefficient of determination (\(R^2\))

  • \(R^2 = r^2\)

  • \(R^2\) is a measure of the of the proportion of variance in one variable that can be predicted by the variance of another.

  • I don’t know why we use a capital \(R\) here but I’d love to know the history. Please share if you know it.

Correlation and causation

  • If \(X\) and \(Y\) are causally related, then they are correlated.

  • If \(X\) and \(Y\) are correlated, it does not mean that \(X\) causes \(Y\) or vice versa.

  • Example: Ice cream sales and drowning deaths are correlated. Does this mean that ice cream causes drowning?

Other types of correlation

  • Spearmans’s rho: This is a non-parametric test (doesn’t assume that your data are distributed in any particular way), so it’s good to use in situations where the data you are trying to correlate are highly non-normal.

  • Kendall’s tau: Non-parametric like Spearman’s rho but often better if you don’t have a large data set.

  • Biserial and point biserial correlations: These correlation measurements are used when one variable is dichotomous / categorical.

  • Partial correlation: An estimate of the correlation between two variables when the influence of a third variable is controlled for.

Concluding remarks

  • Statistical independence implies zero correlation, but not the other way around

  • Correlation is not causation

  • Pearson’s correlation is just one of many different types of correlation