2025

Introduction to correlation

  • Correlation is a measure of the linear relationship between two random variables. It tells us the strength and direction of the linear relationship.

  • Given samples \(\boldsymbol{x}\) and \(\boldsymbol{y}\) from two random variables \(X\) and \(Y\):

    • Positive correlation: Increasing values of \(\boldsymbol{x}\) are associated with increasing values of \(\boldsymbol{y}\).

    • Negative correlation: Increasing values of \(\boldsymbol{x}\) are associated with decreasing values of \(\boldsymbol{y}\).

Examples of correlation at the sample level

Correlation at the population level

  • Variance is a population parameter that describes the variability of a single random variable.

  • Covariance is a population parameter that describes how much the variability of one random variable is related to the variability of another random variable.

\[ \begin{align} Var(X) &= E\left[ (X - \mu_X)^2 \right] \\ Var(Y) &= E\left[ (Y - \mu_Y)^2 \right] \\ Cov(X, Y) &= E \left[ (X - \mu_X) (Y - \mu_Y) \right] \\ \end{align} \]

  • We will ultimately define correlation in terms of covariance.

Correlation at the population level

  • It’s nice to see the variances and covariances expressed in terms of expected values \(E[\cdot]\), but it is often abbreviated to the following:

\[ \begin{align} \sigma^2_X &= Var(X) \\ \sigma^2_Y &= Var(Y) \\ Cov_{X,Y} &= Cov(X,Y) \\ \end{align} \]

  • \(Var(\cdot)\) and \(Cov(\cdot, \cdot)\) are operators defined in terms of expected values \(E[\cdot]\) as seen on the previous slide.

Estimating covariance from a sample

  • Variance from a single sample: \[ \begin{align} s^2_{x} &= \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 \\ &= \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x}) (x_i - \bar{x}) \\ \\ \end{align} \]

  • Covariance from two independent samples: \[ \begin{align} cov_{xy} &= \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x}) (y_i - \bar{y}) \end{align} \]

Attributes of variance and covariance

  • Variance is large when samples \(x\) regularly deviate from the mean \(\bar{x}\)

  • Covariance is large when samples \(x\) regularly deviate from the mean \(\bar{x}\) at the same time that samples \(y\) deviate from \(\bar{y}\).

  • Variance is strictly \(\geq 0\)

  • Covariance is positive if deviations from the mean in \(X\) match the direction of deviations from the mean in \(Y\).

  • Covariance is negative if deviations from the mean in \(X\) and \(Y\) go in opposite directions.

Correlation

  • Correlation is simply covariance standardised to vary from \(-1\) to \(+1\).

  • It’s represented by Pearson’s correlation coefficient, typically denoted as \(r\).

  • Named after the person that developed it (Karl Pearson)

\[ r = \frac{cov_{x,y}}{s_x s_y} \]

  • You can compute it in R using cor(x, y)

Examples of Correlation

NHST: correlation coefficient

  • How do we know whether or not a sample correlation is indicative of a true non-zero population correlation?

  • We use the 5 step NHST procedure.

1. State the null and alternative hypotheses

  • Null Hypothesis (\(H_0\)): There is no correlation between the two variables in the population, i.e., \(\rho = 0\), where \(\rho\) is the population correlation coefficient.

  • Alternative Hypothesis (\(H_1\)): There is a correlation between the two variables in the population, i.e., \(\rho \neq 0\).

\[ \begin{align} H_0: \rho &= 0 \\ H_1: \rho &\neq 0 \end{align} \]

2. Pick a type I error rate

  • \(\alpha = 0.05\)

3. Fully specify a statistic that estimates the parameter in step 1

  • \(\widehat{\rho} = r\)

  • \(\widehat{\rho} \sim \text{???}\)

The sampling distribution of \(r\) is weird and messy

  • The formula for Pearson’s \(r\) is:

\[ r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2} \sqrt{\sum (y_i - \bar{y})^2}} \]

  • Both the numerator and denominators depend on the sample.

  • That makes \(r\) a nonlinear function of samples from random variables.

  • Nonlinear functions of random variables usually don’t have simple sampling distributions.

The sampling distribution of \(r\) is bounded and skewed

  • \(r\) is bounded between \(-1\) and \(1\).

  • For small sample sizes, the sampling distribution of \(r\) is often skewed.

  • There’s no simple formula for its shape.

Transform \(r\) into a \(t\) value:

  • Fortunately, we can transform \(r\) into a \(t\) value that has a known sampling distribution. This can be derived from first principles, but it goes beyond the scope of this course.

\[ \begin{align} t &= r \sqrt{\frac{n - 2}{1 - r^2}} \\ df &= n - 2 \end{align} \]

  • \(n\) is the sample size and \(r\) is the sample correlation coefficient.

  • This gives us a test statistic that does follow a known \(t\) distribution under the null (i.e., when \(\rho = 0\)).

Correlation NHST Using R

# generate some data
x <- rnorm(100)
y <- rnorm(100)

# test for significant correlation
cor.test(x, y)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = -0.085458, df = 98, p-value = 0.9321
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2047033  0.1881048
## sample estimates:
##          cor 
## -0.008632214

Interpreting \(r\) as Effect Size

  • The value of \(r\) tells us the strength and direction of a linear relationship.

  • It’s already a decent effect size measure:

    • Values closer to \(-1\) or \(1\) mean stronger relationships. Values near \(0\) mean weaker or no linear relationship.
  • In practice, we often square \(r\):

    • This gives us \(r^2\), a number between 0 and 1.

    • It reflects the proportion of shared variance between the two variables.

    • It ignores direction, which can be helpful when reporting effect size.

  • We’ll return to \(r^2\) later when we learn about regression models.

Correlation as a window into statistical independence

  • An independent-samples t-test assumes that the two samples \(\boldsymbol{x}\) and \(\boldsymbol{y}\) come from random variabels \(X\) and \(Y\) that are independent.

  • Independence means that the value of one variable does not translate to knowledge about the other in any way.

  • This also means that you can write the joint probability of two independent random variables as the product of their marginal probabilities:

\[ P(X=x \cap Y=y) = P(X=x) P(Y=y) \]

Example of independence

  • Flipping two totally different coins.

  • The result of flipping coin 1 doesn’t affect the result of flipping coin 2 an vice versa.

  • This allows us to write:

\[ P(X=H \cap Y=T) = P(X=H) P(Y=T) \]

How can we know if two random variables are independent?

  • We can estimate whether RVs \(X\) and \(Y\) are independent by looking at their correlation.

  • Statistical independence implies zero correlation but zero correlation does not imply independence

Correlation and causation

  • If \(X\) and \(Y\) are causally related, then they are correlated.

  • If \(X\) and \(Y\) are correlated, it does not mean that \(X\) causes \(Y\) or vice versa.

  • Example: Ice cream sales and drowning deaths are correlated. Does this mean that ice cream causes drowning?

Other types of correlation

  • Spearmans’s rho: This is a non-parametric test (doesn’t assume that your data are distributed in any particular way), so it’s good to use in situations where the data you are trying to correlate are highly non-normal.

  • Kendall’s tau: Non-parametric like Spearman’s rho but often better if you don’t have a large data set.

  • Biserial and point biserial correlations: These correlation measurements are used when one variable is dichotomous / categorical.

  • Partial correlation: An estimate of the correlation between two variables when the influence of a third variable is controlled for.