Partial Correlation

2024

Introduction

Correlation: Measures the strength and direction of the relationship between two variables.
Partial Correlation: Measures the strength and direction of the relationship between two variables while controlling for the effect of one or more additional variables.

Use-case examples

Medicine: Understanding the relationship between a treatment and outcome, controlling for age or other risk factors.
Economics: Analyzing the impact of one economic variable on another while controlling for other influencing factors.
Psychology: Studying the relationship between two psychological traits while controlling for external variables.

Mathematical Definition

The partial correlation between X and Y given a controlling variables Z is denoted as:

\[ r_{XY \cdot Z} = \frac{r_{XY} - r_{XZ}r_{YZ}}{\sqrt{(1-r_{XZ}^2)(1-r_{YZ}^2)}} \]

Relationship to correlation

The partial correlation can be smaller, equal to, or even larger than the simple correlation, depending on the relationships between the variables involved.

Third Variable Influence:
- If the third variable (or controlling variable) has a strong influence on both of the variables being correlated, the partial correlation can be significantly different from the simple correlation.
- If the third variable is not strongly related to the two variables being correlated, the partial correlation may be similar to the simple correlation.

Sign and Magnitude of Relationships:
- Depending on the sign and magnitude of the relationships between the controlling variable and the other two variables, the partial correlation can be either higher or lower than the simple correlation.
- In some cases, controlling for a third variable might reveal a stronger direct relationship between the two variables, leading to a higher partial correlation.

Partial Correlation Smaller than Simple Correlation

If \(X\) and \(Y\) have a strong positive correlation and \(Z\) is positively correlated with both \(X\) and \(Y\), the partial correlation between \(X\) and \(Y\) controlling for \(Z\) will typically be smaller than the simple correlation between \(X\) and \(Y\). This is because \(Z\) explains some of the variation in both \(X\) and \(Y\).

library(data.table)
library(ggplot2)
library(ggpubr)
library(ppcor)

x <- rnorm(100)
y <- 0.5*x + rnorm(100)
z <- 0.5*x + 0.5*y + rnorm(100)

d <- data.table(x, y, z)

g_xy <- ggplot(d, aes(x, y)) + geom_point() + geom_smooth(method="lm")
g_xz <- ggplot(d, aes(x, z)) + geom_point() + geom_smooth(method="lm")
g_yz <- ggplot(d, aes(y, z)) + geom_point() + geom_smooth(method="lm")

pcor_result <- pcor(d)

## $estimate
##           x         y         z
## x 1.0000000 0.1959269 0.4424041
## y 0.1959269 1.0000000 0.4674071
## z 0.4424041 0.4674071 1.0000000
## 
## $p.value
##              x            y            z
## x 0.000000e+00 5.194723e-02 4.543067e-06
## y 5.194723e-02 0.000000e+00 1.074282e-06
## z 4.543067e-06 1.074282e-06 0.000000e+00
## 
## $statistic
##          x        y        z
## x 0.000000 1.967795 4.858494
## y 1.967795 0.000000 5.207247
## z 4.858494 5.207247 0.000000
## 
## $n
## [1] 100
## 
## $gp
## [1] 1
## 
## $method
## [1] "pearson"

Partial Correlation Equal to Simple Correlation

If \(Z\) is uncorrelated with both \(X\) and \(Y\), then controlling for \(Z\) will not affect the correlation between \(X\) and \(Y\). In this case, the partial correlation will be equal to the simple correlation.

Partial Correlation Larger than Simple Correlation

If \(X\) and \(Y\) have a positive correlation, but \(Z\) is negatively correlated with both \(X\) and \(Y\), controlling for \(Z\) might actually increase the correlation between \(X\) and \(Y\). This can happen if \(Z\) acts as a suppressor variable, revealing a stronger direct relationship between \(X\) and \(Y\).

Partial correlation from a regression perspective

Given the following multiple regression:

\[ Z = \beta_{0} + \beta_{1}X + \beta_{2}Y + \epsilon \]

Partial correlation between \(X\) and \(Y\) controlling for \(Z\) can be calculated by fitting two simple linear regressions:

\[ X = \beta_{0} + \beta_{1}Z \epsilon \]

\[ Y = \beta_{0} + \beta_{1}Z \epsilon \]

The partial correlation between \(X\) and \(Y\) controlling for \(Z\) is the correlation between the residuals of these two regressions.

fm_xz <- lm(x ~ z, data=d)
fm_yz <- lm(y ~ z, data=d)

resid_xz <- resid(fm_xz)
resid_yz <- resid(fm_yz)

pcor_xy_z <- cor(resid_xz, resid_yz)
pcor_result <- pcor(d)

##           x         y         z
## x 1.0000000 0.1959269 0.4424041
## y 0.1959269 1.0000000 0.4674071
## z 0.4424041 0.4674071 1.0000000

## [1] 0.1959269

Partial Correlation versus multiple regression betas

\[ \beta_{x} = \frac{r_{xy} - r_{xz}r_{yz}}{1 - r_{xz}^2} \]

\[ r_{xy \cdot z} = \frac{r_{xy} - r_{xz}r_{yz}}{\sqrt{(1 - r_{xz}^2)(1 - r_{yz}^2)}} \]

Simple linear regression beta and Pearson correlation

\[ Y = \beta_{0} + \beta_{1}X + \epsilon \]

\[ \beta_{x} = r_{xy} \frac{s_{y}}{s_{x}} \]