2025

Assumptions of simple linear regression

  • The relationship between \(X\) and \(Y\) is linear.

  • The residuals are independent.

  • The residuals are normally distributed.

  • The residuals have constant variance (homoscedasticity).

Assumption: Linearity

  • The expected value of \(Y\) given \(X\) is a linear function:

    \[ \mathbb{E}[Y \mid X] = \beta_0 + \beta_1 X \]

  • If this assumption is violated, the model is misspecified and predictions or hypothesis tests may be misleading.

Non-linearity example

Non-linearity example

## 
## Call:
## lm(formula = y ~ x, data = d)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.268 -4.920 -1.695  4.262 11.319 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.230099   0.557821  12.961   <2e-16 ***
## x           -0.007442   0.318853  -0.023    0.981    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.578 on 98 degrees of freedom
## Multiple R-squared:  5.559e-06,  Adjusted R-squared:  -0.0102 
## F-statistic: 0.0005448 on 1 and 98 DF,  p-value: 0.9814

Non-linearity example

  • The true relationship between \(x\) and \(y\) is clearly non-linear.

  • The slope for \(x\) is not significantly different from zero.

  • This might lead you to falsely conclude that there is no relationship between \(x\) and \(y\).

  • But we know the true relationship is quadratic and applying a linear model is an instance of model misspecification.

Diagnosing linearity

  • Plot residuals vs fitted values.

  • If the relationship is linear, the residuals should hover randomly around zero.

  • Curvature or systematic patterns suggest model misspecification.

Diagnosing linearity

# plot residuals vs fitted values
d[, fitted := predict(lm(y ~ x, data=d))]
d[, residual := resid(lm(y ~ x, data=d))]

ggplot(d, aes(x=fitted, y=residual)) +
  geom_point() +
  geom_hline(yintercept=0, linetype="dashed", color="red") +
  theme_minimal() +
  labs(title="Residuals vs Fitted Values")

Diagnosing linearity

Assumption: Independence of residuals

  • Each residual should be independent of the others.

  • Violations often lead to underestimated standard errors and consequently inflated type I error rates.

  • This assumption is almost always violated in time series or clustered data (e.g., cognitive science experiments that measure learning across time).

Diagnosing independence

  • Look for patterns in residuals plotted against observation order.

Diagnosing independence

# plots residuals against observation order
d[, order := 1:.N]
d[, residual := resid(lm(y ~ x, data=d))]
ggplot(d, aes(x=order, y=residual)) +
  geom_point() +
  geom_hline(yintercept=0, linetype="dashed", color="red") +
  theme_minimal() +
  labs(title="Residuals vs Observation Order")

Diagnosing independence

Assumption: Normality of residuals

  • Residuals should be normally distributed.

Diagnosing normality

  • Use a histogram or density plot of residuals.

  • A Q-Q plot is more sensitive to deviations from normality. I haven’t shown you these

  • Small deviations are okay if \(n\) is large because of the Central Limit Theorem.

Diagnosing normality

# plot histogram of residuals
d[, residual := resid(lm(y ~ x, data=d))]
ggplot(d, aes(x=residual)) +
  geom_histogram(bins=15, color="white", fill="steelblue") +
  theme_minimal() +
  labs(title="Histogram of Residuals")

Diagnosing normality

Consequences of non-normality

  • We use a \(t\)-test to ask whether a regression coefficient (e.g., \(\beta_1\)) is significantly different from zero:

\[ t = \frac{\widehat{\beta}_1 - 0}{\widehat{\sigma}_{\widehat{\beta}_1}} \]

  • This test assumes that the sampling distribution of \(\widehat{\beta}_1\) is approximately normal under the null hypothesis.

  • If the residuals are normally distributed, then \(\widehat{\beta}_1\) is normally distributed too.

  • If the residuals are not normal, the \(t\)-test may give misleading \(p\)-values and confidence intervals—especially in small samples.

Non-normal residuals can inflate false positives

  • Let’s simulate data where there is no real relationship between \(x\) and \(y\).

  • We’ll draw residuals from a highly skewed distribution (exponential) instead of normal.

  • Even though \(\beta_1 = 0\) in truth, we’ll see that \(t\)-tests often return significant results.

Non-normal residuals can inflate false positives

Non-normal residuals can inflate false positives

  • Type I error rate is can be higher than 5%.

Regression is robust to non-normality

  • Regression is robust to non-normality if your sample size is large and residuals are not wildly skewed or heavy-tailed.

Assumption: Homoscedasticity

  • Residuals should have constant variance across all values of \(X\).

  • If not, standard errors may be biased, leading to invalid inference.

Diagnosing homoscedasticity

  • Again, use the residuals vs fitted values plot.

  • Look for a funnel shape — increasing or decreasing spread suggests heteroscedasticity.

Diagnosing homoscedasticity

Consequences of heteroscedasticity

  • Let’s simulate data with predictor \(x\) is unrelated to \(y\) but with residuals with variance that increases with \(x\). This creates heteroscedasticity.

Simulation: Heteroscedasticity and false positives

  • Even though \(x\) has no effect, the heteroscedasticity can lead to inflated type I error rates.

Summary: Assumptions of Linear Regression

Assumption Consequence of Violation Robust? Solution Exists?
Linearity Model is misspecified; predictions and inference may be misleading No Yes
Independence Inflated type I error rate No Yes
Homoscedasticity Inflated type I error rate Somewhat Yes
Normality of residuals Inflated type I error rate Yes if \(n\) is large Yes