Simple Linear Regression: Assumptions

2025

Assumptions of simple linear regression

The relationship between \(X\) and \(Y\) is linear.
The residuals are independent.
The residuals are normally distributed.
The residuals have constant variance (homoscedasticity).

Assumption: Linearity

The expected value of \(Y\) given \(X\) is a linear function:

\[ \mathbb{E}[Y \mid X] = \beta_0 + \beta_1 X \]
If this assumption is violated, the model is misspecified and predictions or hypothesis tests may be misleading.

Non-linearity example

## 
## Call:
## lm(formula = y ~ x, data = d)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.268 -4.920 -1.695  4.262 11.319 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.230099   0.557821  12.961   <2e-16 ***
## x           -0.007442   0.318853  -0.023    0.981    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.578 on 98 degrees of freedom
## Multiple R-squared:  5.559e-06,  Adjusted R-squared:  -0.0102 
## F-statistic: 0.0005448 on 1 and 98 DF,  p-value: 0.9814

Non-linearity example

The true relationship between \(x\) and \(y\) is clearly non-linear.
The slope for \(x\) is not significantly different from zero.
This might lead you to falsely conclude that there is no relationship between \(x\) and \(y\).
But we know the true relationship is quadratic and applying a linear model is an instance of model misspecification.

Diagnosing linearity

Plot residuals vs fitted values.
If the relationship is linear, the residuals should hover randomly around zero.
Curvature or systematic patterns suggest model misspecification.

Diagnosing linearity

# plot residuals vs fitted values
d[, fitted := predict(lm(y ~ x, data=d))]
d[, residual := resid(lm(y ~ x, data=d))]

ggplot(d, aes(x=fitted, y=residual)) +
  geom_point() +
  geom_hline(yintercept=0, linetype="dashed", color="red") +
  theme_minimal() +
  labs(title="Residuals vs Fitted Values")

Diagnosing linearity

Assumption: Independence of residuals

Each residual should be independent of the others.
Violations often lead to underestimated standard errors and consequently inflated type I error rates.
This assumption is almost always violated in time series or clustered data (e.g., cognitive science experiments that measure learning across time).

Diagnosing independence

Look for patterns in residuals plotted against observation order.

Diagnosing independence

# plots residuals against observation order
d[, order := 1:.N]
d[, residual := resid(lm(y ~ x, data=d))]
ggplot(d, aes(x=order, y=residual)) +
  geom_point() +
  geom_hline(yintercept=0, linetype="dashed", color="red") +
  theme_minimal() +
  labs(title="Residuals vs Observation Order")

Diagnosing independence

Assumption: Normality of residuals

Residuals should be normally distributed.

Diagnosing normality

Use a histogram or density plot of residuals.
A Q-Q plot is more sensitive to deviations from normality. I haven’t shown you these
Small deviations are okay if \(n\) is large because of the Central Limit Theorem.

Diagnosing normality

# plot histogram of residuals
d[, residual := resid(lm(y ~ x, data=d))]
ggplot(d, aes(x=residual)) +
  geom_histogram(bins=15, color="white", fill="steelblue") +
  theme_minimal() +
  labs(title="Histogram of Residuals")

Diagnosing normality

Consequences of non-normality

We use a \(t\)-test to ask whether a regression coefficient (e.g., \(\beta_1\)) is significantly different from zero:

\[ t = \frac{\widehat{\beta}_1 - 0}{\widehat{\sigma}_{\widehat{\beta}_1}} \]

This test assumes that the sampling distribution of \(\widehat{\beta}_1\) is approximately normal under the null hypothesis.
If the residuals are normally distributed, then \(\widehat{\beta}_1\) is normally distributed too.
If the residuals are not normal, the \(t\)-test may give misleading \(p\)-values and confidence intervals—especially in small samples.

Non-normal residuals can inflate false positives

Let’s simulate data where there is no real relationship between \(x\) and \(y\).
We’ll draw residuals from a highly skewed distribution (exponential) instead of normal.
Even though \(\beta_1 = 0\) in truth, we’ll see that \(t\)-tests often return significant results.

Non-normal residuals can inflate false positives

Type I error rate is can be higher than 5%.

Regression is robust to non-normality

Regression is robust to non-normality if your sample size is large and residuals are not wildly skewed or heavy-tailed.

Assumption: Homoscedasticity

Residuals should have constant variance across all values of \(X\).
If not, standard errors may be biased, leading to invalid inference.

Diagnosing homoscedasticity

Again, use the residuals vs fitted values plot.
Look for a funnel shape — increasing or decreasing spread suggests heteroscedasticity.

Diagnosing homoscedasticity

Consequences of heteroscedasticity

Let’s simulate data with predictor \(x\) is unrelated to \(y\) but with residuals with variance that increases with \(x\). This creates heteroscedasticity.

Simulation: Heteroscedasticity and false positives

Even though \(x\) has no effect, the heteroscedasticity can lead to inflated type I error rates.

Summary: Assumptions of Linear Regression

Assumption	Consequence of Violation	Robust?	Solution Exists?
Linearity	Model is misspecified; predictions and inference may be misleading	No	Yes
Independence	Inflated type I error rate	No	Yes
Homoscedasticity	Inflated type I error rate	Somewhat	Yes
Normality of residuals	Inflated type I error rate	Yes if \(n\) is large	Yes