The relationship between \(X\) and \(Y\) is linear.
The residuals are independent.
The residuals are normally distributed.
The residuals have constant variance (homoscedasticity).
2025
The relationship between \(X\) and \(Y\) is linear.
The residuals are independent.
The residuals are normally distributed.
The residuals have constant variance (homoscedasticity).
The expected value of \(Y\) given \(X\) is a linear function:
\[ \mathbb{E}[Y \mid X] = \beta_0 + \beta_1 X \]
If this assumption is violated, the model is misspecified and predictions or hypothesis tests may be misleading.
## ## Call: ## lm(formula = y ~ x, data = d) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7.268 -4.920 -1.695 4.262 11.319 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.230099 0.557821 12.961 <2e-16 *** ## x -0.007442 0.318853 -0.023 0.981 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 5.578 on 98 degrees of freedom ## Multiple R-squared: 5.559e-06, Adjusted R-squared: -0.0102 ## F-statistic: 0.0005448 on 1 and 98 DF, p-value: 0.9814
The true relationship between \(x\) and \(y\) is clearly non-linear.
The slope for \(x\) is not significantly different from zero.
This might lead you to falsely conclude that there is no relationship between \(x\) and \(y\).
But we know the true relationship is quadratic and applying a linear model is an instance of model misspecification.
Plot residuals vs fitted values.
If the relationship is linear, the residuals should hover randomly around zero.
Curvature or systematic patterns suggest model misspecification.
# plot residuals vs fitted values d[, fitted := predict(lm(y ~ x, data=d))] d[, residual := resid(lm(y ~ x, data=d))] ggplot(d, aes(x=fitted, y=residual)) + geom_point() + geom_hline(yintercept=0, linetype="dashed", color="red") + theme_minimal() + labs(title="Residuals vs Fitted Values")
Each residual should be independent of the others.
Violations often lead to underestimated standard errors and consequently inflated type I error rates.
This assumption is almost always violated in time series or clustered data (e.g., cognitive science experiments that measure learning across time).
# plots residuals against observation order d[, order := 1:.N] d[, residual := resid(lm(y ~ x, data=d))] ggplot(d, aes(x=order, y=residual)) + geom_point() + geom_hline(yintercept=0, linetype="dashed", color="red") + theme_minimal() + labs(title="Residuals vs Observation Order")
Use a histogram or density plot of residuals.
A Q-Q plot is more sensitive to deviations from normality. I haven’t shown you these
Small deviations are okay if \(n\) is large because of the Central Limit Theorem.
# plot histogram of residuals d[, residual := resid(lm(y ~ x, data=d))] ggplot(d, aes(x=residual)) + geom_histogram(bins=15, color="white", fill="steelblue") + theme_minimal() + labs(title="Histogram of Residuals")
\[ t = \frac{\widehat{\beta}_1 - 0}{\widehat{\sigma}_{\widehat{\beta}_1}} \]
This test assumes that the sampling distribution of \(\widehat{\beta}_1\) is approximately normal under the null hypothesis.
If the residuals are normally distributed, then \(\widehat{\beta}_1\) is normally distributed too.
If the residuals are not normal, the \(t\)-test may give misleading \(p\)-values and confidence intervals—especially in small samples.
Let’s simulate data where there is no real relationship between \(x\) and \(y\).
We’ll draw residuals from a highly skewed distribution (exponential) instead of normal.
Even though \(\beta_1 = 0\) in truth, we’ll see that \(t\)-tests often return significant results.
Residuals should have constant variance across all values of \(X\).
If not, standard errors may be biased, leading to invalid inference.
Again, use the residuals vs fitted values plot.
Look for a funnel shape — increasing or decreasing spread suggests heteroscedasticity.
Assumption | Consequence of Violation | Robust? | Solution Exists? |
---|---|---|---|
Linearity | Model is misspecified; predictions and inference may be misleading | No | Yes |
Independence | Inflated type I error rate | No | Yes |
Homoscedasticity | Inflated type I error rate | Somewhat | Yes |
Normality of residuals | Inflated type I error rate | Yes if \(n\) is large | Yes |