Simple Linear Regression: Estimation and Prediction

2025

Simple Linear Regression

We have seen that correlation measures the strength and direction of a linear relationship between two variables.
Simple linear regression goes one step further: It models the relationship between two variables so we can predict the outcome variable \(Y\) from the predictor variable \(X\).
It states how much \(Y\) changes when \(X\) changes by a given amount.

Visualising correlation and regression

Regression model

For the \(i^\text{th}\) observation, we can write:

\[ y_{i} = \beta_{0} + \beta_{1} x_{i} + \epsilon_{i} \]

\(y_{i}\) is the \(i^\text{th}\) observed outcome.
\(x_{i}\) is the \(i^\text{th}\) value of the predictor variable.
\(\epsilon_{i}\) is called the residual and is the difference between the observed outcome and the predicted outcome.

\[ \epsilon_{i} \sim \mathscr{N}(0, \sigma_{\epsilon}) \]

\(\beta_{0}\) and \(\beta_{1}\) are population parameters of the linear regression model.

Regression model

Now lets extend to many observations:

\[ \begin{align} y_{1} &= \beta_{0} + \beta_{1} x_{1} + \epsilon_{1} \\ y_{2} &= \beta_{0} + \beta_{1} x_{2} + \epsilon_{2} \\ &\vdots \\ y_{n} &= \beta_{0} + \beta_{1} x_{n} + \epsilon_{n} \\ \end{align} \]

Notice that the \(x_i\) and \(y_i\) values change depending on the observation but \(\beta_0\) and \(\beta_1\) are the same for all observations.

Regression model

We can gather the independent observations up into vectors:

\[ \begin{align} \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_n \end{bmatrix} &= \beta_{0} \begin{bmatrix} 1\\ 1\\ \vdots\\ 1 \end{bmatrix} + \beta_{1} \begin{bmatrix} x_1\\ x_2\\ \vdots\\ x_n\\ \end{bmatrix} + \begin{bmatrix} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_n\\ \end{bmatrix} \\\\ \end{align} \]

Regression model

We can next gather the vectors up into a matrix:

\[ \begin{align} \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_n \end{bmatrix} &= \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \end{bmatrix} + \begin{bmatrix} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_n\\ \end{bmatrix} \\\\ \end{align} \]

Regression model

We can finally write the model in compact matrix form:

\[ \begin{align} \boldsymbol{y} &= \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \end{align} \]

\(\boldsymbol{y}\) is a vector of observed outcomes.
\(\boldsymbol{X}\) is a matrix of predictor variables and is called the design matrix
\(\boldsymbol{\beta}\) is a vector of \(\beta\) parameters.
\(\boldsymbol{\epsilon}\) is a vector of residuals.

How can we pick \(\boldsymbol{\beta}\) values that best fit our data?

let \(y_i\) denote observed values
let \(\widehat{y_{i}}\) denote predicted values:

\[ \widehat{y_{i}} = \beta_{0} + \beta_{1} x_{i} \]

How can we pick \(\boldsymbol{\beta}\) values that best fit our data?

The best fitting \(\boldsymbol{\beta}\) values are those that minimise the discrepancy between \(y_{i}\) and \(\widehat{y_{i}}\).

\[ \DeclareMathOperator*{\argmin}{\arg\!\min} \argmin_{\boldsymbol{\beta}} \sum_{i=1}^{n} (y_{i} - \widehat{y_{i}})^2 \]

\[ \DeclareMathOperator*{\argmin}{\arg\!\min} \argmin_{\boldsymbol{\beta}} \sum_{i=1}^{n} (y_{i} - (\beta_{0} + \beta_{1} x_{i}))^2 \]

How can we pick \(\boldsymbol{\beta}\) values that best fit our data?

The \(\boldsymbol{\beta}\) values that minimise error can be solved for analytically.
The method is to take the derivative with respect to \(\boldsymbol{\beta}\), and then find the \(\boldsymbol{\beta}\) values that make the resulting expression equal to zero.
This process yields the following solution for the \(\boldsymbol{\beta}\) values: \[ \widehat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \]
This method of finding \(\boldsymbol{\beta}\) values is called ordinary least squares.

Example regression \(\widehat{\beta}\) estimates

\(\beta_0\) is the \(y\)-intercept of the regression line.
\(\beta_1\) is the slope of the regression line.

Testing \(\beta\) values for significance

We can use the best fitting \(\widehat{\boldsymbol{\beta}}\) values to answer questions about the population \(\boldsymbol{\beta}\) parameters (e.g., is either significantly different from zero?).
The raw data in a regression analysis are a random sample \(\boldsymbol{y}\) from a random variables \(Y\).
The values in the design matrix \(\boldsymbol{X}\) are treated as fixed (i.e., not random).
Therefore the \(\widehat{\beta}\) values are also random variables.
It turns out the best fitting \(\beta\) values (i.e., \(\widehat{\beta}\)) can be tested with a \(t\)-test.

1. NHST for \(\beta\) values

We test each regression coefficient \(\beta_0\) and \(\beta_1\) to see if it is significantly different from 0.

\[ \begin{align} H_0: & \ \beta_0 = 0 \\ H_1: & \ \beta_0 \neq 0 \end{align} \]

\[ \begin{align} H_0: & \ \beta_1 = 0 \\ H_1: & \ \beta_1 \neq 0 \end{align} \]

We are usually most interested in \(\beta_1\) because it tells us how much \(Y\) changes when \(X\) changes by a given amount, so we focus on that moving forward.

2. Choose a significance level

We pick a Type I error rate:

\[ \alpha = 0.05 \]

3. Sampling distribution of \(\widehat{\beta}\)

We obtain \(\widehat{\beta}_1\) using ordinary least squares (OLS) and describe its sampling distribution:

\[ \begin{align} \widehat{\beta}_1 &\sim \mathscr{N}(\beta_1, \sigma_{\widehat{\beta}_1}) \\ \widehat{\beta}_1 &\sim \mathscr{N}(0, \sigma_{\widehat{\beta}_1}) \quad \text{if } H_0 \text{ is true} \end{align} \]

To test the null hypothesis, we compute a \(t\)-statistic:

\[ \begin{align} t &= \frac{\widehat{\beta}_1 - 0}{\sigma_{\widehat{\beta}_1}} \sim t(n - p) \\ t &= \frac{\widehat{\beta}_j - 0}{\sqrt{\widehat{\sigma}^2 \cdot \left[ (\mathbf{X}^T \mathbf{X})^{-1} \right]_{jj}}} \sim t(n - p) \end{align} \]

4. / 5. NHST decision

Once we have the \(t\)-statistic, we can compute the \(p\)-value and make a decision about the null hypothesis as we have done many times before.

Simple linear regression in `R`

Suppose we obtain random samples from random variable \(Y\) matched with predictor observations \(x_i\).

##                 x          y
##             <num>      <num>
##   1: -0.341066980  1.4756824
##   2:  1.502424534  3.7369983
##   3:  0.528307712  2.2191545
##   4:  0.542191355  2.2640871
##   5: -0.136673356  1.3699349
##   6: -1.136733853  0.7595680
##   7: -1.496627154  0.4901085
##   8: -0.223385644  1.6773231
##   9:  2.001719228  3.6813218
##  10:  0.221703816  2.9877493
##  11:  0.164372909  2.8663688
##  12:  0.332623609  2.8146896
##  13: -0.385207999  1.6885979
##  14: -1.398754027  0.6885479
##  15:  2.675740796  3.1739641
##  16: -0.423686089  0.5819318
##  17: -0.298601512  1.6938028
##  18: -1.792341727  1.9337685
##  19: -0.248008225  1.9414724
##  20: -0.247303918  2.4242924
##  21: -0.255510379  2.1169153
##  22: -1.786938100  0.7170758
##  23:  1.784662816  3.7641111
##  24:  1.763586348  2.8425995
##  25:  0.689600222  1.8570232
##  26: -1.100740644  1.4849596
##  27:  0.714509357  1.5979549
##  28: -0.246470317  2.3086544
##  29: -0.319786166  2.0908911
##  30:  1.362644293  2.5039315
##  31: -1.227882590  1.1418443
##  32: -0.511219233  2.2125374
##  33: -0.731194999  1.1031983
##  34:  0.019752007  1.5179656
##  35: -1.572863915  1.4256920
##  36: -0.703333270  1.4226766
##  37:  0.715932089  2.8205084
##  38:  0.465214906  2.1332970
##  39: -0.973902306  2.1104744
##  40:  0.559217730  2.5273812
##  41: -2.432639745 -0.3388962
##  42: -0.340484927  1.1620719
##  43:  0.713033195  2.9979042
##  44: -0.659037386  2.0158793
##  45: -0.036402623  1.4982674
##  46: -1.593286302  0.5304600
##  47:  0.847792797  2.9407291
##  48: -1.850388849  0.6689173
##  49: -0.323650632  2.7390374
##  50: -0.255248113  2.7581469
##  51:  0.060921227  1.3031149
##  52: -0.823491629  1.1654270
##  53:  1.829730485  2.2896254
##  54: -1.429916216  1.6186859
##  55:  0.254137143  1.4816837
##  56: -2.939773695 -0.4873886
##  57:  0.002415809  3.0118814
##  58:  0.509665571  2.7578195
##  59: -1.084720001  1.8662018
##  60:  0.704832977  2.0204223
##  61:  0.330976350  2.1598476
##  62:  0.976327473  2.7980024
##  63: -0.843339880  0.9377107
##  64: -0.970579905  1.4525794
##  65: -1.771531349  1.2021052
##  66: -0.322470342  2.6851517
##  67: -1.338800742  1.6516660
##  68:  0.688156028  2.9851945
##  69:  0.071280652  2.1059138
##  70:  2.189752359  2.5386248
##  71: -1.157707599  1.2513077
##  72:  1.181688064  1.7584617
##  73: -0.527368362  2.2007417
##  74: -1.456628011  1.9800994
##  75:  0.572967370  2.2551233
##  76: -1.433377705  0.7928600
##  77: -1.055185019  2.0159826
##  78: -0.733111877  1.7031076
##  79:  0.210907264  1.9123176
##  80: -0.998920727  2.0623323
##  81:  1.077850323  2.1590023
##  82: -1.198974383  1.9749924
##  83:  0.216637035  1.6870804
##  84:  0.143087030  2.2672502
##  85: -1.065750091  1.9128136
##  86: -0.428623411  1.1180589
##  87: -0.656179477  1.8709720
##  88:  0.959394327  2.4239038
##  89:  1.556052636  3.1158983
##  90: -1.040796434  1.0853028
##  91:  0.930572409  2.4217930
##  92: -0.075445931  2.6534190
##  93: -1.967195349  1.1006474
##  94: -0.755903643  2.0336437
##  95:  0.461149161  2.1201273
##  96:  0.145106631  1.5578575
##  97: -2.442311321  0.7733815
##  98:  0.580318685  1.6776638
##  99:  0.655051998  1.0294703
## 100: -0.304508837  2.4323069
##                 x          y

Simple linear regression in `R`

We can plot the data to see if there is a relationship between \(x\) and \(y\).

Simple linear regression in `R`

We can fit a simple linear regression model in R using the lm function.

fm <- lm(y ~ x, data=d)

Simple linear regression in `R`

summary(fm)

## 
## Call:
## lm(formula = y ~ x, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.34112 -0.44276 -0.03841  0.46488  1.00040 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.01015    0.05581   36.02   <2e-16 ***
## x            0.55023    0.05103   10.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5487 on 98 degrees of freedom
## Multiple R-squared:  0.5427, Adjusted R-squared:  0.538 
## F-statistic: 116.3 on 1 and 98 DF,  p-value: < 2.2e-16

Simple linear regression in `R`

The Estimate column provides the best fitting \(\beta\) values.
The Std. Error column provides the SEM associated with the \(\beta\) values reported in the Estimate column.
The t value column provides the \(t\)-statistic for the null hypothesis that the \(\beta\) value is zero.
The Pr(>|t|) column provides the \(p\)-value associated with the \(t\)-statistic.
The Residual standard error provides an estimate of the standard deviation of the residuals.
Before we can understand Multiple R-squared, Adjusted R-squared, and F-statistic we need to understand regression as a type of model comparison. We will cover this in a separate slide deck.

Simple linear regression in `R`

The best-fitting \(\beta_0\) is 2.01 and the best fitting \(\beta_1\) is 0.55.
\(\beta_0\) is the intercept and \(\beta_1\) is the slope of the regression model.

Plotting regression lines

ggplot(data = d, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm", se = T, color = "steelblue") +
  theme_minimal() +
  theme(
    aspect.ratio = 1,
    plot.title = element_text(size = 14),
    plot.subtitle = element_text(size = 12)
  )

Plotting regression lines

Why do the regression error bands widen at the tails?

The bands show confidence intervals for the mean predicted value \(\hat{y}\) at each \(x\).
Even though residual variance is assumed constant, uncertainty in the mean prediction increases as \(x\) moves away from \(\bar{x}\).
The further you are from \(\bar{x}\) the more sensitive your prediction becomes to uncertainty in \(\widehat{\beta}_1\).