2024

Simple linear regression

Simple linear regression models attempt to predict the value of some observed outcome random variable \(\boldsymbol{Y}\) as a linear function of a predictor random variable \(\boldsymbol{X}\).

Regression model

For the \(i^{th}\) observation, we can write:

\[ y_{i} = \beta_{0} + \beta_{1} x_{i} + \epsilon_{i} \]

  • \(y_{i}\) is the \(i^{th}\) observed outcome

  • \(x_{i}\) is the \(i^{th}\) value of the predictor variable

  • \(\epsilon_{i}\) is called the residual and is the difference between the observed outcome and the predicted outcome.

\[ \epsilon_{i} \sim Normal(0, \sigma_{\epsilon}) \]

  • \(\beta_{0_{i}}\) and \(\beta_{1_{i}}\) are parameters of the linear regression model

Now lets extend to many observations:

\[ \begin{align} y_{1} &= \beta_{0} + \beta_{1} x_{1} + \epsilon_{1} \\ y_{2} &= \beta_{0} + \beta_{1} x_{2} + \epsilon_{2} \\ &\vdots \\ y_{n} &= \beta_{0} + \beta_{1} x_{n} + \epsilon_{n} \\ \end{align} \]

We can gather the independent observations up into vectors:

\[ \begin{align} \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_n \end{bmatrix} &= \beta_{0} \begin{bmatrix} 1\\ 1\\ \vdots\\ 1 \end{bmatrix} + \beta_{1} \begin{bmatrix} x_1\\ x_2\\ \vdots\\ x_n\\ \end{bmatrix} + \begin{bmatrix} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_n\\ \end{bmatrix} \\\\ \end{align} \]

We can next gather the vectors up into a matrix:

\[ \begin{align} \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_n \end{bmatrix} &= \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \end{bmatrix} + \begin{bmatrix} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_n\\ \end{bmatrix} \\\\ \end{align} \]

We can finally write the model in compact matrix form:

\[ \begin{align} \boldsymbol{y} &= \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \end{align} \]

  • \(\boldsymbol{y}\) is a vector of observed outcomes.

  • \(\boldsymbol{X}\) is a matrix of predictor variables and is called the design matrix

  • \(\boldsymbol{\beta}\) is a vector of \(\beta\) parameters.

  • \(\boldsymbol{\epsilon}\) is a vector of residuals.

How can we pick \(\boldsymbol{\beta}\) values that best fit our data?

  • let \(y_i\) denote observed values

  • let \(\widehat{y_{i}}\) denote predicted values:

\[ \widehat{y_{i}} = \beta_{0} + \beta_{1} x_{i} \]

  • The best fitting \(\boldsymbol{\beta}\) values are those that minimise the discrepancy between \(y_{i}\) and \(\widehat{y_{i}}\).

\[ \DeclareMathOperator*{\argmin}{\arg\!\min} \argmin_{\boldsymbol{\beta}} \sum_{i=1}^{n} (y_{i} - \widehat{y_{i}})^2 \]

\[ \DeclareMathOperator*{\argmin}{\arg\!\min} \argmin_{\boldsymbol{\beta}} \sum_{i=1}^{n} (y_{i} - (\beta_{0} + \beta_{1} x_{i}))^2 \]

  • The \(\boldsymbol{\beta}\) values that minimise error can be solved for analytically.

  • The method is to take the derivative with respect to \(\boldsymbol{\beta}\), and then find the \(\boldsymbol{\beta}\) values that make the resulting expression equal to zero.

  • I won’t do this here and won’t require you to do so either.

  • You should know however that this method of finding \(\beta\) values is called ordinary least squares.

Regression model terms

  • \(SS_{error} = SS_{residual} = \sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2\)

  • \(SS_{error}\) is what you get when you compare raw observations against the full model predictions.

  • \(SS_{total} = \sum_{i=1}^{n} (y_{i} - \bar{y_{i}})^2\)

  • \(SS_{total}\) is what you get when you compare raw observations against the grand mean.

  • \(SS_{error}\) comes from \(\sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2\) with \(\hat{y} = \beta_{0} + \beta_{1} x + \epsilon\),

  • \(SS_{total}\) comes from \(\sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2\) with \(\hat{y} = \bar{y} + \epsilon\)

  • \(SS_{model} = \sum_{i=1}^{n} (\bar{y} - \hat{y_i})^2\) tells you how much the added complexity of the full model reduces the overall variability (i.e., makes better predictions).

  • The percent of variability accounted for above the simple model is given by:

\[R^2 = \frac{SS_{model}}{SS_{total}}\]

  • \(R^2\) is called the coefficient of determination and is just the sqaure of the correlation coefficient between \(x\) and \(y\).

  • The \(F\) ratio tells us whether or not the more complex model provides a significantly better fit to the data than the simplest model

\[ F = \frac{MS_{model}}{MS_{error}} \]

  • The regression \(F\)-ratio tells us how much the regression model has improved the prediction over the simple mean model relative to the overall inaccuracy in the regression model.

  • We can also ask questions about the best fitting \(\beta\) values (i.e., is either \(\beta\) significantly different from zero?).

  • The data you use in a regression comes from random variables and so the \(\beta\) values you estimate are also random.

  • It turns out the best fitting \(\beta\) values (i.e., \(\hat{\beta}\)) can be tested with a \(t\)-test.

Simple linear regression in R

  • Suppose we obtain data on two variables \(x\) and \(y\).
##                 x         y
##             <num>     <num>
##   1:  0.914186363 2.5710357
##   2: -0.191610690 1.4122583
##   3:  0.667938628 1.7740207
##   4:  1.057980473 3.1853587
##   5:  0.750844852 3.4891289
##   6: -0.389775827 1.9720604
##   7: -0.614894504 2.1335376
##   8: -0.148907099 1.1984930
##   9:  0.326261069 2.4123828
##  10: -1.458975014 0.8152275
##  11:  0.702920059 2.6628545
##  12:  0.970466002 2.5176534
##  13: -0.943710324 1.9354231
##  14: -0.641900172 1.4205616
##  15:  2.053199777 2.0558633
##  16:  0.803199108 2.8877389
##  17:  1.017441022 3.1679954
##  18:  0.746545724 3.3881426
##  19:  1.626192054 2.9516998
##  20: -0.200436490 1.3499400
##  21:  0.467695177 1.5874055
##  22: -0.519355788 2.4455923
##  23: -0.930288298 1.3959144
##  24:  1.293559425 2.9121413
##  25: -0.286899159 0.9476620
##  26: -0.279107115 1.9863391
##  27:  0.707897970 1.7719498
##  28:  0.334746318 2.5310623
##  29:  2.020931738 3.0361243
##  30: -0.503240982 2.0150786
##  31: -0.883711010 0.9777824
##  32:  1.598162368 2.8554240
##  33: -0.686483341 2.1493065
##  34:  1.111706349 1.9925258
##  35: -0.659655032 2.1844825
##  36:  0.355806326 3.1164497
##  37: -1.416421469 0.9176625
##  38:  1.089299697 2.8366766
##  39:  1.046278390 2.5003388
##  40: -0.712364755 2.4933058
##  41: -0.437944993 0.8058006
##  42: -0.279485122 1.8329278
##  43: -1.219988216 1.3105080
##  44: -2.568485137 1.8613566
##  45: -0.122033134 1.2526652
##  46: -0.131689278 2.5610838
##  47: -0.777620364 2.3732874
##  48: -0.516385751 0.9361903
##  49: -0.029042687 2.8928216
##  50: -2.287842766 0.6873935
##  51: -1.105518080 0.5191420
##  52: -2.342077893 0.2317854
##  53:  0.056956845 2.6369269
##  54:  2.225019955 2.1175029
##  55:  1.015350997 3.2284353
##  56: -0.105534296 1.0250612
##  57:  1.324037784 2.9939987
##  58: -0.242651161 1.7068616
##  59: -0.169495877 1.7461208
##  60:  1.324047634 2.0886769
##  61: -0.705204696 1.9650054
##  62:  0.060115964 2.0121209
##  63: -1.191044669 1.7597514
##  64:  0.269471551 1.5564801
##  65:  1.650474862 2.9288975
##  66:  0.172054713 1.3294465
##  67:  1.281079905 2.5324887
##  68: -0.610439344 1.6778129
##  69:  0.403155422 2.4708264
##  70: -0.329592606 1.9640561
##  71: -0.091675765 1.5250026
##  72:  0.862018885 3.6518703
##  73:  1.502706541 1.6834040
##  74: -0.455089484 2.3206510
##  75:  0.707292600 2.5387373
##  76: -0.008342843 1.6516972
##  77:  0.190088769 2.1183953
##  78: -1.908316927 1.4897671
##  79: -0.091287053 1.7832808
##  80: -0.509344230 2.0201003
##  81: -1.457420795 1.4149256
##  82: -1.482262543 1.1224994
##  83:  0.315629253 1.4926988
##  84: -1.039007994 1.8064505
##  85:  0.594275896 1.8231943
##  86: -1.440066878 2.1115185
##  87: -0.591969920 2.0987456
##  88:  1.337461222 2.7517628
##  89:  2.338961212 3.4148130
##  90: -0.414190827 2.4339902
##  91: -0.248339401 2.2056044
##  92: -0.127354187 2.2711397
##  93: -1.270728753 1.0834128
##  94:  0.176038888 2.3642855
##  95: -0.541558849 1.4844782
##  96: -1.401232330 1.6270596
##  97: -0.731449686 2.0990979
##  98: -0.146833628 1.8201237
##  99: -0.677546816 2.3360614
## 100: -1.459524671 0.7799083
##                 x         y

  • We can plot the data to see if there is a relationship between \(x\) and \(y\).

  • We can fit a simple linear regression model in R using the lm function.
fm <- lm(y ~ x, data=d)

summary(fm)
## 
## Call:
## lm(formula = y ~ x, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.04713 -0.47513  0.07949  0.33500  1.21665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.03789    0.05509  36.994  < 2e-16 ***
## x            0.46093    0.05388   8.554 1.64e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5506 on 98 degrees of freedom
## Multiple R-squared:  0.4275, Adjusted R-squared:  0.4217 
## F-statistic: 73.18 on 1 and 98 DF,  p-value: 1.641e-13

  • The Estimate column provides the best fitting \(\beta\) values.

  • The Std. Error column provides the SEM associated with the \(\beta\) values reported in the Estimate column.

  • The t value column provides the \(t\)-statistic for the null hypothesis that the \(\beta\) value is zero.

  • The Pr(>|t|) column provides the \(p\)-value associated with the \(t\)-statistic.

  • The Residual standard error provides an estimate of the standard deviation of the residuals.

  • The Multiple R-squared provides the \(R^2\) value.

  • The Adjusted R-squared provides the \(R^2\) value adjusted for the number of predictors in the model.

  • The F-statistic provides the \(F\)-ratio and the p-value reported next to it provides the \(p\)-value associated with the \(F\)-ratio.

  • The best-fitting \(\beta_0\) is 2.04 and the best fitting \(\beta_1\) is 0.46.

  • \(\beta_0\) is the intercept and \(\beta_1\) is the slope of the regression model.

Assumptions of simple linear regression

  • The relationship between \(x\) and \(y\) is linear.

  • The residuals are independent.

  • The residuals are normally distributed.

  • The residuals are homoscedastic.

Check assumptions with plots

  • A common diagnostic tool is to plot the residuals against the predicted values.

  • If linearity holds then this plot will hover around zero from the beginning to the end of the observed range. The residuals should not show trends or patterns across the range of predicted values.

  • If homoscedasticity holds then this plot will show a constant spread of residuals across the range of predicted values.

  • If normality holds then the residuals will be normally distributed and we can use a histrogram of density plot to check as usual.

More regression diagnostics

  • We are just scratching the surface of regression, but we will not go much further in this course.

  • If you’d like a taste, try plot(fm) to see a range of diagnostic plots. You can ask your favourite AI to explain them to you.

Regression vs correlation

  • Both are used to examine the relationship between variables.

  • The sign of the regression coefficient (positive or negative) in simple linear regression involving one independent variable will always match the sign of the correlation coefficient between those two variables.

  • In simple linear regression with one predictor, the square of the correlation coefficient \(r^2\) is equal to the coefficient of determination in regression analysis.

  • Correlation provides a single metric describing the linear relationship between two variables.

  • Regression is used to model the relationship and predict values, offering more detailed insight into how the variables interact.

  • In simple linear regression, \(R^2\) (which is a measure of how well the variation in the dependent variable is explained by the independent variables) is exactly the square of the correlation coefficient \(r\)).

\[ R^2 = \frac{SS_{model}}{SS_{total}} = r^2 \]