2024

Multiple regression

  • Multiple regression works just like the simple linear regression we just covered, but with multiple predictor variables.

Regression model

For the \(i^{th}\) observation, we can write:

\[ \begin{align} y_{i} &= \beta_{0} + \beta_{1} x_{1_i} + \beta_{2} x_{2_i} + \dots + \beta_{k} x_{k_i} + \epsilon_{i} \\\\ \end{align} \]

  • \(y_{i}\) is the \(i^{th}\) observed outcome.

  • \(x_{k_{i}}\) is the \(i^{th}\) value of the \(k^{th}\) predictor variable.

  • \(\epsilon_{i}\) is called the residual and is the difference between the observed outcome and the predicted outcome.

\[ \epsilon_{i} \sim Normal(0, \sigma_{\epsilon}) \]

  • \(\beta_{0_{i}}\), \(\beta_{1_{i}}\), and \(\beta_{2_{i}}\) are parameters of the linear regression model.

Now lets extend to many observations:

\[ \begin{align} y_{1} &= \beta_{0} + \beta_{1} x_{1_1} + \beta_{2} x_{2_1} + \dots + \beta_{k} x_{k_1} + \epsilon_{1} \\ y_{2} &= \beta_{0} + \beta_{1} x_{1_2} + \beta_{2} x_{2_2} + \dots + \beta_{k} x_{k_2} + \epsilon_{2} \\ &\vdots \\ y_{n} &= \beta_{0} + \beta_{1} x_{1_n} + \beta_{2} x_{2_n} + \dots + \beta_{k} x_{k_n} + \epsilon_{n} \\ \end{align} \]

We can gather the independent observations up into vectors:

\[ \begin{align} \boldsymbol{y} &= \beta_{0} \begin{bmatrix} 1\\ 1\\ \vdots\\ 1 \end{bmatrix} + \beta_{1} \begin{bmatrix} x_{1_1}\\ x_{1_2}\\ \vdots\\ x_{1_n}\\ \end{bmatrix} + \beta_{2} \begin{bmatrix} x_{2_1}\\ x_{2_2}\\ \vdots\\ x_{2_n}\\ \end{bmatrix} + \ldots + \beta_{k} \begin{bmatrix} x_{k_1}\\ x_{k_2}\\ \vdots\\ x_{k_n}\\ \end{bmatrix} + \begin{bmatrix} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_n\\ \end{bmatrix} \\\\ \end{align} \]

We can next gather the vectors up into a matrix:

\[ \begin{align} \boldsymbol{y} &= \begin{bmatrix} 1 & x_{1_1} & x_{2_1} & \ldots & x_{k_1} \\ 1 & x_{1_2} & x_{2_2} & \ldots & x_{k_2} \\ \vdots & \vdots & \dots & \dots \\ 1 & x_{1_n} & x_{2_n} & \ldots & x_{k_n} \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{k} \end{bmatrix} + \begin{bmatrix} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_n\\ \end{bmatrix} \\\\ \end{align} \]

We can finally write the model in compact matrix form:

\[ \begin{align} \boldsymbol{y} &= \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \\ \end{align} \]

  • \(\boldsymbol{y}\) is a vector of observed outcomes.

  • \(\boldsymbol{X}\) is a matrix of predictor variables and is called the design matrix

  • \(\boldsymbol{\beta}\) is a vector of \(\beta\) parameters.

  • \(\boldsymbol{\epsilon}\) is a vector of residuals.

How can we pick \(\boldsymbol{\beta}\) values that best fit our data?

  • let \(y_i\) denote observed values

  • let \(\widehat{y_{i}}\) denote predicted values:

\[ \widehat{y_{i}} = \beta_{0} + \beta_{1} x_{i} + \beta_{2} x_{i} + \ldots + \beta_{k} x_{i} \]

  • The best fitting \(\boldsymbol{\beta}\) values are those that minimise the discrepancy between \(y_{i}\) and \(\widehat{y_{i}}\).

\[ \DeclareMathOperator*{\argmin}{\arg\!\min} \argmin_{\boldsymbol{\beta}} \sum_{i=1}^{n} (y_{i} - \widehat{y_{i}})^2 \]

\[ \DeclareMathOperator*{\argmin}{\arg\!\min} \argmin_{\boldsymbol{\beta}} \sum_{i=1}^{n} (y_{i} - (\beta_{0} + \beta_{1} x_{i} + \beta_{2} x_{i} + \ldots + \beta_{k} x_{i}))^2 \]

  • The \(\boldsymbol{\beta}\) values that minimise error can be solved for analytically.

  • The method is to take the derivative with respect to \(\boldsymbol{\beta}\), and then find the \(\boldsymbol{\beta}\) values that make the resulting expression equal to zero.

  • I won’t do this here and won’t require you to do so either.

  • You should know however that this method of finding \(\beta\) values is called ordinary least squares.

Regression model terms

  • \(SS_{error} = SS_{residual} = \sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2\)

  • \(SS_{error}\) is what you get when you compare raw observations against the full model predictions.

  • \(SS_{total} = \sum_{i=1}^{n} (y_{i} - \bar{y_{i}})^2\)

  • \(SS_{total}\) is what you get when you compare raw observations against the grand mean.

  • \(SS_{error}\) comes from \(\sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2\) with \(\hat{y} = \beta_{0} + \beta_{1} x_{i} + \beta_{2} x_{i} + \ldots + \beta_{k} x_{i} + \epsilon\),

  • \(SS_{total}\) comes from \(\sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2\) with \(\hat{y} = \bar{y} + \epsilon\)

  • \(SS_{model} = \sum_{i=1}^{n} (\bar{y} - \hat{y_i})^2\) tells you how much the added complexity of the full model reduces the overall variability (i.e., makes better predictions).

  • The percent of variability accounted for above the simple model is given by:

\[R^2 = \frac{SS_{model}}{SS_{total}}\]

  • \(R^2\) is called the coefficient of determination and is just the sqaure of the correlation coefficient between \(x\) and \(y\).

  • The \(F\) ratio tells us whether or not the more complex model provides a significantly better fit to the data than the simplest model

\[ F = \frac{MS_{model}}{MS_{error}} \]

  • The regression \(F\)-ratio tells us how much the regression model has improved the prediction over the simple mean model relative to the overall inaccuracy in the regression model.

  • We can also ask questions about the best fitting \(\beta\) values (i.e., is either \(\beta\) significantly different from zero?).

  • The data you use in a regression comes from random variables and so the \(\beta\) values you estimate are also random.

  • It turns out the best fitting \(\beta\) values (i.e., \(\hat{\beta}\)) can be tested with a \(t\)-test.

Multiple regression in R

  • Suppose we obtain data on three variables \(X\), \(Y\), and \(Z\).
##                  x            y         z
##              <num>        <num>     <num>
##   1: -0.1270006060 -0.934710687 2.0486143
##   2: -0.5870636454  1.063589007 1.1481576
##   3: -1.1818252735  0.724968223 1.5816542
##   4: -0.1741249442 -0.657355162 1.8778684
##   5: -0.8180039439 -1.666401819 0.1107309
##   6:  0.5305352204 -0.144770226 1.5961658
##   7:  1.0383201363  0.252550838 2.7035856
##   8: -1.4877138731  1.008351871 0.7093535
##   9:  0.7451555105  0.469323406 1.8467448
##  10: -0.3250450726  0.017017592 2.6619083
##  11:  0.4742104651 -0.716119979 2.2566865
##  12: -0.2191240330 -1.780661910 0.8750041
##  13: -0.5945857330 -0.141113048 2.1996881
##  14:  0.2965757615  1.321872432 2.0548366
##  15: -0.2748796337  0.885514687 2.8331131
##  16:  0.2166176012  0.376039733 2.5491432
##  17: -0.0964755586 -0.279854319 0.9866483
##  18:  0.6615163612 -1.388729827 2.9667074
##  19: -1.3141795513 -1.339835208 1.1069781
##  20:  1.0280068532  0.236673830 2.1086470
##  21:  1.3521744516  0.714833075 3.3516741
##  22: -0.9658127229 -0.008983147 2.6575836
##  23:  1.4094250078 -0.516019454 2.9328589
##  24:  0.0462948578 -0.651362393 2.7198915
##  25: -1.3789626734  1.416110774 2.1860071
##  26: -0.5925630474 -0.384247805 1.4564110
##  27:  0.7705200592 -1.941888128 2.1529061
##  28:  1.3932777933 -0.282364105 2.1041343
##  29:  0.0002704068 -0.173978580 1.7032701
##  30:  0.5658312288  2.280987227 2.6441624
##  31:  1.5473586141 -0.170937566 2.0852717
##  32: -1.1633893181  0.239961586 2.2083166
##  33:  0.1567007307  1.245200253 3.2622994
##  34: -0.9588288119 -1.705699817 0.1238667
##  35: -1.1785038009 -0.803787851 1.3857629
##  36:  0.2559013697 -0.957923081 1.6362327
##  37:  0.4608409528  0.408912710 1.1429944
##  38: -1.1723481181  1.345224931 0.9654977
##  39:  0.6940095186 -0.424766964 1.0627294
##  40:  0.7752643683  0.810273484 2.1220918
##  41:  0.4327599498 -2.563958192 0.4008419
##  42: -0.0484148433 -0.535657620 3.5398112
##  43: -2.2067419222  0.947190416 0.4392061
##  44:  0.0825344876  2.390919609 2.4572655
##  45: -0.0833179373 -0.068754973 2.9264009
##  46:  0.0373104949  0.520167591 2.0813167
##  47: -0.0797836139  1.074655385 2.9517762
##  48:  0.6655914689  2.256068349 2.5417768
##  49:  0.5209649828 -0.133387632 1.8678902
##  50:  0.2862544751  0.064305675 1.3905928
##  51: -0.4151271715  0.435108743 1.2383324
##  52: -0.7333221366  0.844920022 2.2189971
##  53: -0.9376260898 -0.529578566 2.2589999
##  54: -2.0947900276  1.874641256 0.5661773
##  55: -0.6776633352  0.157206735 1.5164424
##  56: -0.3162105599 -0.332278750 3.0531567
##  57: -0.4964971841  1.853505230 2.5283272
##  58: -1.0234945012  0.267333578 1.7323478
##  59: -0.6695785453 -0.333011744 2.0949894
##  60: -0.8993140502  0.255003108 0.4012755
##  61: -1.7683996212  0.710140853 1.0237524
##  62:  2.1342042208  1.344929915 3.6912281
##  63:  0.9870178435  0.820062417 2.2380800
##  64: -1.3796090580 -0.328095437 0.9079280
##  65:  1.0498777825 -0.866045317 2.5250971
##  66:  1.3244778760 -1.212855287 1.5154542
##  67: -0.4072676635 -0.291607690 1.4116408
##  68: -1.0172279316 -0.971097327 1.8208804
##  69:  0.8687647119 -1.204071958 1.7270428
##  70:  0.3371437922 -0.672554380 1.6450115
##  71:  0.4319908783  0.014424685 3.0738613
##  72: -1.2367760588  1.384372450 2.0645434
##  73: -0.3328934056 -0.291881109 2.1274974
##  74: -0.6628828343  0.604491232 1.2871221
##  75:  0.5095508997 -0.402507801 1.1319215
##  76: -0.6544678551  1.154985979 1.8160114
##  77: -0.7752333911 -0.147638619 0.9831006
##  78: -0.7819610739 -0.626241342 0.1707455
##  79:  2.7268084575  0.334527649 4.2255598
##  80: -0.8377102939 -0.195936683 1.1452770
##  81:  0.4675923609  1.115666181 1.3640133
##  82:  0.4724233069 -0.023654958 2.7222533
##  83: -1.0463127257  0.137068172 1.9236423
##  84: -0.1816832752  0.538653492 3.1039916
##  85:  0.4315281063  1.205831628 3.4000622
##  86: -0.3500500808 -2.399040157 0.9020665
##  87: -0.4498263550 -0.105254427 0.2005886
##  88: -0.7060834816  1.534204112 1.2784907
##  89:  0.4628889947 -0.830638637 2.2592983
##  90:  0.3164579048 -0.709403980 1.3733458
##  91:  1.0409744934 -0.810298074 1.3038503
##  92: -0.8282991122  0.598402311 1.8848516
##  93:  0.4292346448  0.015580387 3.4812245
##  94:  0.0646466573  2.329475334 2.7021788
##  95: -0.9593979163  0.227242014 1.2074298
##  96: -0.5172125470  0.274008498 1.6536878
##  97: -1.3277282431  0.454203238 0.6934003
##  98: -1.0571495267 -0.031772740 1.8316349
##  99:  0.6960103553  1.561989379 2.7994811
## 100: -1.0612214539  0.066561716 1.0467034
##                  x            y         z

  • We can plot the data to see if there is a relationship between \(x\), \(y\), and \(z\).

  • We can fit a simple linear regression model in R using the lm function.
fm <- lm(z ~ x + y, data=d)

summary(fm)
## 
## Call:
## lm(formula = z ~ x + y, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4422 -0.5214 -0.0799  0.4804  1.8042 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.91693    0.07039  27.234  < 2e-16 ***
## x            0.54163    0.07741   6.997 3.39e-10 ***
## y            0.28953    0.06964   4.158 6.94e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6944 on 97 degrees of freedom
## Multiple R-squared:  0.3873, Adjusted R-squared:  0.3747 
## F-statistic: 30.66 on 2 and 97 DF,  p-value: 4.798e-11

  • The Estimate column provides the best fitting \(\beta\) values.

  • The Std. Error column provides the SEM associated with the \(\beta\) values reported in the Estimate column.

  • The t value column provides the \(t\)-statistic for the null hypothesis that the \(\beta\) value is zero.

  • The Pr(>|t|) column provides the \(p\)-value associated with the \(t\)-statistic.

  • The Residual standard error provides an estimate of the standard deviation of the residuals.

  • The Multiple R-squared provides the \(R^2\) value.

  • The Adjusted R-squared provides the \(R^2\) value adjusted for the number of predictors in the model.

  • The F-statistic provides the \(F\)-ratio and the p-value reported next to it provides the \(p\)-value associated with the \(F\)-ratio.

  • The best-fitting \(\beta_0\) is 1.92 and the best fitting \(\beta_1\) is 0.54.

  • \(\beta_0\) is the intercept and \(\beta_1\) is the slope of the regression plane in the \(x\) direction and \(\beta_2\) is the slope of the regression plane in the \(y\) direction.

Assumptions and all the rest

  • As with the ongoing theme of this slide deck, most of this stuff works out the same for multiple regression as it does for simple linear regression.