2024

## Warning: package 'data.table' was built under R version 4.3.3

Multiple regression

  • Multiple regression works just like the simple linear regression we just covered, but with multiple predictor variables.

Regression model

For the \(i^{th}\) observation, we can write:

\[ \begin{align} y_{i} &= \beta_{0} + \beta_{1} x_{1_i} + \beta_{2} x_{2_i} + \dots + \beta_{k} x_{k_i} + \epsilon_{i} \\\\ \end{align} \]

  • \(y_{i}\) is the \(i^{th}\) observed outcome.

  • \(x_{k_{i}}\) is the \(i^{th}\) value of the \(k^{th}\) predictor variable.

  • \(\epsilon_{i}\) is called the residual and is the difference between the observed outcome and the predicted outcome.

\[ \epsilon_{i} \sim Normal(0, \sigma_{\epsilon}) \]

  • \(\beta_{0_{i}}\), \(\beta_{1_{i}}\), and \(\beta_{2_{i}}\) are parameters of the linear regression model.

Now lets extend to many observations:

\[ \begin{align} y_{1} &= \beta_{0} + \beta_{1} x_{1_1} + \beta_{2} x_{2_1} + \dots + \beta_{k} x_{k_1} + \epsilon_{1} \\ y_{2} &= \beta_{0} + \beta_{1} x_{1_2} + \beta_{2} x_{2_2} + \dots + \beta_{k} x_{k_2} + \epsilon_{2} \\ &\vdots \\ y_{n} &= \beta_{0} + \beta_{1} x_{1_n} + \beta_{2} x_{2_n} + \dots + \beta_{k} x_{k_n} + \epsilon_{n} \\ \end{align} \]

We can gather the independent observations up into vectors:

\[ \begin{align} \boldsymbol{y} &= \beta_{0} \begin{bmatrix} 1\\ 1\\ \vdots\\ 1 \end{bmatrix} + \beta_{1} \begin{bmatrix} x_{1_1}\\ x_{1_2}\\ \vdots\\ x_{1_n}\\ \end{bmatrix} + \beta_{2} \begin{bmatrix} x_{2_1}\\ x_{2_2}\\ \vdots\\ x_{2_n}\\ \end{bmatrix} + \ldots + \beta_{k} \begin{bmatrix} x_{k_1}\\ x_{k_2}\\ \vdots\\ x_{k_n}\\ \end{bmatrix} + \begin{bmatrix} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_n\\ \end{bmatrix} \\\\ \end{align} \]

We can next gather the vectors up into a matrix:

\[ \begin{align} \boldsymbol{y} &= \begin{bmatrix} 1 & x_{1_1} & x_{2_1} & \ldots & x_{k_1} \\ 1 & x_{1_2} & x_{2_2} & \ldots & x_{k_2} \\ \vdots & \vdots & \dots & \dots \\ 1 & x_{1_n} & x_{2_n} & \ldots & x_{k_n} \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{k} \end{bmatrix} + \begin{bmatrix} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_n\\ \end{bmatrix} \\\\ \end{align} \]

We can finally write the model in compact matrix form:

\[ \begin{align} \boldsymbol{y} &= \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \\ \end{align} \]

  • \(\boldsymbol{y}\) is a vector of observed outcomes.

  • \(\boldsymbol{X}\) is a matrix of predictor variables and is called the design matrix

  • \(\boldsymbol{\beta}\) is a vector of \(\beta\) parameters.

  • \(\boldsymbol{\epsilon}\) is a vector of residuals.

How can we pick \(\boldsymbol{\beta}\) values that best fit our data?

  • let \(y_i\) denote observed values

  • let \(\widehat{y_{i}}\) denote predicted values:

\[ \widehat{y_{i}} = \beta_{0} + \beta_{1} x_{i} + \beta_{2} x_{i} + \ldots + \beta_{k} x_{i} \]

  • The best fitting \(\boldsymbol{\beta}\) values are those that minimise the discrepancy between \(y_{i}\) and \(\widehat{y_{i}}\).

\[ \DeclareMathOperator*{\argmin}{\arg\!\min} \argmin_{\boldsymbol{\beta}} \sum_{i=1}^{n} (y_{i} - \widehat{y_{i}})^2 \]

\[ \DeclareMathOperator*{\argmin}{\arg\!\min} \argmin_{\boldsymbol{\beta}} \sum_{i=1}^{n} (y_{i} - (\beta_{0} + \beta_{1} x_{i} + \beta_{2} x_{i} + \ldots + \beta_{k} x_{i}))^2 \]

  • The \(\boldsymbol{\beta}\) values that minimise error can be solved for analytically.

  • The method is to take the derivative with respect to \(\boldsymbol{\beta}\), and then find the \(\boldsymbol{\beta}\) values that make the resulting expression equal to zero.

  • I won’t do this here and won’t require you to do so either.

  • You should know however that this method of finding \(\beta\) values is called ordinary least squares.

Regression model terms

  • \(SS_{error} = SS_{residual} = \sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2\)

  • \(SS_{error}\) is what you get when you compare raw observations against the full model predictions.

  • \(SS_{total} = \sum_{i=1}^{n} (y_{i} - \bar{y_{i}})^2\)

  • \(SS_{total}\) is what you get when you compare raw observations against the grand mean.

  • \(SS_{error}\) comes from \(\sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2\) with \(\hat{y} = \beta_{0} + \beta_{1} x_{i} + \beta_{2} x_{i} + \ldots + \beta_{k} x_{i} + \epsilon\),

  • \(SS_{total}\) comes from \(\sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2\) with \(\hat{y} = \bar{y} + \epsilon\)

  • \(SS_{model} = \sum_{i=1}^{n} (\bar{y} - \hat{y_i})^2\) tells you how much the added complexity of the full model reduces the overall variability (i.e., makes better predictions).

  • The percent of variability accounted for above the simple model is given by:

\[R^2 = \frac{SS_{model}}{SS_{total}}\]

  • \(R^2\) is called the coefficient of determination and is just the sqaure of the correlation coefficient between \(x\) and \(y\).

  • The \(F\) ratio tells us whether or not the more complex model provides a significantly better fit to the data than the simplest model

\[ F = \frac{MS_{model}}{MS_{error}} \]

  • The regression \(F\)-ratio tells us how much the regression model has improved the prediction over the simple mean model relative to the overall inaccuracy in the regression model.

  • We can also ask questions about the best fitting \(\beta\) values (i.e., is either \(\beta\) significantly different from zero?).

  • The data you use in a regression comes from random variables and so the \(\beta\) values you estimate are also random.

  • It turns out the best fitting \(\beta\) values (i.e., \(\hat{\beta}\)) can be tested with a \(t\)-test.

Multiple regression in R

  • Suppose we obtain data on three variables \(X\), \(Y\), and \(Z\).
##                x            y         z
##            <num>        <num>     <num>
##   1:  0.85311300  0.487138784 3.0873510
##   2:  1.20438541  1.389917470 2.5984987
##   3: -1.28659994 -1.223566894 1.6218499
##   4:  0.84577345 -0.441547804 2.8634952
##   5:  0.27635164 -1.450387597 1.5656639
##   6: -0.12673236  0.898431978 1.8067982
##   7: -0.13993094 -0.981138227 1.8116365
##   8: -1.38771057 -0.392723741 1.3924902
##   9:  0.86160534 -0.136435347 3.3061524
##  10: -0.29424708  0.579056513 0.9801463
##  11: -0.01421590  0.232370234 2.2844782
##  12: -1.11570872  1.348404144 2.2900233
##  13:  1.10528247 -1.493397984 2.5527940
##  14: -0.48409937  0.879792256 2.4569203
##  15:  1.07847442  0.762604710 2.1935727
##  16:  0.49593836  1.095560085 2.1293470
##  17:  0.99315853  0.228730483 3.0356999
##  18:  0.99326393 -1.581002741 1.4542371
##  19:  1.02452323 -1.137180161 2.8585073
##  20: -0.50282080  0.164626945 0.4444487
##  21: -1.03765850  0.043788879 1.8577367
##  22: -2.41472149  0.558191762 0.4094872
##  23: -0.79070697 -0.166252854 2.6798007
##  24:  1.57227805  0.460529587 3.4264510
##  25:  2.51188894 -0.284494534 2.3626133
##  26:  2.46299408 -1.210737300 2.5661298
##  27:  0.46962058 -0.030036260 2.2627202
##  28: -0.06077540 -1.294121601 1.7447721
##  29:  0.61993623 -0.245632546 3.0161838
##  30: -1.33503368  1.292276624 2.7435626
##  31:  0.81876915 -0.661484683 2.1203286
##  32:  0.70441248  2.147847851 3.5175991
##  33:  1.67009822  0.064872673 1.2880427
##  34:  1.43567004 -0.661030175 3.3278752
##  35:  1.16065865  0.582526744 3.1986949
##  36:  0.14732453 -0.480174658 3.1835957
##  37: -0.89599009  1.648205999 2.6965203
##  38:  0.20953338 -0.834871043 2.6861088
##  39: -0.56782087 -0.530972254 1.9126693
##  40:  0.49949918 -1.012841493 2.5880642
##  41:  0.38630153 -0.166158232 1.0570819
##  42: -0.98188263  0.198916555 0.5250248
##  43: -0.14830928  0.720804015 1.9558362
##  44: -0.29247481 -0.076674844 1.6682475
##  45: -0.26429773  0.320764297 2.3838150
##  46: -0.63979180 -0.715912236 1.4219839
##  47:  1.13828667 -1.413212449 0.6564161
##  48:  0.19892233  0.251059670 1.2431054
##  49:  0.82012519 -0.416334802 1.8077582
##  50:  0.78307038 -0.424468752 2.0635771
##  51: -1.80256628  1.584956316 1.2904355
##  52:  0.24490864 -1.916759566 1.5913308
##  53:  1.46431967 -0.456791435 4.1898713
##  54: -0.76447578  0.898130547 2.0871943
##  55: -0.57180846 -0.423088855 2.6069509
##  56:  0.08403486  0.155991973 1.2771178
##  57:  1.28949079 -0.409964140 2.7650973
##  58:  0.32563860  0.432397191 3.4751940
##  59: -0.54985142  0.385532611 1.3156206
##  60: -0.22344420  0.975761790 0.8691344
##  61:  1.64278557  0.524099560 3.4647008
##  62: -0.68216875 -0.999281549 0.9448999
##  63: -0.67931048  0.815751050 1.8222288
##  64:  0.44170282  0.696837685 3.2465521
##  65:  0.36569680  1.151918841 1.2593354
##  66:  1.03181050 -1.358816143 1.9402901
##  67:  2.12413686  1.472982276 3.9699601
##  68:  1.09156477  0.518598918 2.4223726
##  69: -0.82881354 -0.368569807 0.4277738
##  70: -1.63245858 -0.684386759 0.7614473
##  71: -0.55851807 -1.179793016 1.4731648
##  72: -1.27128946  1.404279321 1.4794151
##  73: -0.20369270 -2.415418369 0.6448221
##  74: -0.61973296 -0.773386630 0.5223003
##  75: -0.03291869 -0.933431202 2.6375865
##  76:  0.54736004 -1.657498099 0.9882637
##  77: -1.22023332 -0.472340560 1.7960976
##  78:  0.71131542 -1.089941222 0.8779406
##  79:  1.08273878  0.091547271 3.7610724
##  80:  1.43673407  2.551502382 2.2495473
##  81: -0.16181588  1.064529287 2.5754156
##  82:  0.04992388  0.005010507 3.1658723
##  83:  1.60284084  0.133254171 3.6527010
##  84:  0.39884919  0.171280543 1.7921047
##  85:  0.88845338 -1.031571412 0.7029190
##  86:  0.43268772 -1.490998843 2.3857329
##  87:  0.68200854  1.164499227 2.8427663
##  88:  0.40909282  0.345645943 3.1406423
##  89:  0.25423453 -1.561080202 1.5302701
##  90: -0.40541666  1.272093897 2.3168414
##  91: -0.14302845 -1.226177690 3.2117305
##  92:  1.43483981  0.202039362 2.8057783
##  93:  0.15254758 -1.367957087 1.9701856
##  94: -0.11902890 -1.843744495 1.7456947
##  95:  0.28557964 -2.014390759 1.3753395
##  96:  0.47463923  1.125654682 2.9627311
##  97: -0.62538116 -0.507217430 0.3565728
##  98: -1.25426048 -0.396301864 1.6994436
##  99: -0.82703299 -0.492398492 2.3278016
## 100:  0.74763125 -1.848939602 0.8916969
##                x            y         z

  • We can plot the data to see if there is a relationship between \(x\), \(y\), and \(z\).

  • We can fit a simple linear regression model in R using the lm function.
fm <- lm(z ~ x + y, data=d)

summary(fm)
## 
## Call:
## lm(formula = z ~ x + y, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5717 -0.4881  0.1136  0.4995  1.6117 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.03317    0.07759  26.206  < 2e-16 ***
## x            0.48339    0.08001   6.041 2.83e-08 ***
## y            0.29685    0.07502   3.957 0.000145 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7559 on 97 degrees of freedom
## Multiple R-squared:  0.3353, Adjusted R-squared:  0.3216 
## F-statistic: 24.46 on 2 and 97 DF,  p-value: 2.499e-09

  • The Estimate column provides the best fitting \(\beta\) values.

  • The Std. Error column provides the SEM associated with the \(\beta\) values reported in the Estimate column.

  • The t value column provides the \(t\)-statistic for the null hypothesis that the \(\beta\) value is zero.

  • The Pr(>|t|) column provides the \(p\)-value associated with the \(t\)-statistic.

  • The Residual standard error provides an estimate of the standard deviation of the residuals.

  • The Multiple R-squared provides the \(R^2\) value.

  • The Adjusted R-squared provides the \(R^2\) value adjusted for the number of predictors in the model.

  • The F-statistic provides the \(F\)-ratio and the p-value reported next to it provides the \(p\)-value associated with the \(F\)-ratio.

  • The best-fitting \(\beta_0\) is 2.03 and the best fitting \(\beta_1\) is 0.48.

  • \(\beta_0\) is the intercept and \(\beta_1\) is the slope of the regression plane in the \(x\) direction and \(\beta_2\) is the slope of the regression plane in the \(y\) direction.

Assumptions and all the rest

  • As with the ongoing theme of this slide deck, most of this stuff works out the same for multiple regression as it does for simple linear regression.