28 Simple linear regression

Simple linear regression models attempt to predict the value of some observed outcome random variable \(\boldsymbol{Y}\) as a linear function of a predictor random variable \(\boldsymbol{X}\).

For the \(i^{th}\) observation, we can write:

\[ y_{i} = \beta_{0} + \beta_{1} x_{i} + \epsilon_{i} \]

  • \(y_{i}\) is the \(i^{th}\) observed outcome
  • \(x_{i}\) is the \(i^{th}\) value of the predictor variable
  • \(\epsilon_{i} \sim \mathcal{N}(0, \sigma_{\epsilon})\) is called the residual
  • \(\beta_{0_{i}}\) and \(\beta_{1_{i}}\) are parameters of the linear regression model

Now imagine having many observations such that \(i \in [1, n]\):

\[\begin{align} \boldsymbol{y} &= \beta_{0} + \beta_{1} \boldsymbol{x} + \boldsymbol{\epsilon} \\\\ \boldsymbol{y} &= \beta_{0} \begin{bmatrix} 1\\ 1\\ \vdots\\ 1 \end{bmatrix} + \beta_{1} \begin{bmatrix} x_1\\ x_2\\ \vdots\\ x_n\\ \end{bmatrix} + \begin{bmatrix} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_n\\ \end{bmatrix} \\\\ \boldsymbol{y} &= \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \end{bmatrix} + \begin{bmatrix} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_n\\ \end{bmatrix} \\\\ \boldsymbol{y} &= \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \end{align}\]

28.0.0.1 How can we pick \(\boldsymbol{\beta}\) values that best fit our data?

  • let \(y_i\) denote observed values
  • let \(\hat{y_{i}}\) denote predicted values

\[\begin{align} \hat{y_{i}} &= E[y_{i}]\\ &= E[\beta_{0} + \beta_{1} x_{i} + \epsilon_{i}]\\ &= \beta_{0} + \beta_{1} x_{i} \end{align}\]

  • The best fitting \(\boldsymbol{\beta}\) values are those that minimise the discrepancy between \(y_{i}\) and \(\hat{y_{i}}\).

\[ \DeclareMathOperator*{\argmin}{\arg\!\min} \argmin_{\boldsymbol{\beta}} \sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2 \]

  • The \(\boldsymbol{\beta}\) values that do this can be found by using a computer to try out a bunch of different values and recording which ones give the smallest error.

  • However, in this case, the \(\boldsymbol{\beta}\) values that minimise error can be solved for analytically. I won’t go through the derivation here, even though it is fairly simple to go through it. If you want to have a go on your own, the method is to take the derivative with respect to \(\boldsymbol{\beta}\), and then find the \(\boldsymbol{\beta}\) values that make the resulting expression equal to zero.

28.0.0.2 Regression model significance

  • Notice that \(\sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2\) is a sum of squares very much in the style of the sums of squares we have seen thus far in ANOVAs.

  • \(SS_{error} = \sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2\)

  • \(SS_{error}\) is sometimes called \(SS_{residual}\)

  • \(SS_{error}\) is what you get when you compare raw observations against the full model predictions.

  • \(SS_{total}\) is what you get when you compare raw observations against the grand mean.

  • \(SS_{total} = \sum_{i=1}^{n} (y_{i} - \bar{y_{i}})^2\)

  • Just as \(SS_{error}\) comes from \(\sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2\) with \(\hat{y} = \beta_{0} + \beta_{1} x + \epsilon\), you can think of \(SS_{total}\) as coming from \(\sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2\) with \(\hat{y} = \bar{y} + \epsilon\).

  • From this perspective, \(SS_{error}\) is the variability of the data around the prediction from the full model, and \(SS_{total}\) is the variability of the data around the mean (which is pretty much the simplest possible model and thus serves as a good baseline).

  • Finally \(SS_{model} = \sum_{i=1}^{n} (\bar{y} - \hat{y_i})^2\) essentially tells you how much the added complexity of the full model reduces the overall variability (i.e., makes better predictions).

  • The percent of variability accounted for above the simple model is given by:

\[ R^2 = \frac{SS_{model}}{SS_{total}} \]

  • Does the more complex model provide a significantly better fit to the data than the simplest model? This is what the \(F\) ratio tells us.

\[ F = \frac{MS_{model}}{MS_{error}} \]

  • That is, the regression \(F\)-ratio tells us how much the regression model has improved the prediction over the simple mean model, relative to the overall inaccuracy in the regression model.

  • The \(F\) ratio tells us if the overall regression model provides a better fit to the data than the simple model, but we can also ask questions about the best fitting \(\beta\) values (i.e., is either \(\beta\) different from zero?).

  • We won’t prove this here, but it turns out the best fitting \(\beta\) values (i.e., \(\hat{\beta}\)) can be tested with a \(t\)-test.

28.0.0.3 Example: Predict one MEG channel from another

library(data.table)
library(ggplot2)

rm(list=ls())

d <- fread('https://crossley.github.io/book_stats/data/eeg/epochs.txt')
d[, V1 := NULL]

## convert from wide to long format
dd <- melt(d, id.vars=c('time', 'condition', 'epoch'))

## pick out some columns randomly
y <- dd[variable=='MEG 001', mean(value), .(epoch)][, V1]
x <- dd[variable=='MEG 010', mean(value), .(epoch)][, V1]

## visualise possible linear relationship
ddd <- data.table(x, y)
ggplot(ddd, aes(x, y)) +
  geom_point() +
  geom_smooth(method='lm')
## `geom_smooth()` using formula = 'y ~ x'

## fit a simple linear regression model
fm <- lm(y~x, data=ddd)
summary(fm)
## 
## Call:
## lm(formula = y ~ x, data = ddd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -527.21 -133.24   -4.46  118.84  417.85 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -23.6812    18.1518  -1.305    0.195    
## x             2.5436     0.3759   6.766 1.01e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 180.5 on 97 degrees of freedom
## Multiple R-squared:  0.3206, Adjusted R-squared:  0.3136 
## F-statistic: 45.78 on 1 and 97 DF,  p-value: 1.006e-09
cor(x, y)^2
## [1] 0.3206392
  • A correlation analysis provides information on the strength and direction of the linear relationship between two variables, while a simple linear regression analysis estimates parameters in a linear equation that can be used to predict values of one variable based on the other.

  • Note that Multiple R-squared is equal to cor(x,y)^2

  • \(R^2 = \rho_{x,y}\)

28.0.0.4 Example: Speed-Accuracy trade-off in MIS data

library(data.table)
library(ggplot2)

rm(list=ls())

d <- fread('https://crossley.github.io/book_stats/data/mis/mis_data.csv')

## give subjects in different groups unique numbers
d[group==1, subject := subject + 10]

## define x and y
x <- d[, mean(error, na.rm=T), .(group, subject)][, V1]
y <- d[, mean(movement_time, na.rm=T), .(group, subject)][, V1]

## visualise possible linear relationship
ddd <- data.table(x, y)
ggplot(ddd, aes(x, y)) +
  geom_point() +
  geom_smooth(method='lm')
## `geom_smooth()` using formula = 'y ~ x'

## fit a simple linear regression model
fm <- lm(y~x, data=ddd)
summary(fm)
## 
## Call:
## lm(formula = y ~ x, data = ddd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.820 -11.013  -6.315  12.287  34.197 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  323.951      4.597  70.476  < 2e-16 ***
## x            -12.703      2.154  -5.898 1.39e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.62 on 18 degrees of freedom
## Multiple R-squared:  0.659,  Adjusted R-squared:  0.6401 
## F-statistic: 34.79 on 1 and 18 DF,  p-value: 1.388e-05
cor(x, y)^2
## [1] 0.659039
cor(fm$fitted.values, y)^2
## [1] 0.659039