30 General linear model (GLM)

30.1 General linear model: ANOVA and regression have a common base

The multiple regression model we just considered is an example of the general linear model.

\[\begin{align} y_{i} &= \beta_{0} + \beta_{1} x_{1_i} + \beta_{2} x_{2_i} + \dots + \beta_{k} x_{k_i} + \epsilon_{i} \\\\ \boldsymbol{y} &= \beta_{0} \begin{bmatrix} 1\\ 1\\ \vdots\\ 1 \end{bmatrix} + \beta_{1} \begin{bmatrix} x_{1_1}\\ x_{1_2}\\ \vdots\\ x_{1_n}\\ \end{bmatrix} + \beta_{2} \begin{bmatrix} x_{2_1}\\ x_{2_2}\\ \vdots\\ x_{2_n}\\ \end{bmatrix} + \ldots + \beta_{k} \begin{bmatrix} x_{k_1}\\ x_{k_2}\\ \vdots\\ x_{k_n}\\ \end{bmatrix} + \begin{bmatrix} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_n\\ \end{bmatrix} \\\\ \boldsymbol{y} &= \begin{bmatrix} 1 & x_{1_1} & x_{2_1} & \ldots & x_{k_1} \\ 1 & x_{1_2} & x_{2_2} & \ldots & x_{k_2} \\ \vdots & \vdots & \dots & \dots \\ 1 & x_{1_n} & x_{2_n} & \ldots & x_{k_n} \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{k} \end{bmatrix} + \begin{bmatrix} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_n\\ \end{bmatrix} \\\\ \boldsymbol{y} &= \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \end{align}\]

In regression, the \(x_{i,j}\) values that make up the design matrix \(\boldsymbol{X}\) – i.e., the regressors – are continuously valued.
ANOVA can be framed by exactly the same linear model, given that the ANOVA factors are input as “dummy coded” regressors.
What is dummy coding? Consider a factor with three levels. We can dummy code this as follows:

Factor level	Dummy variable 1	Dummy variable 2
1	0	0
2	1	0
3	0	1

This coding leads to the following linear model:

\[\begin{align} \boldsymbol{y} &= \beta_{0} \begin{bmatrix} 1\\ 1\\ \vdots\\ 1\\ 1\\ \vdots\\ 1\\ 1\\ \end{bmatrix} + \beta_{1} \begin{bmatrix} 0\\ 0\\ \vdots\\ 1\\ 1\\ \vdots\\ 0\\ 0 \end{bmatrix} + \beta_{2} \begin{bmatrix} 0\\ 0\\ \vdots\\ 0\\ 0\\ \vdots\\ 1\\ 1 \end{bmatrix} + \begin{bmatrix} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_{m-1}\\ \epsilon_m\\ \vdots\\ \epsilon_{n-1}\\ \epsilon_n \end{bmatrix} \\\\ \boldsymbol{y} &= \begin{bmatrix} 1 & 0 & 0 \\ 1 & 0 & 0 \\ \vdots & \vdots & \vdots\\ 1 & 1 & 0 \\ 1 & 1 & 0 \\ \vdots & \vdots & \vdots\\ 1 & 0 & 1 \\ 1 & 0 & 1 \\ \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \beta_{2} \end{bmatrix} + \begin{bmatrix} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_{m-1}\\ \epsilon_m\\ \vdots\\ \epsilon_{n-1}\\ \epsilon_n\\ \end{bmatrix} \\\\ \boldsymbol{y} &= \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \end{align}\]

With this coding, it’s fairly straight forward to show the following:

\[ \beta_0 = \bar{x_1} \\ \beta_1 = \bar{x_2} - \bar{x_1} \\ \beta_2 = \bar{x_3} - \bar{x_1} \\ \]

That is, the \(\beta\) estimates corresponds to differences in means, which is exactly what ANOVA is after.

30.1.0.1 Example: Criterion learning data

d <- fread('https://crossley.github.io/book_stats/data/criterion_learning/crit_learn.csv')

## We will answer this question: Are there significant differences between the
## mean `t2c` per subject betweem conditions? For simplicity, only consider the
## `Delay`, `Long ITI` and `Short ITI` conditions.

## clean up the data (just trsut me this makes sense to do here)
dd <- unique(d[prob_num <= 3 &
               nps >= 3 &
               cnd %in% c('Delay', 'Long ITI', 'Short ITI'),
               .(cnd, sub, prob_num, t2c)])

ddd <- dd[, .(t2c=mean(t2c)), .(cnd, sub)]

ddd[, sub := factor(sub)]
ddd[, cnd := factor(cnd)]

ggplot(ddd, aes(x=cnd, y=t2c)) +
  geom_boxplot() +
  theme(legend.position='none')

fm <- lm(t2c ~ cnd, data=ddd)
summary(fm)

## 
## Call:
## lm(formula = t2c ~ cnd, data = ddd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.593 -21.259 -12.481   7.519 106.949 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   46.708      5.109   9.142 6.52e-12 ***
## cnd1           5.884      7.017   0.839    0.406    
## cnd2          -2.894      7.017  -0.412    0.682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.35 on 46 degrees of freedom
## Multiple R-squared:  0.01518,    Adjusted R-squared:  -0.02764 
## F-statistic: 0.3544 on 2 and 46 DF,  p-value: 0.7035

anova(fm)

## Analysis of Variance Table
## 
## Response: t2c
##           Df Sum Sq Mean Sq F value Pr(>F)
## cnd        2    886  442.82  0.3544 0.7035
## Residuals 46  57475 1249.45

ddd[cnd=='Delay', mean(t2c)]

## [1] 52.59259

ddd[cnd=='Long ITI', mean(t2c)] - ddd[cnd=='Delay', mean(t2c)]

## [1] -8.777778

ddd[cnd=='Short ITI', mean(t2c)] - ddd[cnd=='Delay', mean(t2c)]

## [1] -8.874644

Cool that we can see that R does the dummy coding for us, and automatically sets the baseline value to the Delay condition.
What if you want to assign a different condition to baseline? Not for this lecture but you should keep taking classes and learning!
In summary, an regression with dummy coded factors as regressors is equivalent to an ANOVA, and both are just instances of the general linear model.

library(data.table)
library(ggplot2)
library(ez)
library(EnvStats)