Tutorial 12 - Multiple regression

# load libraries

library(data.table)
library(ggplot2)

# clean work space

rm(list = ls())

# init colorscheme

COL <- c("#2271B2", "#E69F00", "#D55E00")
names(COL) <- c("blue", "orange", "red")
theme_set(
  theme_minimal(base_size = 13) +
    theme(
      panel.grid.minor = element_blank(),
      strip.text = element_text(face = "bold"),
      legend.position = "bottom"
    )
)
update_geom_defaults("point", list(size = 2))
update_geom_defaults("line", list(linewidth = 0.8))

Overview

So far, you have worked with simple linear regression, where one predictor is used to explain one outcome. Multiple regression extends that framework by including two or more predictors at the same time.

In this tutorial, you will use participant-level summaries from a category learning dataset. The key point is the same as last week: before fitting an inferential model, make sure each row is an independent participant or unit.

You will model late-session accuracy using:

one continuous predictor: early_accuracy
one categorical predictor: group
their interaction

Part 1 - Introducing the data

Load the block-level summary file.

d <- fread("data/cat_learn_rb_ii_summary.csv")
str(d)
## Classes 'data.table' and 'data.frame':   1000 obs. of  4 variables:
##  $ sub_id      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ group       : chr  "RB" "RB" "RB" "RB" ...
##  $ block       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ prop_correct: num  0.55 0.5 0.6 0.7 0.55 0.55 0.6 0.55 0.6 0.5 ...
##  - attr(*, ".internal.selfref")=<externalptr>
head(d)
##    sub_id  group block prop_correct
##     <int> <char> <int>        <num>
## 1:      1     RB     1         0.55
## 2:      1     RB     2         0.50
## 3:      1     RB     3         0.60
## 4:      1     RB     4         0.70
## 5:      1     RB     5         0.55
## 6:      1     RB     6         0.55

Each row here is one participant in one block. That means the file is still at the repeated-measures level. For the regression in this tutorial, first aggregate to one row per participant.

Create participant-level summaries:

early_accuracy - mean accuracy across Blocks 1 to 5
late_accuracy - mean accuracy across Blocks 21 to 25
learning_gain - late minus early accuracy

d_sub <- d[, .(
  early_accuracy = mean(prop_correct[block %in% 1:5]),
  late_accuracy  = mean(prop_correct[block %in% 21:25]),
  learning_gain  = mean(prop_correct[block %in% 21:25]) -
                   mean(prop_correct[block %in% 1:5])
), by = .(sub_id, group)]

d_sub[, group_fac := factor(group, levels = c("II", "RB"))]

d_sub
##     sub_id  group early_accuracy late_accuracy learning_gain group_fac
##      <int> <char>          <num>         <num>         <num>    <fctr>
##  1:      1     RB           0.58          0.94          0.36        RB
##  2:      2     RB           0.60          0.93          0.33        RB
##  3:      3     RB           0.63          0.91          0.28        RB
##  4:      4     RB           0.46          0.90          0.44        RB
##  5:      5     RB           0.51          0.94          0.43        RB
##  6:      6     RB           0.45          0.93          0.48        RB
##  7:      7     RB           0.58          0.97          0.39        RB
##  8:      8     RB           0.58          0.87          0.29        RB
##  9:      9     RB           0.47          0.95          0.48        RB
## 10:     10     RB           0.63          0.95          0.32        RB
## 11:     11     RB           0.52          0.91          0.39        RB
## 12:     12     RB           0.55          0.91          0.36        RB
## 13:     13     RB           0.45          0.95          0.50        RB
## 14:     14     RB           0.61          0.92          0.31        RB
## 15:     15     RB           0.55          0.93          0.38        RB
## 16:     16     RB           0.51          0.93          0.42        RB
## 17:     17     RB           0.51          0.88          0.37        RB
## 18:     18     RB           0.60          0.90          0.30        RB
## 19:     19     RB           0.54          0.94          0.40        RB
## 20:     20     RB           0.60          0.90          0.30        RB
## 21:     21     II           0.47          0.86          0.39        II
## 22:     22     II           0.51          0.85          0.34        II
## 23:     23     II           0.50          0.90          0.40        II
## 24:     24     II           0.49          0.84          0.35        II
## 25:     25     II           0.54          0.88          0.34        II
## 26:     26     II           0.58          0.87          0.29        II
## 27:     27     II           0.46          0.88          0.42        II
## 28:     28     II           0.54          0.85          0.31        II
## 29:     29     II           0.43          0.89          0.46        II
## 30:     30     II           0.52          0.84          0.32        II
## 31:     31     II           0.50          0.89          0.39        II
## 32:     32     II           0.47          0.92          0.45        II
## 33:     33     II           0.47          0.82          0.35        II
## 34:     34     II           0.57          0.86          0.29        II
## 35:     35     II           0.48          0.91          0.43        II
## 36:     36     II           0.52          0.87          0.35        II
## 37:     37     II           0.51          0.85          0.34        II
## 38:     38     II           0.49          0.88          0.39        II
## 39:     39     II           0.42          0.89          0.47        II
## 40:     40     II           0.56          0.92          0.36        II
##     sub_id  group early_accuracy late_accuracy learning_gain group_fac

At this point, each row is one participant. That is the correct level for the regression models below.

Part 2 - Visualise the participant-level relationship

Before fitting a model, plot the data.

ggplot(d_sub, aes(x = early_accuracy, y = late_accuracy, colour = group_fac)) +
  geom_point(size = 2, alpha = 0.8) +
  labs(
    x = "Early-session accuracy",
    y = "Late-session accuracy",
    colour = "Group",
    title = "Participant-level category learning summaries"
  ) +
  scale_colour_manual(values = COL)

Late accuracy as a function of early accuracy, coloured by learning group.

This plot suggests two things:

participants who start stronger may also finish stronger
the RB and II groups may follow different patterns

Multiple regression lets you test both ideas in one model.

Part 3 - Simple regression baseline

Start with the simplest model: predict late accuracy from early accuracy alone.

fit_simple <- lm(late_accuracy ~ early_accuracy, data = d_sub)
summary(fit_simple)
## 
## Call:
## lm(formula = late_accuracy ~ early_accuracy, data = d_sub)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.071098 -0.028565  0.004565  0.028455  0.064333 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.82885    0.05415  15.305   <2e-16 ***
## early_accuracy  0.13245    0.10278   1.289    0.205    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03592 on 38 degrees of freedom
## Multiple R-squared:  0.04187,    Adjusted R-squared:  0.01666 
## F-statistic: 1.661 on 1 and 38 DF,  p-value: 0.2053

This model asks whether participants who are more accurate early in training also tend to be more accurate late in training.

Part 4 - Add a categorical predictor

Now add group as a second predictor.

fit_group <- lm(late_accuracy ~ early_accuracy + group_fac, data = d_sub)
summary(fit_group)
## 
## Call:
## lm(formula = late_accuracy ~ early_accuracy + group_fac, data = d_sub)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.055384 -0.019800  0.002623  0.016461  0.049999 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.903494   0.042314  21.352  < 2e-16 ***
## early_accuracy -0.059808   0.083532  -0.716    0.478    
## group_facRB     0.052191   0.009233   5.653 1.85e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02667 on 37 degrees of freedom
## Multiple R-squared:  0.4859, Adjusted R-squared:  0.4581 
## F-statistic: 17.49 on 2 and 37 DF,  p-value: 4.512e-06

Because group_fac is a factor, R creates a dummy-coded comparison. With II as the reference level:

the intercept refers to an II participant with early_accuracy = 0
the coefficient for early_accuracy is the slope for II participants
the coefficient for group_facRB is the RB - II difference, holding early accuracy constant

Compare the R-squared values from the two models.

summary(fit_simple)$r.squared
## [1] 0.04187443
summary(fit_group)$r.squared
## [1] 0.4858977

If R-squared increases, the group predictor explains additional variance above and beyond early accuracy.

Part 5 - Add an interaction

The additive model above assumes that the relationship between early and late accuracy is the same in both groups. To test whether that slope differs by group, add an interaction.

fit_interact <- lm(late_accuracy ~ early_accuracy * group_fac, data = d_sub)
summary(fit_interact)
## 
## Call:
## lm(formula = late_accuracy ~ early_accuracy * group_fac, data = d_sub)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.056096 -0.020356  0.003024  0.016455  0.051320 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 0.91482    0.07242  12.632 8.56e-15 ***
## early_accuracy             -0.08240    0.14390  -0.573    0.570    
## group_facRB                 0.03433    0.09248   0.371    0.713    
## early_accuracy:group_facRB  0.03454    0.17794   0.194    0.847    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02702 on 36 degrees of freedom
## Multiple R-squared:  0.4864, Adjusted R-squared:  0.4436 
## F-statistic: 11.37 on 3 and 36 DF,  p-value: 2.162e-05

Interpret the coefficients like this:

Intercept - predicted late accuracy for an II participant when early_accuracy = 0
early_accuracy - slope of early -> late accuracy for II participants
group_facRB - difference in intercept between RB and II
early_accuracy:group_facRB - difference in slope between RB and II

If the interaction is significant, the relationship between early and late accuracy differs across groups.

Visualise the fitted lines

ggplot(d_sub, aes(x = early_accuracy, y = late_accuracy, colour = group_fac)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Early-session accuracy",
    y = "Late-session accuracy",
    colour = "Group",
    title = "Multiple regression with group-specific slopes"
  ) +
  scale_colour_manual(values = COL)

Regression lines from the interaction model.

Part 6 - Model comparison

When models are nested, anova() lets you test whether the added term improves fit.

anova(fit_simple, fit_group, fit_interact)
## Analysis of Variance Table
## 
## Model 1: late_accuracy ~ early_accuracy
## Model 2: late_accuracy ~ early_accuracy + group_fac
## Model 3: late_accuracy ~ early_accuracy * group_fac
##   Res.Df      RSS Df Sum of Sq       F   Pr(>F)    
## 1     38 0.049034                                  
## 2     37 0.026310  1 0.0227240 31.1253 2.54e-06 ***
## 3     36 0.026283  1 0.0000275  0.0377   0.8472    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This table shows whether:

adding group improves fit beyond early accuracy alone
adding the interaction improves fit beyond the additive model

You can also compare the models with AIC.

AIC(fit_simple, fit_group, fit_interact)
##              df       AIC
## fit_simple    3 -148.6494
## fit_group     4 -171.5516
## fit_interact  5 -169.5935

Lower AIC indicates the better balance of fit and parsimony.

Part 7 - Multiple regression assumptions

Because d_sub has one row per participant, the independence assumption is appropriate here. The remaining assumptions are the usual regression checks:

linearity
homoscedasticity
approximately normal residuals

d_diag <- data.table(fitted = fitted(fit_interact), resid = resid(fit_interact))

ggplot(d_diag, aes(x = fitted, y = resid)) +
  geom_point(colour = COL[1], alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    x = "Fitted values",
    y = "Residuals",
    title = "Residuals vs fitted"
  )

ggplot(d_diag, aes(sample = resid)) +
  stat_qq(colour = COL[1]) +
  stat_qq_line(colour = COL[2]) +
  labs(
    x = "Theoretical quantiles",
    y = "Sample quantiles",
    title = "Q-Q plot of residuals"
  )

Look for residuals scattered around zero and Q-Q points that stay reasonably close to the diagonal.

Apply this to your assigned dataset

Your assigned dataset is determined by your student ID number. Take the last digit and compute last_digit %% 3 in R.

If the result is 0: https://github.com/crossley/cogs2020/tree/main/final_project_data/cat_learn_switch
If the result is 1: https://github.com/crossley/cogs2020/tree/main/final_project_data/cat_learn_auto
If the result is 2: https://github.com/crossley/cogs2020/tree/main/final_project_data/cat_learn_unlearn

Load your assigned dataset and fit a multiple regression model. Specifically:

Decide on an independent unit of analysis before you fit the model.
Create one summary row per participant or unit if needed.
Identify a continuous outcome and at least two predictors.
Fit a simple regression, then add a second predictor.
Test an interaction if it is theoretically meaningful.
Compare the nested models with anova().
Check the residual plots.
Write a short paragraph interpreting the final model.