# load libraries
library(data.table)
library(ggplot2)
# clean work space
rm(list = ls())
# init colorscheme
COL <- c("#2271B2", "#E69F00", "#D55E00")
names(COL) <- c("blue", "orange", "red")
theme_set(
theme_minimal(base_size = 13) +
theme(
panel.grid.minor = element_blank(),
strip.text = element_text(face = "bold"),
legend.position = "bottom"
)
)
update_geom_defaults("point", list(size = 2))
update_geom_defaults("line", list(linewidth = 0.8))
So far, you have worked with simple linear regression, where one predictor is used to explain one outcome. Multiple regression extends that framework by including two or more predictors at the same time.
In this tutorial, you will use participant-level summaries from a category learning dataset. The key point is the same as last week: before fitting an inferential model, make sure each row is an independent participant or unit.
You will model late-session accuracy using:
early_accuracygroupLoad the block-level summary file.
d <- fread("data/cat_learn_rb_ii_summary.csv")
str(d)
## Classes 'data.table' and 'data.frame': 1000 obs. of 4 variables:
## $ sub_id : int 1 1 1 1 1 1 1 1 1 1 ...
## $ group : chr "RB" "RB" "RB" "RB" ...
## $ block : int 1 2 3 4 5 6 7 8 9 10 ...
## $ prop_correct: num 0.55 0.5 0.6 0.7 0.55 0.55 0.6 0.55 0.6 0.5 ...
## - attr(*, ".internal.selfref")=<externalptr>
head(d)
## sub_id group block prop_correct
## <int> <char> <int> <num>
## 1: 1 RB 1 0.55
## 2: 1 RB 2 0.50
## 3: 1 RB 3 0.60
## 4: 1 RB 4 0.70
## 5: 1 RB 5 0.55
## 6: 1 RB 6 0.55
Each row here is one participant in one block. That means the file is still at the repeated-measures level. For the regression in this tutorial, first aggregate to one row per participant.
Create participant-level summaries:
early_accuracy - mean accuracy across Blocks 1 to
5late_accuracy - mean accuracy across Blocks 21 to
25learning_gain - late minus early accuracyd_sub <- d[, .(
early_accuracy = mean(prop_correct[block %in% 1:5]),
late_accuracy = mean(prop_correct[block %in% 21:25]),
learning_gain = mean(prop_correct[block %in% 21:25]) -
mean(prop_correct[block %in% 1:5])
), by = .(sub_id, group)]
d_sub[, group_fac := factor(group, levels = c("II", "RB"))]
d_sub
## sub_id group early_accuracy late_accuracy learning_gain group_fac
## <int> <char> <num> <num> <num> <fctr>
## 1: 1 RB 0.58 0.94 0.36 RB
## 2: 2 RB 0.60 0.93 0.33 RB
## 3: 3 RB 0.63 0.91 0.28 RB
## 4: 4 RB 0.46 0.90 0.44 RB
## 5: 5 RB 0.51 0.94 0.43 RB
## 6: 6 RB 0.45 0.93 0.48 RB
## 7: 7 RB 0.58 0.97 0.39 RB
## 8: 8 RB 0.58 0.87 0.29 RB
## 9: 9 RB 0.47 0.95 0.48 RB
## 10: 10 RB 0.63 0.95 0.32 RB
## 11: 11 RB 0.52 0.91 0.39 RB
## 12: 12 RB 0.55 0.91 0.36 RB
## 13: 13 RB 0.45 0.95 0.50 RB
## 14: 14 RB 0.61 0.92 0.31 RB
## 15: 15 RB 0.55 0.93 0.38 RB
## 16: 16 RB 0.51 0.93 0.42 RB
## 17: 17 RB 0.51 0.88 0.37 RB
## 18: 18 RB 0.60 0.90 0.30 RB
## 19: 19 RB 0.54 0.94 0.40 RB
## 20: 20 RB 0.60 0.90 0.30 RB
## 21: 21 II 0.47 0.86 0.39 II
## 22: 22 II 0.51 0.85 0.34 II
## 23: 23 II 0.50 0.90 0.40 II
## 24: 24 II 0.49 0.84 0.35 II
## 25: 25 II 0.54 0.88 0.34 II
## 26: 26 II 0.58 0.87 0.29 II
## 27: 27 II 0.46 0.88 0.42 II
## 28: 28 II 0.54 0.85 0.31 II
## 29: 29 II 0.43 0.89 0.46 II
## 30: 30 II 0.52 0.84 0.32 II
## 31: 31 II 0.50 0.89 0.39 II
## 32: 32 II 0.47 0.92 0.45 II
## 33: 33 II 0.47 0.82 0.35 II
## 34: 34 II 0.57 0.86 0.29 II
## 35: 35 II 0.48 0.91 0.43 II
## 36: 36 II 0.52 0.87 0.35 II
## 37: 37 II 0.51 0.85 0.34 II
## 38: 38 II 0.49 0.88 0.39 II
## 39: 39 II 0.42 0.89 0.47 II
## 40: 40 II 0.56 0.92 0.36 II
## sub_id group early_accuracy late_accuracy learning_gain group_fac
At this point, each row is one participant. That is the correct level for the regression models below.
Before fitting a model, plot the data.
ggplot(d_sub, aes(x = early_accuracy, y = late_accuracy, colour = group_fac)) +
geom_point(size = 2, alpha = 0.8) +
labs(
x = "Early-session accuracy",
y = "Late-session accuracy",
colour = "Group",
title = "Participant-level category learning summaries"
) +
scale_colour_manual(values = COL)
Late accuracy as a function of early accuracy, coloured by learning group.
This plot suggests two things:
Multiple regression lets you test both ideas in one model.
Start with the simplest model: predict late accuracy from early accuracy alone.
fit_simple <- lm(late_accuracy ~ early_accuracy, data = d_sub)
summary(fit_simple)
##
## Call:
## lm(formula = late_accuracy ~ early_accuracy, data = d_sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.071098 -0.028565 0.004565 0.028455 0.064333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.82885 0.05415 15.305 <2e-16 ***
## early_accuracy 0.13245 0.10278 1.289 0.205
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03592 on 38 degrees of freedom
## Multiple R-squared: 0.04187, Adjusted R-squared: 0.01666
## F-statistic: 1.661 on 1 and 38 DF, p-value: 0.2053
This model asks whether participants who are more accurate early in training also tend to be more accurate late in training.
Now add group as a second predictor.
fit_group <- lm(late_accuracy ~ early_accuracy + group_fac, data = d_sub)
summary(fit_group)
##
## Call:
## lm(formula = late_accuracy ~ early_accuracy + group_fac, data = d_sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.055384 -0.019800 0.002623 0.016461 0.049999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.903494 0.042314 21.352 < 2e-16 ***
## early_accuracy -0.059808 0.083532 -0.716 0.478
## group_facRB 0.052191 0.009233 5.653 1.85e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02667 on 37 degrees of freedom
## Multiple R-squared: 0.4859, Adjusted R-squared: 0.4581
## F-statistic: 17.49 on 2 and 37 DF, p-value: 4.512e-06
Because group_fac is a factor, R creates a dummy-coded
comparison. With II as the reference level:
early_accuracy = 0early_accuracy is the slope for II
participantsgroup_facRB is the RB - II
difference, holding early accuracy constantCompare the R-squared values from the two models.
summary(fit_simple)$r.squared
## [1] 0.04187443
summary(fit_group)$r.squared
## [1] 0.4858977
If R-squared increases, the group predictor explains additional variance above and beyond early accuracy.
The additive model above assumes that the relationship between early and late accuracy is the same in both groups. To test whether that slope differs by group, add an interaction.
fit_interact <- lm(late_accuracy ~ early_accuracy * group_fac, data = d_sub)
summary(fit_interact)
##
## Call:
## lm(formula = late_accuracy ~ early_accuracy * group_fac, data = d_sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.056096 -0.020356 0.003024 0.016455 0.051320
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.91482 0.07242 12.632 8.56e-15 ***
## early_accuracy -0.08240 0.14390 -0.573 0.570
## group_facRB 0.03433 0.09248 0.371 0.713
## early_accuracy:group_facRB 0.03454 0.17794 0.194 0.847
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02702 on 36 degrees of freedom
## Multiple R-squared: 0.4864, Adjusted R-squared: 0.4436
## F-statistic: 11.37 on 3 and 36 DF, p-value: 2.162e-05
Interpret the coefficients like this:
early_accuracy = 0If the interaction is significant, the relationship between early and late accuracy differs across groups.
ggplot(d_sub, aes(x = early_accuracy, y = late_accuracy, colour = group_fac)) +
geom_point(alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "Early-session accuracy",
y = "Late-session accuracy",
colour = "Group",
title = "Multiple regression with group-specific slopes"
) +
scale_colour_manual(values = COL)
Regression lines from the interaction model.
When models are nested, anova() lets you test whether
the added term improves fit.
anova(fit_simple, fit_group, fit_interact)
## Analysis of Variance Table
##
## Model 1: late_accuracy ~ early_accuracy
## Model 2: late_accuracy ~ early_accuracy + group_fac
## Model 3: late_accuracy ~ early_accuracy * group_fac
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 38 0.049034
## 2 37 0.026310 1 0.0227240 31.1253 2.54e-06 ***
## 3 36 0.026283 1 0.0000275 0.0377 0.8472
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This table shows whether:
group improves fit beyond early accuracy
aloneYou can also compare the models with AIC.
AIC(fit_simple, fit_group, fit_interact)
## df AIC
## fit_simple 3 -148.6494
## fit_group 4 -171.5516
## fit_interact 5 -169.5935
Lower AIC indicates the better balance of fit and parsimony.
Because d_sub has one row per participant, the
independence assumption is appropriate here. The remaining assumptions
are the usual regression checks:
d_diag <- data.table(fitted = fitted(fit_interact), resid = resid(fit_interact))
ggplot(d_diag, aes(x = fitted, y = resid)) +
geom_point(colour = COL[1], alpha = 0.5) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(
x = "Fitted values",
y = "Residuals",
title = "Residuals vs fitted"
)
ggplot(d_diag, aes(sample = resid)) +
stat_qq(colour = COL[1]) +
stat_qq_line(colour = COL[2]) +
labs(
x = "Theoretical quantiles",
y = "Sample quantiles",
title = "Q-Q plot of residuals"
)
Look for residuals scattered around zero and Q-Q points that stay reasonably close to the diagonal.
Your assigned dataset is determined by your student ID number. Take
the last digit and compute last_digit %% 3 in R.
0: https://github.com/crossley/cogs2020/tree/main/final_project_data/cat_learn_switch1: https://github.com/crossley/cogs2020/tree/main/final_project_data/cat_learn_auto2: https://github.com/crossley/cogs2020/tree/main/final_project_data/cat_learn_unlearnLoad your assigned dataset and fit a multiple regression model. Specifically:
anova().