# load libraries
library(data.table)
library(ggplot2)
# clean work space
rm(list = ls())
# init colorscheme
COL <- c("#2271B2", "#E69F00", "#D55E00")
names(COL) <- c("blue", "orange", "red")
theme_set(
theme_minimal(base_size = 13) +
theme(
panel.grid.minor = element_blank(),
strip.text = element_text(face = "bold"),
legend.position = "bottom"
)
)
update_geom_defaults("point", list(size = 2))
update_geom_defaults("line", list(linewidth = 0.8))
In your final project you will:
This tutorial focuses on the second and third items.
A key idea for today:
Inference and study design can be done well or badly. Today you learn how to tell the difference.
We will use a rat T-maze dataset. In the maze, a rat can turn left or right at a T-junction, and one arm is baited with reward on the majority of trials. We compare two experiments that differ in which arm was baited more often.
d1 <- fread("data/experiment_1_summary.csv")
d2 <- fread("data/experiment_2_summary.csv")
str(d1)
## Classes 'data.table' and 'data.frame': 4800 obs. of 9 variables:
## $ experiment : chr "exp1" "exp1" "exp1" "exp1" ...
## $ rat_id : int 1 1 1 1 1 1 1 1 1 1 ...
## $ trial : int 1 2 3 4 5 6 7 8 9 10 ...
## $ cage_context : chr "mesh" "wood" "mesh" "mesh" ...
## $ scent : chr "none" "none" "none" "peppermint" ...
## $ choice : chr "L" "R" "L" "L" ...
## $ reward : int 1 0 1 1 1 1 0 1 1 1 ...
## $ reaction_time: num 0.572 0.936 0.457 0.446 0.2 ...
## $ maze_run_time: num 3.79 3.21 2.14 2.65 2.73 ...
## - attr(*, ".internal.selfref")=<externalptr>
str(d2)
## Classes 'data.table' and 'data.frame': 4800 obs. of 9 variables:
## $ experiment : chr "exp2" "exp2" "exp2" "exp2" ...
## $ rat_id : int 1 1 1 1 1 1 1 1 1 1 ...
## $ trial : int 1 2 3 4 5 6 7 8 9 10 ...
## $ cage_context : chr "wood" "mesh" "mesh" "mesh" ...
## $ scent : chr "peppermint" "peppermint" "none" "peppermint" ...
## $ choice : chr "L" "L" "R" "R" ...
## $ reward : int 0 0 1 0 0 0 0 1 1 1 ...
## $ reaction_time: num 0.872 0.302 0.486 0.704 0.343 ...
## $ maze_run_time: num 4.07 4.11 3.48 3.98 2.51 ...
## - attr(*, ".internal.selfref")=<externalptr>
Key variables:
rat_id — individual rat (1–24)trial — trial number (1–200)choice — “L” (left) or “R” (right)reward — 1 if rewarded, 0 otherwiseExperiment design:
In Experiment 1, left choices are rewarded on about 70% of trials and right choices on about 30%.
In Experiment 2, right choices are rewarded on about 70% of trials and left choices on about 30%.
We can verify the reward contingencies directly from the loaded data by calculating the observed reward rate for each choice in each experiment.
contingency_check <- rbind(
d1[, .(reward_rate = mean(reward), n_trials = .N), by = .(choice)][, experiment := "Experiment 1"],
d2[, .(reward_rate = mean(reward), n_trials = .N), by = .(choice)][, experiment := "Experiment 2"]
)
setcolorder(contingency_check, c("experiment", "choice", "reward_rate", "n_trials"))
contingency_check
## experiment choice reward_rate n_trials
## <char> <char> <num> <int>
## 1: Experiment 1 L 0.6896811 2665
## 2: Experiment 1 R 0.3030445 2135
## 3: Experiment 2 L 0.2977820 2119
## 4: Experiment 2 R 0.7068258 2681
These are close to the nominal contingencies of 70/30 and 70/30 in the opposite direction. Small deviations are expected because reward delivery is probabilistic.
We compare the two experiments on the basis of each rat’s mean reward rate across all 200 trials.
rat_means_1 <- d1[, .(mean_reward = mean(reward), experiment = "Experiment 1"), .(rat_id)]
rat_means_2 <- d2[, .(mean_reward = mean(reward), experiment = "Experiment 2"), .(rat_id)]
rat_means <- rbind(rat_means_1, rat_means_2)
rat_means[, .N, .(experiment)]
## experiment N
## <char> <int>
## 1: Experiment 1 24
## 2: Experiment 2 24
Group-level summary:
rat_means[, .(
n = .N,
mean_reward = mean(mean_reward),
sd_reward = sd(mean_reward)
), .(experiment)]
## experiment n mean_reward sd_reward
## <char> <int> <num> <num>
## 1: Experiment 1 24 0.5177083 0.05419328
## 2: Experiment 2 24 0.5262500 0.05485653
The observed difference in mean reward rate:
mean_exp1 <- rat_means[experiment == "Experiment 1", mean(mean_reward)]
mean_exp2 <- rat_means[experiment == "Experiment 2", mean(mean_reward)]
diff_obs <- mean_exp1 - mean_exp2
diff_obs
## [1] -0.008541667
Key comparison Question:
diff_obs meaningful, or could it plausibly
be noise?Today we develop tools to answer that responsibly.
set.seed(42)
ggplot(rat_means, aes(x = experiment, y = mean_reward)) +
stat_summary(fun = mean, geom = "col", fill = COL[1], width = 0.6) +
stat_summary(
fun.data = mean_se,
geom = "errorbar",
width = 0.15,
linewidth = 0.8
) +
geom_point(position = position_jitter(width = 0.12), alpha = 0.5) +
labs(
x = "Experiment",
y = "Per-rat mean reward rate",
title = "Mean reward rate by experiment (real data)"
)
This bar plot shows the group means, SEM error bars, and the individual rats. In the real data, the two groups do not appear to differ much in mean reward obtained per rat.
To illustrate how a small number of unusual observations can change the apparent result, we now create a teaching example by adding several extreme per-rat values to Experiment 2. These extra points are not part of the original dataset. Real data can contain values like this for legitimate reasons: a rat might have been injured, become ill, failed to engage with the task, or been affected by an unusual testing session. In that situation, the question is not simply “should we remove outliers?” but “do these cases belong to the population we are trying to study?”
outliers_demo <- data.table(
rat_id = c(101, 102, 103),
mean_reward = c(0.05, 0.08, 0.10),
experiment = "Experiment 2"
)
rat_means_demo <- rbind(rat_means, outliers_demo)
set.seed(42)
ggplot(rat_means_demo, aes(x = experiment, y = mean_reward)) +
stat_summary(fun = mean, geom = "col", fill = COL[1], width = 0.6) +
stat_summary(
fun.data = mean_se,
geom = "errorbar",
width = 0.15,
linewidth = 0.8
) +
geom_point(position = position_jitter(width = 0.12), alpha = 0.5) +
labs(
x = "Experiment",
y = "Per-rat mean reward rate",
title = "Mean reward rate by experiment (with added outliers)"
)
Now the groups look more different, but that impression is driven partly by a few extreme observations pulling the mean down in Experiment 2 and inflating the SEM. This motivates the next question: if we remove outliers, how much does the result change?
If the population of interest is healthy rats performing the task under normal conditions, then observations from rats that were injured, sick, or otherwise unable to perform the task as intended may not be relevant to the research question. In that case, removing them can be sensible, but only if the rule is justified and reported transparently.
We will define an “outlier rule” using the IQR method applied to the pooled per-rat means in the demonstration dataset.
all_means <- rat_means_demo$mean_reward
q1 <- quantile(all_means, 0.25, na.rm = TRUE)
q3 <- quantile(all_means, 0.75, na.rm = TRUE)
iqr <- q3 - q1
lower <- q1 - 1.5 * iqr
upper <- q3 + 1.5 * iqr
lower
## 25%
## 0.3675
upper
## 75%
## 0.6675
Make a cleaned version.
rat_means_clean <- rat_means_demo[mean_reward >= lower & mean_reward <= upper]
rat_means_demo[, .(n_before = .N)]
## n_before
## <int>
## 1: 51
rat_means_clean[, .(n_after = .N)]
## n_after
## <int>
## 1: 48
Compare means before vs after outlier exclusion.
sum_raw <- rat_means_demo[, .(
mean_reward = mean(mean_reward),
sd_reward = sd(mean_reward),
n = .N
), .(experiment)]
sum_cln <- rat_means_clean[, .(
mean_reward = mean(mean_reward),
sd_reward = sd(mean_reward),
n = .N
), .(experiment)]
sum_raw
## experiment mean_reward sd_reward n
## <char> <num> <num> <int>
## 1: Experiment 1 0.5177083 0.05419328 24
## 2: Experiment 2 0.4762963 0.15310625 27
sum_cln
## experiment mean_reward sd_reward n
## <char> <num> <num> <int>
## 1: Experiment 1 0.5177083 0.05419328 24
## 2: Experiment 2 0.5262500 0.05485653 24
Boxplots summarise central tendency and spread differently from bar plots. A bar plot with SEM shows the mean and an uncertainty measure around it, whereas a boxplot shows the median, quartiles, and the overall shape of the distribution. Because the median is more robust to extreme values than the mean, boxplots are often less sensitive to outliers. Below we compare the demonstration dataset before and after outlier removal.
set.seed(42)
plot_raw <- copy(rat_means_demo)
plot_raw[, dataset := "With added outliers"]
plot_clean <- copy(rat_means_clean)
plot_clean[, dataset := "After outlier removal"]
plot_compare <- rbind(plot_raw, plot_clean, fill = TRUE)
ggplot(plot_compare, aes(x = experiment, y = mean_reward)) +
geom_boxplot(fill = COL[1], outlier.shape = NA) +
geom_point(position = position_jitter(width = 0.12), alpha = 0.5) +
facet_wrap(~ dataset) +
labs(
x = "Experiment",
y = "Per-rat mean reward rate",
title = "Boxplots before and after outlier removal"
)
The contrast between the bar plots and boxplots is useful. Both visualise central tendency and spread, but they do so in different ways. Mean-based plots can be strongly influenced by a few extreme observations, whereas boxplots are often more robust because the median and quartiles move less.
Discussion points:
To justify your analysis, you must be able to explain why a decision was reasonable and what it changed.
In Week 6 we started doing inference (is there evidence?). This week we add: how big is the effect?
A simple effect size measure is the standardised mean difference. We will compute Cohen’s d using a pooled standard deviation. Here we use the demonstration dataset to show how a few extreme observations can change the estimated effect size.
pooled_sd <- function(x, g) {
g <- as.factor(g)
x1 <- x[g == levels(g)[1]]
x2 <- x[g == levels(g)[2]]
n1 <- length(x1)
n2 <- length(x2)
s1 <- sd(x1)
s2 <- sd(x2)
sp <- sqrt(((n1 - 1) * s1^2 + (n2 - 1) * s2^2) / (n1 + n2 - 2))
return(sp)
}
mean_exp1_demo <- rat_means_demo[experiment == "Experiment 1", mean(mean_reward)]
mean_exp2_demo <- rat_means_demo[experiment == "Experiment 2", mean(mean_reward)]
sp_raw <- pooled_sd(rat_means_demo$mean_reward, rat_means_demo$experiment)
d_raw <- (mean_exp1_demo - mean_exp2_demo) / sp_raw
sp_raw
## [1] 0.1175454
d_raw
## [1] 0.3523069
Repeat for the cleaned dataset.
mean_exp1_cln <- rat_means_clean[experiment == "Experiment 1", mean(mean_reward)]
mean_exp2_cln <- rat_means_clean[experiment == "Experiment 2", mean(mean_reward)]
sp_cln <- pooled_sd(rat_means_clean$mean_reward, rat_means_clean$experiment)
d_cln <- (mean_exp1_cln - mean_exp2_cln) / sp_cln
sp_cln
## [1] 0.05452591
d_cln
## [1] -0.1566533
Discussion points:
We will simulate a situation where the effect size is fixed, but we change the sample size.
We will generate data from two normal distributions with a fixed mean difference.
set.seed(42)
mu1 <- 0
mu2 <- 0.3
sigma <- 1
simulate_two_group <- function(n_per_group) {
x1 <- rnorm(n_per_group, mu1, sigma)
x2 <- rnorm(n_per_group, mu2, sigma)
d <- data.table(
grp = rep(c("A", "B"), each = n_per_group),
x = c(x1, x2)
)
return(d)
}
A simple z-test for difference in means (normal test):
z_test_diff_means <- function(x, g, sigma_known) {
g <- as.factor(g)
x1 <- x[g == levels(g)[1]]
x2 <- x[g == levels(g)[2]]
n1 <- length(x1)
n2 <- length(x2)
z <- (mean(x1) - mean(x2)) / (sigma_known * sqrt(1/n1 + 1/n2))
p <- 2 * pnorm(abs(z), lower.tail = FALSE)
return(list(z = z, p = p))
}
Compare p-values for different n.
d_small <- simulate_two_group(20)
d_big <- simulate_two_group(200)
res_small <- z_test_diff_means(d_small$x, d_small$grp, sigma_known = sigma)
res_big <- z_test_diff_means(d_big$x, d_big$grp, sigma_known = sigma)
res_small
## $z
## [1] 0.5151724
##
## $p
## [1] 0.6064326
res_big
## $z
## [1] -3.241692
##
## $p
## [1] 0.001188223
Key point:
This is a central evaluation skill for the final project.
Power is about how likely your method is to detect an effect of a given size.
This matters for:
We will do a simple normal-approximation power calculation for a difference in means.
First, choose a target effect to design for. For power, we return to the real dataset rather than the teaching example with artificial outliers. Use the effect size observed in the real data as a starting point.
sp_real <- pooled_sd(rat_means$mean_reward, rat_means$experiment)
d_real <- (mean_exp1 - mean_exp2) / sp_real
d_real
## [1] -0.1566533
d_target <- abs(d_real)
d_target
## [1] 0.1566533
Effect sizes are in units of standard deviation. To understand what that means in the original measurement scale, we can convert back to a mean difference by multiplying by a plausible sigma.
Here we use the pooled SD from the real dataset as a proxy.
sigma_proxy <- sp_real
delta_target <- d_target * sigma_proxy
sigma_proxy
## [1] 0.05452591
delta_target
## [1] 0.008541667
Power is the probability that a study will detect a real effect of a given size.
One way to understand this is to imagine repeating the same study many times. Each repetition draws a new sample from the same populations, runs the same normal test, and asks whether the result is statistically significant.
If the sample size is small, sampling noise is large, so even a real effect will often be missed. If the sample size is larger, the estimated group difference becomes more stable, and the study detects the effect more often.
We can estimate power by simulation.
simulate_power_z <- function(n_per_group, mu1, mu2, sigma, n_sims = 1000) {
p_vals <- replicate(n_sims, {
x1 <- rnorm(n_per_group, mean = mu1, sd = sigma)
x2 <- rnorm(n_per_group, mean = mu2, sd = sigma)
z <- (mean(x1) - mean(x2)) / (sigma * sqrt(2 / n_per_group))
2 * pnorm(abs(z), lower.tail = FALSE)
})
mean(p_vals < 0.05)
}
To run the simulation, we set one group mean to 0 and the other to
delta_target. This creates a population difference equal to
the effect size estimated from the real data.
mu1_sim <- 0
mu2_sim <- delta_target
n_vec <- seq(5, 800, by = 5)
pwr <- sapply(
n_vec,
simulate_power_z,
mu1 = mu1_sim,
mu2 = mu2_sim,
sigma = sigma_proxy
)
d_pwr <- data.table(n_per_group = n_vec, power = pwr)
ggplot(d_pwr, aes(x = n_per_group, y = power)) +
geom_line() +
geom_hline(yintercept = 0.8, linetype = 2, colour = "red") +
ylim(0, 1) +
labs(
x = "N per group",
y = "Estimated power",
title = "Power curve from repeated simulated studies"
)
The curve shows how often the normal test would be expected to return
p < .05 if the true effect were about the size observed
in the real data.
When the observed effect is very small, the power analysis often says that an extremely large sample would be needed. This does not prove that a real but tiny effect exists. It may simply mean that the current data do not provide evidence for a meaningful effect.
This calculation treats the observed effect size as a plausible target for design. That is only sensible if you think the observed effect reflects something real. If the observed difference is mostly noise, then the estimated required sample size may not be very meaningful.
In practice, power analysis is most useful when the target effect is chosen from theory, prior evidence, or a minimum effect size that would matter scientifically.
Find the smallest n that achieves 0.8 power.
d_pwr[power >= 0.8][1]
## n_per_group power
## <num> <num>
## 1: 610 0.8
Discussion points:
Below is a short, deliberately imperfect analysis summary. Your job is to evaluate it.
We compared two conditions and found a statistically significant difference (p < .05). We removed a few outliers because they looked weird. We tried a few different transformations and reported the one that gave the cleanest result. The effect is real because it is significant.
Critique prompts (write brief answers):
What information is missing that you would need to judge the strength of the evidence?
What analysis choices were made without justification?
Which sentence strongly suggests researcher degrees of freedom (p-hacking risk)?
What would you want reported instead of only “p < .05”?
What would you want to know about sample size and power?
This kind of critique is part of “defending analysis choices” in your final project: you show that you understand what makes an analysis trustworthy.
Today you practiced moving from:
to
This is the core shift needed for the final project.
In particular you now have language and tools to discuss:
Your assigned dataset is determined by your student ID number. Take
the last digit and compute last_digit %% 3 in R.
0: https://github.com/crossley/cogs2020/tree/main/final_project_data/cat_learn_switch1: https://github.com/crossley/cogs2020/tree/main/final_project_data/cat_learn_auto2: https://github.com/crossley/cogs2020/tree/main/final_project_data/cat_learn_unlearnLoad your assigned dataset into R.
For this tutorial, your task is:
You will include this reflection in your final project as part of the “limitations and assumptions” discussion.