Tutorial 7 - Effect size, power, and evaluating inference

# load libraries

library(data.table)
library(ggplot2)

# clean work space

rm(list = ls())

# init colorscheme

COL <- c("#2271B2", "#E69F00", "#D55E00")
names(COL) <- c("blue", "orange", "red")
theme_set(
  theme_minimal(base_size = 13) +
    theme(
      panel.grid.minor = element_blank(),
      strip.text = element_text(face = "bold"),
      legend.position = "bottom"
    )
)
update_geom_defaults("point", list(size = 2))
update_geom_defaults("line", list(linewidth = 0.8))

Overview

In your final project you will:

analyse a dataset (what does the data show?)
evaluate an analysis (how strong is the evidence?)
justify and defend your analysis choices (why these steps?)

This tutorial focuses on the second and third items.

A key idea for today:

Inference and study design can be done well or badly. Today you learn how to tell the difference.

We will use a rat T-maze dataset. In the maze, a rat can turn left or right at a T-junction, and one arm is baited with reward on the majority of trials. We compare two experiments that differ in which arm was baited more often.

1. Load real data and inspect the experiment design

d1 <- fread("data/experiment_1_summary.csv")
d2 <- fread("data/experiment_2_summary.csv")

str(d1)
## Classes 'data.table' and 'data.frame':   4800 obs. of  9 variables:
##  $ experiment   : chr  "exp1" "exp1" "exp1" "exp1" ...
##  $ rat_id       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ trial        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ cage_context : chr  "mesh" "wood" "mesh" "mesh" ...
##  $ scent        : chr  "none" "none" "none" "peppermint" ...
##  $ choice       : chr  "L" "R" "L" "L" ...
##  $ reward       : int  1 0 1 1 1 1 0 1 1 1 ...
##  $ reaction_time: num  0.572 0.936 0.457 0.446 0.2 ...
##  $ maze_run_time: num  3.79 3.21 2.14 2.65 2.73 ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(d2)
## Classes 'data.table' and 'data.frame':   4800 obs. of  9 variables:
##  $ experiment   : chr  "exp2" "exp2" "exp2" "exp2" ...
##  $ rat_id       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ trial        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ cage_context : chr  "wood" "mesh" "mesh" "mesh" ...
##  $ scent        : chr  "peppermint" "peppermint" "none" "peppermint" ...
##  $ choice       : chr  "L" "L" "R" "R" ...
##  $ reward       : int  0 0 1 0 0 0 0 1 1 1 ...
##  $ reaction_time: num  0.872 0.302 0.486 0.704 0.343 ...
##  $ maze_run_time: num  4.07 4.11 3.48 3.98 2.51 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Key variables:

rat_id — individual rat (1–24)
trial — trial number (1–200)
choice — “L” (left) or “R” (right)
reward — 1 if rewarded, 0 otherwise

Experiment design:

In Experiment 1, left choices are rewarded on about 70% of trials and right choices on about 30%.
In Experiment 2, right choices are rewarded on about 70% of trials and left choices on about 30%.

We can verify the reward contingencies directly from the loaded data by calculating the observed reward rate for each choice in each experiment.

contingency_check <- rbind(
  d1[, .(reward_rate = mean(reward), n_trials = .N), by = .(choice)][, experiment := "Experiment 1"],
  d2[, .(reward_rate = mean(reward), n_trials = .N), by = .(choice)][, experiment := "Experiment 2"]
)

setcolorder(contingency_check, c("experiment", "choice", "reward_rate", "n_trials"))
contingency_check
##      experiment choice reward_rate n_trials
##          <char> <char>       <num>    <int>
## 1: Experiment 1      L   0.6896811     2665
## 2: Experiment 1      R   0.3030445     2135
## 3: Experiment 2      L   0.2977820     2119
## 4: Experiment 2      R   0.7068258     2681

These are close to the nominal contingencies of 70/30 and 70/30 in the opposite direction. Small deviations are expected because reward delivery is probabilistic.

2. Define a key comparison

We compare the two experiments on the basis of each rat’s mean reward rate across all 200 trials.

rat_means_1 <- d1[, .(mean_reward = mean(reward), experiment = "Experiment 1"), .(rat_id)]
rat_means_2 <- d2[, .(mean_reward = mean(reward), experiment = "Experiment 2"), .(rat_id)]

rat_means <- rbind(rat_means_1, rat_means_2)

rat_means[, .N, .(experiment)]
##      experiment     N
##          <char> <int>
## 1: Experiment 1    24
## 2: Experiment 2    24

Group-level summary:

rat_means[, .(
  n = .N,
  mean_reward = mean(mean_reward),
  sd_reward = sd(mean_reward)
), .(experiment)]
##      experiment     n mean_reward  sd_reward
##          <char> <int>       <num>      <num>
## 1: Experiment 1    24   0.5177083 0.05419328
## 2: Experiment 2    24   0.5262500 0.05485653

The observed difference in mean reward rate:

mean_exp1 <- rat_means[experiment == "Experiment 1", mean(mean_reward)]
mean_exp2 <- rat_means[experiment == "Experiment 2", mean(mean_reward)]
diff_obs <- mean_exp1 - mean_exp2
diff_obs
## [1] -0.008541667

Key comparison Question:

Is diff_obs meaningful, or could it plausibly be noise?

Today we develop tools to answer that responsibly.

3. Visualise the comparison

set.seed(42)

ggplot(rat_means, aes(x = experiment, y = mean_reward)) +
  stat_summary(fun = mean, geom = "col", fill = COL[1], width = 0.6) +
  stat_summary(
    fun.data = mean_se,
    geom = "errorbar",
    width = 0.15,
    linewidth = 0.8
  ) +
  geom_point(position = position_jitter(width = 0.12), alpha = 0.5) +
  labs(
    x = "Experiment",
    y = "Per-rat mean reward rate",
    title = "Mean reward rate by experiment (real data)"
  )

This bar plot shows the group means, SEM error bars, and the individual rats. In the real data, the two groups do not appear to differ much in mean reward obtained per rat.

To illustrate how a small number of unusual observations can change the apparent result, we now create a teaching example by adding several extreme per-rat values to Experiment 2. These extra points are not part of the original dataset. Real data can contain values like this for legitimate reasons: a rat might have been injured, become ill, failed to engage with the task, or been affected by an unusual testing session. In that situation, the question is not simply “should we remove outliers?” but “do these cases belong to the population we are trying to study?”

outliers_demo <- data.table(
  rat_id = c(101, 102, 103),
  mean_reward = c(0.05, 0.08, 0.10),
  experiment = "Experiment 2"
)

rat_means_demo <- rbind(rat_means, outliers_demo)

set.seed(42)

ggplot(rat_means_demo, aes(x = experiment, y = mean_reward)) +
  stat_summary(fun = mean, geom = "col", fill = COL[1], width = 0.6) +
  stat_summary(
    fun.data = mean_se,
    geom = "errorbar",
    width = 0.15,
    linewidth = 0.8
  ) +
  geom_point(position = position_jitter(width = 0.12), alpha = 0.5) +
  labs(
    x = "Experiment",
    y = "Per-rat mean reward rate",
    title = "Mean reward rate by experiment (with added outliers)"
  )

Now the groups look more different, but that impression is driven partly by a few extreme observations pulling the mean down in Experiment 2 and inflating the SEM. This motivates the next question: if we remove outliers, how much does the result change?

If the population of interest is healthy rats performing the task under normal conditions, then observations from rats that were injured, sick, or otherwise unable to perform the task as intended may not be relevant to the research question. In that case, removing them can be sensible, but only if the rule is justified and reported transparently.

4. Outlier handling as an analysis choice

We will define an “outlier rule” using the IQR method applied to the pooled per-rat means in the demonstration dataset.

all_means <- rat_means_demo$mean_reward

q1 <- quantile(all_means, 0.25, na.rm = TRUE)
q3 <- quantile(all_means, 0.75, na.rm = TRUE)
iqr <- q3 - q1

lower <- q1 - 1.5 * iqr
upper <- q3 + 1.5 * iqr

lower
##    25% 
## 0.3675
upper
##    75% 
## 0.6675

Make a cleaned version.

rat_means_clean <- rat_means_demo[mean_reward >= lower & mean_reward <= upper]

rat_means_demo[, .(n_before = .N)]
##    n_before
##       <int>
## 1:       51
rat_means_clean[, .(n_after = .N)]
##    n_after
##      <int>
## 1:      48

Compare means before vs after outlier exclusion.

sum_raw <- rat_means_demo[, .(
  mean_reward = mean(mean_reward),
  sd_reward = sd(mean_reward),
  n = .N
), .(experiment)]

sum_cln <- rat_means_clean[, .(
  mean_reward = mean(mean_reward),
  sd_reward = sd(mean_reward),
  n = .N
), .(experiment)]

sum_raw
##      experiment mean_reward  sd_reward     n
##          <char>       <num>      <num> <int>
## 1: Experiment 1   0.5177083 0.05419328    24
## 2: Experiment 2   0.4762963 0.15310625    27
sum_cln
##      experiment mean_reward  sd_reward     n
##          <char>       <num>      <num> <int>
## 1: Experiment 1   0.5177083 0.05419328    24
## 2: Experiment 2   0.5262500 0.05485653    24

Boxplots summarise central tendency and spread differently from bar plots. A bar plot with SEM shows the mean and an uncertainty measure around it, whereas a boxplot shows the median, quartiles, and the overall shape of the distribution. Because the median is more robust to extreme values than the mean, boxplots are often less sensitive to outliers. Below we compare the demonstration dataset before and after outlier removal.

set.seed(42)

plot_raw <- copy(rat_means_demo)
plot_raw[, dataset := "With added outliers"]

plot_clean <- copy(rat_means_clean)
plot_clean[, dataset := "After outlier removal"]

plot_compare <- rbind(plot_raw, plot_clean, fill = TRUE)

ggplot(plot_compare, aes(x = experiment, y = mean_reward)) +
  geom_boxplot(fill = COL[1], outlier.shape = NA) +
  geom_point(position = position_jitter(width = 0.12), alpha = 0.5) +
  facet_wrap(~ dataset) +
  labs(
    x = "Experiment",
    y = "Per-rat mean reward rate",
    title = "Boxplots before and after outlier removal"
  )

The contrast between the bar plots and boxplots is useful. Both visualise central tendency and spread, but they do so in different ways. Mean-based plots can be strongly influenced by a few extreme observations, whereas boxplots are often more robust because the median and quartiles move less.

Discussion points:

Did the mean difference change?
Did the spread change?
If you removed outliers, how would you justify that choice?

To justify your analysis, you must be able to explain why a decision was reasonable and what it changed.

5. Effect size vs significance

In Week 6 we started doing inference (is there evidence?). This week we add: how big is the effect?

A simple effect size measure is the standardised mean difference. We will compute Cohen’s d using a pooled standard deviation. Here we use the demonstration dataset to show how a few extreme observations can change the estimated effect size.

pooled_sd <- function(x, g) {
  g <- as.factor(g)
  x1 <- x[g == levels(g)[1]]
  x2 <- x[g == levels(g)[2]]
  n1 <- length(x1)
  n2 <- length(x2)
  s1 <- sd(x1)
  s2 <- sd(x2)
  sp <- sqrt(((n1 - 1) * s1^2 + (n2 - 1) * s2^2) / (n1 + n2 - 2))
  return(sp)
}

mean_exp1_demo <- rat_means_demo[experiment == "Experiment 1", mean(mean_reward)]
mean_exp2_demo <- rat_means_demo[experiment == "Experiment 2", mean(mean_reward)]

sp_raw <- pooled_sd(rat_means_demo$mean_reward, rat_means_demo$experiment)
d_raw  <- (mean_exp1_demo - mean_exp2_demo) / sp_raw

sp_raw
## [1] 0.1175454
d_raw
## [1] 0.3523069

Repeat for the cleaned dataset.

mean_exp1_cln <- rat_means_clean[experiment == "Experiment 1", mean(mean_reward)]
mean_exp2_cln <- rat_means_clean[experiment == "Experiment 2", mean(mean_reward)]

sp_cln <- pooled_sd(rat_means_clean$mean_reward, rat_means_clean$experiment)
d_cln  <- (mean_exp1_cln - mean_exp2_cln) / sp_cln

sp_cln
## [1] 0.05452591
d_cln
## [1] -0.1566533

Discussion points:

Did the effect size change after cleaning?
Which analysis would you report, and why?
If the effect is “statistically significant” but d is tiny, what does that mean for interpretation?

6. Why sample size can change “significance” without changing the effect

We will simulate a situation where the effect size is fixed, but we change the sample size.

We will generate data from two normal distributions with a fixed mean difference.

set.seed(42)

mu1 <- 0
mu2 <- 0.3
sigma <- 1

simulate_two_group <- function(n_per_group) {
  x1 <- rnorm(n_per_group, mu1, sigma)
  x2 <- rnorm(n_per_group, mu2, sigma)
  d <- data.table(
    grp = rep(c("A", "B"), each = n_per_group),
    x = c(x1, x2)
  )
  return(d)
}

A simple z-test for difference in means (normal test):

z_test_diff_means <- function(x, g, sigma_known) {
  g <- as.factor(g)
  x1 <- x[g == levels(g)[1]]
  x2 <- x[g == levels(g)[2]]
  n1 <- length(x1)
  n2 <- length(x2)
  z <- (mean(x1) - mean(x2)) / (sigma_known * sqrt(1/n1 + 1/n2))
  p <- 2 * pnorm(abs(z), lower.tail = FALSE)
  return(list(z = z, p = p))
}

Compare p-values for different n.

d_small <- simulate_two_group(20)
d_big   <- simulate_two_group(200)

res_small <- z_test_diff_means(d_small$x, d_small$grp, sigma_known = sigma)
res_big   <- z_test_diff_means(d_big$x,   d_big$grp,   sigma_known = sigma)

res_small
## $z
## [1] 0.5151724
## 
## $p
## [1] 0.6064326
res_big
## $z
## [1] -3.241692
## 
## $p
## [1] 0.001188223

Key point:

the effect size is about the size of the difference
the p-value is about the evidence relative to sample size
large n can make small effects look “significant”

This is a central evaluation skill for the final project.

7. Power as design (not just statistics)

Power is about how likely your method is to detect an effect of a given size.

This matters for:

evaluating an existing study
designing a new study

We will do a simple normal-approximation power calculation for a difference in means.

First, choose a target effect to design for. For power, we return to the real dataset rather than the teaching example with artificial outliers. Use the effect size observed in the real data as a starting point.

sp_real <- pooled_sd(rat_means$mean_reward, rat_means$experiment)
d_real  <- (mean_exp1 - mean_exp2) / sp_real

d_real
## [1] -0.1566533

d_target <- abs(d_real)
d_target
## [1] 0.1566533

Effect sizes are in units of standard deviation. To understand what that means in the original measurement scale, we can convert back to a mean difference by multiplying by a plausible sigma.

Here we use the pooled SD from the real dataset as a proxy.

sigma_proxy <- sp_real
delta_target <- d_target * sigma_proxy

sigma_proxy
## [1] 0.05452591
delta_target
## [1] 0.008541667

8. Estimating power by simulating repeated studies

Power is the probability that a study will detect a real effect of a given size.

One way to understand this is to imagine repeating the same study many times. Each repetition draws a new sample from the same populations, runs the same normal test, and asks whether the result is statistically significant.

If the sample size is small, sampling noise is large, so even a real effect will often be missed. If the sample size is larger, the estimated group difference becomes more stable, and the study detects the effect more often.

We can estimate power by simulation.

simulate_power_z <- function(n_per_group, mu1, mu2, sigma, n_sims = 1000) {
  p_vals <- replicate(n_sims, {
    x1 <- rnorm(n_per_group, mean = mu1, sd = sigma)
    x2 <- rnorm(n_per_group, mean = mu2, sd = sigma)

    z <- (mean(x1) - mean(x2)) / (sigma * sqrt(2 / n_per_group))
    2 * pnorm(abs(z), lower.tail = FALSE)
  })

  mean(p_vals < 0.05)
}

To run the simulation, we set one group mean to 0 and the other to delta_target. This creates a population difference equal to the effect size estimated from the real data.

mu1_sim <- 0
mu2_sim <- delta_target

n_vec <- seq(5, 800, by = 5)
pwr   <- sapply(
  n_vec,
  simulate_power_z,
  mu1 = mu1_sim,
  mu2 = mu2_sim,
  sigma = sigma_proxy
)

d_pwr <- data.table(n_per_group = n_vec, power = pwr)

ggplot(d_pwr, aes(x = n_per_group, y = power)) +
  geom_line() +
  geom_hline(yintercept = 0.8, linetype = 2, colour = "red") +
  ylim(0, 1) +
  labs(
    x = "N per group",
    y = "Estimated power",
    title = "Power curve from repeated simulated studies"
  )

The curve shows how often the normal test would be expected to return p < .05 if the true effect were about the size observed in the real data.

When the observed effect is very small, the power analysis often says that an extremely large sample would be needed. This does not prove that a real but tiny effect exists. It may simply mean that the current data do not provide evidence for a meaningful effect.

This calculation treats the observed effect size as a plausible target for design. That is only sensible if you think the observed effect reflects something real. If the observed difference is mostly noise, then the estimated required sample size may not be very meaningful.

In practice, power analysis is most useful when the target effect is chosen from theory, prior evidence, or a minimum effect size that would matter scientifically.

Find the smallest n that achieves 0.8 power.

d_pwr[power >= 0.8][1]
##    n_per_group power
##          <num> <num>
## 1:         610   0.8

Discussion points:

Does the original study (24 rats per experiment) appear adequately powered to detect the effect you observed?
If your target effect is small, required n becomes large.
If a study has low n, it may be underpowered unless the effect is large.
This is how you evaluate design quality, not just p-values.

9. Analysis critique (practice evaluating)

Below is a short, deliberately imperfect analysis summary. Your job is to evaluate it.

We compared two conditions and found a statistically significant difference (p < .05). We removed a few outliers because they looked weird. We tried a few different transformations and reported the one that gave the cleanest result. The effect is real because it is significant.

Critique prompts (write brief answers):

What information is missing that you would need to judge the strength of the evidence?
What analysis choices were made without justification?
Which sentence strongly suggests researcher degrees of freedom (p-hacking risk)?
What would you want reported instead of only “p < .05”?
What would you want to know about sample size and power?

This kind of critique is part of “defending analysis choices” in your final project: you show that you understand what makes an analysis trustworthy.

Summary

Today you practiced moving from:

doing analysis

evaluating analysis

This is the core shift needed for the final project.

In particular you now have language and tools to discuss:

effect size vs p-value
sensitivity to analysis choices
power as a design constraint
critique of weak or incomplete analysis claims

Apply this to your assigned dataset

Your assigned dataset is determined by your student ID number. Take the last digit and compute last_digit %% 3 in R.

If the result is 0: https://github.com/crossley/cogs2020/tree/main/final_project_data/cat_learn_switch
If the result is 1: https://github.com/crossley/cogs2020/tree/main/final_project_data/cat_learn_auto
If the result is 2: https://github.com/crossley/cogs2020/tree/main/final_project_data/cat_learn_unlearn

Load your assigned dataset into R.

For this tutorial, your task is:

Identify a key comparison in your dataset (two conditions or two groups).
Compute per-participant (or per-subject) means for each group or condition.
Compute Cohen’s d for this comparison.
Using the effect size you computed, estimate the required sample size for 80% power.
Write 2–3 sentences reflecting on whether the original study appears adequately powered to detect the effect you observed.

You will include this reflection in your final project as part of the “limitations and assumptions” discussion.

Tutorial 7 - Effect size, power, and evaluating inference

Author: Matthew J. Crossley

Last update: 25 April, 2026

Overview

1. Load real data and inspect the experiment design

2. Define a key comparison

3. Visualise the comparison

4. Outlier handling as an analysis choice

5. Effect size vs significance

6. Why sample size can change “significance” without changing the effect

7. Power as design (not just statistics)

8. Estimating power by simulating repeated studies

9. Analysis critique (practice evaluating)

Summary

Apply this to your assigned dataset