# load libraries

library(data.table)
library(ggplot2)

# clean work space

rm(list = ls())

# init colorscheme

COL <- c("#2271B2", "#E69F00", "#D55E00")
names(COL) <- c("blue", "orange", "red")
theme_set(
  theme_minimal(base_size = 13) +
    theme(
      panel.grid.minor = element_blank(),
      strip.text = element_text(face = "bold"),
      legend.position = "bottom"
    )
)
update_geom_defaults("point", list(size = 2))
update_geom_defaults("line", list(linewidth = 0.8))

Overview

In your final project you will:

  • analyse a dataset (what does the data show?)
  • evaluate an analysis (how strong is the evidence?)
  • justify and defend your analysis choices (why these steps?)

This tutorial focuses on the second and third items.

A key idea for today:

Inference and study design can be done well or badly. Today you learn how to tell the difference.

We will use a rat T-maze dataset. In the maze, a rat can turn left or left at a T-junction, and one arm is baited with reward on the majority of trials. We compare two experiments that differ in which arm was baited more often.


Part 1 — Load real data and define a comparison

d1 <- fread("data/experiment_1_summary.csv")
d2 <- fread("data/experiment_2_summary.csv")

str(d1)
## Classes 'data.table' and 'data.frame':   4800 obs. of  9 variables:
##  $ experiment   : chr  "exp1" "exp1" "exp1" "exp1" ...
##  $ rat_id       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ trial        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ cage_context : chr  "mesh" "wood" "mesh" "mesh" ...
##  $ scent        : chr  "none" "none" "none" "peppermint" ...
##  $ choice       : chr  "L" "R" "L" "L" ...
##  $ reward       : int  1 0 1 1 1 1 0 1 1 1 ...
##  $ reaction_time: num  0.572 0.936 0.457 0.446 0.2 ...
##  $ maze_run_time: num  3.79 3.21 2.14 2.65 2.73 ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(d2)
## Classes 'data.table' and 'data.frame':   4800 obs. of  9 variables:
##  $ experiment   : chr  "exp2" "exp2" "exp2" "exp2" ...
##  $ rat_id       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ trial        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ cage_context : chr  "wood" "mesh" "mesh" "mesh" ...
##  $ scent        : chr  "peppermint" "peppermint" "none" "peppermint" ...
##  $ choice       : chr  "L" "L" "R" "R" ...
##  $ reward       : int  0 0 1 0 0 0 0 1 1 1 ...
##  $ reaction_time: num  0.872 0.302 0.486 0.704 0.343 ...
##  $ maze_run_time: num  4.07 4.11 3.48 3.98 2.51 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Key variables:

  • rat_id — individual rat (1–24)
  • trial — trial number (1–200)
  • choice — “L” (left) or “R” (right)
  • reward — 1 if rewarded, 0 otherwise

Reward contingencies:

  • Experiment 1: left arm rewarded 70%, right arm 30%
  • Experiment 2: right arm rewarded 70%, left arm 30%

We compare the two experiments on the basis of each rat’s mean reward rate across all 200 trials.

rat_means_1 <- d1[, .(mean_reward = mean(reward), experiment = "Experiment 1"), .(rat_id)]
rat_means_2 <- d2[, .(mean_reward = mean(reward), experiment = "Experiment 2"), .(rat_id)]

rat_means <- rbind(rat_means_1, rat_means_2)

rat_means[, .N, .(experiment)]
##      experiment     N
##          <char> <int>
## 1: Experiment 1    24
## 2: Experiment 2    24

Group-level summary:

rat_means[, .(
  n = .N,
  mean_reward = mean(mean_reward),
  sd_reward = sd(mean_reward)
), .(experiment)]
##      experiment     n mean_reward  sd_reward
##          <char> <int>       <num>      <num>
## 1: Experiment 1    24   0.5177083 0.05419328
## 2: Experiment 2    24   0.5262500 0.05485653

The observed difference in mean reward rate:

mean_exp1 <- rat_means[experiment == "Experiment 1", mean(mean_reward)]
mean_exp2 <- rat_means[experiment == "Experiment 2", mean(mean_reward)]
diff_obs <- mean_exp1 - mean_exp2
diff_obs
## [1] -0.008541667

Question (final project framing):

  • Is diff_obs meaningful, or could it plausibly be noise?

Today we develop tools to answer that responsibly.


Part 2 — Same data, different conclusions

This section shows how small analysis choices can change the story, even when using the same dataset.

2.1 Visualise the comparison

set.seed(42)

ggplot(rat_means, aes(x = experiment, y = mean_reward)) +
  geom_boxplot(fill = COL[1], outlier.shape = NA) +
  geom_point(position = position_jitter(width = 0.15), alpha = 0.4) +
  labs(
    x = "Experiment",
    y = "Per-rat mean reward rate",
    title = "Reward rate by experiment"
  )

The two groups differ in the expected direction: rats in Experiment 1 earn reward at a different rate than rats in Experiment 2, reflecting the reversed reward contingencies.

2.2 A common choice: removing outliers

We will define an “outlier rule” using the IQR method applied to the pooled per-rat means.

all_means <- rat_means$mean_reward

q1 <- quantile(all_means, 0.25, na.rm = TRUE)
q3 <- quantile(all_means, 0.75, na.rm = TRUE)
iqr <- q3 - q1

lower <- q1 - 1.5 * iqr
upper <- q3 + 1.5 * iqr

lower
##    25% 
## 0.3875
upper
##    75% 
## 0.6575

Make a cleaned version.

rat_means_clean <- rat_means[mean_reward >= lower & mean_reward <= upper]

rat_means[, .(n_before = .N)]
##    n_before
##       <int>
## 1:       48
rat_means_clean[, .(n_after = .N)]
##    n_after
##      <int>
## 1:      48

Compare means before vs after outlier exclusion.

sum_raw <- rat_means[, .(
  mean_reward = mean(mean_reward),
  sd_reward = sd(mean_reward),
  n = .N
), .(experiment)]

sum_cln <- rat_means_clean[, .(
  mean_reward = mean(mean_reward),
  sd_reward = sd(mean_reward),
  n = .N
), .(experiment)]

sum_raw
##      experiment mean_reward  sd_reward     n
##          <char>       <num>      <num> <int>
## 1: Experiment 1   0.5177083 0.05419328    24
## 2: Experiment 2   0.5262500 0.05485653    24
sum_cln
##      experiment mean_reward  sd_reward     n
##          <char>       <num>      <num> <int>
## 1: Experiment 1   0.5177083 0.05419328    24
## 2: Experiment 2   0.5262500 0.05485653    24

Plot the cleaned data.

set.seed(42)

ggplot(rat_means_clean, aes(x = experiment, y = mean_reward)) +
  geom_boxplot(fill = COL[1], outlier.shape = NA) +
  geom_point(position = position_jitter(width = 0.15), alpha = 0.4) +
  labs(
    x = "Experiment",
    y = "Per-rat mean reward rate",
    title = "Reward rate by experiment (outliers removed)"
  )

Discussion points:

  • Did the mean difference change?
  • Did the spread change?
  • If you removed outliers, how would you justify that choice?

To justify your analysis, you must be able to explain why a decision was reasonable and what it changed.


Part 3 — Effect size vs significance

In Week 6 we started doing inference (“is there evidence?”). This week we add: “how big is the effect?”

3.1 A simple effect size: standardised mean difference

We will compute Cohen’s d using a pooled standard deviation.

pooled_sd <- function(x, g) {
  g <- as.factor(g)
  x1 <- x[g == levels(g)[1]]
  x2 <- x[g == levels(g)[2]]
  n1 <- length(x1)
  n2 <- length(x2)
  s1 <- sd(x1)
  s2 <- sd(x2)
  sp <- sqrt(((n1 - 1) * s1^2 + (n2 - 1) * s2^2) / (n1 + n2 - 2))
  return(sp)
}

sp_raw <- pooled_sd(rat_means$mean_reward, rat_means$experiment)
d_raw  <- (mean_exp1 - mean_exp2) / sp_raw

sp_raw
## [1] 0.05452591
d_raw
## [1] -0.1566533

Repeat for the cleaned dataset.

mean_exp1_cln <- rat_means_clean[experiment == "Experiment 1", mean(mean_reward)]
mean_exp2_cln <- rat_means_clean[experiment == "Experiment 2", mean(mean_reward)]

sp_cln <- pooled_sd(rat_means_clean$mean_reward, rat_means_clean$experiment)
d_cln  <- (mean_exp1_cln - mean_exp2_cln) / sp_cln

sp_cln
## [1] 0.05452591
d_cln
## [1] -0.1566533

Discussion points:

  • Did the effect size change after cleaning?
  • Which analysis would you report, and why?
  • If the effect is “statistically significant” but d is tiny, what does that mean for interpretation?

3.2 Why sample size can change “significance” without changing the effect

We will simulate a situation where the effect size is fixed, but we change the sample size.

We will generate data from two normal distributions with a fixed mean difference.

set.seed(42)

mu1 <- 0
mu2 <- 0.3
sigma <- 1

simulate_two_group <- function(n_per_group) {
  x1 <- rnorm(n_per_group, mu1, sigma)
  x2 <- rnorm(n_per_group, mu2, sigma)
  d <- data.table(
    grp = rep(c("A", "B"), each = n_per_group),
    x = c(x1, x2)
  )
  return(d)
}

A simple z-test for difference in means (normal test):

z_test_diff_means <- function(x, g, sigma_known) {
  g <- as.factor(g)
  x1 <- x[g == levels(g)[1]]
  x2 <- x[g == levels(g)[2]]
  n1 <- length(x1)
  n2 <- length(x2)
  z <- (mean(x1) - mean(x2)) / (sigma_known * sqrt(1/n1 + 1/n2))
  p <- 2 * pnorm(abs(z), lower.tail = FALSE)
  return(list(z = z, p = p))
}

Compare p-values for different n.

d_small <- simulate_two_group(20)
d_big   <- simulate_two_group(200)

res_small <- z_test_diff_means(d_small$x, d_small$grp, sigma_known = sigma)
res_big   <- z_test_diff_means(d_big$x,   d_big$grp,   sigma_known = sigma)

res_small
## $z
## [1] 0.5151724
## 
## $p
## [1] 0.6064326
res_big
## $z
## [1] -3.241692
## 
## $p
## [1] 0.001188223

Key point:

  • the effect size is about the size of the difference
  • the p-value is about the evidence relative to sample size
  • large n can make small effects look “significant”

This is a central evaluation skill for the final project.


Part 4 — Power as design (not just statistics)

Power is about how likely your method is to detect an effect of a given size.

This matters for:

  • evaluating an existing study
  • designing a new study

We will do a simple normal-approximation power calculation for a difference in means.

4.1 Choose a target effect to design for

Use the effect size you observed (raw or cleaned) as a starting point.

d_target <- abs(d_raw)
d_target
## [1] 0.1566533

Convert effect size back into a mean difference scale by choosing a plausible sigma.

Here we use the pooled SD from the real dataset as a proxy.

sigma_proxy <- sp_raw
delta_target <- d_target * sigma_proxy

sigma_proxy
## [1] 0.05452591
delta_target
## [1] 0.008541667

4.2 Power calculation for a z-test

For a two-sided test at alpha = 0.05, the critical value is:

alpha <- 0.05
zcrit <- qnorm(1 - alpha / 2)
zcrit
## [1] 1.959964

If the true mean difference is delta_target, the z-test statistic has mean:

mu_z = delta_target / (sigma * sqrt(1/n1 + 1/n2))

Assume equal group sizes: n per group.

Then:

mu_z = delta_target / (sigma * sqrt(2/n))

Power (two-sided) is:

P(|Z| > zcrit) when Z ~ N(mu_z, 1).

We compute power across a range of sample sizes.

power_two_sided <- function(n_per_group, delta, sigma, zcrit) {
  mu_z  <- delta / (sigma * sqrt(2 / n_per_group))
  power <- pnorm(-zcrit, mean = mu_z, sd = 1) +
           (1 - pnorm(zcrit, mean = mu_z, sd = 1))
  return(power)
}

n_vec <- seq(5, 200, by = 5)
pwr   <- sapply(n_vec, power_two_sided,
                delta = delta_target, sigma = sigma_proxy, zcrit = zcrit)

d_pwr <- data.table(n_per_group = n_vec, power = pwr)

ggplot(d_pwr, aes(x = n_per_group, y = power)) +
  geom_line() +
  geom_hline(yintercept = 0.8, linetype = 2, colour = "red") +
  ylim(0, 1) +
  labs(
    x = "N per group",
    y = "Power",
    title = "Power curve for observed effect size"
  )

Find the smallest n that achieves 0.8 power.

d_pwr[power >= 0.8][1]
##    n_per_group power
##          <num> <num>
## 1:          NA    NA

Discussion points:

  • Does the original study (24 rats per experiment) appear adequately powered to detect the effect you observed?
  • If your target effect is small, required n becomes large.
  • If a study has low n, it may be underpowered unless the effect is large.
  • This is how you evaluate design quality, not just p-values.

Part 5 — Analysis critique (practice evaluating)

Below is a short, deliberately imperfect analysis summary. Your job is to evaluate it.

5.1 A contrived analysis summary


We compared two conditions and found a statistically significant difference (p < .05). We removed a few outliers because they looked weird. We tried a few different transformations and reported the one that gave the cleanest result. The effect is real because it is significant.


5.2 Critique prompts (write brief answers)

  1. What information is missing that you would need to judge the strength of the evidence?

  2. What analysis choices were made without justification?

  3. Which sentence strongly suggests researcher degrees of freedom (p-hacking risk)?

  4. What would you want reported instead of only “p < .05”?

  5. What would you want to know about sample size and power?

This kind of critique is part of “defending analysis choices” in your final project: you show that you understand what makes an analysis trustworthy.


Summary

Today you practiced moving from:

  • doing analysis

to

  • evaluating analysis

This is the core shift needed for the final project.

In particular you now have language and tools to discuss:

  • effect size vs p-value
  • sensitivity to analysis choices
  • power as a design constraint
  • critique of weak or incomplete analysis claims

Apply this to your assigned dataset

Your assigned dataset is determined by your student ID number. Take the last digit and compute last_digit %% 3 in R.

Load your assigned dataset into R.

For this tutorial, your task is:

  1. Identify a key comparison in your dataset (two conditions or two groups).
  2. Compute per-participant (or per-subject) means for each group or condition.
  3. Compute Cohen’s d for this comparison.
  4. Using the effect size you computed, estimate the required sample size for 80% power.
  5. Write 2–3 sentences reflecting on whether the original study appears adequately powered to detect the effect you observed.

You will include this reflection in your final project as part of the “limitations and assumptions” discussion.