Tutorial 6 - Inference, sampling variability, and decisions

# load libraries

library(data.table)
library(ggplot2)

# clean work space

rm(list = ls())

# init colorscheme

COL <- c("#2271B2", "#E69F00", "#D55E00")
names(COL) <- c("blue", "orange", "red")
theme_set(
  theme_minimal(base_size = 13) +
    theme(
      panel.grid.minor = element_blank(),
      strip.text = element_text(face = "bold"),
      legend.position = "bottom"
    )
)
update_geom_defaults("point", list(size = 2))
update_geom_defaults("line", list(linewidth = 0.8))

Overview

Last week we focused on describing how variables behave:

what kind of variable we have
how its values are distributed
which summaries and plots are defensible
what assumptions those choices imply

This week we take the next step.

The goal of this tutorial is to connect those descriptive choices to statistical inference.

A key idea for today:

Statistical inference is about deciding whether observed differences are larger than we would expect from sampling variability alone.

This is the point where analysis moves from description to inference.

Part 1 — Framing the question

In your final project you will compare conditions and make claims about differences between them.

For example:

do rats prefer one arm of a maze over another?
does an environmental cue shift that preference?

We begin with a simple question using rat T-maze data from Experiment 1.

In Experiment 1, 24 rats ran 200 trials each in a T-maze. The left arm was rewarded on 70% of trials. Each row in the data file gives one trial for one rat.

Load Experiment 1.

d1 <- fread("data/experiment_1_summary.csv")
head(d1)
##    experiment rat_id trial cage_context      scent choice reward reaction_time maze_run_time
##        <char>  <int> <int>       <char>     <char> <char>  <int>         <num>         <num>
## 1:       exp1      1     1         mesh       none      L      1     0.5716877      3.789606
## 2:       exp1      1     2         wood       none      R      0     0.9360044      3.209561
## 3:       exp1      1     3         mesh       none      L      1     0.4574906      2.137541
## 4:       exp1      1     4         mesh peppermint      L      1     0.4461169      2.648328
## 5:       exp1      1     5         mesh      lemon      L      1     0.2000000      2.727809
## 6:       exp1      1     6         mesh peppermint      L      1     0.6402233      2.699873

Our question:

Do rats in Experiment 1 choose the rewarded arm more than chance (i.e., is their mean reward rate greater than 0.5)?

Notice that the dependent variable here is a proportion computed from a binary trial-level variable (reward), so we are no longer working with a continuous timing variable like last week.

Part 2 — Per-rat reward rates

First, compute each rat’s mean reward rate, the proportion of trials on which the rat received a reward.

d1_rat <- d1[, .(reward_rate = mean(reward)), .(rat_id)]
d1_rat
##     rat_id reward_rate
##      <int>       <num>
##  1:      1       0.585
##  2:      2       0.520
##  3:      3       0.520
##  4:      4       0.505
##  5:      5       0.535
##  6:      6       0.640
##  7:      7       0.455
##  8:      8       0.395
##  9:      9       0.545
## 10:     10       0.545
## 11:     11       0.515
## 12:     12       0.485
## 13:     13       0.580
## 14:     14       0.530
## 15:     15       0.565
## 16:     16       0.555
## 17:     17       0.490
## 18:     18       0.525
## 19:     19       0.540
## 20:     20       0.455
## 21:     21       0.445
## 22:     22       0.445
## 23:     23       0.490
## 24:     24       0.560
##     rat_id reward_rate

Why this summary makes sense:

reward is binary on each trial
averaging across many trials gives a per-rat proportion
that proportion is the quantity we want to compare with chance

Plot the distribution of per-rat reward rates.

ggplot(d1_rat, aes(x = reward_rate)) +
  geom_histogram(bins = 10, fill = COL[1], colour = "white") +
  geom_vline(xintercept = 0.5, linetype = 2, colour = "red") +
  labs(
    title = "Per-rat mean reward rate — Experiment 1",
    x = "Mean reward rate",
    y = "Number of rats"
  )

The dashed line marks chance (0.5).

Notice that the per-rat means vary even though all rats experienced the same 70% reward schedule.

Some of this spread reflects sampling variability: each rat’s 200-trial experience is a sample from the same reward schedule, yet produces a different summary statistic. For teaching purposes, we treat the rats here as if they were generated from the same underlying process.

Part 3 — Sampling variability

Even when data come from the same generating process, samples differ.

To see this, simulate two groups of 24 rats whose true reward probability is exactly 0.5 (pure chance), and compare their sample mean reward rates.

set.seed(42)

group_a <- rbinom(24, size = 200, prob = 0.5) / 200
group_b <- rbinom(24, size = 200, prob = 0.5) / 200

mean(group_a)
## [1] 0.4841667
mean(group_b)
## [1] 0.491875

The means differ slightly even though both groups were generated from the same null distribution (p = 0.5).

Repeat several times to drive the point home.

set.seed(42)
replicate(5, mean(rbinom(24, 200, 0.5) / 200))
## [1] 0.4841667 0.4918750 0.5047917 0.5004167 0.5156250

Important idea:

samples vary
statistics computed from samples vary
variation alone can produce apparent differences

This is why comparing means directly is not sufficient. We need to ask whether the observed difference is larger than what sampling variability alone would produce.

Part 4 — Distribution of sample means

We now simulate many hypothetical experiments, each with 24 rats whose true reward probability is 0.5.

For each simulated experiment we compute the group mean reward rate.

set.seed(42)
sim_means <- replicate(5000, mean(rbinom(24, 200, 0.5) / 200))
d_sim <- data.table(xbar = sim_means)

ggplot(d_sim, aes(x = xbar)) +
  geom_histogram(bins = 40, fill = COL[1], colour = "white") +
  geom_vline(
    xintercept = mean(d1_rat$reward_rate),
    colour = COL[2], linetype = 1, linewidth = 1
  ) +
  labs(
    title = "Sampling distribution of mean reward rate under H\u2080 (p = 0.5)",
    subtitle = "Orange line = observed mean from Experiment 1",
    x = "Simulated group mean reward rate",
    y = "Count"
  )

This histogram shows the distribution of sample means we would expect if rats were performing at chance.

Observations:

the distribution is approximately normal
it is much narrower than the distribution of individual trial outcomes
the observed mean from Experiment 1 sits well to the right of this null distribution

This visual comparison is the core logic of inference.

Last week the main question was “what does this variable look like?” This week the main question is “how unusual is this summary if the null hypothesis were true?”

So let us answer that directly.

How often would a group mean at least this large appear if rats were really performing at chance?

obs_mean <- mean(d1_rat$reward_rate)
tail_prop <- mean(sim_means >= obs_mean)

obs_mean
## [1] 0.5177083
tail_prop
## [1] 0.0052

tail_prop is the proportion of simulated null experiments that produced a group mean at least as large as the one we observed.

If this proportion is very small, then the observed mean is rare under the null model.

This gives us a first inferential answer:

under the chance model, our observed mean would occur very rarely
that is evidence against the idea that rats were choosing randomly

This is the same core idea that later appears in formal terms like p-value.

Part 5 — Two universes and a decision

The lecture framed inference as comparing two possible “universes.”

In this tutorial, those universes are:

Null universe: rats are performing at chance, so the true mean reward rate is 0.5
Alternative universe: rats are performing better than chance, so the true mean reward rate is greater than 0.5

The simulation in Part 4 built a model of the null universe.

Now we can ask the decision question clearly:

If the null universe were true, would our observed mean be common or rare?

Our simulation suggests it would be rare.

That means the observed data fit the null universe poorly, so we have evidence in favour of the alternative.

This is the basic logic of statistical inference:

Build a model of what we would expect under the null.
Compare the observed statistic to that null model.
Ask whether the observed statistic is unusually extreme.
Use that answer to decide whether the null remains plausible.

The important idea is not the label. Whether we call the result a tail probability, a p-value, or evidence against the null, the reasoning is the same:

the null model says what should happen by chance alone
the observed statistic tells us what actually happened
inference asks whether the gap between those two is large enough to matter

We can now turn to a different, but related, question.

Instead of asking “is this mean unusually large under the null?” we can ask:

What range of population mean reward rates is plausible given our sample?

Part 6 — Critical values from the null model

Part 4 used simulation to estimate a tail probability directly.

There is another way to make the same decision.

Instead of waiting to see the observed mean and then asking how much null probability lies above it, we can choose a cutoff in advance.

That cutoff is called a critical value.

Because our question is one-sided:

H0: mean reward rate = 0.5 H1: mean reward rate > 0.5

we only care about the upper tail of the null distribution.

If we are willing to tolerate a 5% Type I error rate, then the critical value is the point with only 5% of the null distribution above it.

crit_value_sim <- quantile(sim_means, 0.95)
crit_value_sim
##       95% 
## 0.5114583

Now add that cutoff to the simulated null distribution.

ggplot(d_sim, aes(x = xbar)) +
  geom_histogram(bins = 40, fill = COL[1], colour = "white") +
  geom_vline(xintercept = obs_mean, colour = COL[2], linewidth = 1) +
  geom_vline(xintercept = crit_value_sim, colour = COL[3], linetype = 2, linewidth = 1) +
  labs(
    title = "Observed mean and critical value under H0",
    subtitle = "Orange = observed mean, red dashed = 5% critical value",
    x = "Simulated group mean reward rate",
    y = "Count"
  )

Decision logic:

if the observed mean falls to the right of the critical value, reject the null
if it falls to the left, do not reject the null

This is exactly the same inferential logic as the tail probability in Part 4. It is just a different way of expressing the same decision.

Part 7 — Confidence intervals from the same idea

A confidence interval answers a different question.

Instead of asking:

Is the observed mean unusually large under a null value of 0.5?

it asks:

Which population mean reward rates are plausible given the sample we observed?

We can keep the same normal-style reasoning and use simulation to build intuition.

First estimate the standard error of the observed mean.

n_rats <- nrow(d1_rat)
se_obs <- sd(d1_rat$reward_rate) / sqrt(n_rats)
se_obs
## [1] 0.01106216

Now imagine many hypothetical repeats of the experiment if the observed mean were the true population mean.

set.seed(42)
sim_ci <- rnorm(5000, mean = obs_mean, sd = se_obs)
d_ci <- data.table(xbar = sim_ci)

ci_lower <- quantile(sim_ci, 0.025)
ci_upper <- quantile(sim_ci, 0.975)

ggplot(d_ci, aes(x = xbar)) +
  geom_histogram(bins = 40, fill = COL[1], colour = "white") +
  geom_vline(xintercept = obs_mean, colour = COL[2], linewidth = 1) +
  geom_vline(xintercept = c(ci_lower, ci_upper),
             colour = COL[3], linetype = 2, linewidth = 1) +
  labs(
    title = "Confidence interval logic from repeated sampling",
    subtitle = "Orange = observed mean, red dashed = 95% interval bounds",
    x = "Hypothetical repeated sample means",
    y = "Count"
  )

The red dashed lines mark a range of values close to the observed mean that would commonly arise under repeated sampling.

That gives us a 95% confidence interval for the population mean reward rate.

ci_lower
##      2.5% 
## 0.4955721
ci_upper
##     97.5% 
## 0.5391596

Interpretation:

the interval is centred on the observed mean because the sample mean is our estimate of the population mean
wider intervals mean more uncertainty
if the null value 0.5 lies outside the interval, that is evidence that chance performance is not a plausible mean for this dataset

Part 8 — Formal Normal test: p-value, critical value, and CI

The lecture showed that there are three closely related ways to express the same inferential decision.

We can now write those three views down formally for the rat example.

Before we do that, an important assumption:

A Normal test assumes that we know the true standard deviation of the sampling distribution under the null hypothesis.

In real data analysis, we usually do not know that value exactly.

Later in the course, we will learn how to proceed when the population standard deviation is unknown. That is where t-tests come in.

For now, we will make a teaching simplification and treat the standard deviation of the null sampling distribution as known.

8.1 Specify the test

We are testing:

H0: mu = 0.5
H1: mu > 0.5

The sample statistic is the mean per-rat reward rate.

From the simulated null distribution in Part 4, we can treat that statistic as approximately normal under H0.

mu0 <- 0.5
se_null <- sd(sim_means)

mu0
## [1] 0.5
se_null
## [1] 0.007231074

8.2 P-value approach

The p-value is the probability, under H0, of observing a mean at least as large as the one we actually saw.

p_value <- pnorm(obs_mean, mean = mu0, sd = se_null, lower.tail = FALSE)
p_value
## [1] 0.007164229

8.3 Critical value approach

The critical value is the cutoff that leaves 5% of the null distribution above it.

crit_value <- qnorm(0.95, mean = mu0, sd = se_null)
crit_value
## [1] 0.5118941

If obs_mean > crit_value, reject H0.

8.4 Confidence interval approach

For a one-sided test with H1: mu > 0.5, the matching confidence interval is a one-sided lower bound.

ci_lower_1s <- qnorm(0.05, mean = obs_mean, sd = se_obs, lower.tail = FALSE)
ci_lower_1s
## [1] 0.535904

If 0.5 is below this lower bound, reject H0.

8.5 Put all three on one picture

x <- seq(mu0 - 4 * se_null, mu0 + 4 * se_null, length.out = 1000)
fx <- dnorm(x, mean = mu0, sd = se_null)
d_norm <- data.table(x = x, fx = fx)

ggplot(d_norm, aes(x = x, y = fx)) +
  geom_line(linewidth = 1) +
  geom_area(
    data = d_norm[x >= obs_mean],
    aes(x = x, y = fx),
    fill = COL[2], alpha = 0.25
  ) +
  geom_vline(xintercept = obs_mean, colour = COL[2], linewidth = 1) +
  geom_vline(xintercept = crit_value, colour = COL[3], linetype = 2, linewidth = 1) +
  geom_vline(xintercept = ci_lower_1s, colour = "darkgreen", linetype = 2, linewidth = 1) +
  labs(
    title = "Normal test: p-value, critical value, and confidence interval",
    subtitle = "Orange = observed mean, red dashed = critical value, green dashed = lower CI bound",
    x = "Mean reward rate",
    y = "Density"
  )

These three views are equivalent ways to think about the same decision:

p-value: how much null probability lies beyond the observed mean?
critical value: did the observed mean cross the rejection cutoff?
confidence interval: is the null value excluded from the interval of plausible means?

The labels differ, but the inferential logic is the same.

Why this matters for the final project

In your final project, you will need to answer questions like:

Why did you summarise the data this way?
Why did you choose this visualisation?
Why is this inferential question the right one?
What assumptions are you making about the data?

Use this tutorial as a model for how to answer those questions, not just how to list them.

A stronger explanation now sounds like this:

“The trial-level variable was binary, so I first summarised performance as a per-subject reward rate. I then defined a null model and asked whether the observed mean or group difference was unusually large under that model. After that, I used confidence intervals to describe the uncertainty around the estimated means and a plot to communicate those estimates clearly.”

Apply this to your assigned dataset

Your assigned dataset is determined by your student ID number. Take the last digit and compute last_digit %% 3 in R.

If the result is 0: https://github.com/crossley/cogs2020/tree/main/final_project_data/cat_learn_switch
If the result is 1: https://github.com/crossley/cogs2020/tree/main/final_project_data/cat_learn_auto
If the result is 2: https://github.com/crossley/cogs2020/tree/main/final_project_data/cat_learn_unlearn

Load your assigned dataset into R.

For this tutorial, your task is:

Identify one key dependent variable and one grouping variable in your dataset.
State why your summary of the dependent variable is appropriate for that variable type.
Compute per-participant (or per-subject) means for your chosen DV in each condition.
Define a null or comparison question for those means.
Use simulation to show what the relevant summary statistic would look like under that null model.
State whether the observed statistic looks common or rare under the null, and explain why.
Compute a 95% confidence interval for the mean in each condition.
Create a plot showing the means and confidence intervals across conditions.
Write 2-3 sentences interpreting both:
1. what the simulation says about the inferential question
2. what the confidence intervals say about uncertainty in your estimates

You will use this visualisation and interpretation in your final project.

Stop here

Today we introduced:

sampling variability using real rat T-maze data
the distribution of sample means under a null hypothesis
simulation-based inference using tail proportions
critical values from a simulated null model
confidence intervals for mean reward rates
the formal Normal test connection between p-values, critical values, and confidence intervals

This week extends last week:

first justify your summaries and assumptions from the variable
then build a null model and ask whether the observed pattern is larger than expected by chance

Next week we will formalise this process into hypothesis tests and use them to make explicit statistical decisions.