# load libraries

library(data.table)
library(ggplot2)

# clean work space

rm(list = ls())

# init colorscheme

COL <- c("#2271B2", "#E69F00", "#D55E00")
names(COL) <- c("blue", "orange", "red")
theme_set(
  theme_minimal(base_size = 13) +
    theme(
      panel.grid.minor = element_blank(),
      strip.text = element_text(face = "bold"),
      legend.position = "bottom"
    )
)
update_geom_defaults("point", list(size = 2))
update_geom_defaults("line", list(linewidth = 0.8))

Overview (important)

This tutorial is directly connected to your final project.

In the final project, you will need to:

  • choose particular analyses
  • choose particular visualisations
  • justify why those choices make sense for your data

This tutorial starts building that habit.

We will:

  1. look at real experimental data
  2. treat it explicitly as a random variable
  3. compare it to data generated from known probability distributions
  4. practice explaining why some models make sense and others do not
  5. connect those decisions to the kinds of choices you will make in your project

Key idea for today

A probability distribution is a model describing how data vary.

We do not choose distributions because they are mathematically convenient.

We choose them because they are reasonable descriptions of how data behave.

That means we should always ask:

  • what values can this variable take?
  • what shape does it have?
  • what summary statistics make sense?
  • what assumptions would be defensible later if we used this variable in an analysis?

Part 1 — Load real experimental data

We will work with data from a rat T-maze experiment. Each row records one trial: which direction the rat chose, whether it received a reward, and how quickly it responded.

d <- fread("data/experiment_1_summary.csv")

Inspect the structure.

str(d)
## Classes 'data.table' and 'data.frame':   4800 obs. of  9 variables:
##  $ experiment   : chr  "exp1" "exp1" "exp1" "exp1" ...
##  $ rat_id       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ trial        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ cage_context : chr  "mesh" "wood" "mesh" "mesh" ...
##  $ scent        : chr  "none" "none" "none" "peppermint" ...
##  $ choice       : chr  "L" "R" "L" "L" ...
##  $ reward       : int  1 0 1 1 1 1 0 1 1 1 ...
##  $ reaction_time: num  0.572 0.936 0.457 0.446 0.2 ...
##  $ maze_run_time: num  3.79 3.21 2.14 2.65 2.73 ...
##  - attr(*, ".internal.selfref")=<externalptr>

The columns we will focus on:

  • reaction_time — how long (in seconds) the rat took to initiate its choice after the trial began
  • reward — whether the rat received a reward on that trial (0 = no, 1 = yes)
  • rat_id — which rat produced the observation
  • trial — trial number within the session

Part 2 — Reaction time as a random variable

reaction_time differs from trial to trial and from rat to rat. No two measurements are exactly the same. That is the defining feature of a random variable: a quantity whose value varies in a way governed by some underlying probability process.

We do not know in advance what a rat’s reaction time will be on any given trial. But we can study how those times are distributed across many observations.

A histogram is our first approximation to a probability distribution.

ggplot(d, aes(x = reaction_time)) +
  geom_histogram(bins = 30, fill = COL[1], colour = "white") +
  labs(
    x = "Reaction time (s)",
    y = "Count",
    title = "Distribution of reaction times across all rats and trials"
  )

Think carefully about what you see:

  • Is the distribution symmetric, or does it lean to one side?
  • Is it bounded below (can reaction time be negative)?
  • Are large values common or rare?
  • Does it have a long tail in one direction?

These questions matter later when you justify your analysis choices.

For your project, a histogram like this helps you inspect the shape of the variable before deciding how to summarise or model it.

Questions to consider:

  • If you had to describe the expected value of reaction time in plain language, what would you say it represents here?
  • Which feature of the histogram tells you whether the variance is likely to be small or large?

Part 3 — Describing the distribution numerically

We can summarise the distribution with a few descriptive statistics.

d[, .(
  mean_rt   = mean(reaction_time),
  median_rt = median(reaction_time),
  sd_rt     = sd(reaction_time)
)]
##      mean_rt median_rt     sd_rt
##        <num>     <num>     <num>
## 1: 0.6827906 0.5865538 0.4116576

Important distinctions:

  • these are sample statistics
  • they describe this particular dataset
  • they estimate properties of an underlying process

In lectures, these underlying properties are called parameters. The mean and standard deviation you compute here are estimates of the true characteristics of the process that generated the reaction times.

Notice that the mean and median do not answer exactly the same question:

  • the mean is the arithmetic average
  • the median is the middle value

When a distribution is skewed, these can differ in informative ways.

If the distribution has a long right tail, the mean is often pulled upward more than the median. That is one reason to look at both.

Questions to consider:

  • Which summary in the table above is closest to the idea of an expected value?
  • Which summary tells you most directly about the spread or variability of reaction time?

Part 4 — Generating data from known distributions

Now we generate artificial data where we know the distribution exactly.

This is useful because it lets us build intuition about what different shapes look like, and then ask which shape best matches our real data.

For continuous variables such as reaction time, we usually describe the model with a probability density function rather than a probability mass function. The histogram gives you a rough visual summary of that density.

Normal distribution

rnorm() generates random numbers from a normal distribution. Here we ask for 1000 values with mean 0.6 and standard deviation 0.1.

set.seed(42)
x_norm <- rnorm(1000, mean = 0.6, sd = 0.1)
d_norm <- data.table(x = x_norm)

ggplot(d_norm, aes(x)) +
  geom_histogram(bins = 30, fill = COL[1], colour = "white") +
  labs(
    x = "Value",
    y = "Count",
    title = "Simulated data: Normal distribution"
  )

Properties:

  • symmetric around the mean
  • defined by its mean and standard deviation
  • can produce negative values in principle, even when negative values make no physical sense

Question:

Would a normal distribution be a sensible model for reaction time? Why or why not? * If you increased the standard deviation but kept the mean fixed, what would happen to the variance and the overall spread of the histogram?*

Uniform distribution

runif() generates random numbers that are equally likely across a range. Here every value between 0 and 1.2 is equally likely.

set.seed(42)
x_unif <- runif(1000, min = 0, max = 1.2)
d_unif <- data.table(x = x_unif)

ggplot(d_unif, aes(x)) +
  geom_histogram(bins = 30, fill = COL[1], colour = "white") +
  labs(
    x = "Value",
    y = "Count",
    title = "Simulated data: Uniform distribution"
  )

Properties:

  • all values in the range are equally likely
  • there is no strong centre
  • it has hard boundaries at both ends

Question:

Does this look anything like how reaction times actually behave? * Compared with the normal example above, does the uniform distribution look as though it has a larger or smaller variance? Why?*

Gamma distribution

rgamma() generates positive, right-skewed values. Here the shape is concentrated away from zero but still has a long right tail.

set.seed(42)
x_gamma <- rgamma(1000, shape = 7, scale = 0.09)
d_gamma <- data.table(x = x_gamma)

ggplot(d_gamma, aes(x)) +
  geom_histogram(bins = 30, fill = COL[1], colour = "white") +
  labs(
    x = "Value",
    y = "Count",
    title = "Simulated data: Gamma distribution"
  )

Properties:

  • bounded below by 0
  • positively skewed, with a longer right tail
  • often more plausible for timing measures than a symmetric normal distribution

Question:

Which features of this gamma distribution make it more or less plausible for reaction time? * Compared with the normal example, does the gamma distribution appear to have a similar expected value but a different variance, or a different expected value as well? Explain your reasoning from the plot.*

An exponential distribution is an even more strongly right-skewed positive distribution. It is useful to know about, but many real RT distributions have a clearer peak away from zero, so the gamma is often a better teaching example here.

Estimate the parameters by eye

One useful way to build intuition is to compare the observed RT distribution with several candidate distributions that differ in shape and tail length.

The goal is not to find a perfect answer. The goal is to practice noticing which parameters make a candidate distribution look too narrow, too wide, too symmetric, or too heavy-tailed.

set.seed(42)
d_candidates <- rbindlist(list(
  data.table(candidate = "A: shape = 2.5, scale = 0.27", x = rgamma(1000, shape = 2.5, scale = 0.27)),
  data.table(candidate = "B: shape = 4, scale = 0.17",   x = rgamma(1000, shape = 4, scale = 0.17)),
  data.table(candidate = "C: shape = 6, scale = 0.115",  x = rgamma(1000, shape = 6, scale = 0.115)),
  data.table(candidate = "D: shape = 9, scale = 0.075",  x = rgamma(1000, shape = 9, scale = 0.075))
))

ggplot(d_candidates, aes(x)) +
  geom_histogram(bins = 30, fill = COL[1], colour = "white") +
  facet_wrap(~ candidate, ncol = 2) +
  labs(
    x = "Value",
    y = "Count",
    title = "Candidate gamma distributions"
  )

Your task:

  • Compare the observed RT histogram with candidates A-D.
  • Decide which candidate looks closest overall.
  • Justify the choice using three features: where the peak is, how spread out the values are, and how heavy the right tail is.
  • Decide which candidate seems to have the largest expected value.
  • Decide which candidate seems to have the largest variance.

Use this comparison to practice estimating parameters by eye. Focus on which candidate has the closest peak, spread, and tail.


Part 5 — Comparing real data to models

Plot the reaction time data again.

ggplot(d, aes(x = reaction_time)) +
  geom_histogram(bins = 30, fill = COL[1], colour = "white") +
  labs(
    x = "Reaction time (s)",
    y = "Count",
    title = "Observed reaction times - which distribution does this resemble?"
  )

Now compare it qualitatively to the three distributions above:

  • normal
  • uniform
  • gamma

Key question:

Which of these distributions looks like a plausible description of how rat reaction times vary?

There is no single perfect answer here, but some answers are much easier to defend than others. The shape of the histogram and the physical constraints on reaction time should guide your reasoning.

A stronger justification would mention both:

  • the type of variable: reaction time is continuous and cannot be negative
  • the shape of the data: short and moderate RTs are common, with a longer tail toward slower responses

For your project, a defensible answer would sound like this:

“I would summarise reaction time with a histogram and numerical summaries such as the mean and median because it is a continuous variable. A normal distribution is not ideal because the RTs are bounded below and slightly right-skewed. A gamma-like model is more plausible because the data are positive and have a longer right tail.”


Part 6 — A different kind of variable: reward

Not all variables in this dataset behave like reaction time. Consider reward.

d[, .N, by = reward]
##    reward     N
##     <int> <int>
## 1:      1  2485
## 2:      0  2315

reward takes only two values: 0 or 1. Every trial is either a success or a failure. This kind of variable is called binary or dichotomous.

The appropriate probability model for a single binary observation is the Bernoulli distribution.

The Bernoulli distribution has one parameter:

  • p = the probability that the outcome is 1

For a Bernoulli variable, the expected value is the probability of observing a 1, and the variance tells you how much trial outcomes fluctuate around that expected value.

This matters because model choice begins with the kind of variable you have:

  • for reaction time, values vary continuously across a range
  • for reward, only two outcomes are possible on each trial

Questions to consider:

  • In this experiment, what does the expected value of reward mean in plain language?
  • If p increased, what would happen to the expected reward on a single trial?
  • When would the variance of a Bernoulli variable be relatively small, and when would it be larger?

When many Bernoulli trials are considered together, the total number of rewarded trials follows a Binomial distribution.

The Binomial distribution is discrete. For a rat that completed n trials, the number of rewarded trials can only be 0, 1, 2, …, n.

For this dataset, each rat completed 200 trials, so the total reward count for a rat is a Binomial-style count variable.

For a Binomial variable, the expected value is the expected number of rewarded trials out of n, and the variance tells you how much those counts should vary from rat to rat when the underlying process is the same.

d_rat_count <- d[, .(reward_count = sum(reward), n_trials = .N), .(rat_id)]
d_count_dist <- d_rat_count[, .N, by = reward_count][order(reward_count)]

ggplot(d_count_dist, aes(x = reward_count, y = N)) +
  geom_segment(aes(xend = reward_count, yend = 0), colour = COL[1], linewidth = 0.8) +
  geom_point(colour = COL[1], size = 2) +
  labs(
    x = "Rewarded trials per rat",
    y = "Number of rats",
    title = "Observed reward counts across rats"
  )

This plot is useful because it shows the discrete nature of the variable clearly: only whole-number counts are possible.

Questions to consider:

  • If each rat completed 200 trials with reward probability p, what would the expected reward count represent?
  • If two experiments had the same expected reward count but different variances, what would look different in the distribution of counts?

The same behaviour can also be represented at a more aggregated level. Instead of looking at the total reward count across all trials, you can summarise reward within blocks of trials and work with a block-level reward rate.

d[, block := ceiling(trial / 25)]
d_block <- d[, .(reward_rate = mean(reward)), .(rat_id, block)]

ggplot(d_block, aes(x = reward_rate)) +
  geom_histogram(bins = 16, fill = COL[1], colour = "white") +
  labs(
    x = "Reward rate within block",
    y = "Count",
    title = "Block-level reward rates"
  )

These three representations support different questions:

  • trial-level reward asks whether a single observation is reward or no reward, so a Bernoulli model makes sense
  • per-rat reward count asks how many rewarded trials occurred out of n, so a Binomial model makes sense
  • block-level reward rate asks how performance changes when trials are summarised into proportions, which is often a useful way to visualise learning or category accuracy

This is directly relevant to final projects on category learning, where the same raw behaviour can often be represented as:

  • a correct or incorrect response on each trial
  • a total number correct out of n trials
  • an average accuracy score across blocks, conditions, or sessions

This is the same logic you will later use when you justify:

  • why you used a mean or a proportion
  • why you used one plot rather than another
  • why one statistical test made more sense than another

Part 7 — Sampling variability

Even when the underlying process is identical, samples differ from one another. This is called sampling variability.

We will use a few new R functions here, so define them clearly first:

  • set.seed() fixes the random-number starting point so your results can be reproduced.
  • rgamma() generates random values from a gamma distribution.
  • replicate() repeats the same command many times.
  • sample() draws values from an existing vector.

First, generate two samples from the same gamma distribution and compare their means.

set.seed(42)
x1 <- rgamma(1000, shape = 7, scale = 0.09)
x2 <- rgamma(1000, shape = 7, scale = 0.09)

mean(x1)
## [1] 0.6233283
mean(x2)
## [1] 0.6334993

The means are similar, but not identical, even though both samples came from the same process.

Now use replicate() to repeat that same operation several times.

set.seed(42)
replicate(5, mean(rgamma(1000, shape = 7, scale = 0.09)))
## [1] 0.6233283 0.6334993 0.6297199 0.6134440 0.6447908

This shows that the sample mean itself varies from sample to sample.

We can see the same idea in the observed RT data. Here sample() draws 100 RTs from the dataset, and replace = TRUE means each draw is made with replacement, so the same value could be selected more than once.

set.seed(42)
replicate(5, mean(sample(d$reaction_time, size = 100, replace = TRUE)))
## [1] 0.7100559 0.7548614 0.6874426 0.7249125 0.7307430

Important idea:

  • data vary because of randomness in the underlying process
  • statistics such as the mean vary because samples vary
  • models describe the process, not just one realised dataset

This idea becomes central when we begin hypothesis testing.

When you later compare two group means, the question is not only “are they different?” It is also “is that difference larger than we would expect from sampling variability alone?”

This is also where visual estimation meets inference. A candidate distribution might look like a reasonable match by eye, but repeated samples from the same process still vary. That is why visual judgement is useful for building intuition, but not sufficient for making a formal inferential decision.


Why this matters for your final project

In your final project, you will need to answer questions like:

  • Why did you summarise the data this way?
  • Why did you choose this visualisation?
  • Why did you use this statistical test?
  • What assumptions are you making about the data?
  • What does the expected value mean for this variable?
  • How variable is the process around that expected value?

Use this tutorial to answer those questions more concretely.

A stronger project-style answer might look like this:

“I started by plotting the distribution of the dependent variable because the shape of the data affects both summary statistics and model choice. For reaction time, a histogram was useful because RT is continuous and positive. The distribution looked right-skewed, so a symmetric normal model would be harder to defend than a positive right-skewed model. That reasoning also affects which summaries and tests would be appropriate later.”

The answer always begins with understanding how the data vary.


Apply this to your assigned dataset

Your assigned dataset is determined by your student ID number. Take the last digit of your student ID and compute last_digit %% 3 in R.

Download the data files from your assigned repository. Load one of the key data files into R as a data.table.

For this tutorial, your task is:

  1. Identify one key dependent variable in your dataset.
  2. State what type of variable it is (for example: continuous, binary, count).
  3. Visualise its distribution using a plot you can defend.
  4. Compute summary statistics that match that variable type.
  5. Write 2-3 sentences explaining:
    1. why you chose that summary
    2. why you chose that visualisation
    3. what distributional assumptions seem most plausible at this stage

You will use this justification in your final project to explain your analysis choices.


Stop here

Today we did not perform hypothesis tests.

We focused on:

  • what random variables are
  • how data form distributions
  • how models describe variability
  • how variable type constrains model choice
  • how to begin defending analysis choices

Next week we will use these ideas to decide when observed differences are larger than we would expect by chance.