# load libraries
library(data.table)
library(ggplot2)
# clean work space
rm(list = ls())
# init colorscheme
COL <- c("#2271B2", "#E69F00", "#D55E00")
names(COL) <- c("blue", "orange", "red")
theme_set(
theme_minimal(base_size = 13) +
theme(
panel.grid.minor = element_blank(),
strip.text = element_text(face = "bold"),
legend.position = "bottom"
)
)
update_geom_defaults("point", list(size = 2))
update_geom_defaults("line", list(linewidth = 0.8))
This tutorial is directly connected to your final project.
In the final project, you will need to:
This tutorial starts building that habit.
We will:
A probability distribution is a model describing how data vary.
We do not choose distributions because they are mathematically convenient.
We choose them because they are reasonable descriptions of how data behave.
That means we should always ask:
We will work with data from a rat T-maze experiment. Each row records one trial: which direction the rat chose, whether it received a reward, and how quickly it responded.
d <- fread("data/experiment_1_summary.csv")
Inspect the structure.
str(d)
## Classes 'data.table' and 'data.frame': 4800 obs. of 9 variables:
## $ experiment : chr "exp1" "exp1" "exp1" "exp1" ...
## $ rat_id : int 1 1 1 1 1 1 1 1 1 1 ...
## $ trial : int 1 2 3 4 5 6 7 8 9 10 ...
## $ cage_context : chr "mesh" "wood" "mesh" "mesh" ...
## $ scent : chr "none" "none" "none" "peppermint" ...
## $ choice : chr "L" "R" "L" "L" ...
## $ reward : int 1 0 1 1 1 1 0 1 1 1 ...
## $ reaction_time: num 0.572 0.936 0.457 0.446 0.2 ...
## $ maze_run_time: num 3.79 3.21 2.14 2.65 2.73 ...
## - attr(*, ".internal.selfref")=<externalptr>
The columns we will focus on:
reaction_time — how long (in seconds) the rat took to
initiate its choice after the trial beganreward — whether the rat received a reward on that
trial (0 = no, 1 = yes)rat_id — which rat produced the observationtrial — trial number within the sessionreaction_time differs from trial to trial and from rat
to rat. No two measurements are exactly the same. That is the defining
feature of a random variable: a quantity whose value
varies in a way governed by some underlying probability process.
We do not know in advance what a rat’s reaction time will be on any given trial. But we can study how those times are distributed across many observations.
A histogram is our first approximation to a probability distribution.
ggplot(d, aes(x = reaction_time)) +
geom_histogram(bins = 30, fill = COL[1], colour = "white") +
labs(
x = "Reaction time (s)",
y = "Count",
title = "Distribution of reaction times across all rats and trials"
)
Think carefully about what you see:
These questions matter later when you justify your analysis choices.
For your project, a histogram like this helps you inspect the shape of the variable before deciding how to summarise or model it.
Questions to consider:
We can summarise the distribution with a few descriptive statistics.
d[, .(
mean_rt = mean(reaction_time),
median_rt = median(reaction_time),
sd_rt = sd(reaction_time)
)]
## mean_rt median_rt sd_rt
## <num> <num> <num>
## 1: 0.6827906 0.5865538 0.4116576
Important distinctions:
In lectures, these underlying properties are called parameters. The mean and standard deviation you compute here are estimates of the true characteristics of the process that generated the reaction times.
Notice that the mean and median do not answer exactly the same question:
When a distribution is skewed, these can differ in informative ways.
If the distribution has a long right tail, the mean is often pulled upward more than the median. That is one reason to look at both.
Questions to consider:
Now we generate artificial data where we know the distribution exactly.
This is useful because it lets us build intuition about what different shapes look like, and then ask which shape best matches our real data.
For continuous variables such as reaction time, we usually describe the model with a probability density function rather than a probability mass function. The histogram gives you a rough visual summary of that density.
rnorm() generates random numbers from a normal
distribution. Here we ask for 1000 values with mean
0.6 and standard deviation 0.1.
set.seed(42)
x_norm <- rnorm(1000, mean = 0.6, sd = 0.1)
d_norm <- data.table(x = x_norm)
ggplot(d_norm, aes(x)) +
geom_histogram(bins = 30, fill = COL[1], colour = "white") +
labs(
x = "Value",
y = "Count",
title = "Simulated data: Normal distribution"
)
Properties:
Question:
Would a normal distribution be a sensible model for reaction time? Why or why not? * If you increased the standard deviation but kept the mean fixed, what would happen to the variance and the overall spread of the histogram?*
runif() generates random numbers that are equally likely
across a range. Here every value between 0 and
1.2 is equally likely.
set.seed(42)
x_unif <- runif(1000, min = 0, max = 1.2)
d_unif <- data.table(x = x_unif)
ggplot(d_unif, aes(x)) +
geom_histogram(bins = 30, fill = COL[1], colour = "white") +
labs(
x = "Value",
y = "Count",
title = "Simulated data: Uniform distribution"
)
Properties:
Question:
Does this look anything like how reaction times actually behave? * Compared with the normal example above, does the uniform distribution look as though it has a larger or smaller variance? Why?*
rgamma() generates positive, right-skewed values. Here
the shape is concentrated away from zero but still has a long right
tail.
set.seed(42)
x_gamma <- rgamma(1000, shape = 7, scale = 0.09)
d_gamma <- data.table(x = x_gamma)
ggplot(d_gamma, aes(x)) +
geom_histogram(bins = 30, fill = COL[1], colour = "white") +
labs(
x = "Value",
y = "Count",
title = "Simulated data: Gamma distribution"
)
Properties:
0Question:
Which features of this gamma distribution make it more or less plausible for reaction time? * Compared with the normal example, does the gamma distribution appear to have a similar expected value but a different variance, or a different expected value as well? Explain your reasoning from the plot.*
An exponential distribution is an even more strongly right-skewed positive distribution. It is useful to know about, but many real RT distributions have a clearer peak away from zero, so the gamma is often a better teaching example here.
One useful way to build intuition is to compare the observed RT distribution with several candidate distributions that differ in shape and tail length.
The goal is not to find a perfect answer. The goal is to practice noticing which parameters make a candidate distribution look too narrow, too wide, too symmetric, or too heavy-tailed.
set.seed(42)
d_candidates <- rbindlist(list(
data.table(candidate = "A: shape = 2.5, scale = 0.27", x = rgamma(1000, shape = 2.5, scale = 0.27)),
data.table(candidate = "B: shape = 4, scale = 0.17", x = rgamma(1000, shape = 4, scale = 0.17)),
data.table(candidate = "C: shape = 6, scale = 0.115", x = rgamma(1000, shape = 6, scale = 0.115)),
data.table(candidate = "D: shape = 9, scale = 0.075", x = rgamma(1000, shape = 9, scale = 0.075))
))
ggplot(d_candidates, aes(x)) +
geom_histogram(bins = 30, fill = COL[1], colour = "white") +
facet_wrap(~ candidate, ncol = 2) +
labs(
x = "Value",
y = "Count",
title = "Candidate gamma distributions"
)
Your task:
Use this comparison to practice estimating parameters by eye. Focus on which candidate has the closest peak, spread, and tail.
Plot the reaction time data again.
ggplot(d, aes(x = reaction_time)) +
geom_histogram(bins = 30, fill = COL[1], colour = "white") +
labs(
x = "Reaction time (s)",
y = "Count",
title = "Observed reaction times - which distribution does this resemble?"
)
Now compare it qualitatively to the three distributions above:
Key question:
Which of these distributions looks like a plausible description of how rat reaction times vary?
There is no single perfect answer here, but some answers are much easier to defend than others. The shape of the histogram and the physical constraints on reaction time should guide your reasoning.
A stronger justification would mention both:
For your project, a defensible answer would sound like this:
“I would summarise reaction time with a histogram and numerical summaries such as the mean and median because it is a continuous variable. A normal distribution is not ideal because the RTs are bounded below and slightly right-skewed. A gamma-like model is more plausible because the data are positive and have a longer right tail.”
Not all variables in this dataset behave like reaction time. Consider
reward.
d[, .N, by = reward]
## reward N
## <int> <int>
## 1: 1 2485
## 2: 0 2315
reward takes only two values: 0 or
1. Every trial is either a success or a failure. This kind
of variable is called binary or
dichotomous.
The appropriate probability model for a single binary observation is the Bernoulli distribution.
The Bernoulli distribution has one parameter:
p = the probability that the outcome is
1For a Bernoulli variable, the expected value is the probability of
observing a 1, and the variance tells you how much trial
outcomes fluctuate around that expected value.
This matters because model choice begins with the kind of variable you have:
Questions to consider:
reward mean in plain language?p increased, what would happen to the expected
reward on a single trial?When many Bernoulli trials are considered together, the total number of rewarded trials follows a Binomial distribution.
The Binomial distribution is discrete. For a rat
that completed n trials, the number of rewarded trials can
only be 0, 1, 2, …,
n.
For this dataset, each rat completed 200 trials, so the total reward count for a rat is a Binomial-style count variable.
For a Binomial variable, the expected value is the expected number of
rewarded trials out of n, and the variance tells you how
much those counts should vary from rat to rat when the underlying
process is the same.
d_rat_count <- d[, .(reward_count = sum(reward), n_trials = .N), .(rat_id)]
d_count_dist <- d_rat_count[, .N, by = reward_count][order(reward_count)]
ggplot(d_count_dist, aes(x = reward_count, y = N)) +
geom_segment(aes(xend = reward_count, yend = 0), colour = COL[1], linewidth = 0.8) +
geom_point(colour = COL[1], size = 2) +
labs(
x = "Rewarded trials per rat",
y = "Number of rats",
title = "Observed reward counts across rats"
)
This plot is useful because it shows the discrete nature of the variable clearly: only whole-number counts are possible.
Questions to consider:
200 trials with reward
probability p, what would the expected reward count
represent?The same behaviour can also be represented at a more aggregated level. Instead of looking at the total reward count across all trials, you can summarise reward within blocks of trials and work with a block-level reward rate.
d[, block := ceiling(trial / 25)]
d_block <- d[, .(reward_rate = mean(reward)), .(rat_id, block)]
ggplot(d_block, aes(x = reward_rate)) +
geom_histogram(bins = 16, fill = COL[1], colour = "white") +
labs(
x = "Reward rate within block",
y = "Count",
title = "Block-level reward rates"
)
These three representations support different questions:
n, so a Binomial model makes senseThis is directly relevant to final projects on category learning, where the same raw behaviour can often be represented as:
n trialsThis is the same logic you will later use when you justify:
Even when the underlying process is identical, samples differ from one another. This is called sampling variability.
We will use a few new R functions here, so define them clearly first:
set.seed() fixes the random-number starting point so
your results can be reproduced.rgamma() generates random values from a gamma
distribution.replicate() repeats the same command many times.sample() draws values from an existing vector.First, generate two samples from the same gamma distribution and compare their means.
set.seed(42)
x1 <- rgamma(1000, shape = 7, scale = 0.09)
x2 <- rgamma(1000, shape = 7, scale = 0.09)
mean(x1)
## [1] 0.6233283
mean(x2)
## [1] 0.6334993
The means are similar, but not identical, even though both samples came from the same process.
Now use replicate() to repeat that same operation
several times.
set.seed(42)
replicate(5, mean(rgamma(1000, shape = 7, scale = 0.09)))
## [1] 0.6233283 0.6334993 0.6297199 0.6134440 0.6447908
This shows that the sample mean itself varies from sample to sample.
We can see the same idea in the observed RT data. Here
sample() draws 100 RTs from the dataset, and
replace = TRUE means each draw is made with replacement, so
the same value could be selected more than once.
set.seed(42)
replicate(5, mean(sample(d$reaction_time, size = 100, replace = TRUE)))
## [1] 0.7100559 0.7548614 0.6874426 0.7249125 0.7307430
Important idea:
This idea becomes central when we begin hypothesis testing.
When you later compare two group means, the question is not only “are they different?” It is also “is that difference larger than we would expect from sampling variability alone?”
This is also where visual estimation meets inference. A candidate distribution might look like a reasonable match by eye, but repeated samples from the same process still vary. That is why visual judgement is useful for building intuition, but not sufficient for making a formal inferential decision.
In your final project, you will need to answer questions like:
Use this tutorial to answer those questions more concretely.
A stronger project-style answer might look like this:
“I started by plotting the distribution of the dependent variable because the shape of the data affects both summary statistics and model choice. For reaction time, a histogram was useful because RT is continuous and positive. The distribution looked right-skewed, so a symmetric normal model would be harder to defend than a positive right-skewed model. That reasoning also affects which summaries and tests would be appropriate later.”
The answer always begins with understanding how the data vary.
Your assigned dataset is determined by your student ID number. Take
the last digit of your student ID and compute
last_digit %% 3 in R.
0: https://github.com/crossley/cogs2020/tree/main/final_project_data/cat_learn_switch1: https://github.com/crossley/cogs2020/tree/main/final_project_data/cat_learn_auto2: https://github.com/crossley/cogs2020/tree/main/final_project_data/cat_learn_unlearnDownload the data files from your assigned repository. Load one of
the key data files into R as a data.table.
For this tutorial, your task is:
You will use this justification in your final project to explain your analysis choices.
Today we did not perform hypothesis tests.
We focused on:
Next week we will use these ideas to decide when observed differences are larger than we would expect by chance.