Tutorial 5 - Random Variables and Probability Distributions

## Warning: package 'data.table' was built under R version 4.3.3

Learning objective

You have just handed in homework 1 which involved heaps of simple data wrangling and ggplotting so we will take a short break from this and focus on the theory of random variables and probability distributions.
You should aim to grapple with the practice problems below but be sure to first inspect the upcoming quiz so that you can appropriately allocate your time and attention.

Work through these practice exercises (1)

It’s a good idea to work through these on your own, but if you get very stuck, solutions can be found here

1.

Consider an experiment in which two four-sided die are thrown.

(a) Create a data.table named ans_1a with two columns labelled die1 and die2 where these columns encode the sample space of this experiment. As a hint, the following shows the what the output of ans_1a[1:8] should produce for you. It’s up to you to determine how many observations in total ans_1a should have.

##     die1  die2
##    <num> <num>
## 1:     1     1
## 2:     2     1
## 3:     3     1
## 4:     4     1
## 5:     1     2
## 6:     2     2
## 7:     3     2
## 8:     4     2

(b) Create a deep copy of ans_1a called ans_1b and then add a column to ans_1b called X and set it equal to the sum of the corresponding die1 and die2 values on a per row basis.

(c) Let \(X\) be the random variable defined by the X column in ans_1b. What is \(P(X<5)\)? Store your result in a variable named ans_1c.

(d) Is the event defined as the set of all outcomes for which \(X=2\) an elementary event? Store your answer ("YES" or "NO") in a variable named ans_1d.

2.

For the following probability distribution:

\(x\)	2	3	4	5	6
\(f(x)\)	.1	.3	.3	.2	.1

(a) Is \(f(x)\) a valid probability distribution? Store your answer ("YES" or "NO") in a variable named ans_2a.

(b) Calculate \(E(X)\) store the result in a variable named ans_2b.

(c) Calculate \(P(X \geq 4)\) store the result in a variable named ans_2c.

(d) Calculate \(P(2 < X \leq 4)\) and store the result in a variable named ans_2d.

3.

Given the following probability distribution:

Find:

(a) Calculate \(P(X \geq 2)\) and store your result in a variable named ans_3a.

(b) Calculate \(P(0 < X \leq 2)\) and store your result in a variable named ans_3b.

(c) Calculate \(E(X)\) and store your result in a variable named ans_3c.

(d) \(Var(X)\) and store your result in a variable named ans_3d.

(e) \(sd(X)\) and store your result in a variable named ans_3e.

4.

Consider the probability distributions for random vairbale \(X\) and random variable \(Y\) defined below.

a. What is the population mean of \(X\)? Store your answer in a variable named ans_4a.

b. What is the population mean of \(Y\)? Sore your answer in a variable named ans_4b.

## Warning in melt.data.table(d): id.vars and measure.vars are internally guessed
## when both are 'NULL'. All non-numeric/integer/logical type columns are
## considered id.vars, which in this case are columns []. Consider providing at
## least one of 'id' or 'measure' vars in future.
## Warning in melt.data.table(d): 'measure.vars' [x1, x2, y1, y2] are not all of
## the same type. By order of hierarchy, the molten data value column will be of
## type 'double'. All measure variables not of type 'double' will be coerced too.
## Check DETAILS in ?melt.data.table for more on coercion.

c. Which of the histograms above most likely corresponds to a sample from \(X\)? Please store your answer as one of the following:

ans_4c <- 'upper left'
ans_4c <- 'lower left'
ans_4c <- 'upper right'
ans_4c <- 'lower right'

d. Which of the histograms above most likely corresponds to a sample from \(Y\)? Please store your answer as one of the following:

ans_4d <- 'upper left'
ans_4d <- 'lower left'
ans_4d <- 'upper right'
ans_4d <- 'lower right'

Work through these practice exercises (2)

It’s a good idea to work through these on your own, but if you get very stuck, solutions can be found here

1.

Consider the following two random variables:

\(X\sim\mathcal{N}(\mu=0,\sigma=1)\)
\(Y{\sim}Binom(n=10,p=0.2)\)

The \(X\) probability distribution (pdf since \(X\) is continuous) and the \(Y\) probability distribution (pmf since \(Y\) is discrete) are illustrated below. The probability corresponding to \(x{\leq}1\) and \(y{\leq}3\) (lower.tail=TRUE) are coloured blue and the the probability corresponding to \(x{>}1\) and \(y{>}3\) (lower.tail=FALSE) are coloured green. This figure is only here to help you think about what is being asked below.

## Warning in geom_segment(aes(x = 1, xend = 1, y = 0, yend = dnorm(1)), linetype = 2): All aesthetics have length 1, but the data has 801 rows.
## ℹ Did you mean to use `annotate()`?
## Warning in is.na(x): is.na() applied to non-(list or vector) of type
## 'expression'

## Warning in is.na(x): is.na() applied to non-(list or vector) of type
## 'expression'

## Warning in is.na(x): is.na() applied to non-(list or vector) of type
## 'expression'

Are the following statements TRUE or FALSE?

(a) \(p(X{<}x)=p(X{\leq}x)\) for all \(x\).

# uncomment the correct answer
# ans_1a <- TRUE
# ans_1a <- FALSE

(b) \(p(Y < y) = p(Y\leq y)\) for all \(y\).

# uncomment the correct answer
# ans_1b <- TRUE
# ans_1b <- FALSE

(c) \(p(X < x) = 1 - p(X > x)\) for all \(x\).

# uncomment the correct answer
# ans_1c <- TRUE
# ans_1c <- FALSE

(d) \(p(Y < y) = 1 - p(Y > y)\) for all \(y\).

# uncomment the correct answer
# ans_1d <- TRUE
# ans_1d <- FALSE

2.

Consider the following random variables:

\(X{\sim}\mathcal{N}(\mu_X=15,\sigma_X=2.74)\)
\(Y{\sim}Binom(n=30,p=0.5)\).

The \(X\) and \(Y\) distributions are illustrated in the following figure:

For all of the following use built-in R functions to compute the requested quantities. The functions are covered in the lecture notes this week.

(a) Compute \(P(X > 17)\) and store it in a variable named ans_2a.

(b) Compute \(P(X \leq 14)\) and store it in a variable named ans_2b.

(c) Compute \(P(X \geq 20)\) and store it in a variable named ans_2c.

(d) Compute \(P(X < 13)\) and store it in a variable named ans_2d.

(e) Compute \(P(Y > 17)\) and store it in a variable named ans_2e.

(f) Compute \(P(Y \leq 14)\) and store it in a variable named ans_2f.

(g) Compute \(P(Y \geq 20)\) and store it in a variable named ans_2g.

(h) Compute \(P(Y < 13)\) and store it in a variable named ans_2h.

(i) How do all of the above quantities computed with respect to \(X\) compare to those computed with respect to \(Y\)? No points for this one. Just think about it.

3.

Consider the following plots:

For each of the following problems, please respond by setting the appropriate variable to one of "I", "II", "III", "IV", "V", "VI", "VII", or "VIII". When responding to the questions below, please note that the x and y axis labels in the above plots all use ‘X’ or ‘x’ but may correspond to either the X or the Y random variable from problem 2.

(a) Which plot corresponds to the quantity requested in problem 2a? Please store your answer in a variable named ans_3a.

(b) Which plot corresponds to the quantity requested in problem 2b? Please store your answer in a variable named ans_3b.

(c) Which plot corresponds to the quantity requested in problem 2c? Please store your answer in a variable named ans_3c.

(d) Which plot corresponds to the quantity requested in problem 2d? Please store your answer in a variable named ans_3d.

(e) Which plot corresponds to the quantity requested in problem 2e? Please store your answer in a variable named ans_3e.

(f) Which plot corresponds to the quantity requested in problem 2f? Please store your answer in a variable named ans_3f.

(g) Which plot corresponds to the quantity requested in problem 2g? Please store your answer in a variable named ans_3g.

(h) Which plot corresponds to the quantity requested in problem 2h? Please store your answer in a variable named ans_3h.

4.

Consider the random variable \(X{\sim}F(d_1,d_2)\) with fixed parameters \(d_1=5\), \(d_2=100\). Given these values of \(d_1\) and \(d_2\), the following holds true:

\[ \text{if } X \sim F(d_1,d_2)\text{ then:}\\ \mathbb{E}[X] = \frac{d_2}{d_2 - 2} \\ \mathbb{V}\text{ar}[X] = \frac{2 d_2^2 (d_1 + d_2 - 2)}{d_1 (d_2 - 2)^2 (d_2 - 4)} \]

You might need numerical values for \(\mathbb{E}[X]\) and \(\mathbb{V}\text{ar}[X]\) later in this problem. If you do, you can compute them with the following code chunk:

d1 <- 5
d2 <- 100

mu_x <- (d2/(d2-2))
sig_x <- (2*(d2^2)*(d1+d2-2))/(d1*(d2-2)^2*(d2-4))

We next perform 6 experiments. In each experiment, we draw 1000 samples from \(X\) and plot a histogram for the results of each experiment. The results of these experiments are shown in the following histograms.

(a)Do the histograms appear Normally distributed? If not, what is the main difference? Please respond with one of the following:

ans_4a <- "Skewed"
ans_4a <- "Normal"

We next estimate the distribution of sample means as a function of the number of samples n that were drawn per experiment. We will perform 6 experiments in which we draw 2, 5, 10, 15, 30, and 50 samples from \(X\) (each of these numbers corresponds to one experiment). Each of these experiments, if performed once, will give us one sample mean. To examine the distribution of sample means, we repeat each experiment 1000 times. This will give us 1000 sample means per experiment. For each experiment, we estimate the distribution of sample means with a histogram and make a single figure with separate panels for each experiment to show our results.

(b) What happens to the variance of the distribution of sample means as \(n\) increases?

ans_4b <- "Stays the same"
ans_4b <- "Increases"
ans_4b <- "Decreases"

(c) What happens to the skew of the distribution of sample means as \(n\) increases?

ans_4c <- "Stays the same"
ans_4c <- "Increases"
ans_4c <- "Decreases"

(d) If \(X{\sim}F(d_1,d_2)\) as defined at the top of the problem, and \(n=1000\), what is the population mean of \(\bar{X}\) (the distribution of sample means)? Store your answer in a variable named ans_4d.

(e) If \(X{\sim}F(d_1,d_2)\) as defined at the top of the problem, and \(n=1000\), what is the population variance of \(\bar{X}\) (the distribution of sample means)? Store your answer in a variable named ans_4e.

Tutorial 5 - Random Variables and Probability Distributions

Author: Matthew J. Crossley

Last update: 20 May, 2025

Learning objective

Work through these practice exercises (1)

1.

2.

3.

4.

Work through these practice exercises (2)

1.

2.

3.

4.