You have just handed in homework 1 which involved heaps of simple data wrangling and ggplotting so we will take a short break from this and focus on the theory of random variables and probability distributions.
You should aim to grapple with the practice problems below but be sure to first inspect the upcoming quiz so that you can appropriately allocate your time and attention.
Consider an experiment in which two four-sided die are thrown.
(a) Create a data.table
named
ans_1a
with two columns labelled die1
and
die2
where these columns encode the sample
space of this experiment. As a hint, the following shows the
what the output of ans_1a[1:8]
should produce for you. It’s
up to you to determine how many observations in total
ans_1a
should have.
## die1 die2
## <num> <num>
## 1: 1 1
## 2: 2 1
## 3: 3 1
## 4: 4 1
## 5: 1 2
## 6: 2 2
## 7: 3 2
## 8: 4 2
(b) Create a deep copy of ans_1a
called
ans_1b
and then add a column to ans_1b
called
X
and set it equal to the sum of the corresponding
die1
and die2
values on a per row basis.
(c) Let \(X\) be
the random variable defined by the X
column in
ans_1b
. What is \(P(X<5)\)? Store your result in a
variable named ans_1c
.
(d) Is the event defined as the set of all outcomes
for which \(X=2\) an elementary event?
Store your answer ("YES"
or "NO"
) in a
variable named ans_1d
.
For the following probability distribution:
\(x\) | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|
\(f(x)\) | .1 | .3 | .3 | .2 | .1 |
(a) Is \(f(x)\) a
valid probability distribution? Store your answer ("YES"
or
"NO"
) in a variable named ans_2a
.
(b) Calculate \(E(X)\) store the result in a variable named
ans_2b
.
(c) Calculate \(P(X \geq
4)\) store the result in a variable named
ans_2c
.
(d) Calculate \(P(2 < X
\leq 4)\) and store the result in a variable named
ans_2d
.
Given the following probability distribution:
Find:
(a) Calculate \(P(X \geq
2)\) and store your result in a variable named
ans_3a
.
(b) Calculate \(P(0 < X
\leq 2)\) and store your result in a variable named
ans_3b
.
(c) Calculate \(E(X)\) and store your result in a variable
named ans_3c
.
(d) \(Var(X)\) and
store your result in a variable named ans_3d
.
(e) \(sd(X)\) and
store your result in a variable named ans_3e
.
Consider the probability distributions for random vairbale \(X\) and random variable \(Y\) defined below.
a. What is the population mean of \(X\)? Store your answer in a variable named
ans_4a
.
b. What is the population mean of \(Y\)? Sore your answer in a variable named
ans_4b
.
## Warning in melt.data.table(d): id.vars and measure.vars are internally guessed
## when both are 'NULL'. All non-numeric/integer/logical type columns are
## considered id.vars, which in this case are columns []. Consider providing at
## least one of 'id' or 'measure' vars in future.
## Warning in melt.data.table(d): 'measure.vars' [x1, x2, y1, y2] are not all of
## the same type. By order of hierarchy, the molten data value column will be of
## type 'double'. All measure variables not of type 'double' will be coerced too.
## Check DETAILS in ?melt.data.table for more on coercion.
c. Which of the histograms above most likely corresponds to a sample from \(X\)? Please store your answer as one of the following:
ans_4c <- 'upper left'
ans_4c <- 'lower left'
ans_4c <- 'upper right'
ans_4c <- 'lower right'
d. Which of the histograms above most likely corresponds to a sample from \(Y\)? Please store your answer as one of the following:
ans_4d <- 'upper left'
ans_4d <- 'lower left'
ans_4d <- 'upper right'
ans_4d <- 'lower right'
Consider the following two random variables:
\(X\sim\mathcal{N}(\mu=0,\sigma=1)\)
\(Y{\sim}Binom(n=10,p=0.2)\)
The \(X\) probability distribution
(pdf since \(X\) is continuous) and the
\(Y\) probability distribution (pmf
since \(Y\) is discrete) are
illustrated below. The probability corresponding to \(x{\leq}1\) and \(y{\leq}3\) (lower.tail=TRUE
)
are coloured blue and the the probability corresponding to \(x{>}1\) and \(y{>}3\) (lower.tail=FALSE
)
are coloured green. This figure is only here to help you think about
what is being asked below.
## Warning in geom_segment(aes(x = 1, xend = 1, y = 0, yend = dnorm(1)), linetype = 2): All aesthetics have length 1, but the data has 801 rows.
## ℹ Did you mean to use `annotate()`?
## Warning in is.na(x): is.na() applied to non-(list or vector) of type
## 'expression'
## Warning in is.na(x): is.na() applied to non-(list or vector) of type
## 'expression'
## Warning in is.na(x): is.na() applied to non-(list or vector) of type
## 'expression'
Are the following statements TRUE
or
FALSE
?
(a) \(p(X{<}x)=p(X{\leq}x)\) for all \(x\).
# uncomment the correct answer
# ans_1a <- TRUE
# ans_1a <- FALSE
(b) \(p(Y < y) = p(Y\leq y)\) for all \(y\).
# uncomment the correct answer
# ans_1b <- TRUE
# ans_1b <- FALSE
(c) \(p(X < x) = 1 - p(X > x)\) for all \(x\).
# uncomment the correct answer
# ans_1c <- TRUE
# ans_1c <- FALSE
(d) \(p(Y < y) = 1 - p(Y > y)\) for all \(y\).
# uncomment the correct answer
# ans_1d <- TRUE
# ans_1d <- FALSE
Consider the following random variables:
\(X{\sim}\mathcal{N}(\mu_X=15,\sigma_X=2.74)\)
\(Y{\sim}Binom(n=30,p=0.5)\).
The \(X\) and \(Y\) distributions are illustrated in the following figure:
For all of the following use built-in R
functions to
compute the requested quantities. The functions are covered in the
lecture notes this week.
(a) Compute \(P(X >
17)\) and store it in a variable named ans_2a
.
(b) Compute \(P(X \leq
14)\) and store it in a variable named ans_2b
.
(c) Compute \(P(X \geq
20)\) and store it in a variable named ans_2c
.
(d) Compute \(P(X <
13)\) and store it in a variable named ans_2d
.
(e) Compute \(P(Y >
17)\) and store it in a variable named ans_2e
.
(f) Compute \(P(Y \leq
14)\) and store it in a variable named ans_2f
.
(g) Compute \(P(Y \geq
20)\) and store it in a variable named ans_2g
.
(h) Compute \(P(Y <
13)\) and store it in a variable named ans_2h
.
(i) How do all of the above quantities computed with respect to \(X\) compare to those computed with respect to \(Y\)? No points for this one. Just think about it.
Consider the following plots:
For each of the following problems, please respond by setting the
appropriate variable to one of "I"
, "II"
,
"III"
, "IV"
, "V"
,
"VI"
, "VII"
, or "VIII"
. When
responding to the questions below, please note that the x and y axis
labels in the above plots all use ‘X’ or ‘x’ but may correspond to
either the X or the Y random variable from problem 2.
(a) Which plot corresponds to the quantity requested
in problem 2a? Please store your answer in a variable named
ans_3a
.
(b) Which plot corresponds to the quantity requested
in problem 2b? Please store your answer in a variable named
ans_3b
.
(c) Which plot corresponds to the quantity requested
in problem 2c? Please store your answer in a variable named
ans_3c
.
(d) Which plot corresponds to the quantity requested
in problem 2d? Please store your answer in a variable named
ans_3d
.
(e) Which plot corresponds to the quantity requested
in problem 2e? Please store your answer in a variable named
ans_3e
.
(f) Which plot corresponds to the quantity requested
in problem 2f? Please store your answer in a variable named
ans_3f
.
(g) Which plot corresponds to the quantity requested
in problem 2g? Please store your answer in a variable named
ans_3g
.
(h) Which plot corresponds to the quantity requested
in problem 2h? Please store your answer in a variable named
ans_3h
.
Consider the random variable \(X{\sim}F(d_1,d_2)\) with fixed parameters \(d_1=5\), \(d_2=100\). Given these values of \(d_1\) and \(d_2\), the following holds true:
\[ \text{if } X \sim F(d_1,d_2)\text{ then:}\\ \mathbb{E}[X] = \frac{d_2}{d_2 - 2} \\ \mathbb{V}\text{ar}[X] = \frac{2 d_2^2 (d_1 + d_2 - 2)}{d_1 (d_2 - 2)^2 (d_2 - 4)} \]
You might need numerical values for \(\mathbb{E}[X]\) and \(\mathbb{V}\text{ar}[X]\) later in this problem. If you do, you can compute them with the following code chunk:
d1 <- 5
d2 <- 100
mu_x <- (d2/(d2-2))
sig_x <- (2*(d2^2)*(d1+d2-2))/(d1*(d2-2)^2*(d2-4))
We next perform 6 experiments. In each experiment, we draw 1000 samples from \(X\) and plot a histogram for the results of each experiment. The results of these experiments are shown in the following histograms.
(a)Do the histograms appear Normally distributed? If not, what is the main difference? Please respond with one of the following:
ans_4a <- "Skewed"
ans_4a <- "Normal"
We next estimate the distribution of sample means as a function of
the number of samples n
that were drawn per experiment. We
will perform 6 experiments in which we draw 2, 5, 10, 15, 30, and 50
samples from \(X\) (each of these
numbers corresponds to one experiment). Each of these experiments, if
performed once, will give us one sample mean. To examine the
distribution of sample means, we repeat each experiment 1000 times. This
will give us 1000 sample means per experiment. For each experiment, we
estimate the distribution of sample means with a histogram and make a
single figure with separate panels for each experiment to show our
results.
(b) What happens to the variance of the distribution of sample means as \(n\) increases?
ans_4b <- "Stays the same"
ans_4b <- "Increases"
ans_4b <- "Decreases"
(c) What happens to the skew of the distribution of sample means as \(n\) increases?
ans_4c <- "Stays the same"
ans_4c <- "Increases"
ans_4c <- "Decreases"
(d) If \(X{\sim}F(d_1,d_2)\) as defined at the top
of the problem, and \(n=1000\), what is
the population mean of \(\bar{X}\) (the
distribution of sample means)? Store your answer in a variable named
ans_4d
.
(e) If \(X{\sim}F(d_1,d_2)\) as defined at the top
of the problem, and \(n=1000\), what is
the population variance of \(\bar{X}\)
(the distribution of sample means)? Store your answer in a variable
named ans_4e
.