Day 6 lab: intro probability

Inverse cdf method for continuous random variables

Premise for Inverse CDF method: When you want random samples from a distribution.

Here we use examples from distributions that are easy to sample from so we can compare our results with existing functions. In reality this method would be used for uncommon distributions.

Preparation

First, let’s recall/introduce how we can make our own functions in in R.

Toy example:

add_two_numbers <- function(number1, number2){
  return(number1 + number2)
}

add_two_numbers(number1 = 2, number2 = 9)

[1] 11

# We can call function without argument names but it is not good practice e.g.
add_two_numbers(2, 9)

[1] 11

Before we start generating random numbers, let’s set the seed to make our experiments reproducible.

To understand why we set a seed run the chunk below and compare the output. Then run it again and compare between runs.

rnorm(1)

[1] 2.274749

rnorm(1)

[1] 0.9803857

In order to make our analysis reproducible we need to set the seed, allowing another person to run the script and get the same results.

set.seed(345454)

Example 1: Generating realizations of an exponential distributed random variable using the inverse cdf method

What is the inverse CDF of an exponential distribution? Uh oh. How can we calculate this? R has functions that compute quantiles, but quantiles return the same value as inverse cdfs.

For example, if \(X \sim Exp(0.1)\), then number \(c\) such that \(Pr(X < c) = 0.6\) is \(F^{-1}(0.6)\) and can be obtained by the following R code

qexp(0.6, rate = 0.1)

[1] 9.162907

Equipped with this observation, let’s compute

rexp_isi_buds <- function(num_draws, rate = 1){
  runif_draws <- runif(num_draws) # Random draws from uniform distirbution
  return(qexp(runif_draws, rate = rate)) # Plug draws into cdf
}

my_exp_sample <- rexp_isi_buds(100000, rate = 0.1)

We can plot our observations to visually asses that we did it correctly. Note the difference in default bin sizes for the two plotting methods.

# base R plot for reference
hist(my_exp_sample)

# ggplot
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data.frame("my_exp_sample" = my_exp_sample) %>% # Need data frame for ggplot()
  ggplot(aes(x = my_exp_sample)) +
  geom_histogram() + 
  theme_bw()# Try changing the number of bins by setting bins = 10

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

If you have time : spend a few minutes improving the plot above with recommendations learned last week!

Let’s compare our simulations with simulations produced by the R function rexp() (base R function)

my_exp_sample2 <- rexp(100000, 0.1)

qqplot(my_exp_sample, my_exp_sample2)
abline(0,1)

We can also compare empirical and theoretical means and standard deviations

# Parentheses tell R to print the value being stored
(emp_exp_mean <- mean(my_exp_sample))

[1] 9.99156

(theor_exp_mean <- 1/0.1)

[1] 10

(emp_exp_sd <- sd(my_exp_sample))

[1] 10.07315

(theor_exp_sd <- sqrt(1/0.1^2))

[1] 10

After the example, can you better explain why we would want to use the inverse cdf method to generate random samples from an uncommon distribution?

Exercise 1

Write a function to generate realizations of a normal random variable with a user specified mean and variance. Compare results of your simulations and simulation produced using rnorm().

rnorm_isi_buds <- function(num_draws, mean,var){
  runif_draws <- runif(num_draws) 
  return(qnorm(runif_draws, mean = mean, sd = sqrt(var))) 
}

my_norm_sample <- rnorm_isi_buds(100000, mean = 0, var =1 )

# Looking at different sample sizes
par(mfrow=c(1,2))
hist(my_norm_sample)
hist(rnorm_isi_buds(40, mean=0,var=1))

Comparing it to rnorm implemented in base R:

qqplot(my_norm_sample, rnorm(100000,mean = 0,sd = 1))
abline(0,1)

(NOTE): the function we implemented takes variance as an argument to parametrize the Normal distribution, while rnorm takes sd. Indeed:

qqplot(
  
  rnorm_isi_buds(100000, mean = 0, var= 10),
  rnorm(100000, mean = 0, sd= 10 )
  

)
abline(0,1)

Inverse c.d.f. method for discrete random variables

Example 2: Bernoulli random variable (aka flip of a biased coin)

rbernoulli <- function(success_prob){
  return_value <- 0
  
  if (runif(1) < success_prob){
    return_value <- 1 # else statement is not necessary, because we set return_value to 0 from the start
  }
  
  return(return_value)
}

Let’s test our function

num_sim <- 1000
my_coins <- numeric(num_sim)
my_prob <- 0.24

for (i in 1:num_sim){
  my_coins[i] <- rbernoulli(my_prob)
}

mean(my_coins)

[1] 0.233

Exercise 2

Modify the rbernoulli() function so it has 2 arguments: the number of realizations to draw and success probability

Compare the mean number of successes, first from 10 realizations with probability of successs p = 0.2 and 10 realizations with probability of success p = 0.6 ; and then 1000 realizations with each of p=0.2 and p = 0.6. What do you observe?

rbernoulli_draws <- function(num_draws, success_prob){
  bernoulli_sample <- c()
  for(i in 1:num_draws){
    bernoulli_sample[i] <- rbernoulli(success_prob)
  }
  return(bernoulli_sample)
}

sample1_10 <- rbernoulli_draws(10,0.2)
sample2_10 <- rbernoulli_draws(10,0.6)

paste0("Mean for sample size of 10, p = 0.2: ", mean(sample1_10))

[1] "Mean for sample size of 10, p = 0.2: 0.1"

paste0("Mean for sample size of 10, p = 0.6: ", mean(sample2_10))

[1] "Mean for sample size of 10, p = 0.6: 0.7"

sample1_1000 <- rbernoulli_draws(1000,0.2)
sample2_1000 <- rbernoulli_draws(1000,0.6)

paste0("Mean for sample size of 1000, p = 0.2: ", mean(sample1_1000))

[1] "Mean for sample size of 1000, p = 0.2: 0.187"

paste0("Mean for sample size of 1000, p = 0.6: ", mean(sample2_1000))

[1] "Mean for sample size of 1000, p = 0.6: 0.634"

Exercise 3

Write a function to simulate from a discrete random variable taking values 1, 2, 3 with probabilities p1, p2, p3, supplied by the user of the function. Note: you could do it with a sample() function, but we don’t want you to use it for this exercise.

rdiscr3 <- function(p1, p2, p3){
  if (abs(1 - p1 - p2 - p3) > 10^-7) { # 10^-7 used to approximate 0
    stop("probabilities must add to 1")
  }
  
  return_value <- 3
  
  my_unif <- runif(1)
  
  if (my_unif < p1) {
    return_value <- 1
  } else if (my_unif < p1 + p2) {
    return_value <- 2
  }
  
  return(return_value)
}

rdiscr3_ndraws <- function(num_draws, p1, p2, p3){
  random_sample <- c()
  for(i in 1:num_draws){
    random_sample[i] <- rdiscr3(p1,p2,p3)
  }
  return(random_sample)
}

The code below should help you test your function. Name your function rdiscr3

num_sim <- 10000
test_vector <- rdiscr3_ndraws(num_draws = num_sim, 0.1, 0.2, 0.7)
# Calculate empirical proportions
sum(ifelse(test_vector == 1, yes = 1, no = 0)) / num_sim

[1] 0.099

sum(ifelse(test_vector == 2, yes = 1, no = 0)) / num_sim

[1] 0.2014

sum(ifelse(test_vector == 3, yes = 1, no = 0)) / num_sim

[1] 0.6996

Exercise 4

For many common distributions, you can use base R or common packages to simulate random samples. Generate 1000 samples and show the results using an appropriate plot from :

Gamma(3,4) distribution.

gamma_draws <- rgamma(10000,3,4)
hist(gamma_draws, main = "Random Sample from Gamma(3,4)")

Beta(5,6) distribution.

beta_draws <- rbeta(10000, 5,6)
hist(beta_draws, main = "Random Sample from Beta(5,6)")

Poisson(5) distribution.

poisson_draws <- rpois(10000,5)
# discrete distribution
barplot(table(poisson_draws), main = "Random Sample from Poisson(5)")