library(tidyverse)
library(ggforce)sample()How to use sample()
sample(x = a set vector,
size = the number of trials,
replace = replace the number you draw or not
prob = probability of each number drawn)
x is a set vector containing the
value from 1 to 10x <- c(1, 2, 3, 4, 5, 6)xsample(x,
size = 5)[1] 3 5 6 1 2
number 1, number 1 is not drawn in the
trialreplace = TRUE, which means
extraction with restoration, then it is likely that you
select number 1 in the trialsample(x,
size = 5,
replace = TRUE)[1] 6 2 2 6 2
prob means the probability that each number is
drawnsample(x,
size = 5,
replace = TRUE,
prob = c(1, 1, 1, 2, 2, 2))[1] 1 6 4 5 4
prob = c(1/9, 1/9, 1/, 2/9, 2/9, 2/9)
and get the same resultsize = 3replace = TRUE, which means
extraction with restorationsample(1:6,
3,
replace = TRUE)[1] 6 5 5
prob = c(2, 1, 2, 1, 2, 1)sample(x,
size = 5,
replace = TRUE,
prob = c(2, 1, 2, 1, 2, 1))[1] 1 5 3 5 1
table(), you can show the resultdice <- sample(1:6,
10^4,
replace = TRUE,
prob = c(2, 1, 2, 1, 2, 1))
table(dice)dice
1 2 3 4 5 6
2187 1126 2264 1156 2209 1058
barplot(), let’s visualize this resulttable(dice) |>
barplot()
- You see that the odd numbers (1, 3, 5) is twice as likely as the even
numbers (2, 4, 6) to be drawn in this simulation
| Function | Probability distribution | Parameters |
|---|---|---|
rbeta() |
beta distribution | n, shape1, shape2 |
rbinom() |
binomial distribution | n, size, prob |
rcauchy() |
cauchy distribution | n, location, scale |
rchisq() |
Chi-square distribution | n, df |
rexp() |
exponential distribution | n, rate |
rf() |
F distribution | n, |
rgamma() |
gamma distribution | n, shape, scale |
rgeom() |
geometric distribution | n, prob |
rhyper() |
hypergeometric distribution | nn, m, n, k |
rlnorm() |
lognormal distribution | n, meanlog, sdlog |
rmultinom() |
multinomial distribution | n, size, prob |
rnbinom() |
negative binomial distribution | n, size, prob |
rnorm() |
normal distribution | n, mean, sd |
rpois() |
Poisson distribution | n, lambda |
rt() |
t distribution | n, df |
runif() |
uniform distribution | n, min, max |
rweibull() |
Weibull distribution | n, shape, scale |
mvtnorm::rmvnorm() |
multivariate normal distribution | n, mean, sigma |
When you generate “random numbers” in R, you are actually
generating pseudorandom numbers
These numbers are generated with an algorithm that requires a
seed to initialize
Being pseudorandom instead of pure random means that, if you know
the seed and the generator, you can predict (and reproduce) the
output
Setting a seed in R means to initialize a pseudorandom number
generator
Most of the simulation methods in Statistics require the
possibility to generate pseudorandom numbers that mimic the properties
of independent generations of a uniform distribution in the interval
(0, 1)
If you want to draw a random number in R, you receive different
numbers every time you do
Suppose you wan to draw three numbers from a normal distribution
with average = 0 and standard deviation = 1
Round the number with 2 decimal places three times
rnorm(3) |>
round(2)[1] -1.52 0.39 -0.52
rnorm(3) |>
round(2)[1] -1.32 0.27 -0.06
rnorm(3) |>
round(2)[1] -0.12 0.02 0.19
You see that ever time you draw random numbers, you receive
different numbers
If you a simulation onece, there is no problem
But, if you want to get the same simulation result for some
reason, then you must be happy to have the same result by fixing random
number to get
This is when you use set.seed()
If you use the same set.seed(), then you get the
same random number
In the parenthesis, you set a numeric scalar (which means, any nubmers you like)
Let’s set a numeric scalar, 20220826 within
set.seed()
set.seed(20220826)
rnorm(3) |>
round(2)[1] -0.14 -0.87 -0.29
set.seed(20220826)
rnorm(3) |>
round(2)[1] -0.14 -0.87 -0.29
N randomly chosen people, at least two
will share a birthdayAnswer The probability of a shared birthday exceeds 50% in a group of only 23 people.
Birthday Paradox using Rn_studentn_student = 10n_student <- 10
Birth_vec <- sample(1:365,
n_student,
replace = TRUE)
Birth_vec [1] 140 55 363 170 292 335 345 6 179 6
duplicated()duplicated()TRUE,
FALSE otherwiseduplicated(Birth_vec) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
FALSE and 1 TRUETRUE means that there is a shared birthdayany()any() enables us to check whether or not there is at
least one TRUETRUE means that there is at least one shared
birthdayFALSE means that there is no TRUEany(duplicated(Birth_vec))[1] TRUE
TRUE because the 6th DOB is a shared
birthdayn_student <- 10 # the number of students
n_trials <- 100 # the number of trials
Result_vec <- rep(NA, n_trials) # Prepare for the place where we put our simulation result: Result_vec
# iteration
for (i in 1:n_trials) {
Birth_vec <- sample(1:365, n_student, replace = TRUE)
Result_vec[i] <- any(duplicated(Birth_vec))
}
Result_vec [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
[61] FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[85] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[97] FALSE FALSE FALSE FALSE
sum(Result_vec)[1] 12
TRUEn_student <- 10 # the number of students
n_trials <- 1000 # the number of trials
# Prepare for the place where we put our simulation result: Result_vec
Result_vec <- rep(NA, n_trials)
# iteration
for (i in 1:n_trials) {
Birth_vec <- sample(1:365, n_student, replace = TRUE)
Result_vec[i] <- any(duplicated(Birth_vec))
}
sum(Result_vec)[1] 110
TRUEProb_df <- tibble(Students = 2:100, # change the number of students from 2 to 100
Probs = NA)
for (i in 1:nrow(Prob_df)) {
Result_vec <- rep(NA, 100) # Assign the number of trials = 100
for (j in 1:100) { # Assign the number of trials = 100
Birth_vec <- sample(1:365, Prob_df$Students[i], replace = TRUE)
Result_vec[j] <- any(duplicated(Birth_vec))
}
Prob_df$Probs[i] <- mean(Result_vec)
}DT::datatable(Prob_df)Students) when the
probability that we have at least one shared birthday
(Probs) is 0.5 (=50%)Prob_df |>
filter(Probs >= 0.4 & Probs <= 0.6)# A tibble: 7 × 2
Students Probs
<int> <dbl>
1 19 0.43
2 20 0.43
3 21 0.5
4 22 0.51
5 23 0.59
6 24 0.58
7 25 0.59
Students) when the
probability that we have at least one shared birthday
(Probs) is 0.5 (=50%) is around 21 or 22x-axis as the number of
students (Students) and y-axis as the
probability that we have at least one shared birthday
(Probs)Prob_df %>%
ggplot() +
geom_line(aes(x = Students,
y = Probs * 100),
size = 1) +
geom_hline(yintercept = 50, # Draw a red dotted line where Probs = 50%
color = "red",
linetype = 2) +
labs(x = "The number of Students",
y = "The prob. that we have at least one shared birthday (Probs) \n(Trials = 100)") Simulation result (Trial = 100) The probability of a shared birthday exceeds 50% in a group of only 23 people
n
people\[P(n) = 1 - \frac{365!}{365^n(365-n)!}\]
\(3! = 3 × 2 × 1 = 6\)
→ \(365! = 365 × 364 ×....×
1\)
In R, you can calculate this value by using
factorial()
For instance, if you have 50 people, then \(P(n)\) is calculated as follows:
1 - factorial(365) / (365^50 * factorial(365 - 50))[1] NaN
NaN means Not calculableInf ・Let’s calculate
\(100!\) on R
factorial(100)[1] 9.332622e+157
This is a gigantic number with 157 decimal points!!
Let’s calculate \(200!\) on R
factorial(200)[1] Inf
Inf means that R fails to calculate because the number
is goo large\[P(n) = 1 - \frac{n!×_{365}C_n}{365^n(365-n)!}\]
C means binomial coefficient\[_nC_k = \frac{n!}{k!(n-k)!}\]
C by using
choose(n, k) 1 - ((factorial(50) * choose(365, 50)) / 365^50)[1] 0.9703736
Probs) is 0.97 (= 97%)Birth_Expect <- function (n) {
1 - ((factorial(n) * choose(365, n)) / 365^n)
}
Birth_Expect(22)[1] 0.4756953
Birth_Expect <- function (n) {
1 - ((factorial(n) * choose(365, n)) / 365^n)
}
Birth_Expect(23)[1] 0.5072972
We see that the probability that we have at least one shared birthday is when there are 23 students
Let’s add this theoretical value to the data frame
(Prob_df) we made in
2.1.1 Simulation of birthday paradox and name it
Expect
Prob_df <- Prob_df |>
mutate(Expect = map_dbl(Students,
~Birth_Expect(.x)))DT::datatable(Prob_df)pivot_longer(), let’s customize this data to a
tidy oneProb_tidy <- Prob_df |>
pivot_longer(cols = Probs:Expect,
names_to = "Type",
values_to = "Prob")DT::datatable(Prob_tidy)Probs) and the result gained as theoretical value
(Expect)Prob_tidy %>%
mutate(Type = ifelse(Type == "Probs", "simulation", "theory")) %>%
ggplot() +
geom_line(aes(x = Students,
y = Prob * 100,
color = Type),
size = 1) +
labs(x = "The number of students",
y = "The prob. that we have at least one shared birthday (Probs) \n(Trials = 100)",
color = "") +
theme(legend.position = "bottom")n_student <- 100 # the number of students
n_trials <- 1000 # the number of trials
# Prepare for the place where we put our simulation result: Result_vec
Result_vec <- rep(NA, n_trials)
# iteration
for (i in 1:n_trials) {
Birth_vec <- sample(1:365,
n_student,
replace = TRUE)
Result_vec[i] <- any(duplicated(Birth_vec))
}Prob_df <- tibble(Students = 2:100,
Probs = NA)
for (i in 1:nrow(Prob_df)) {
Result_vec <- rep(NA, n_trials)
for (j in 1:n_trials) {
Birth_vec <- sample(1:365, Prob_df$Students[i], replace = TRUE)
Result_vec[j] <- any(duplicated(Birth_vec))
}
Prob_df$Probs[i] <- mean(Result_vec)
}Prob_df %>%
mutate(Expect = map_dbl(Students, ~Birth_Expect(.x))) %>%
pivot_longer(cols = Probs:Expect,
names_to = "Type",
values_to = "Prob") |>
mutate(Type = ifelse(Type == "Probs", "Simulation", "Theoretical value")) %>%
ggplot() +
geom_line(aes(x = Students, y = Prob * 100, color = Type), size = 1) +
labs(x = "The number of Students",
y = "The prob. that we have at least one shared birthday (Probs) \n(Trials = 1000)",
color = "") +
theme_minimal(base_family = "HiraKakuProN-W3") +
theme(legend.position = "bottom")Let’s consider a circle of radius 1 centered at the origin, find its area without using \(π\)
Step 1: Draw 4 lines around the circle and make a
square
Step 2: Randomly add points inside of the
square
Step 3: Calculate the ratio of dots inside and
outside of the square
→ We get the ratio of the area inside and outside of the square
→ We can calculate the are of the circle
→ We can estimate the value of \(\pi =
3.14\)
2 × 2 = 4set.seed(2022)
pi_df <- tibble(x = runif(100, -1, 1),
y = runif(100, -1, 1))
pi_df# A tibble: 100 × 2
x y
<dbl> <dbl>
1 0.632 0.910
2 0.295 -0.594
3 -0.759 0.681
4 0.0876 0.622
5 -0.631 0.148
6 0.272 -0.585
7 -0.851 0.784
8 -0.916 -0.831
9 -0.259 0.243
10 0.515 0.968
# … with 90 more rows
pi_df |>
ggplot() +
geom_rect(aes(xmin = -1,
ymin = -1,
xmax = 1,
ymax = 1),
fill = "white", color = "black") +
geom_point(aes(x = x, y = y)) +
coord_fixed(ratio = 1) +
theme_minimal()geom_circle() which
comes with {ggforce}x = 0, y = 0 and the radius of the
circle: \(r = 1\)pi_df %>%
ggplot() +
geom_rect(aes(xmin = -1,
ymin = -1,
xmax = 1,
ymax = 1),
fill = "white", color = "black") +
geom_circle(aes(x0 = 0,
y0 = 0,
r = 1)) +
geom_point(aes(x = x, y = y)) +
coord_fixed(ratio = 1) +
theme_minimal()in_circle\[(x-x')^2 + (y - y')^2 < r^2\]
\[x^2 + y^2 < 1^2\]
in_circle, you can decide whether or not the
random dots satisfy this mathematical equationpi_df <- pi_df %>%
mutate(in_circle = if_else(x^2 + y^2 < 1^2, "Inside", "Outside"))pi_df %>%
ggplot() +
geom_rect(aes(xmin = -1,
ymin = -1,
xmax = 1,
ymax = 1),
fill = "white",
color = "black") +
geom_point(aes(x = x,
y = y,
color = in_circle),
size = 2) +
geom_circle(aes(x0 = 0,
y0 = 0,
r = 1)) +
labs(x = "X", y = "Y", color = "") +
coord_fixed(ratio = 1) +
theme_void(base_size = 12) group_by(), you can count the number of dots
inside and outside of the circlepi_df |>
group_by(in_circle) |>
summarise(N = n())# A tibble: 2 × 2
in_circle N
<chr> <int>
1 Inside 82
2 Outside 18
4 * 0.82[1] 3.28
set.seed(2022)
pi_df2 <- tibble(x = runif(2000, -1, 1),
y = runif(2000, -1, 1))
pi_df2 <- pi_df2 %>%
mutate(in_circle = if_else(x^2 + y^2 < 1^2, "Inside", "Outside"))
pi_df2 %>%
ggplot() +
geom_rect(aes(xmin = -1,
ymin = -1,
xmax = 1,
ymax = 1),
fill = "white", color = "black") +
geom_point(aes(x = x,
y = y,
color = in_circle,
alpha = 0.5),
size = 2) +
geom_circle(aes(x0 = 0,
y0 = 0,
r = 1)) +
labs(x = "X", y = "Y", color = "") +
coord_fixed(ratio = 1) +
theme_void(base_size = 12) pi_df2 %>%
group_by(in_circle) %>%
summarise(N = n())# A tibble: 2 × 2
in_circle N
<chr> <int>
1 Inside 1562
2 Outside 438
4 * 1562 / 2000[1] 3.124
dots <- 2:2000 # Change the number of dots from 2 to 2000
n <- length(dots)
results <- rep(NA, n)
for (i in 1:n) {
results[i] <- tibble(x = runif(dots[i], -1, 1),
y = runif(dots[i], -1, 1)) |>
mutate(in_circle = if_else(x^2 + y^2 < 1^2, "Inside", "Outside")) |>
group_by(in_circle) |>
summarise(N = n()) |>
filter(in_circle == "Inside") |> # Select the dots insides of the circle
summarise(mean(N/dots[i]*4)) # Estimate pi
}
pi <- unlist(results) # Transform the date into vector
df_pi <- tibble(dots, pi) # Make a data frame, df_pi
df_pi %>%
ggplot() +
geom_line(aes(x = dots,
y = pi),
size = 0.2,
color = "blue",
alpha = 0.8) +
geom_hline(yintercept = 3.14, # Draw the line = 3.14
color = "red",
linetype = 2) +
labs(x = "The number of dots",
y = "Estimated pi") +
ggtitle("Estimated pi gained from Monte Carlo Simulation") Let’s think about Monty Hall problem
The Monty Hall problem is a brain teaser, in the form of a probability puzzle, loosely based on the American television game show Let’s Make a Deal and named after its original host, Monty Hall.
【Rules】
Stay:
You should stay, because there is no difference between picking door
No.1 and door No.2 in terms of winning a car.
Move:
You should pick door No.2 because it enhances your chance of winning a
car.
【Simulation on R】
Script for Monty Hall Problem on R
set.seed(1838-3-11) # Set a random number, Okuma's DOB
# Player are supposed to play this game 200 times
trials <- 200
prize <- sample(1:3, trials, replace = TRUE)
prize
stay <- prize == 1 # Define that a car is behind door No.1
head(stay)
move <- prize != 1 # Define that a car is not behind door No.2 and No.3
head(move)
prob.stay <- prob.move <- rep(NA, trials)
for (i in 1:trials) {
prob.stay[i] <- sum(stay[1:i]) / i # Prob. of winning a car if you stay
prob.move[i] <- sum(move[1:i]) / i # Prob . of winning a car if you move
# Plot your simulation results
plot(1:trials, prob.move, ylim = c(0, 1),
type = "l", col = "red",
xlab = "number of trials",
ylab = "prob. of getting a car",
main = "Simulation of Monty Hall Problem")
lines(1:trials, prob.stay, type = "l", col = "blue")
abline(h = c(0.33, 0.66), lty = "dotted") # Draw two lines with prob = 33 % and prob = 66%
legend("topright", lty = 1, col = c("red", "blue"),
legend = c("change", "stay"))}・How to run this simulation:
1. Go to File on R (not
RStudio) 2. Click New File
3. Paste the R code above
(Script for Monty Hall Problem on R) 4. Hit the
return key or enter key and you get the
following results: