library(tidyverse)
library(ggforce)
sample()
How to use sample()
sample(x = a set vector,
size = the number of trials,
replace = replace the number you draw or not
prob = probability of each number drawn)
x
is a set vector containing the
value from 1 to 10<- c(1, 2, 3, 4, 5, 6) x
x
sample(x,
size = 5)
[1] 3 5 6 1 2
number 1
, number 1
is not drawn in the
trialreplace = TRUE
, which means
extraction with restoration, then it is likely that you
select number 1
in the trialsample(x,
size = 5,
replace = TRUE)
[1] 6 2 2 6 2
prob
means the probability that each number is
drawnsample(x,
size = 5,
replace = TRUE,
prob = c(1, 1, 1, 2, 2, 2))
[1] 1 6 4 5 4
prob = c(1/9, 1/9, 1/, 2/9, 2/9, 2/9)
and get the same resultsize = 3
replace = TRUE
, which means
extraction with restorationsample(1:6,
3,
replace = TRUE)
[1] 6 5 5
prob = c(2, 1, 2, 1, 2, 1)
sample(x,
size = 5,
replace = TRUE,
prob = c(2, 1, 2, 1, 2, 1))
[1] 1 5 3 5 1
table()
, you can show the result<- sample(1:6,
dice 10^4,
replace = TRUE,
prob = c(2, 1, 2, 1, 2, 1))
table(dice)
dice
1 2 3 4 5 6
2187 1126 2264 1156 2209 1058
barplot()
, let’s visualize this resulttable(dice) |>
barplot()
- You see that the odd numbers (1, 3, 5) is twice as likely as the even numbers (2, 4, 6) to be drawn in this simulation
Function | Probability distribution | Parameters |
---|---|---|
rbeta() |
beta distribution | n, shape1, shape2 |
rbinom() |
binomial distribution | n, size, prob |
rcauchy() |
cauchy distribution | n, location, scale |
rchisq() |
Chi-square distribution | n, df |
rexp() |
exponential distribution | n, rate |
rf() |
F distribution | n, |
rgamma() |
gamma distribution | n, shape, scale |
rgeom() |
geometric distribution | n, prob |
rhyper() |
hypergeometric distribution | nn, m, n, k |
rlnorm() |
lognormal distribution | n, meanlog, sdlog |
rmultinom() |
multinomial distribution | n, size, prob |
rnbinom() |
negative binomial distribution | n, size, prob |
rnorm() |
normal distribution | n, mean, sd |
rpois() |
Poisson distribution | n, lambda |
rt() |
t distribution | n, df |
runif() |
uniform distribution | n, min, max |
rweibull() |
Weibull distribution | n, shape, scale |
mvtnorm::rmvnorm() |
multivariate normal distribution | n, mean, sigma |
When you generate “random numbers” in R, you are actually
generating pseudorandom numbers
These numbers are generated with an algorithm that requires a
seed to initialize
Being pseudorandom instead of pure random means that, if you know
the seed and the generator, you can predict (and reproduce) the
output
Setting a seed in R means to initialize a pseudorandom number
generator
Most of the simulation methods in Statistics require the
possibility to generate pseudorandom numbers that mimic the properties
of independent generations of a uniform distribution in the interval
(0, 1)
If you want to draw a random number in R, you receive different
numbers every time you do
Suppose you wan to draw three numbers from a normal distribution
with average = 0 and standard deviation = 1
Round the number with 2 decimal places three times
rnorm(3) |>
round(2)
[1] -1.52 0.39 -0.52
rnorm(3) |>
round(2)
[1] -1.32 0.27 -0.06
rnorm(3) |>
round(2)
[1] -0.12 0.02 0.19
You see that ever time you draw random numbers, you receive
different numbers
If you a simulation onece, there is no problem
But, if you want to get the same simulation result for some
reason, then you must be happy to have the same result by fixing random
number to get
This is when you use set.seed()
If you use the same set.seed()
, then you get the
same random number
In the parenthesis, you set a numeric scalar (which means, any nubmers you like)
Let’s set a numeric scalar, 20220826
within
set.seed()
set.seed(20220826)
rnorm(3) |>
round(2)
[1] -0.14 -0.87 -0.29
set.seed(20220826)
rnorm(3) |>
round(2)
[1] -0.14 -0.87 -0.29
N
randomly chosen people, at least two
will share a birthdayAnswer The probability of a shared birthday exceeds 50% in a group of only 23 people.
Birthday Paradox
using Rn_student
n_student = 10
<- 10
n_student <- sample(1:365,
Birth_vec
n_student,replace = TRUE)
Birth_vec
[1] 140 55 363 170 292 335 345 6 179 6
duplicated()
duplicated()
TRUE
,
FALSE
otherwiseduplicated(Birth_vec)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
FALSE
and 1 TRUE
TRUE
means that there is a shared birthdayany()
any()
enables us to check whether or not there is at
least one TRUE
TRUE
means that there is at least one shared
birthdayFALSE
means that there is no TRUE
any(duplicated(Birth_vec))
[1] TRUE
TRUE
because the 6th DOB is a shared
birthday<- 10 # the number of students
n_student <- 100 # the number of trials
n_trials
<- rep(NA, n_trials) # Prepare for the place where we put our simulation result: Result_vec
Result_vec
# iteration
for (i in 1:n_trials) {
<- sample(1:365, n_student, replace = TRUE)
Birth_vec <- any(duplicated(Birth_vec))
Result_vec[i]
}
Result_vec
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
[61] FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[85] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[97] FALSE FALSE FALSE FALSE
sum(Result_vec)
[1] 12
TRUE
<- 10 # the number of students
n_student <- 1000 # the number of trials
n_trials
# Prepare for the place where we put our simulation result: Result_vec
<- rep(NA, n_trials)
Result_vec
# iteration
for (i in 1:n_trials) {
<- sample(1:365, n_student, replace = TRUE)
Birth_vec <- any(duplicated(Birth_vec))
Result_vec[i]
}
sum(Result_vec)
[1] 110
TRUE
<- tibble(Students = 2:100, # change the number of students from 2 to 100
Prob_df Probs = NA)
for (i in 1:nrow(Prob_df)) {
<- rep(NA, 100) # Assign the number of trials = 100
Result_vec
for (j in 1:100) { # Assign the number of trials = 100
<- sample(1:365, Prob_df$Students[i], replace = TRUE)
Birth_vec <- any(duplicated(Birth_vec))
Result_vec[j]
}
$Probs[i] <- mean(Result_vec)
Prob_df }
::datatable(Prob_df) DT
Students
) when the
probability that we have at least one shared birthday
(Probs
) is 0.5 (=50%)|>
Prob_df filter(Probs >= 0.4 & Probs <= 0.6)
# A tibble: 7 × 2
Students Probs
<int> <dbl>
1 19 0.43
2 20 0.43
3 21 0.5
4 22 0.51
5 23 0.59
6 24 0.58
7 25 0.59
Students
) when the
probability that we have at least one shared birthday
(Probs
) is 0.5 (=50%) is around 21 or 22x-axis
as the number of
students (Students
) and y-axis
as the
probability that we have at least one shared birthday
(Probs
)%>%
Prob_df ggplot() +
geom_line(aes(x = Students,
y = Probs * 100),
size = 1) +
geom_hline(yintercept = 50, # Draw a red dotted line where Probs = 50%
color = "red",
linetype = 2) +
labs(x = "The number of Students",
y = "The prob. that we have at least one shared birthday (Probs) \n(Trials = 100)")
Simulation result (Trial = 100) The probability of a shared birthday exceeds 50% in a group of only 23 people
n
people\[P(n) = 1 - \frac{365!}{365^n(365-n)!}\]
\(3! = 3 × 2 × 1 = 6\)
→ \(365! = 365 × 364 ×....×
1\)
In R
, you can calculate this value by using
factorial()
For instance, if you have 50 people, then \(P(n)\) is calculated as follows:
1 - factorial(365) / (365^50 * factorial(365 - 50))
[1] NaN
NaN
means Not calculable
Inf
・Let’s calculate
\(100!\) on R
factorial(100)
[1] 9.332622e+157
This is a gigantic number with 157 decimal points!!
Let’s calculate \(200!\) on R
factorial(200)
[1] Inf
Inf
means that R fails to calculate because the number
is goo large\[P(n) = 1 - \frac{n!×_{365}C_n}{365^n(365-n)!}\]
C
means binomial coefficient\[_nC_k = \frac{n!}{k!(n-k)!}\]
C
by using
choose(n, k)
1 - ((factorial(50) * choose(365, 50)) / 365^50)
[1] 0.9703736
Probs
) is 0.97 (= 97%)<- function (n) {
Birth_Expect 1 - ((factorial(n) * choose(365, n)) / 365^n)
}Birth_Expect(22)
[1] 0.4756953
<- function (n) {
Birth_Expect 1 - ((factorial(n) * choose(365, n)) / 365^n)
}Birth_Expect(23)
[1] 0.5072972
We see that the probability that we have at least one shared birthday is when there are 23 students
Let’s add this theoretical value to the data frame
(Prob_df
) we made in
2.1.1 Simulation of birthday paradox
and name it
Expect
<- Prob_df |>
Prob_df mutate(Expect = map_dbl(Students,
~Birth_Expect(.x)))
::datatable(Prob_df) DT
pivot_longer()
, let’s customize this data to a
tidy one<- Prob_df |>
Prob_tidy pivot_longer(cols = Probs:Expect,
names_to = "Type",
values_to = "Prob")
::datatable(Prob_tidy) DT
Probs
) and the result gained as theoretical value
(Expect
)%>%
Prob_tidy mutate(Type = ifelse(Type == "Probs", "simulation", "theory")) %>%
ggplot() +
geom_line(aes(x = Students,
y = Prob * 100,
color = Type),
size = 1) +
labs(x = "The number of students",
y = "The prob. that we have at least one shared birthday (Probs) \n(Trials = 100)",
color = "") +
theme(legend.position = "bottom")
<- 100 # the number of students
n_student <- 1000 # the number of trials
n_trials
# Prepare for the place where we put our simulation result: Result_vec
<- rep(NA, n_trials)
Result_vec
# iteration
for (i in 1:n_trials) {
<- sample(1:365,
Birth_vec
n_student, replace = TRUE)
<- any(duplicated(Birth_vec))
Result_vec[i] }
<- tibble(Students = 2:100,
Prob_df Probs = NA)
for (i in 1:nrow(Prob_df)) {
<- rep(NA, n_trials)
Result_vec
for (j in 1:n_trials) {
<- sample(1:365, Prob_df$Students[i], replace = TRUE)
Birth_vec <- any(duplicated(Birth_vec))
Result_vec[j]
}
$Probs[i] <- mean(Result_vec)
Prob_df }
%>%
Prob_df mutate(Expect = map_dbl(Students, ~Birth_Expect(.x))) %>%
pivot_longer(cols = Probs:Expect,
names_to = "Type",
values_to = "Prob") |>
mutate(Type = ifelse(Type == "Probs", "Simulation", "Theoretical value")) %>%
ggplot() +
geom_line(aes(x = Students, y = Prob * 100, color = Type), size = 1) +
labs(x = "The number of Students",
y = "The prob. that we have at least one shared birthday (Probs) \n(Trials = 1000)",
color = "") +
theme_minimal(base_family = "HiraKakuProN-W3") +
theme(legend.position = "bottom")
Let’s consider a circle of radius 1 centered at the origin, find its area without using \(π\)
Step 1
: Draw 4 lines around the circle and make a
square
Step 2
: Randomly add points inside of the
square
Step 3
: Calculate the ratio of dots inside and
outside of the square
→ We get the ratio of the area inside and outside of the square
→ We can calculate the are of the circle
→ We can estimate the value of \(\pi =
3.14\)
2 × 2 = 4
set.seed(2022)
<- tibble(x = runif(100, -1, 1),
pi_df y = runif(100, -1, 1))
pi_df
# A tibble: 100 × 2
x y
<dbl> <dbl>
1 0.632 0.910
2 0.295 -0.594
3 -0.759 0.681
4 0.0876 0.622
5 -0.631 0.148
6 0.272 -0.585
7 -0.851 0.784
8 -0.916 -0.831
9 -0.259 0.243
10 0.515 0.968
# … with 90 more rows
|>
pi_df ggplot() +
geom_rect(aes(xmin = -1,
ymin = -1,
xmax = 1,
ymax = 1),
fill = "white", color = "black") +
geom_point(aes(x = x, y = y)) +
coord_fixed(ratio = 1) +
theme_minimal()
geom_circle()
which
comes with {ggforce}
x = 0
, y = 0
and the radius of the
circle: \(r = 1\)%>%
pi_df ggplot() +
geom_rect(aes(xmin = -1,
ymin = -1,
xmax = 1,
ymax = 1),
fill = "white", color = "black") +
geom_circle(aes(x0 = 0,
y0 = 0,
r = 1)) +
geom_point(aes(x = x, y = y)) +
coord_fixed(ratio = 1) +
theme_minimal()
in_circle
\[(x-x')^2 + (y - y')^2 < r^2\]
\[x^2 + y^2 < 1^2\]
in_circle
, you can decide whether or not the
random dots satisfy this mathematical equation<- pi_df %>%
pi_df mutate(in_circle = if_else(x^2 + y^2 < 1^2, "Inside", "Outside"))
%>%
pi_df ggplot() +
geom_rect(aes(xmin = -1,
ymin = -1,
xmax = 1,
ymax = 1),
fill = "white",
color = "black") +
geom_point(aes(x = x,
y = y,
color = in_circle),
size = 2) +
geom_circle(aes(x0 = 0,
y0 = 0,
r = 1)) +
labs(x = "X", y = "Y", color = "") +
coord_fixed(ratio = 1) +
theme_void(base_size = 12)
group_by()
, you can count the number of dots
inside and outside of the circle|>
pi_df group_by(in_circle) |>
summarise(N = n())
# A tibble: 2 × 2
in_circle N
<chr> <int>
1 Inside 82
2 Outside 18
4 * 0.82
[1] 3.28
set.seed(2022)
<- tibble(x = runif(2000, -1, 1),
pi_df2 y = runif(2000, -1, 1))
<- pi_df2 %>%
pi_df2 mutate(in_circle = if_else(x^2 + y^2 < 1^2, "Inside", "Outside"))
%>%
pi_df2 ggplot() +
geom_rect(aes(xmin = -1,
ymin = -1,
xmax = 1,
ymax = 1),
fill = "white", color = "black") +
geom_point(aes(x = x,
y = y,
color = in_circle,
alpha = 0.5),
size = 2) +
geom_circle(aes(x0 = 0,
y0 = 0,
r = 1)) +
labs(x = "X", y = "Y", color = "") +
coord_fixed(ratio = 1) +
theme_void(base_size = 12)
%>%
pi_df2 group_by(in_circle) %>%
summarise(N = n())
# A tibble: 2 × 2
in_circle N
<chr> <int>
1 Inside 1562
2 Outside 438
4 * 1562 / 2000
[1] 3.124
<- 2:2000 # Change the number of dots from 2 to 2000
dots <- length(dots)
n <- rep(NA, n)
results
for (i in 1:n) {
<- tibble(x = runif(dots[i], -1, 1),
results[i] y = runif(dots[i], -1, 1)) |>
mutate(in_circle = if_else(x^2 + y^2 < 1^2, "Inside", "Outside")) |>
group_by(in_circle) |>
summarise(N = n()) |>
filter(in_circle == "Inside") |> # Select the dots insides of the circle
summarise(mean(N/dots[i]*4)) # Estimate pi
}
<- unlist(results) # Transform the date into vector
pi <- tibble(dots, pi) # Make a data frame, df_pi
df_pi
%>%
df_pi ggplot() +
geom_line(aes(x = dots,
y = pi),
size = 0.2,
color = "blue",
alpha = 0.8) +
geom_hline(yintercept = 3.14, # Draw the line = 3.14
color = "red",
linetype = 2) +
labs(x = "The number of dots",
y = "Estimated pi") +
ggtitle("Estimated pi gained from Monte Carlo Simulation")
Let’s think about Monty Hall problem
The Monty Hall problem is a brain teaser, in the form of a probability puzzle, loosely based on the American television game show Let’s Make a Deal and named after its original host, Monty Hall.
【Rules】
Stay
:
You should stay, because there is no difference between picking door
No.1 and door No.2 in terms of winning a car.
Move
:
You should pick door No.2 because it enhances your chance of winning a
car.
【Simulation on R】
Script for Monty Hall Problem on R
set.seed(1838-3-11) # Set a random number, Okuma's DOB
# Player are supposed to play this game 200 times
<- 200
trials <- sample(1:3, trials, replace = TRUE)
prize
prize<- prize == 1 # Define that a car is behind door No.1
stay head(stay)
<- prize != 1 # Define that a car is not behind door No.2 and No.3
move head(move)
<- prob.move <- rep(NA, trials)
prob.stay for (i in 1:trials) {
<- sum(stay[1:i]) / i # Prob. of winning a car if you stay
prob.stay[i] <- sum(move[1:i]) / i # Prob . of winning a car if you move
prob.move[i]
# Plot your simulation results
plot(1:trials, prob.move, ylim = c(0, 1),
type = "l", col = "red",
xlab = "number of trials",
ylab = "prob. of getting a car",
main = "Simulation of Monty Hall Problem")
lines(1:trials, prob.stay, type = "l", col = "blue")
abline(h = c(0.33, 0.66), lty = "dotted") # Draw two lines with prob = 33 % and prob = 66%
legend("topright", lty = 1, col = c("red", "blue"),
legend = c("change", "stay"))}
・How to run this simulation:
1. Go to File
on R
(not
RStudio
) 2. Click New File
3. Paste the R code above
(Script for Monty Hall Problem on R
) 4. Hit the
return key
or enter key
and you get the
following results: