R Packages
we use in this sectionStatistics | Objectives |
---|---|
Descriptive Statistics: | Quantitatively describes or summarizes features from a collection of information |
Inferential Statistics: | Infer properties of a population, by testing hypotheses and deriving estimates |
parameter
and statistic
.How to estimate ・Calculate
sample statistic (such
as sample mean
: \(\bar{X}\))
→ Infer parameter
(such as population mean
: \(\mu\))
Assumption in statistical estination
・Population distributio is normally
distributed
・But, it is not always the case (we have lots of populations which are
not nornally distributed)
・The Central Limit Theorem enables us to conduct statistical
estimation
Estimation
refers to:Population Distribution
and
Population Variance
Population Distribution | Population Variance \(\sigma^2\) | Approach |
Normal Distribution | Known | Use population variance\(\sigma^2\) |
Normal Distribution | Unknown | Use unbiased variance & t-distribution |
Unknown | Unknown | Use unbiased variance & t-distribution & collect more samples |
(1) Is the estimation method ‘point estimation’ or ‘interval
estimation’?.
(2) Is the population distribution ‘normal’?
(3) Is the ‘population variance’ known or unknown?.
(4) Is the estimation aimed at ‘population mean’ or ‘population variance’?.
\[\bar{X} = \frac{X_1 + X_2 + X_3 + ・・・X_n}{n}\]
random variable
.random variable
because it is the sum of random variables
(\(X_1, X_2, ...\)).random variable
, so it is denoted with a capital letter
(the lowercase \(x\) represents a
specific observation.Variance
\[S^2= \frac{(X_1-\bar{X}) + (X_2-\bar{X}) + ・・・(X_n-\bar{X})}{n}\].
random variable
because it is the sum of random variables
(\(X_1, X_2, ...\)).\[U^2= \frac{(X_1-\bar{X}) + (X_2-\bar{X}) + ・・・(X_n-\bar{X})}{n-1}\]
Why we use unbiased variance
instead of sample variance
① ・Sample variance \(S^2\) underestimates the true variance
\(\sigma^2\).
・Let’s express the expected value of sample variance \(S^2\) in an equation.
\[E[S^2] = \sigma^2 - \frac{1}{n}\sigma^2\]
・It can be seen that sample variance \(S^2\) underestimates the true variance
\(\sigma^2\) by \(\frac{1}{n}\).
・The degree of underestimation decreases as \(n\) increases.
・Let’s rewrite the above equation:
\[E[S^2] = \sigma^2 - \frac{1}{n}\sigma^2 = \frac{n-1}{n}\sigma^2\]
\[E[S^2] = \frac{n-1}{n}\sigma^2\]
・To make this expected value of sample variance \(S^2\) equal to the true variance \(\sigma^2\), it is sufficient to
multiply \(E[S^2]\) by \(\frac{n}{n-1}\).
・The equation representing sample variance \(S^2\) is as follows:
\[S^2= \frac{X_1 + X_2 + X_3 + ・・・X_n}{n}\]
・Let’s try multiplying both sides by \(\frac{n}{n-1}\)
\[\frac{n}{n-1}・S^2= \frac{(X_1-\bar{X}) + (X_2-\bar{X}) + ・・・(X_n-\bar{X})}{n}・\frac{n}{n-1}\\ = \frac{(X_1-\bar{X}) + (X_2-\bar{X}) + ・・・(X_n-\bar{X})}{n-1} \\=U^2\]
・This matches the formula for unbiased variance.
→ The expected value of the unbiased variance is
\[E[U^2] = \sigma^2\]
Why we use unbiased variance
instead of sample variance
②
・Here, we are performing point estimation of the population variance
\(\sigma\) when the population mean
\(\mu\) is unknown.
・Assume we have obtained three observation values: x1, x2, and x3.
・The sample variance is calculated as the sum of the squared
differences between the mean of the three observation values (x1, x2,
and x3) and each observation value.
- If we want to estimate the true population variance \(\sigma\), ideally, we should calculate the
sample variance using the population mean \(\mu\).
- However, since the population mean \(\mu\) is unknown, we are using the sample
mean \(\bar{X}\) obtained from the
sample.
- We are using the difference between the mean calculated from the
obtained sample and the individual observation values.
→ If we compare it with the difference between the true population mean
\(\mu\) and the individual observation
values, …
→ The difference between the mean calculated from the obtained sample
and the individual observation values will be smaller.
- The following formula can be used to calculate the sample variance
\(S^2\):
\[S^2= \frac{(X_1-\bar{X}) + (X_2-\bar{X}) + ・・・(X_n-\bar{X})}{n}\]
・The sample variance is calculated as the sum of the squared
differences between the known population mean \(\mu\) and each of the three observation
values: x1, x2, and x3.
- The following formula can be used to calculate the sample
variance:
\[S^2= \frac{(X_1-\mu) + (X_2-\mu) + ・・・(X_n-\mu)}{n}\]
・In point estimation, when using the true population mean \(\mu\) compared to using the mean of the three observation values \(\bar{X}\) as a substitute, the sample variance is smaller when using the mean of the observation values \(\bar{X}\).
Since probability corresponds to area, the area enclosed by the
graph and the horizontal axis = 1 (= 100%).
Moving 1.96 times the standard deviation (\(\sigma\)) to the left and right from \(\mu\) (equals 1.96\(\sigma\)) covers 95% of the area.
Extract only one sample \(x\)
from this population.
The probability that the value of x is within the dark grey area is
95%.
→ The following inequality holds with 95% probability.
\[μ−1.96σ ≦ X ≦ μ+1.96σ\]
Solve this equation for \(\mu\):
\[X−1.96σ ≦ μ ≦ X +1.96σ\]
\[144.32 ≦ μ ≦ 175.68\]
Key Points to Understand a 95% Confidence
Interval ・A 95% confidence interval (represented as a
horizontal bar) can be created for each extracted value.
・Out of the many possible confidence intervals, 95% will include the
population mean \(\mu\).
→ 5% will fail to capture the population
mean.
The property of the normal distribution
[1] 168.5
\[μ−1.96\frac{\sigma}{\sqrt{n}}≦ \bar{X} ≦ μ+ 1.96\frac{\sigma}{\sqrt{n}}\]
Solving for \(\mu\)
\[\bar{X}−1.96\frac{\sigma}{\sqrt{n}}≦ \mu ≦
\bar{X} + 1.96\frac{\sigma}{\sqrt{n}}\]
Substitute \(\bar{X} = 168.5\)、\(n = 9\)、\(\sigma = 8\)
\[168.5−1.96\frac{8}{\sqrt{9}}≦ \mu ≦ 168.5+ 1.96\frac{8}{\sqrt{9}}\\ = 168.5-5.23≦ \mu ≦ 168.5 + 3.23\\ = 163.3 ≦ \mu ≦ 173.7\]
Key Points to Understand Interval Estimation ・By taking more samples, a more precise (narrower width) interval estimation can be made.
・Because the estimation is based on the distribution of sample
means, not the distribution of the samples themselves.
・The distribution of sample mean is sharper (i.e., has a smaller
variance) than population mean.
\[\bar{X}−1.96\frac{\sigma}{\sqrt{n}}≦ \mu ≦ \bar{X} + 1.96\frac{\sigma}{\sqrt{n}}\]
However, this is only possible when
the population variance \(\sigma^2\) is
known.
→ It cannot be used when the population variance \(\sigma^2\) is unknown
Why Interval Estimation Cannot Be Performed When Population Variance \(\sigma^2\) is Unknown
・Properties of Expectation and Variance
・Multiplying a random variable by \(a\) times and adding \(b\) → The constant \(b\) comes out, and it becomes the expectation of the random variable only.
\[E[aX + b] = aE[X] + b\]
・The constant \(a\) is squared and comes out.
\[V[aX + b] = a^2V[X]\]
Definition of Variance \[V[X] = E[X-E[X])^2]\\ = E[X^2]-E[X]^2\]
・\(\bar{X}\)’s mean is \(\mu\)
・\(\bar{X}\)’s variance is \(\frac{\sigma^2}{n}\)
\(\bar{X}\) | Standardization | |
---|---|---|
Mean | \(\mu\) | 0 |
Variance | \(\frac{\sigma^2}{n}\) | 1 |
・The random variable \(Z\) follows
a standard normal distribution.
・The mean of the standard normal distribution is 0, and the variance is
1.
\[Z = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}}\]
\[−1.96 ≦
\frac{\bar{X}-\mu}{\frac{\sigma}{\sqrt{n}}} ≦ 1.96\]
・Solving for \(\mu\) \[\bar{X}−1.96\frac{\sigma}{\sqrt{n}}≦ \mu
≦ \bar{X}+1.96\frac{\sigma}{\sqrt{n}}\]
\[Z = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}}\].
・Instead of the population variance \(\sigma^2\), we use unbiased variance \(U\) for estimation.
\[T = \frac{\bar{X} - \mu}{\frac{U}{\sqrt{n}}}\].
Theorem: T distribution ・Assume
there are independent random variables \((X_1,
X_2,..., X_n)\) that follow a normal distribution with mean \(\mu\) and variance \(\sigma^2\).
・In this case, \(T\) follows a \(t\) distribution with \(n-1\) degrees of freedom:
\[T = \frac{\bar{X} - \mu}{\frac{U}{\sqrt{n}}}\]
\[−3.182 ≦ \frac{\bar{X}-\mu}{\frac{U}{\sqrt{n}}} ≦ 3.182\]
\[\bar{X} = \frac{500 + 450 + 623 + 400}{4} = 493.25\]
\[U^2 = \frac{(x_1-\bar{x})^2 + .. + (x_n-\bar{x})^2}{n-1}\\ = \frac{(500-493.25)^2 + .. + (400-493.25)^2}{4-1} = 9149\]
\[= 493.25−3.182\frac{95.7}{\sqrt{4}}≦ \mu ≦ 493.25+3.182\frac{95.7}{\sqrt{4}}\\ = 493.25−152.3≦ \mu ≦ 493.25+152.3\\ = 341 ≦ \mu ≦ 645\]
One Sample t-test
data: toefl
t = 10.314, df = 3, p-value = 0.001944
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
341.0496 645.4504
sample estimates:
mean of x
493.25
This command has the same meaning as
t.test(toefl, mu = 0)
.
It performs a t-test on the obtained sample to test whether the
population mean is 0.
The values below “95 percent confidence interval:” represent the
95% confidence interval (341.0496
645.4504
).
This matches the values calculated manually above.
FYI (From Estimation to Hypothesis testing)
・if you want to test whether the population mean is 342 using the obtained four samples, you can enter and execute the following in RStudio.
One Sample t-test
data: toefl
t = 3.1626, df = 3, p-value = 0.05077
alternative hypothesis: true mean is not equal to 342
95 percent confidence interval:
341.0496 645.4504
sample estimates:
mean of x
493.25
H_0 = 342
cannot be rejected.
One Sample t-test
data: toefl
t = 3.2044, df = 3, p-value = 0.04917
alternative hypothesis: true mean is not equal to 340
95 percent confidence interval:
341.0496 645.4504
sample estimates:
mean of x
493.25
Simulation of the Central Limit Theorem
Does the sample mean approximately follow a normal distribution as the sample size increases?
bag <- 0:9 # Create cards artificially from 0 to 9.
exp_1 <- sample(bag, # Use the bag containing cards.
size = 2, # Specify drawing 2 cards (sample size N = 2).
replace = TRUE) # Specify that the drawn cards are returned to the bag.
mean(exp_1) # Calculate and display the mean.
[1] 6
for loop
.How to use for loop
adding 10
five times
starting from 0, you can do the following.result
.result <- rep(NA, # NA represents empty contents.
length.out = 5) # Specify how many storage locations are needed.
result
[1] NA NA NA NA NA
for loop
.for
and enclose the loop’s content in curly braces
{}
.for(i in 1:5){ # 'i' repeats from 1 to 5.
A <- A + 10 # Add 10 to 'A' for each iteration.
result[i] <- A # Store the result of the 'i-th' operation (i.e., addition) in 'result[i].'
}
result # Check the contents of the container.
[1] 10 20 30 40 50
→ You can see that the container (result
) contains the
expected values (numbers obtained by starting from 0 and adding 10 each
time), as anticipated.
bag <- 0:9 # Create cards artificially
N <- 2 # Sample size (choose 2 cards)
trials <- 10 # Number of experiment repetitions = 10
sim1 <- rep(NA, length.out = trials) # Container to store results
for (i in 1:trials) {
experiment <- sample(bag,
size = N,
replace = TRUE) # Sampling with replacement
sim1[i] <- mean(experiment) # Save the mean for the i-th trial
}
df_sim1 <- tibble(avg = sim1)
h_sim1 <- ggplot(df_sim1, aes(x = avg)) +
geom_histogram(binwidth = 1,
boundary = 0.5,
color = "black") +
labs(x = "Mean of 2 cards",
y = "Frequency") +
ggtitle("SIM1: N = 2, Number of Trials = 10") +
scale_x_continuous(breaks = 0:9) +
geom_vline(xintercept = mean(df_sim1$avg), # Draw a vertical line at the mean
col = "lightgreen") +
theme_bw(base_size = 14, base_family = "HiraKakuPro-W3") # Command to prevent character encoding issues (for Mac users only)
plot(h_sim1) # Visualize and display the results
# A tibble: 10 × 1
avg
<dbl>
1 2.5
2 3.5
3 1.5
4 5.5
5 5.5
6 6.5
7 6.5
8 2
9 6
10 3.5
[1] 4.3
bag <- 0:9 # Create cards artificially
N <- 5 # Sample size (choose 5 cards)
trials <- 100 # Number of experiment repetitions = 100
sim2 <- rep(NA, length.out = trials) # Container to store results
for (i in 1:trials) {
experiment <- sample(bag,
size = N,
replace = TRUE) # Sampling with replacement
sim2[i] <- mean(experiment) # Save the mean for the i-th trial
}
df_sim2 <- tibble(avg = sim2)
h_sim2 <- ggplot(df_sim2, aes(x = avg)) +
geom_histogram(binwidth = 1,
boundary = 0.5,
color = "black") +
labs(x = "Mean of 5 cards",
y = "Frequency") +
ggtitle("SIM2: N = 5, Number of Trials = 100") +
scale_x_continuous(breaks = 0:9) +
geom_vline(xintercept = mean(df_sim2$avg), # Draw a vertical line at the mean
col = "lightgreen") +
theme_bw(base_size = 14, base_family = "HiraKakuPro-W3")
plot(h_sim2)
# A tibble: 100 × 1
avg
<dbl>
1 4.2
2 2.8
3 2.8
4 3.2
5 6
6 3.8
7 4.8
8 3
9 2.4
10 3.2
# ℹ 90 more rows
[1] 4.346
bag <- 0:9 # Create cards artificially
N <- 100 # Sample size (choose 100 cards)
trials <- 1000 # Number of experiment repetitions
sim3 <- rep(NA, length.out = trials) # Container to store results
for (i in 1:trials) {
experiment <- sample(bag, size = N, replace = TRUE) # Sampling with replacement
sim3[i] <- mean(experiment) # Save the mean for the i-th trial
}
df_sim3 <- tibble(avg = sim3)
h_sim3 <- ggplot(df_sim3, aes(x = avg)) +
geom_histogram(binwidth = 0.125,
color = "black") +
labs(x = "Mean of 100 cards", y = "Frequency")+
ggtitle("SIM3: N = 100, Number of Trials = 1000") +
scale_x_continuous(breaks = 0:9) +
geom_vline(xintercept = mean(df_sim3$avg), # Draw a vertical line at the mean
col = "lightgreen") +
theme_bw(base_size = 14, base_family = "HiraKakuPro-W3")
plot(h_sim3)
# A tibble: 1,000 × 1
avg
<dbl>
1 4.38
2 4.76
3 4.74
4 4.53
5 4.89
6 4.19
7 4.23
8 4.6
9 4.64
10 3.72
# ℹ 990 more rows
[1] 4.49938
What it Bernoulli
Distribution? ・A probability distribution that
represents the results of experiments or trials with only two possible
outcomes, such as win or lose
, heads or tails
,
or pass or fail
, using 0 and 1.
・When the probability of getting 1 is \(p\), the probability of getting 0 is \(1-p\), making it a simple probability
distribution.
・\(k\): A parameter representing
success or failure (1 for success, 0 for failure).
・\(p\): Probability of success.
\[E(X) =\sum kP(X = k) = p\]
\[V(X) = E(X^2)-(E(X))^2\\ = p(1-p)\]
\[p−1.96\sqrt{\frac{p(1-p)}{n}}≦ \bar{X} ≦ p+ 1.96\sqrt{\frac{p(1-p)}{n}}\]
\[R−1.96\sqrt{\frac{p(1-p)}{n}}≦ p≦ R+ 1.96\sqrt{\frac{p(1-p)}{n}}\]
→ Since the interval for \(p\)
cannot be narrowed down, the 95% confidence interval cannot be
calculated.
→ It cannot be estimated.
→ \(p\) can be evaluated and the
equation can be rearranged.
・When n is sufficiently large, \(p\)
is almost close to \(R\) (Law of Large
Numbers).
Law of Large Numbers ・The sample mean of independent random variables from the same distribution converges to the population mean.
\[R−1.96\sqrt{\frac{p(1-p)}{n}}≦ p≦ R+ 1.96\sqrt{\frac{p(1-p)}{n}}\]
\[R−1.96\sqrt{\frac{R(1-R)}{n}}≦ p≦ R+ 1.96\sqrt{\frac{R(1-R)}{n}}\] → 95% CI
\[R−1.96\sqrt{\frac{R(1-R)}{n}}≦ p≦ R+ 1.96\sqrt{\frac{R(1-R)}{n}}\]
\[0.75-1.96\sqrt{\frac{0.75*0.25}{100}}≦
p≦ 0.75+1.96\sqrt{\frac{0.75*0.25}{100}}\]
\[0.69≦ p≦ 0.91\]
The Chi-Square (\(\chi^2\)) theorem ・If we have independent random variables \(X_1, X_2, ..., X_n\) following a normal distribution with variance \(\sigma^2\), then
\[T =
\frac{(n-1)U^2}{\sigma^2}\]
follows a chi-square (\(\chi^2\))
distribution with \(n-1\) degrees of
freedom.
U
represents the unbiased sample variance.n
observations.T
follows a \(\chi^2\) distribution with n-1
degrees of freedom.U
is a random
variable.n-1
degrees of freedom when U
(unbiased sample
variance) follows this distribution, representing the degrees of freedom
as n-1
.T
\[T = \frac{(n-1)U^2}{\sigma^2}\]
\[T = (\frac{X_1-\bar{X}}{\sigma})^2 + (\frac{X_2-\bar{X}}{\sigma})^2 +...+ (\frac{X_n-\bar{X}}{\sigma})^2\]
What \(T\) stands for ・The sum of how much each sample, \(X_1, X_2,..., X_n\), deviates from the sample mean, \(\bar{X}\).
\[T = (\frac{X_1-\bar{X}}{\sigma})^2 +
(\frac{X_2-\bar{X}}{\sigma})^2
+...+ (\frac{X_n-\bar{X}}{\sigma})^2\]
\[= \frac{n-1}{\sigma^2}\frac{(X_1-\bar{X})^2
+ (X_2-\bar{X})^2 + (X_n-\bar{X})^2}{n-1}\]
・The right-hand side of the above equation, \(\frac{(X_1-\bar{X})^2 + (X_2-\bar{X})^2 + (X_n-\bar{X})^2}{n-1}\), is the unbiased variance, denoted as \(U^2\)
\[= \frac{(n-1)U^2}{\sigma^2}\]
t
.→ The difference between the sample mean and the sample values is
small.
→ The value of t
is small.
→ The difference between the sample mean and the sample values is
large.
→ The value of t
is large.
T
represents the sum of how much each
sample X_1, X_2,..., X_n
deviates from the sample mean
\(\bar{X}\).\[T = \frac{(n-1)U^2}{\sigma^2}\]
\[2.7 ≦ \frac{(n-1)U^2}{\sigma^2}≦ 19.2\]
\[\frac{(n-1)U^2}{19.2}≦ \sigma^2 ≦ \frac{(n-1)U^2}{2.7}\]
n = 10
and \(U^2 =
9.2\):\[\frac{82.8}{19.2}≦ \sigma^2 ≦ \frac{82.8}{2.7}\] \[4.3 ≦ \sigma^2 ≦ 30.7\]
Statistical hypothesis testing
involves verifying
hypotheses about a population based on information obtained from a
sample.Theorem on T distribution ・Assume
there are independent random variables \(X_1,
X_2,..., X_n\) that follow a normal distribution with mean \(\mu\) and variance \(\sigma^2\)
- In this case,
\[T = \frac{\bar{X} - \mu}{\frac{U}{\sqrt{n}}}\]
follows a t distribution with n-1 degrees of freedom.
\[T = \frac{\bar{X} - \mu}{\frac{U}{\sqrt{n}}} = \frac{493.3-300}{\frac{95.3}{\sqrt{4}}}= \frac{193.3}{47.65}=4.06\]
Verify using R (Two-tailed test)
One Sample t-test
data: test
t = 4.0408, df = 3, p-value = 0.02727
alternative hypothesis: true mean is not equal to 300
95 percent confidence interval:
341.0496 645.4504
sample estimates:
mean of x
493.25
→ The p-value obtained here is 0.02727.
→ Null Hypothesis \(H_0\): “The
population mean is 300 points” is rejected.
→ The population mean is more than 300 points
(since the sample mean is 493 points)
Theorem on T distribution ・Assume
there are independent random variables \(X_1,
X_2,..., X_n\) that follow a normal distribution with mean \(\mu\) and variance \(\sigma^2\)
- In this case,
\[T = \frac{\bar{X} - \mu}{\frac{U}{\sqrt{n}}}\]
follows a t distribution with n-1 degrees of freedom.
\[T = \frac{\bar{X} - \mu}{\frac{U}{\sqrt{n}}} = \frac{493.3-300}{\frac{95.3}{\sqrt{4}}}= \frac{193.3}{47.65}=4.06\]
Verify using R (one-tailed t.test)
One Sample t-test
data: test
t = 4.0408, df = 3, p-value = 0.01364
alternative hypothesis: true mean is greater than 300
95 percent confidence interval:
380.7004 Inf
sample estimates:
mean of x
493.25
→ The p-value obtained here is 0.01364.
→ Null Hypothesis \(H_0\): “The
population mean is less than 300 points” is rejected.
→ The population mean is more than 300
points