• R Packages we use in this section
library(tidyverse)
library(patchwork)

1. Statistics

Statistics Objectives
Descriptive Statistics: Quantitatively describes or summarizes features from a collection of information
Inferential Statistics: Infer properties of a population, by testing hypotheses and deriving estimates

Descriptive Statistics

  • Descriptive statistics is solely concerned with properties of the observed data, and it does not rest on the assumption that the data come from a larger population.
  • Descriptive statistics is distinguished from inferential statistics by its aim to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent.
  • Descriptive statistics, unlike inferential statistics, is not developed on the basis of probability theory, and are frequently nonparametric statistics.

Inferential Statistics

  • The aim of inferential statistics is to infer properties of a population, by testing hypotheses and deriving estimates.
  • Inferential statistics has developed on the basis of probability theory, and are frequently parametric statistics.

2. Population and Sample

Population:

  • A population is the pool of individuals from which a statistical sample is drawn for a study
Source: 浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年、p.117
Source: 浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年、p.117

Sample:

  • A sample is a smaller, manageable version of a larger group.
  • It is a subset containing the characteristics of a larger population.
  • Samples are used in statistical testing when population sizes are too large for the test to include all possible members or observations.
  • A sample should represent the population as a whole and not reflect any bias toward a specific attribute.

Population parameter and Sample statistic

  • In inferenctial statistics, it is important to differentical between parameter and statistic.

How to estimate ・Calculate sample statistic (such as sample mean: \(\bar{X}\))
→ Infer parameter (such as population mean: \(\mu\))

Assumption in statistical estination ・Population distributio is normally distributed
・But, it is not always the case (we have lots of populations which are not nornally distributed)
・The Central Limit Theorem enables us to conduct statistical estimation

I. Statistical Estimation

3. Point Estimation and Interval Estimation

Estimation refers to:

  • Using specific values to predict something like “The average value of a population is probably around ○○.
1) Point Estimation
  • Based on a sample, predicting a specific population parameter value at a pinpoint.
    => ex: The average height of female students at Waseda University is probably around 160 cm.
2) Interval Estimation
  • Based on a sample, predicting a specific population parameter value within a certain range.
    => ex: The average height of female students at Waseda University is probably between 160 cm and 165 cm.

Population Distribution and Population Variance

Population Distribution Population Variance \(\sigma^2\) Approach
Normal Distribution Known Use population variance\(\sigma^2\)
Normal Distribution Unknown Use unbiased variance & t-distribution
Unknown Unknown Use unbiased variance & t-distribution & collect more samples

(1) Is the estimation method ‘point estimation’ or ‘interval estimation’?.
(2) Is the population distribution ‘normal’?

  • Even if not normal, ‘Central Limit Theorem’ + large sample size can enable estimation.

(3) Is the ‘population variance’ known or unknown?.

  • If unknown, use ‘unbiased variance’ instead.
  • Estimation possible using t-distribution.

(4) Is the estimation aimed at ‘population mean’ or ‘population variance’?.

  • Consider estimating the population mean with n samples from the population.
  • The diagram below summarizes the content explained here.

4. Point Estimation

4.1 Point Estimation of the ‘Mean’

  • Taking \(n\) samples from a population where neither the population mean \(\mu\) nor the population variance is known

  • Calculate the sample mean \(\bar{X}\) (read as ‘X-bar’).

\[\bar{X} = \frac{X_1 + X_2 + X_3 + ・・・X_n}{n}\]

  • \(X_1, X_2, ...\) etc., are chosen randomly
    → Since it’s unknown which numbers will be selected, it can be called random variable.
  • \(\bar{X}\) is a random variable because it is the sum of random variables (\(X_1, X_2, ...\)).
  • The sample mean \(\bar{X}\) is a random variable, so it is denoted with a capital letter (the lowercase \(x\) represents a specific observation.

4.2 Point Estimation of Variance

  • Taking \(n\) samples from a population where neither population mean \(\mu\) nor population variance \(\sigma^2\) is known.

  • Calculate the sample variance \(S^2\)

\[S^2= \frac{(X_1-\bar{X}) + (X_2-\bar{X}) + ・・・(X_n-\bar{X})}{n}\].

  • \(S^2\) is a random variable because it is the sum of random variables (\(X_1, X_2, ...\)).
  • However, the sample variance is inappropriate for point estimation, so it is not used.
  • The unbiased variance \(U^2\) is more suitable for point estimation.
  • We calculate the unbiased variance \(U^2\).

\[U^2= \frac{(X_1-\bar{X}) + (X_2-\bar{X}) + ・・・(X_n-\bar{X})}{n-1}\]

  • \(S^2\) is smaller than \(U^2\).

Why we use unbiased variance instead of sample variance ・Sample variance \(S^2\) underestimates the true variance \(\sigma^2\).
・Let’s express the expected value of sample variance \(S^2\) in an equation.

\[E[S^2] = \sigma^2 - \frac{1}{n}\sigma^2\]

・It can be seen that sample variance \(S^2\) underestimates the true variance \(\sigma^2\) by \(\frac{1}{n}\).
・The degree of underestimation decreases as \(n\) increases.
・Let’s rewrite the above equation:

\[E[S^2] = \sigma^2 - \frac{1}{n}\sigma^2 = \frac{n-1}{n}\sigma^2\]

\[E[S^2] = \frac{n-1}{n}\sigma^2\]

・To make this expected value of sample variance \(S^2\) equal to the true variance \(\sigma^2\), it is sufficient to multiply \(E[S^2]\) by \(\frac{n}{n-1}\).
・The equation representing sample variance \(S^2\) is as follows:

\[S^2= \frac{X_1 + X_2 + X_3 + ・・・X_n}{n}\]

・Let’s try multiplying both sides by \(\frac{n}{n-1}\)

\[\frac{n}{n-1}・S^2= \frac{(X_1-\bar{X}) + (X_2-\bar{X}) + ・・・(X_n-\bar{X})}{n}・\frac{n}{n-1}\\ = \frac{(X_1-\bar{X}) + (X_2-\bar{X}) + ・・・(X_n-\bar{X})}{n-1} \\=U^2\]

・This matches the formula for unbiased variance.
→ The expected value of the unbiased variance is

\[E[U^2] = \sigma^2\]

Why we use unbiased variance instead of sample variance

Sample variance when the population mean \(μ\) is unknown.

・Here, we are performing point estimation of the population variance \(\sigma\) when the population mean \(\mu\) is unknown.
・Assume we have obtained three observation values: x1, x2, and x3.

・The sample variance is calculated as the sum of the squared differences between the mean of the three observation values (x1, x2, and x3) and each observation value.
- If we want to estimate the true population variance \(\sigma\), ideally, we should calculate the sample variance using the population mean \(\mu\).
- However, since the population mean \(\mu\) is unknown, we are using the sample mean \(\bar{X}\) obtained from the sample.
- We are using the difference between the mean calculated from the obtained sample and the individual observation values.
→ If we compare it with the difference between the true population mean \(\mu\) and the individual observation values, …
→ The difference between the mean calculated from the obtained sample and the individual observation values will be smaller.
- The following formula can be used to calculate the sample variance \(S^2\):

\[S^2= \frac{(X_1-\bar{X}) + (X_2-\bar{X}) + ・・・(X_n-\bar{X})}{n}\]

Sample variance when the population mean \(\mu\)is known

・The sample variance is calculated as the sum of the squared differences between the known population mean \(\mu\) and each of the three observation values: x1, x2, and x3.
- The following formula can be used to calculate the sample variance:

\[S^2= \frac{(X_1-\mu) + (X_2-\mu) + ・・・(X_n-\mu)}{n}\]

・In point estimation, when using the true population mean \(\mu\) compared to using the mean of the three observation values \(\bar{X}\) as a substitute, the sample variance is smaller when using the mean of the observation values \(\bar{X}\).

→ Using the mean of the observation values \(\bar{X}\) leads to an underestimation of the population variance \(\sigma^2\).

5. Interval Estimation

5.1 When the Population Variance is Known

  • In ‘point estimation,’ we predicted the population mean and variance with pinpoint accuracy.
  • In ‘interval estimation,’ we predict the population mean and variance with a range. Here, we assume that the population variance \(\sigma^2\) is known.
    ↑ This assumption is not very common in reality…
    → But here, we deliberately make this assumption.
    ・In ‘interval estimation,’ we use the confidence level to make estimations with a range.

Estimating the population mean with a single sample

Refer to the left side of the figure below:

  • Assume the population is normally distributed.
  • The population mean \(\mu\) is unknown.
  • The population variance \(\sigma^2\) = 8^2.
  • Extract only one sample \(\bar{X}\) from this population.
    → Estimate the population mean \(\mu\) with 95% confidence.

Refer to the right side of the figure above:
  • Since probability corresponds to area, the area enclosed by the graph and the horizontal axis = 1 (= 100%).

  • Moving 1.96 times the standard deviation (\(\sigma\)) to the left and right from \(\mu\) (equals 1.96\(\sigma\)) covers 95% of the area.

  • Extract only one sample \(x\) from this population.
    The probability that the value of x is within the dark grey area is 95%.
    → The following inequality holds with 95% probability.
    \[μ−1.96σ ≦ X ≦ μ+1.96σ\]

  • Solve this equation for \(\mu\):

\[X−1.96σ ≦ μ ≦ X +1.96σ\]

  • Assuming the sample \(x\) extracted from this population is 160 cm.
  • Substitute X = 160, σ = 8.

\[144.32 ≦ μ ≦ 175.68\]

  • This is the 95% confidence interval.

What is a “95% Confidence Interval”?

  • Consider taking two samples.
  • If a sample is taken a little to the left of the population mean \(\mu\), then the 95% confidence interval includes the population mean \(\mu\).
  • If a sample is taken a little to the right of the population mean \(\mu\), the 95% confidence interval includes the population mean \(\mu\).

Common Misconception:

  • One might think that a “95% confidence interval” means there’s a 95% probability that the interval contains the value of the parameter we’re interested in..
  • However, this is not correct.
  • If the parameter is within the interval, the probability that this confidence interval includes the parameter is 100%.
  • If the parameter is not within the interval, the probability is 0%.
  • A single confidence interval either contains or does not contain the parameter.
  • The existence of the parameter’s “true value” (a constant) is assumed, regardless of whether we know this value.
  • Therefore, it’s not possible for a 95% confidence interval to have a 95% probability of capturing the parameter we want to estimate.
  • The probability is either 100% (successfully capturing the parameter) or 0% (missing the parameter).

Application:

  • A 95% confidence interval is calculated for each sample.
  • Of these calculated 95% confidence intervals, 95% will capture the true parameter within their range, while the remaining 5% will fail to capture it.

  • When 20 samples are taken, a “95% confidence interval” means that in 19 cases, it includes the population mean \(\mu\), but in 1 case, it does not include the population mean \(\mu\).

Key Points to Understand a 95% Confidence Interval ・A 95% confidence interval (represented as a horizontal bar) can be created for each extracted value.
・Out of the many possible confidence intervals, 95% will include the population mean \(\mu\).
5% will fail to capture the population mean.

Interval Estimation of the Population Mean with Multiple Samples:

  • It is not practical to estimate the population mean using a single sample.
    → It is common to extract multiple samples for interval estimation of the population mean.
  • Utilize the property of the normal distribution that the larger the number of samples, the smaller the variance becomes.

The property of the normal distribution

5.2 When the Population Variance is Known:

On the left side of the diagram:

  • Extract 9 samples from a population with a known population variance \(\sigma^2 = 6^2\) and unknown population mean.
  • Estimate the 95% confidence interval for the population mean \(\mu\).
  • The sample mean \(\bar{X} = 168.5\).
height <- c(165, 170, 173, 163, 166, 176, 163, 160, 172, 177)

height |> 
  mean()
[1] 168.5

On the right side of the diagram:

  • The variance of the sample mean \(\bar{X}\): \(\frac{\sigma^2}{n}\)
  • The standard deviation of the sample mean \(\bar{X}\): \(\frac{\sigma}{\sqrt{n}}\)
  • To estimate the 95% confidence interval for the population mean \(\mu\):
    → In other words, to cover 95% of the area under the normal distribution:
    → It’s sufficient to move 1.96 times the standard deviation of the sample mean \(\frac{\sigma}{\sqrt{n}}\) away from the sample mean \(\mu\).
  • The formula that holds true 95% of the time for the sample mean \(\bar{X}\) randomly extracted from the population is as follows:

\[μ−1.96\frac{\sigma}{\sqrt{n}}≦ \bar{X} ≦ μ+ 1.96\frac{\sigma}{\sqrt{n}}\]

  • Solving for \(\mu\)
    \[\bar{X}−1.96\frac{\sigma}{\sqrt{n}}≦ \mu ≦ \bar{X} + 1.96\frac{\sigma}{\sqrt{n}}\]

  • Substitute \(\bar{X} = 168.5\)\(n = 9\)\(\sigma = 8\)

\[168.5−1.96\frac{8}{\sqrt{9}}≦ \mu ≦ 168.5+ 1.96\frac{8}{\sqrt{9}}\\ = 168.5-5.23≦ \mu ≦ 168.5 + 3.23\\ = 163.3 ≦ \mu ≦ 173.7\]

Relationship Between 95% Confidence Interval and Sample Size:

  • Compare the 95% confidence intervals of the population mean \(\mu\) when only one sample is randomly extracted from the same population versus when multiple samples (here, 9) are randomly extracted.
  • When only one sample is randomly extracted
    → Approximately from 144 cm to 176 cm, a width of over 30 cm

  • When 9 samples are randomly extracted
    → Approximately from 166 cm to 173 cm, a width of 10 cm

Key Points to Understand Interval Estimation ・By taking more samples, a more precise (narrower width) interval estimation can be made.

The Reasons:

・Because the estimation is based on the distribution of sample means, not the distribution of the samples themselves.
・The distribution of sample mean is sharper (i.e., has a smaller variance) than population mean.

5.3 When the population variance is unknown (→ Use t-distribution)

  • To estimate the population mean \(\mu\), the following formula is used:

\[\bar{X}−1.96\frac{\sigma}{\sqrt{n}}≦ \mu ≦ \bar{X} + 1.96\frac{\sigma}{\sqrt{n}}\]

However, this is only possible when the population variance \(\sigma^2\) is known.
→ It cannot be used when the population variance \(\sigma^2\) is unknown

Why Interval Estimation Cannot Be Performed When Population Variance \(\sigma^2\) is Unknown

1. Standardization of Random Variables (Standard Normal Distribution)

・Properties of Expectation and Variance

2. Properties of the Mean (Linearity of Expectation)

・Multiplying a random variable by \(a\) times and adding \(b\) → The constant \(b\) comes out, and it becomes the expectation of the random variable only.

\[E[aX + b] = aE[X] + b\]

Properties of the Variance

・The constant \(a\) is squared and comes out.

\[V[aX + b] = a^2V[X]\]

Definition of Variance \[V[X] = E[X-E[X])^2]\\ = E[X^2]-E[X]^2\]

Consider \(\bar{X}\)

\(\bar{X}\)’s mean is \(\mu\)
\(\bar{X}\)’s variance is \(\frac{\sigma^2}{n}\)

\(\bar{X}\) Standardization
Mean \(\mu\) 0
Variance \(\frac{\sigma^2}{n}\) 1

・The random variable \(Z\) follows a standard normal distribution.
・The mean of the standard normal distribution is 0, and the variance is 1.

  • \(Z\)’s definition:

\[Z = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}}\]

  • The following inequality holds with a 95% probability.

\[−1.96 ≦ \frac{\bar{X}-\mu}{\frac{\sigma}{\sqrt{n}}} ≦ 1.96\]
・Solving for \(\mu\) \[\bar{X}−1.96\frac{\sigma}{\sqrt{n}}≦ \mu ≦ \bar{X}+1.96\frac{\sigma}{\sqrt{n}}\]

  • A 95% confidence interval was obtained.
    However, this can only be used when the population variance \(\sigma^2\) is known.

A solution:

\[Z = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}}\].

・Instead of the population variance \(\sigma^2\), we use unbiased variance \(U\) for estimation.

\[T = \frac{\bar{X} - \mu}{\frac{U}{\sqrt{n}}}\].

  • The random variable \(Z\) follows a standard normal distribution.
  • Does the random variable \(T\) also follow a standard normal distribution?
  • If it’s a calculation between coefficients, it follows a standard normal distribution.
  • However, when calculating \(T\), it is divided by the random variable \(U\).
    \(T\) follows not a normal distribution, but a t-distribution.

Theorem: T distribution ・Assume there are independent random variables \((X_1, X_2,..., X_n)\) that follow a normal distribution with mean \(\mu\) and variance \(\sigma^2\).
・In this case, \(T\) follows a \(t\) distribution with \(n-1\) degrees of freedom:

\[T = \frac{\bar{X} - \mu}{\frac{U}{\sqrt{n}}}\]

Using a t-test to find a 95% confidence interval:

  • Investigating the TOEFL ITP scores of 4 seminar students
toefl <- c(500, 450, 623, 400)
  • Estimating the population mean (TOEFL ITP scores of Takudai students) \(μ\) with a 95% confidence interval.
  • Since the sample size is n = 4, it follows a t-distribution with degrees of freedom 4-1 = 3.
  • The number “3.182” where “degrees of freedom = 3” and “0.025” intersect in the table below is the threshold value for a 95% confidence level (equivalent to a 5% significance level).

  • In the graph on the right side of the diagram below, the area to the left of -3.182 is 0.25 (2.5%), and the area to the right of 3.182 is 0.25 (2.5%).
  • Adding both areas together equals 0.5, which is 5%.

  • The following inequality holds with a 95% probability.

\[−3.182 ≦ \frac{\bar{X}-\mu}{\frac{U}{\sqrt{n}}} ≦ 3.182\]

  • Solving for \(\mu\)

\[\bar{X}−3.182\frac{U}{\sqrt{n}}≦ \mu ≦ \bar{X}+3.182\frac{U}{\sqrt{n}}\]

  • Let’s calculate the values to substitute into this equation and calculate the sample mean \(\bar{X}\)

\[\bar{X} = \frac{500 + 450 + 623 + 400}{4} = 493.25\]

  • Lets’ calculate the unbiased standard deviation \(U^2\).

\[U^2 = \frac{(x_1-\bar{x})^2 + .. + (x_n-\bar{x})^2}{n-1}\\ = \frac{(500-493.25)^2 + .. + (400-493.25)^2}{4-1} = 9149\]

  • Since unbiased standard deviation \(U =\sqrt{U^2}\), U = 95.7
    • Let’s calculate the values to substitute into the following equation.

\[\bar{X}−3.182\frac{U}{\sqrt{n}}≦ \mu ≦ \bar{X}+3.182\frac{U}{\sqrt{n}}\]

\[= 493.25−3.182\frac{95.7}{\sqrt{4}}≦ \mu ≦ 493.25+3.182\frac{95.7}{\sqrt{4}}\\ = 493.25−152.3≦ \mu ≦ 493.25+152.3\\ = 341 ≦ \mu ≦ 645\]

  • To calculate a 95% confidence interval using R, enter and execute the following within the code chunk.
t.test(toefl)

    One Sample t-test

data:  toefl
t = 10.314, df = 3, p-value = 0.001944
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 341.0496 645.4504
sample estimates:
mean of x 
   493.25 
  • This command has the same meaning as t.test(toefl, mu = 0).

  • It performs a t-test on the obtained sample to test whether the population mean is 0.

  • The values below “95 percent confidence interval:” represent the 95% confidence interval (341.0496 645.4504).

  • This matches the values calculated manually above.

FYI (From Estimation to Hypothesis testing)

Is the population mean 342?

・if you want to test whether the population mean is 342 using the obtained four samples, you can enter and execute the following in RStudio.

t.test(toefl, mu = 342)

    One Sample t-test

data:  toefl
t = 3.1626, df = 3, p-value = 0.05077
alternative hypothesis: true mean is not equal to 342
95 percent confidence interval:
 341.0496 645.4504
sample estimates:
mean of x 
   493.25 
  • Since we obtained a 95% confidence interval (from 341 to 645), the population mean of 342 falls within the 95% confidence interval.
  • With a p-value of 0.05077, which is greater than 0.05. → The null hypothesis H_0 = 342 cannot be rejected.
    → The population mean can be 342.

Is the population mean 340?

  • Now, if you want to test whether the population mean is 340 using the obtained four samples, you can enter and execute the following in RStudio
t.test(toefl, mu = 340)

    One Sample t-test

data:  toefl
t = 3.2044, df = 3, p-value = 0.04917
alternative hypothesis: true mean is not equal to 340
95 percent confidence interval:
 341.0496 645.4504
sample estimates:
mean of x 
   493.25 
  • Since we obtained a 95% confidence interval (from 341 to 645), the population mean of 340 falls outside the 95% confidence interval.
  • With a p-value of 0.04917, which is smaller than 0.05. → The null hypothesis \(H_0 = 340\) is rejected. → It cannot be possible that the population mean is 340. → Based on the obtained sample mean \(\bar{X}=493.25\), it suggests that the population mean is greater than 340.
    ・For more details on statistical testing, refer to “II. Hypothesis Testing.”

5.4 The Central Limit Theorem

The reason why we can estimate the population parameters.

  • Even when the population is not normally distributed, and even when the population variance is not normally distributed, we can estimate the population parameters because the sample mean follows a normal distribution as the sample size increases.
  • Thanks to the Central Limit Theorem, we can collect samples and estimate population parameters without worrying about the distribution of the population.

Definition of the Central Limit Theorem

  • The Central Limit Theorem (CLT) is a theorem that states that even if the population is not normally distributed, as the sample size (sample size) increases, the sample mean can be approximated by a normal distribution.
  • It is a theorem that states, “when the sample size is increased to infinity, the standardized sample mean converges to a standard normal distribution.”
  • It is thanks to this theorem that we can use the standard normal distribution when conducting statistical estimation and testing.
  • Thanks to this theorem, it is possible to calculate the required sample size for conducting statistical estimation in surveys and other situations.

Normal Distribution

  • A probability distribution for continuous variables representing data distributions that accumulate around the mean value.
  • By approximating various distributions to the normal distribution, it becomes possible to probabilistically predict their range.
  • Here, let’s use R to run simulations and ‘experience’ the credibility of the Central Limit Theorem.

Simulation of the Central Limit Theorem

Does the sample mean approximately follow a normal distribution as the sample size increases?

Simulation using 10 cards in a deck.
  • Create 10 cards with numbers from 0 to 9 and place them in a deck.
  • Consider this deck as the population (creating an artificial population).
  • Randomly select one card from the population.
  • Each card has an equal probability of being chosen, which is 1/10.
  • The distribution of numbers when a card is randomly chosen follows a discrete uniform distribution.
  • Since the population of 10 cards has a population mean of 4.5, will the sample mean also be 4.5?
  • Attempt to estimate the (usually unknown but known here) population mean from the sample mean.
  • Draw one card at a time, record the number drawn, and return the card to the deck.
  • Shuffle the cards, draw another card, record its number, and return the card to the deck.
  • Record the average score of the two randomly chosen cards.
  • Repeat these steps n times.

  • The number of ways to choose the first card is 10 (since there are 10 cards numbered from 0 to 9).
  • The number of ways to choose the second card is also 10 (since there are 10 cards numbered from 0 to 9).
    → There are 100 possible combinations when drawing 2 out of the 10 cards.
  • The numbers on the cards range from 0 to 9, totaling 10 different integers.
    → The possible total values range from 0 to 18, for a total of 19 possibilities (for example, if both chosen numbers are 9, the total is 18).
    → The possible average values range from 0 to 9, for a total of 19 possibilities (for example, if both chosen numbers are 9, the average is 9).
    The probability of obtaining a certain average value can be represented as follows.
  • For example, when drawing two cards like the 0 and 8 assumed here, we find that the probability of obtaining the sample mean (4) is about 0.08 (or 8%).
  • The probability of obtaining the population mean of 4.5 is 0.1 (or 10%).
  • Let’s actually draw once using R to see this in practice.
bag <- 0:9                   # Create cards artificially from 0 to 9.
exp_1 <- sample(bag,          # Use the bag containing cards.
              size = 2,       # Specify drawing 2 cards (sample size N = 2).
  replace = TRUE)             # Specify that the drawn cards are returned to the bag.
mean(exp_1)                   # Calculate and display the mean.
[1] 6
  • Each time you conduct the experiment, you will obtain a different mean value (please try it several times).
  • Let’s repeat this experiment 1,000 times.
  • Examine the distribution of the mean values obtained in each trial.
  • A simple way to perform the same task in R is to use a for loop.

How to use for loop

  • For example, to repeat the task of adding 10 five times starting from 0, you can do the following.
  • First, create and save the initial number, which is 0.
A <- 0 # Specify and save the initial number as A, which is 0.
  • Next, to save the results, prepare a container named result.
result <- rep(NA,              # NA represents empty contents.
              length.out = 5) # Specify how many storage locations are needed.
  • Let’s check the contents of result
result
[1] NA NA NA NA NA
  • Here, we have only created a container, so it is empty (= NA).
  • Let’s try repeating the task of adding 10 to the number 5 times using a for loop.
  • Specify the number of repetitions inside the parentheses after for and enclose the loop’s content in curly braces {}.
for(i in 1:5){   # 'i' repeats from 1 to 5.  
  A <- A + 10    # Add 10 to 'A' for each iteration.
  result[i] <- A # Store the result of the 'i-th' operation (i.e., addition) in 'result[i].' 
}
result           # Check the contents of the container.
[1] 10 20 30 40 50

→ You can see that the container (result) contains the expected values (numbers obtained by starting from 0 and adding 10 each time), as anticipated.

Simulation of the Central Limit Theorem using a for loop

sim1: Sample Size = 2, Number of Iterations = 10
  • Utilize this for loop to repeat the process of drawing two cards 10 times, as you did in the class practice, and visualize the results.
bag <- 0:9                          # Create cards artificially
N <- 2                                # Sample size (choose 2 cards)
trials <- 10                          # Number of experiment repetitions = 10
sim1 <- rep(NA, length.out = trials)  # Container to store results
for (i in 1:trials) {
  experiment <- sample(bag, 
                       size = N, 
                       replace = TRUE) # Sampling with replacement
  sim1[i] <- mean(experiment)          # Save the mean for the i-th trial
}

df_sim1 <- tibble(avg = sim1)
h_sim1 <- ggplot(df_sim1, aes(x = avg)) +
  geom_histogram(binwidth = 1, 
                 boundary = 0.5,
                 color = "black") +
  labs(x = "Mean of 2 cards", 
       y = "Frequency") +
  ggtitle("SIM1: N = 2, Number of Trials = 10") +
  scale_x_continuous(breaks = 0:9) +
  geom_vline(xintercept = mean(df_sim1$avg),  # Draw a vertical line at the mean
             col = "lightgreen") +
  theme_bw(base_size = 14, base_family = "HiraKakuPro-W3") # Command to prevent character encoding issues (for Mac users only)

plot(h_sim1)      # Visualize and display the results

  • It doesn’t appear to be a normal distribution.
df_sim1
# A tibble: 10 × 1
     avg
   <dbl>
 1   2.5
 2   3.5
 3   1.5
 4   5.5
 5   5.5
 6   6.5
 7   6.5
 8   2  
 9   6  
10   3.5
  • If you want to find out the mean values obtained in this experiment, you can execute the following command:
mean(df_sim1$avg)
[1] 4.3
sim2: Sample Size = 5, Number of Trials = 100
  • Try repeating the task of drawing 5 cards 100 times.
bag <- 0:9                          # Create cards artificially
N <- 5                                # Sample size (choose 5 cards)
trials <- 100                          # Number of experiment repetitions = 100
sim2 <- rep(NA, length.out = trials)   # Container to store results
for (i in 1:trials) {
  experiment <- sample(bag, 
                       size = N, 
                       replace = TRUE)  # Sampling with replacement
  sim2[i] <- mean(experiment)         # Save the mean for the i-th trial
}

df_sim2 <- tibble(avg = sim2)
h_sim2 <- ggplot(df_sim2, aes(x = avg)) +
  geom_histogram(binwidth = 1, 
                 boundary = 0.5,
                 color = "black") +
  labs(x = "Mean of 5 cards", 
       y = "Frequency") +
  ggtitle("SIM2: N = 5, Number of Trials = 100") +
  scale_x_continuous(breaks = 0:9) +
  geom_vline(xintercept = mean(df_sim2$avg),  # Draw a vertical line at the mean
             col = "lightgreen") +
  theme_bw(base_size = 14, base_family = "HiraKakuPro-W3")

plot(h_sim2)

  • It doesn’t appear to be a normal distribution.
df_sim2
# A tibble: 100 × 1
     avg
   <dbl>
 1   4.2
 2   2.8
 3   2.8
 4   3.2
 5   6  
 6   3.8
 7   4.8
 8   3  
 9   2.4
10   3.2
# ℹ 90 more rows
  • If you want to find out the mean values obtained in this experiment, you can execute the following command:
mean(df_sim2$avg)
[1] 4.346
sim3: Sample Size = 100, Number of Trials = 1000
  • Try repeating the task of drawing 100 cards 1000 times.
bag <- 0:9                          # Create cards artificially
N <- 100                            # Sample size (choose 100 cards)
trials <- 1000                     # Number of experiment repetitions
sim3 <- rep(NA, length.out = trials) # Container to store results
for (i in 1:trials) {
  experiment <- sample(bag, size = N, replace = TRUE)  # Sampling with replacement
  sim3[i] <- mean(experiment)        # Save the mean for the i-th trial
}

df_sim3 <- tibble(avg = sim3)
h_sim3 <- ggplot(df_sim3, aes(x = avg)) +
  geom_histogram(binwidth = 0.125, 
                 color = "black") +
  labs(x = "Mean of 100 cards", y = "Frequency")+
  ggtitle("SIM3: N = 100, Number of Trials = 1000") +
  scale_x_continuous(breaks = 0:9) +
  geom_vline(xintercept = mean(df_sim3$avg),  # Draw a vertical line at the mean
             col = "lightgreen") +
  theme_bw(base_size = 14, base_family = "HiraKakuPro-W3")

plot(h_sim3)

df_sim3
# A tibble: 1,000 × 1
     avg
   <dbl>
 1  4.38
 2  4.76
 3  4.74
 4  4.53
 5  4.89
 6  4.19
 7  4.23
 8  4.6 
 9  4.64
10  3.72
# ℹ 990 more rows
  • If you want to find out the mean values obtained in this experiment, you can execute the following command:
mean(df_sim3$avg)
[1] 4.49938
  • Compared to the simulation with N = 2, it’s evident that the distribution is much closer to a normal distribution.
library(patchwork)
h_sim1 + h_sim2 + h_sim3

Summary ・Even if the underlying population distribution is uniform, as the sample size N increases, the “distribution of means’” approaches a normal distribution.
→ When the sample size N is sufficiently large (a rough rule of thumb is N > 100), it is permissible to use the normal distribution for statistical estimation and testing.

5.5 Estimation of Population Proportion (Bernoulli Distribution).

  • Here, as an example, we will attempt to estimate the population proportion using political approval ratings.
  • Suppose we conduct a survey asking, “Do you support the current government?”
  • Sample size N = 5.

On the right side of the figure:

  • Let’s assume the obtained sample is Support, Support, Support, Support, Not Support.
  • For convenience, let’s represent ‘Support’ as 1 and ‘Not Support’ as 0.
    → The random variable Xi takes values of 1 or 0.
  • We want to calculate the 95% confidence interval for the population proportion.
  • With a distribution of 0 or 1, the variables Xi (i = 1, 2, … n) follow a Bernoulli distribution.

What it Bernoulli Distribution? ・A probability distribution that represents the results of experiments or trials with only two possible outcomes, such as win or lose, heads or tails, or pass or fail, using 0 and 1.
・When the probability of getting 1 is \(p\), the probability of getting 0 is \(1-p\), making it a simple probability distribution.

\(k\): A parameter representing success or failure (1 for success, 0 for failure).
\(p\): Probability of success.

Expected Value (Mean) of the Bernoulli Distribution

\[E(X) =\sum kP(X = k) = p\]

Variance of the Bernoulli Distribution

\[V(X) = E(X^2)-(E(X))^2\\ = p(1-p)\]

  • The variance of the sample mean (\(\bar{X}\)) = \(\frac{p(1-p)}{n}\).
  • Standard deviation of the sample mean (\(\bar{X}\)) = \(\sqrt{\frac{p(1-p)}{n}}\).
  • To estimate the population mean \(\mu\) with a 95% confidence level:
    → In other words, to achieve a 95% confidence level in a normal distribution,
    → The sample mean \(\mu\) should deviate from the standard deviation of the sample mean \(\sqrt{\frac{p(1-p)}{n}}\) by 1.96 times.
  • The equation for the 95% confidence interval of the sample mean (\(\bar{X}\)) randomly extracted from the population is as follows:

\[p−1.96\sqrt{\frac{p(1-p)}{n}}≦ \bar{X} ≦ p+ 1.96\sqrt{\frac{p(1-p)}{n}}\]

  • Since \(\bar{X} = \frac{X_1 + X_2 + .... +X_n}{n} = R\) (sample ratio) (according to the Central Limit Theorem), we get the following:

\[R−1.96\sqrt{\frac{p(1-p)}{n}}≦ p≦ R+ 1.96\sqrt{\frac{p(1-p)}{n}}\]

  • The equation with \(\bar{X}\) contains \(p\).
  • The population proportion \(p\) cannot be directly determined from the sample.

→ Since the interval for \(p\) cannot be narrowed down, the 95% confidence interval cannot be calculated.
→ It cannot be estimated.

When n is sufficiently large

\(p\) can be evaluated and the equation can be rearranged.
・When n is sufficiently large, \(p\) is almost close to \(R\) (Law of Large Numbers).

Law of Large Numbers ・The sample mean of independent random variables from the same distribution converges to the population mean.

\[R−1.96\sqrt{\frac{p(1-p)}{n}}≦ p≦ R+ 1.96\sqrt{\frac{p(1-p)}{n}}\]

  • Replace \(\frac{\sqrt{p(1-p)}}{{n}}\) with \(\frac{\sqrt{R(1-R)}}{{n}}\).
  • The equation can be replaced as follows: (The \(p\) in the middle, which we want to estimate, is not replaced)

\[R−1.96\sqrt{\frac{R(1-R)}{n}}≦ p≦ R+ 1.96\sqrt{\frac{R(1-R)}{n}}\] → 95% CI

Quiz:

  • Waseda University randomly surveyed 100 students, asking them if they support the current government.
  • Out of these 100 students, 75 answered “yes” in support of the government.
  • Now, let’s estimate the government’s support rate among Waseda University students with a 95% confidence level.
Answer:
  • Substitute Sample ratio R = 0.75 and n = 100 to the following inequality:

\[R−1.96\sqrt{\frac{R(1-R)}{n}}≦ p≦ R+ 1.96\sqrt{\frac{R(1-R)}{n}}\]

\[0.75-1.96\sqrt{\frac{0.75*0.25}{100}}≦ p≦ 0.75+1.96\sqrt{\frac{0.75*0.25}{100}}\]
\[0.69≦ p≦ 0.91\]

5.6 Chi-Square Distribution

  • Assume that both the population mean and variance are unknown.
  • The focus is on estimating the ‘population variance’.
  • Examples of situations in real life where knowing the population variance is important.
  • The variation in the amount of coffee served at Starbucks.
  • The variation in the diameter of screws produced in a factory.
  • Assume that the population follows a normal distribution.
  • For estimation, it is necessary to have a distribution of variance that can be derived from a sample.
  • What distribution does the estimated variance follow?
    → Use the Chi-Square distribution \(\chi^2\).

The Chi-Square (\(\chi^2\)) theorem ・If we have independent random variables \(X_1, X_2, ..., X_n\) following a normal distribution with variance \(\sigma^2\), then

\[T = \frac{(n-1)U^2}{\sigma^2}\]
follows a chi-square (\(\chi^2\)) distribution with \(n-1\) degrees of freedom.

  • U represents the unbiased sample variance.
  • \(\sigma^2\) is the population variance.
  • We have a sample of n observations.
  • The probability distribution of T follows a \(\chi^2\) distribution with n-1 degrees of freedom.
  • On the right side of the equation, only U is a random variable.
  • In summary, ‘T’ follows a chi-square distribution with n-1 degrees of freedom when U (unbiased sample variance) follows this distribution, representing the degrees of freedom as n-1.

The reason for considering the random variable T

\[T = \frac{(n-1)U^2}{\sigma^2}\]

  • The equation above can be rewritten as follows:

\[T = (\frac{X_1-\bar{X}}{\sigma})^2 + (\frac{X_2-\bar{X}}{\sigma})^2 +...+ (\frac{X_n-\bar{X}}{\sigma})^2\]

  • \(\sigma^2\)・・・population variance
  • \(\sigma\)・・・standard deviation
  • \(X_1-\bar{X}\)・・・How far \(X_1\) is from the sample mean
  • \((\frac{X_1-\bar{X}}{\sigma})^2\)・・・The degree of deviation from the mean

What \(T\) stands for ・The sum of how much each sample, \(X_1, X_2,..., X_n\), deviates from the sample mean, \(\bar{X}\).

\[T = (\frac{X_1-\bar{X}}{\sigma})^2 + (\frac{X_2-\bar{X}}{\sigma})^2 +...+ (\frac{X_n-\bar{X}}{\sigma})^2\]
\[= \frac{n-1}{\sigma^2}\frac{(X_1-\bar{X})^2 + (X_2-\bar{X})^2 + (X_n-\bar{X})^2}{n-1}\]

・The right-hand side of the above equation, \(\frac{(X_1-\bar{X})^2 + (X_2-\bar{X})^2 + (X_n-\bar{X})^2}{n-1}\), is the unbiased variance, denoted as \(U^2\)

\[= \frac{(n-1)U^2}{\sigma^2}\]

Let’s consider what it means for the shape of a chi-squared distribution:

  • For example, in the case of 3 degrees of freedom, the probabilities are concentrated around small values (around 1).
  • As the degrees of freedom increase from 1 to 5, 10, 20, etc., the probabilities gradually concentrate around larger values of t.

When the degrees of freedom (sample size) are small,

→ The difference between the sample mean and the sample values is small.
 → The value of t is small.

When the degrees of freedom (sample size) are large,

→ The difference between the sample mean and the sample values is large.
 → The value of t is large.

Exercise:

  • From a population with an unknown population mean and an unknown population variance, 10 random samples are drawn.
  • We want to estimate the population variance with a 95% confidence interval.
  • Sample mean (\(\bar{X}\)) = 5.8.
  • Sample unbiased variance (\(U^2\)) = 10

Answer:

  • With a sample size of n = 10, we consider a chi-square distribution with 9 degrees of freedom (df) since df = n - 1 = 9.
  • The chi-square distribution with 9 degrees of freedom has a shape similar to the one on the left side of the following diagram.

  • Using the chi-square distribution table on the right side of the diagram, we can find the chi-square values (\(\chi^2\) values) that correspond to a 95% confidence interval.
  • To create a 95% confidence interval, we evenly distribute the 5% rejection region between the left and right sides.
  • So, for a chi-square distribution with 9 degrees of freedom (df),
    ・The lower 2.5% point (left tail) with df = 9 is approximately 2.7.
    ・The upper 2.5% point (right tail) with df = 9 is approximately 19.2.
  • The variable T represents the sum of how much each sample X_1, X_2,..., X_n deviates from the sample mean \(\bar{X}\).
  • This sum is represented by the chi-square value (\(\chi^2\)), which we use to calculate the confidence interval.

\[T = \frac{(n-1)U^2}{\sigma^2}\]

  • This means that the probability of satisfying the following equation is 95%.

\[2.7 ≦ \frac{(n-1)U^2}{\sigma^2}≦ 19.2\]

  • What we want to determine here is the evaluation of the population variance \(\sigma^2\).
    → Solving for \(\sigma^2\)

\[\frac{(n-1)U^2}{19.2}≦ \sigma^2 ≦ \frac{(n-1)U^2}{2.7}\]  

  • Substitute n = 10 and \(U^2 = 9.2\):

\[\frac{82.8}{19.2}≦ \sigma^2 ≦ \frac{82.8}{2.7}\] \[4.3 ≦ \sigma^2 ≦ 30.7\]

II. Hypothesis Testing

  • Statistical hypothesis testing involves verifying hypotheses about a population based on information obtained from a sample.
  • The mathematical methods used in statistical hypothesis testing are similar to those used in estimation.
  • The methods of argumentation are unique and specific to probability theory.

6.1 Two-Tailed t.test

  • Four samples (test scores) are randomly drawn from the population.
  • The population is assumed to follow a normal distribution.
  • Both the population mean and population variance are unknown.
  • The average of the four samples is 493.3 points.
  • The unbiased standard deviation of the four samples is 95.7 points.
  • Based on this information, can we conclude that the population mean is 300 points?
  • We want to estimate the population mean with a significance level of 5%.

Hypothesis Testing Procedure

1. Determine the null hypothesis

  • \(H_0\): The population mean is 300 points:
    → \(\mu = 300\)
    => This is the hypothesis you want to reject

2. Formulate the alternative hypothesis

  • \(H_1\): The population mean is not 300 points
    \(\mu < 300\) or \(\mu > 300\)
    => This is the hypothesis you want to assert
    Two-tailed test

3. Examine the distribution of the target statistical measure

  • Under \(H_0\), examine the distribution of the target statistical measure(here, the sample mean \(\bar{X}\)

4. Decide on the significance level \(\alpha\) and establish the rejection region

  • The significance level is usually 0.05 = 5%
  • Look for the threshold value for each degree of freedom (df) and significance level \(\alpha\)
  • Set up the rejection region on the t-distribution table that is advantageous for the alternative hypothesis \(H_1\)
  • For the claim that “the population mean is not 300 points,” having two rejection regions is advantageous
    → Divide the 5% significance level into two, assigning 2.5% to the upper and 2.5% to the lower end
    → This is Two-tailed test
  • The threshold value for a degree of freedom = 3 and significance level of 5% is 3.182

Hypothesis Testing Procedure: Two-ways (Summary)

Theorem on T distribution ・Assume there are independent random variables \(X_1, X_2,..., X_n\) that follow a normal distribution with mean \(\mu\) and variance \(\sigma^2\)
- In this case,

\[T = \frac{\bar{X} - \mu}{\frac{U}{\sqrt{n}}}\]

follows a t distribution with n-1 degrees of freedom.

Calculate the t-value obtained from the sample (by hand calculation).

  • Substitute the following and calculate the value of T
  • \(\bar{X}=493.3\)
  • \(\mu = 300\)
  • \(U = 95.3\)
  • \(n = 4\)

\[T = \frac{\bar{X} - \mu}{\frac{U}{\sqrt{n}}} = \frac{493.3-300}{\frac{95.3}{\sqrt{4}}}= \frac{193.3}{47.65}=4.06\]

5. Check if the statistic obtained from the sample is in the rejection region

  • If it is in the rejection region
    → Reject the null hypothesis and accept the alternative hypothesis.
  • If it is not in the rejection region
    → Cannot reject the null hypothesis
    → Cannot conclude anything.
  • The t-value obtained from the sample (4.06) is in the rejection region (3.182 or above)
    → Reject the null hypothesis \(H_0\): The population mean is 300 points.
    → Accept the alternative hypothesis “The population mean is more than 300 points” (because the average score is 493 points).
  • If the t-value obtained from the sample is outside the rejection region.
    → Cannot reject the null hypothesis.

Verify using R (Two-tailed test)

test <- c(500, 450, 623, 400)
t.test(test, 
  mu = 300) # The default in R is a Two-tailed test

    One Sample t-test

data:  test
t = 4.0408, df = 3, p-value = 0.02727
alternative hypothesis: true mean is not equal to 300
95 percent confidence interval:
 341.0496 645.4504
sample estimates:
mean of x 
   493.25 

→ The p-value obtained here is 0.02727.
→ Null Hypothesis \(H_0\): “The population mean is 300 points” is rejected.
The population mean is more than 300 points (since the sample mean is 493 points)

6.2 One-Tailed t.test

  • Four samples (test scores) are randomly drawn from the population.
  • The population is assumed to follow a normal distribution.
  • Both the population mean and population variance are unknown.
  • The average of the four samples is 493.3 points.
  • The unbiased standard deviation of the four samples is 95.7 points.
  • Based on this information, can we conclude that the population mean is less than 300 points?

Hypothesis Testing Procedure

1. Determine the null hypothesis

  • \(H_0\): The population mean is less than 300 points:
    → \(\mu < 300\)
    => This is the hypothesis you want to reject

2. Formulate the alternative hypothesis

  • \(H_1\): The population mean is not 300 points
    \(\mu > 300\)
    => This is the hypothesis you want to assert
    one-tailed t.test

3. Examine the distribution of the target statistical measure

  • Under \(H_0\), examine the distribution of the target statistical measure(here, the sample mean \(\bar{X}\)

4. Decide on the significance level \(\alpha\) and establish the rejection region

  • The significance level is usually 0.05 = 5%
  • Look for the threshold value for each degree of freedom (df) and significance level \(\alpha\)
  • Set up the rejection region on the t-distribution table that is advantageous for the alternative hypothesis \(H_1\)
  • For the claim that “the population mean is less than 300 points,” having two rejection regions is advantageous
    → Set the 5% significance level.
    → This is one-tailed t.test
  • The threshold value for a degree of freedom = 3 and significance level of 5% is 2.353.

Hypothesis Testing Procedure:One-way (Summary)

Theorem on T distribution ・Assume there are independent random variables \(X_1, X_2,..., X_n\) that follow a normal distribution with mean \(\mu\) and variance \(\sigma^2\)
- In this case,

\[T = \frac{\bar{X} - \mu}{\frac{U}{\sqrt{n}}}\]

follows a t distribution with n-1 degrees of freedom.

Calculate the t-value obtained from the sample (by hand calculation).

  • Substitute the following and calculate the value of T
  • \(\bar{X}=493.3\)
  • \(\mu = 300\)
  • \(U = 95.3\)
  • \(n = 4\)

\[T = \frac{\bar{X} - \mu}{\frac{U}{\sqrt{n}}} = \frac{493.3-300}{\frac{95.3}{\sqrt{4}}}= \frac{193.3}{47.65}=4.06\]

5. Check if the statistic obtained from the sample is in the rejection region

  • If it is in the rejection region
    → Reject the null hypothesis and accept the alternative hypothesis.
  • If it is not in the rejection region
    → Cannot reject the null hypothesis
    → Cannot conclude anything.
  • The t-value obtained from the sample (4.06) is in the rejection region (2.353 or above)
    → Reject the null hypothesis \(H_0\): The population mean less than 300 points.
    → Accept the alternative hypothesis “The population mean is more than 300 points”.
  • If the t-value obtained from the sample is outside the rejection region.
    → Cannot reject the null hypothesis.

Verify using R (one-tailed t.test)

test <- c(500, 450, 623, 400)
t.test(test, 
  mu=300, 
  alternative="greater") # 片側検定での対抗仮説は 母平均は 300より greater 

    One Sample t-test

data:  test
t = 4.0408, df = 3, p-value = 0.01364
alternative hypothesis: true mean is greater than 300
95 percent confidence interval:
 380.7004      Inf
sample estimates:
mean of x 
   493.25 

→ The p-value obtained here is 0.01364.
→ Null Hypothesis \(H_0\): “The population mean is less than 300 points” is rejected.
The population mean is more than 300 points

Reference
  • 宋財泫 (Jaehyun Song)・矢内勇生 (Yuki Yanai)「私たちのR: ベストプラクティスの探究」
  • 土井翔平(北海道大学公共政策大学院)「Rで計量政治学入門」
  • 矢内勇生(高知工科大学)統計学2
  • 浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年