1. Statistics

  • Statistics can be classified into two types: Descriptive Statistics and Inferential Statistics.
Types of Statistics Characteristics
Descriptive Statistics: Use statistical measures to grasp data trends and properties
Inferential Statistics: Test and estimate population parameters (like population mean and variance

Inferential Statistics

  • The skill of predicting and estimating unobserved events using available descriptive statistics is known as “Inferential Statistics”.
  • It involves randomly extracting a sample from the population, and using statistics such as the sample mean and unbiased variance obtained from this sample to test and estimate population parameters (like population mean and variance).
  • The basic idea of inferential statistics is that by increasing the number of randomly drawn samples from the population and repeating trials infinitely, we can infer about the vast and unknown population from a part of it, the sample.
  • In inferential statistics, it is assumed that the subject of analysis follows a probability distribution.
  • Here, we will explain and practice Descriptive Statistics, which is the foundational knowledge commonly used in both types of statistics.

2. Population and Sample

Population:

  • A population is the pool of individuals from which a statistical sample is drawn for a study
Source: 浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年、p.117
Source: 浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年、p.117

Sample:

  • A sample is a smaller, manageable version of a larger group.
  • It is a subset containing the characteristics of a larger population.
  • Samples are used in statistical testing when population sizes are too large for the test to include all possible members or observations.
  • A sample should represent the population as a whole and not reflect any bias toward a specific attribute.

Population parameter and Sample statistic

  • In inferenctial statistics, it is important to differentical between parameter and statistic.

How to estimate ・Calculate sample statistic (such as sample mean: \(\bar{X}\))
→ Infer parameter (such as population mean: \(\mu\))

Assumption in statistical estination ・Population distributio is normally distributed
・But, it is not always the case (we have lots of populations which are not nornally distributed)
・The Central Limit Theorem enables us to conduct statistical estimation

2.1 Number of Samples and Sample Size

  • Careful attention is needed as sample size and number of samples are often confused.
  • Sample Size: Refers to the number of observations extracted, denoted by n.
  • Number of Samples : The number of sample sets extracted with n observations.

Example:

  • Mr. Yanai went to Koriyama City, with a population of 3.28 million, and took two samples regarding the Great East Japan Earthquake.
  • In sample A, a public opinion survey was conducted with 1,000 people, and in sample B, another survey was conducted with 2,000 people.
    → The number of samples is 2 (sample A and sample B), and the sample sizes are 2,000 and 1,000 respectively.

2.2 Random Sampling

(1) Simple Random Sampling Method (simple random sampling):

  • A typical method for selecting an unbiased sample.
  • A sample obtained through simple random sampling can be considered an unbiased representation of the population.
    → This sample can be used to estimate the population.

(2) Multistage Sampling Method:

  • A method of extracting samples in several stages.
    → Example: Extracting a sample of 1000 people from the eligible voters in Japan.
  • 1st stage: Randomly selecting municipalities.
  • 2nd stage: Randomly selecting samples from the chosen municipalities.
  • Multistage sampling is used when the population distribution is widespread.
    → An example of multistage sampling… Stratified two-stage sampling.
  • Stratify each municipality into large cities, small and medium cities, and counties.
  • Assign weights according to the size of each stratum (for example, 3:2:1) and randomly select municipalities.

【Relationship between Population Mean \((μ)\) and Sample Mean \(\bar{x}\)

  • Although parameters related to the population are often unknown, here we will attempt to artificially create a population using R.
  • The population to be created has a population mean of 10 and a standard deviation of 5 (population variance = 25).
  • Using the dnorm() function, you can artificially create a population and display its distribution density.
# Specify a population mean of 10, standard deviation of 5  
# Specify the range from -10 to 30
curve(dnorm(x, 10, 5), from = -10, to = 30) 

  • Randomly sample from this population.
  • The sample size to be extracted is 20 (N = 20), and name this sample \((x1)\).
  • When randomly sampling from this artificially created population, use the rnorm() function.
# # Randomly extract 20 numbers from a population with a mean of 10 and a standard deviation of 5 (sd = 5), name it x1, and display the data and histogram.

x1 <- rnorm(20, mean = 10, sd =5) 
x1
 [1] 11.978594  6.917834  7.668746  7.307478 10.847311  6.187763 11.212720
 [8] 10.524525 21.320335  7.414352 12.755208  7.578064  9.529863  6.792361
[15]  8.679532  2.512480 11.451368  6.328466 13.408677  8.167278
  • You can specify the number of digits you want to display using the round() function.
  • For example, if you want to display integers, you should specify the number of decimal places (digits) as 0.
round(x1, digits = 0) # Specify the Number of Decimal Places to Display as 0. 
 [1] 12  7  8  7 11  6 11 11 21  7 13  8 10  7  9  3 11  6 13  8
hist(x1) # Display the extracted sample in a histogram

The sample mean \(\bar{x}\) can be calculated using the following formula:
\[\bar{x} = \frac{\sum_{i=1}^n x_i}{n}\]

・In R, the sample mean can be calculated using the mean() function.

mean(x1)              
[1] 9.429148

2.3 Estimator and Estimate

\[\bar{x} = \frac{\sum_{i=1}^n x_i}{n}\]

Estimator (estimator):

  • Used for estimating a parameter and expressed as a function of the sample.
  • Here, the above formula for \(\bar{x}\).

Estimate (estimate):

  • The numerical result calculated by substituting actual sample values.
  • Here, the value of the above \(\bar{x}\).
  • Randomly extract a sample of size 20, name it x2, and calculate the sample mean.
  • The population mean is specified as 10 (mean = 10), and standard deviation as 5 (sd = 5).
x2 <- rnorm(20, mean = 10, sd =5) 
x2
 [1] 11.051708 19.158094  9.271258  4.834369 16.371883  1.000707 17.500840
 [8]  8.262222 16.659910 14.391944  5.199113 12.250473 18.393731  9.871836
[15] 16.134405 12.462566  7.896137  8.655362  7.190878  6.212851
round(x2, digits = 0)      
 [1] 11 19  9  5 16  1 18  8 17 14  5 12 18 10 16 12  8  9  7  6
hist(x2)                 

mean(x2)                  
[1] 11.13851
  • Since R randomly extracts numbers from the population, there is no bias in sampling, but there is an error.
  • Here, even though the population mean is set to 10, the sample means of the first sample x1 and the second sample x2 do not exactly match 10.
  • This is the error.
  • However, if sampling is repeated many times (i.e., x1, x2, ….. xn), sample means that are either larger or smaller than the population mean are obtained, and averaging these values will match the population mean.
    → This provides the basis for statistically estimating the population parameters using the statistics obtained from the sample.
  • Apart from the sample mean, using codes like min( ), median( ), max( ) allows for easy acquisition of descriptive statistics such as the minimum, median, and maximum values.
min(x2)  
[1] 1.000707
median(x2) 
[1] 10.46177
max(x2)   
[1] 19.15809
  • The five-number summary can be displayed at once with the following command
quantile(x2, c(0, .25, .5, .75, 1))
       0%       25%       50%       75%      100% 
 1.000707  7.719822 10.461772 16.193775 19.158094 

2.4 Sample Unbiased Variance and Sample Variance

  • Suppose 10 students took the TOEFL (iBT) test.
  • Assign the test scores of these 10 students to a variable named x.
x <- c(22, 33, 44, 55, 66, 77, 88, 99, 100)
x
[1]  22  33  44  55  66  77  88  99 100
  • In typical analyses using samples, it is common to use sample unbiased variance \(U^2\).
    R’s variance calculation function var() is set to use sample unbiased variance \(U^2\) by default.
  • There is a need to differentiate between sample unbiased variance and sample variance.
  • Sample unbiased variance \(U^2\)・・・Used when ‘interested in the population’ (equals sample unbiased variance).
  • Sample variance \(S^2\)・・・Used when ‘only interested in the extracted sample’.
When using sample unbiased variance $U^2$ (default in R):
  • Used when only a part of the population data is available, and there is a need to understand the variability of the population data behind the obtained data.

  • Sample unbiased variance \(U^2\) can be calculated using var()

  • Suppose \(x\) is a sample extracted from the population.

  • Since the population variance is unknown, we use the statistics obtained from the sample to estimate the unknown population variance.

  • The statistic for estimating the population variance is the sample unbiased variance \(U^2\).

  • Sample unbiased variance \(U^2\) can be calculated using the following formula:

\[U^2 = \frac{\sum_{i=1}^N (x_i - \bar{x})^2}{N-1}\]

\(\bar{x}\): sample mean.

・Let’s calculate in R.

sum((x - mean(x))^2) / (length(x) - 1) 
[1] 808.6111
  • We get the same result with var(x).
var(x)
[1] 808.6111
When using sample variance \(S^2\):
  • When a sample extracted from the population is available.

  • When the interest lies not in the population but only in the sample.

  • When the only concern is how much the sample is dispersed.

  • R does not have a function to calculate sample variance \(S^2\).

  • \(S^2\) must be calculated by creating the following formula:

\[S^2 = \frac{\sum_{i=1}^N (x_i - \bar{x})^2}{N}\]

\(\bar{x}\): sample mean

  • R does not have a function to calculate sample variance \(S^2\).
    \(S^2\) must be calculated by creating the following formula:
var(x) * (length(x) - 1) / length(x)
[1] 718.7654

Sample variance \(S^2\ \) (718.7654) is always smaller than sample unbiased variance (808.6111) \(S_x^2\)

  • This is because in the formula for calculating sample variance \(S^2\), the denominator is N, whereas for sample unbiased variance \(U^2\), the denominator is N - 1.
  • For more details, refer to 10.2 Inferential Statistics (Estimation and Testing).

Why we use the sum of squares (\(S^2\) or \(U^2\)) in variance calculation?

・The difference (deviation) of each data point in a group from the average value is an indicator of how far each is from the mean.
→ Deviations can be both positive and negative.
→ As a measure of how far each data point is from the mean, it is not suitable.
By squaring the deviations, the measure of distance from the mean is converted to absolute values, making it an appropriate indicator.]

3. Sample Distribution

3.1 Do Statistics Distribute?

  • Sampling Distribution:
  • The distribution of a statistic obtained when a sample is drawn of a certain size.
  • In random sampling, there are many ways to select a sample.
  • For example, there are 120 ways to choose 3 people from 10 people \((_{10}C_3)\).
  • In R, this can be calculated using the choose() function as follows:
choose(10, 3)  
[1] 120
  • Let’s calculate the number of ways to select a sample of 10 people from a population of 100 million voters.
  • It turns out to be more than \((2.7)\) x \((10^{73})\) ways.
choose(100000000, 10) # Combinations of choosing 3 people out of 10
[1] 2.755731e+73
  • There are numerous ways to select a sample of a specific size from a specific population.
  • It is possible to calculate the value of a statistic for each sample.
  • However, the values of the statistics do not necessarily all match.
  • If the process of extracting samples of the same size is repeated, statistics will distribute.
  • The distribution of statistics obtained when samples are drawn in all combinations for a certain sample size is called sampling distribution.
  • Sampling distribution… the crux of statistical estimation, linking statistics and parameters with probability.

3.2 Population Mean and Sample Mean

  • Artificially create a population.
  • The number of times each of the three legislators (Mr. A, Mr. B, and Mr. C) has been elected.
Mr.A Mr.B Mr.C
# of times elected 3 6 12
  • Create a population with election counts of 3, 6, and 12, and calculate the population mean and standard deviation.
win <- c(3, 6, 12)  # Create a population with election counts of 3, 6, and 12
mean(win)           # Population mean of election counts
[1] 7
  • Calculate the variance
var1 <-  ((7-3)^2 +(7-6)^2 + (7-12)^2)/3
var1
[1] 14
  • Since standard deviation is the square root of variance, the standard deviation is
sqrt(var1)
[1] 3.741657
  • The average number of election victories is 7, and the standard deviation of election counts is 3.7.
  • Therefore, the population mean \((μ = 7)\) and population standard deviation \((σ = 3.7)\).
  • Choose all samples of size = 2 from this population.
  • There are 3 ways to choose 2 out of 3 people, A, B, and C: ‘A and B’, ‘A and C’, ‘B and C’.
  • For example, let’s calculate the sample mean and standard deviation of election counts for the sample ‘A and B’ in R.
ab <- c(3, 6)  # Name the sample chosen as A and B as ab
mean(ab)       # Sample mean of sample ab
[1] 4.5
  • Calculate the variance
var2 <-  ((4.5-3)^2 + (4.5-6)^2)/2
var2
[1] 2.25
  • Since standard deviation is the square root of variance, the standard deviation is
sqrt(var2)
[1] 1.5
sample number legislator chosen sample mean sample sd
1 A & B 4.5 1.5
2 A & C 7.5 4.5
3 B & C 9 3
  • Looking at the sample mean in the table above, the sample means of the three samples (4.5, 7.5, 9) are all different.
  • Moreover, they differ from the previously set population mean \((μ = 7)\).
  • The fact that we can estimate population parameters from a sample indicates that the sample is an unbiased representation of the population.
  • Unbiased means that even if each statistic differs from the parameter, the average of the obtained values matches the parameter.
  • Calculating the average of the three sample means \((μ)\) gives:

\[{μ} = \frac{4.5+7.5+9}{3} = 7\]

  • When samples are drawn in all possible combinations and the average of the averages from each sample is calculated, it matches the population mean.
    → Unbiasedness
    → Unbiased Estimator: A term for an estimator that possesses unbiasedness.
    → It can be said that the sample mean \(\bar{x}\) is an unbiased estimator of the population mean \((μ)\).
    → The population mean \((μ)\) can be estimated using the sample mean \(\bar{x}\).

3.3 Population Standard Deviation and Sample Standard Deviation

  • The population mean \((μ)\) can be estimated using the sample mean \(\bar{x}\).
  • However, if we calculate the average of the three sample standard deviations, it does not match the population standard deviation (3.7).

\[\frac{1.5+4.5+3}{3} = 3 < {3.7}.\]

  • The average of the sample standard deviations is always smaller than the population standard deviation.
  • The sample standard deviation (statistic) has a systematic bias (bias) that underestimates the population standard deviation (parameter).
    → The sample standard deviation \((s)\) cannot be said to be an unbiased estimator of the population standard deviation \((σ)\).
    → Use the sample unbiased standard deviation\((u_x)\) to estimate the population standard deviation\((σ)\).
    The population standard deviation \((σ)\) can be estimated by calculating the sample unbiased standard deviation \((u_x)\).
  • Estimate using the sd() function.
sd(win) # Calculate the unbiased standard deviation of win
[1] 4.582576
  • The formula for the population variance \((\sigma^2)\).
    => the population standard deviation \((σ)\) is the square root of \((σ^2)\)

\[\sigma^2 = \frac{\sum_{i=1}^N (x_i - \mu)^2}{N}\]

  • The formula for calculating the sample unbiased variance \((u_x^2)\).
    => the unbiased standard deviation \((u_x)\) is the square root of \((u_x^2)\)

\[u_x^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{N-1}\]

  • The code used for calculations using R:
R code
Population variance \((\sigma^2)\) var(x) - (length(x) - 1) / length(x)
Population standard deviation \((\sigma)\) sqrt(var(x) - (length(x) - 1) / length(x))
Sample unbiased variance \((u_x^2)\) var(x)
Unbiased standard deviation \((u_x)\) sd(x)

3.4 Simulation Assuming a Large Population

  • Artificially create a normally distributed population representing the height of Japanese adult men (n = 100,000, population mean = 170cm, population standard deviation = 6), and name it height.
height <- rnorm(100000, 170, 6)
hist(height)

  • It turns out there are more than \((2.7)\) x \((10^{43})\) ways to choose 10 people from 100,000.
choose(100000, 10)
[1] 2.754492e+43
  • It is impossible to calculate the sample mean for all \((2.7)\) x \((10^{43})\) combinations.
  • Alternative approach
    → Randomly select 10 people from the 100,000 population, name it sample1, calculate the sample mean, and display it in a histogram.
sample1 <- sample(height, 10, replace = FALSE) 
hist(sample1)                

mean(sample1)                 
[1] 169.903
  • Conduct the task of randomly selecting 10 people from the population 500 times and examine the distribution.
  • Create two types of samples with sizes of 10 and 90 (n = 10, 90), and display them in an animation.

Animation-Based Simulation (n = 10, 90)

  • Artificially create a population representing the height of Japanese adult men (n = 100,000, population mean = 170cm, population standard deviation = 6), and name it height.
height <- rnorm(100000, 170, 6)  
## Parameters: n = sample size
##             T = number of samples
height_experiment <- function(n = 10, T = 500, seed = 2016-01-24) {
    means <- rep(NA, T)
    s_mat <- matrix(NA, ncol = n, nrow = T)
    set.seed(seed)
    height <- rnorm(100000, 170, 6)
    for (i in 1:T) {
        s <- sample(height, n, replace = TRUE)
        s_mat[i,] <- s
        means[i] <- mean(s)
        par(family = "HiraKakuPro-W3", cex.lab = 1.4, cex.axis = 1.2, cex.main = 1.6, mex = 1.4)
        hist(means[1:i], axes = FALSE, freq = FALSE, col = "gray",
             xlim = c(min(height), max(height)),
             xlab = "sample mean", ylab = "probability density",
             main = paste0("sample size = ", n, ", sample id = ", i))
        axis(1, min(height):max(height))
        abline(v = 170, lwd = 2, col = "red")
    }
    axis(2)
    return(s_mat)
}
  • To display the results of the simulation, enter the following command.
source("<path-of-the-directory>/height_sim.R")
sim_1 <- height_experiment(n = 10) # Extract 500 samples of size 10. 
sim_2 <- height_experiment(n = 90) # Extract 500 samples of size 90
  • The distribution of the sample mean \(\bar{x}\) when 500 samples of size 10 are extracted from the population.

  • The distribution of the sample mean \(\bar{x}\) when 500 samples of size 90 are extracted from the population.

  1. In both samples, the center of the histogram matches the population mean \(({μ} = 170)\).
    → Regardless of the size of the population, the sample mean is a universal estimator of the population mean.
  • The variability of the sample means is smaller than that of the population.
    The reason → It’s less likely for the sample mean to take extreme values.
  • The sample mean is less likely to take values near the extremes of the population distribution, and is distributed over a narrower range than the population.
  • Variability of the distribution… measured by standard deviation.
  • Variability of the sample mean… represented by the following formula:

\[SD (\bar{x})= \frac{\sigma}{\sqrt{n}}\]

  • Since the denominator of the formula includes \(n\), the larger the sample size \(n\), the less likely the sample mean is to take extreme values.
  • With the population standard deviation of height \(({\sigma = 6})\) and 500 extracted samples of size 10,

\[SD (\bar{x})= \frac{\sigma}{\sqrt{n}}\]

\[= \frac{6}{\sqrt{10}}\] \[= 1.9\]

  • With the population standard deviation of height \(({\sigma = 6})\) and increasing the size of the 500 extracted samples to 90, then….

\[SD (\bar{x})= \frac{\sigma}{\sqrt{n}}\]

\[= \frac{6}{\sqrt{90}}\] \[= 0.63\]

  • The standard deviation for a sample size of 10 is about one third of 1.90.
Consistency:
  • The larger the sample size, the higher the probability that the sample mean calculated from a single sample will be close to the population mean.
    → Consistent estimator

4. Estimation of Population Mean

4.1 \(z\) Values and \(t\) Values

  • Point Estimation: Ex) The population mean is 45 points.
  • Interval Estimation: Ex) The population mean is between 43 and 47 points.
  • Confidence Interval (CI) is used to express the uncertainty associated with the estimation.
  • The standard deviation of the sample mean \({SD (\bar{x}})\) is known.
    → The uncertainty of the sample mean can be captured.
    → The uncertainty of the estimation can be clarified.
    The formula for calculating the standard deviation of the sample mean \({SD (\bar{x}})\) is

\[SD (\bar{x})= \frac{\sigma}{\sqrt{n}}\]

  • If the values of the population standard deviation \({\sigma}\) and the sample size \(n\) are known, then the standard deviation of the sample mean \(({SD (\bar{x}})\)) can be determined.

  • While the sample size \(n\) is known, the population standard deviation \({\sigma}\) is usually unknown
    → Instead of the population standard deviation \({\sigma}\), use the unbiased standard deviation \(u_x\) to estimate the standard deviation of the sample mean \(({SD (\bar{x}})\)).
    → Refer to the Sample Distribution section of this page.

\[SE = \frac{u_x}{\sqrt{n}}\]

  • \(SE\) is the Standard Error (Standard Error: Std. Err is also used).
  • When the value of the population standard deviation \({\sigma}\) is unknown, Standard Error \(SE\) is used as an estimate for the standard deviation of the sample mean \({SD (\bar{x}})\).
  • The \(t\) value can be calculated using the following formula, using the Standard Error \(SE\), which represents the variability of the sample mean.

\[t = \frac{\bar{x} - μ}{SE} = \frac{\bar{x} - μ}{u_x / \sqrt{n}} \]
\(\bar{x}\) : sample mean
\(μ\) : population mean
\(n\) : sample size
\(u_x\) : unbiased standard deviation (= sample standard deviation)
\(SE\) : standard Error

Reference

  • When the population standard deviation \({\sigma}\) is known
  • Standard normal distribution (\(z\) distribution) can be used instead of \(t\) distribution.
    The \(z\) value can be calculated using the following formula:

\[Z = \frac{\bar{x} - μ_0}{σ / \sqrt{n}} \]

  • The \(z\) value is the transformed (standardized) value of \(\bar{x}\) using \({\sigma}\).
  • However, in reality, it is rare for the population standard deviation \({\sigma}\) to be known in data analysis.
  • In practice, the \(t\) value is used for analysis.
  • Draw four types of \(t\) distribution charts for each size of degrees of freedom (df: degree of freedom).
x <- seq(-4, 4, length=100)
hx <- dnorm(x)

df <- c(1, 3, 8, 30)
colors <- c("blue", "red", "green", "gold", "black")
labels <- c("df=1", "df=3", "df=8", "df=30", "N.distribution")

plot(x, hx, type="l", lty=2, xlab="t value",
  ylab="Density", main="Normal Distribution and t Distribution with multiple degrees of freedom")

for (i in 1:4){
  lines(x, dt(x,df[i]), lwd=2, col=colors[i])
}

legend("topright", inset=.05, title="Degree of Freedom",
  labels, lwd=2, lty=c(1, 1, 1, 1, 2), col=colors)

The characteristics of the \(t\) distribution:

  • A symmetric distribution centered around 0.
  • The shape of the distribution changes depending on the value of degrees of freedom (degree of freedom: df).
  • When the degree of freedom is small (i.e., the sample size is small), the \(t\) value fluctuates greatly.
    → This is why the \(t\) distribution table details \(t\) values for small degrees of freedom.
  • As the degrees of freedom increase (i.e., the sample size increases), the distribution becomes more centered.
  • Since the numerator in the formula for calculating the \(t\) value is \(({\bar{x} - μ})\), when \(({\bar{x} = μ})\) (i.e., when the sample mean equals the population mean), the \(t\) value becomes 0.
  • In estimating the sample mean, use the \(t\) distribution table with degrees of freedom \(({n - 1})\).
  • For example, for a sample size n = 100, use the \(t\) distribution with 99 degrees of freedom.
  • Use the fact that the transformed \(t\) of sample mean \(\bar{x}\) follows the \(t\) distribution with 99 degrees of freedom’ for statistical estimation.
  • What the black solid line in the \(t\) distribution diagram (degrees of freedom = 99) indicates:
    → The state of the distribution of \(t\) conforming to the distribution with 99 degrees of freedom when the sample is redrawn multiple times.
    Probability Density Curve.
The probability density curve of the t distribution with 99 degrees of freedom.
The probability density curve of the \(t\) distribution with 99 degrees of freedom.

The area enclosed between the probability density curve and the horizontal axis (where probability density = 0) represents probability.

  • When the degrees of freedom \(({n - 1 = 99})\), 95% of the \(t\) values fall within the range of \(({-1.98})\) to \(({1.98})\) (this range is denoted as \(({[-1.98, 1.98]})\)).
  • The shape of the \(t\) distribution changes with degrees of freedom.
    → With different degrees of freedom, values other than \(({-1.98})\) or \(({1.98})\) are used.
  • Generally, when \(t\) follows a \(t\) distribution with degrees of freedom \(({n - 1})\)
    → 95% of \(t\) values fall within the interval \(({[-t_{n-1,0.025}, t_{n-1,0.025}]})\).
    → The area of the empty space on both ends of the curve is \(({0.025 (0.0252=0.05: significance level (α) = 0.05)})\) each.
    → 90% of \(t\) values fall within the interval \(({[-t_{n-1,0.05}, t_{n-1,0.05}]})\).
    → The area of the empty space on both ends of the curve is \(({0.05 (0.052=0.10: significance level (α) = 0.10)})\) each.

Significance Level (\(α\): level of significance):

A criterion for how low a probability must be to be considered ‘unlikely to occur in reality’.

  • It is common to use \(α = 0.05 (5 percent)\).
  • The qt() function can be used to specify a significance level and find the rejection region.
  • For degrees of freedom of 99, a significance level of 5%, and a two-tailed test, the rejection region is
# The t-value for a lower probability of 0.025 with 99 degrees of freedom in a t-distribution.
qt(0.025, 99)  
[1] -1.984217
# The t-value for a upper probability of 0.975 with 99 degrees of freedom in a t-distribution.
qt(0.975, 99) 
[1] 1.984217

4.2 Excercise.

  • Q1: Using R, find the rejection region for a two-tailed test with 99 degrees of freedom and a significance level of 10%.

  • Q2: Using R, find the rejection region for a two-tailed test with 99 degrees of freedom and a significance level of 1%.

5. Confidence Interval

5.1 How to Calculate Confidence Intervals

  • Survey the temperature of sentiment among Japanese voters towards the Democratic Party of Japan (DPJ).
  • Obtain a sample of n = 100 people through simple random sampling.
  • Unbiased standard deviation \(u_x\) = 21.36.
  • Temperature of sentiment towards the DPJ: average sentiment temperature \(\bar{x}\) = 55.13 degrees.

Q: Can it be said that the average sentiment temperature of all voters is 50 degrees?

  • The 95% confidence interval for the population mean \((μ)\) can be calculated using the following formula when we have the sample mean \(\bar{x}\) and the standard error \((SE = \frac{u_x}{\sqrt{n}})\).”

\(({[\bar{x}-t_{n-1,0.025} - SE, \bar{x} + t_{n-1,0.975} - SE]})\)

  • the average sentiment temperature \(\bar{x}\) = 55.13
  • \(({t_{99,0.975}})\) = 1.98

\[SE = \frac{u_x}{\sqrt{n}}= \frac{21.36}{\sqrt{100}} = 2.136\]

  • Substituting the above values, the 95% confidence interval for the population mean \((μ)\) is

\(({[\bar{x}-t_{n-1,0.025} - SE < μ < \bar{x} + t_{n-1,0.975} - SE]})\)

55.13 - 1.98 - 2.136 ≦ \((μ)\) ≦ 55.13 + 1.98 - 2.136
:::

50.9 ≦ \((μ)\) ≦ 59.4

Conclusions ・The 95% confidence interval for the temperature of sentiment towards the Democratic Party among all voters is between 50.9 degrees and 59.4 degrees.
→ 50 degrees is not included in the 95% confidence interval for sentiment temperature.
→ It cannot be said that the average sentiment temperature in the population is 50 degrees.

Summary of Confidence Intervals

  1. When multiple samples are extracted, the transformed \(t\) of sample mean \(\bar{x}\) follows a \(t\) distribution with degrees of freedom \(({n - 1})\).

\[t = \frac{\bar{x} - μ}{SE} = \frac{\bar{x} - μ}{u_x / \sqrt{n}} \]

  1. Due to the properties of the \(t\) distribution, the following equation holds for 95% of all samples:

\[-t_{n-1,0.025} ≦ \frac{\bar{x} - μ}{SE} ≦ t_{n-1,0.0975}\]

  1. Transforming around \(μ\), we get:

\(({[\bar{x}-t_{n-1,0.025} - SE < μ < \bar{x} + t_{n-1,0.975} - SE]})\)

  

  1. For 95% of the obtained samples, the true population mean \((μ)\) lies within the interval \(({[\bar{x}-t_{n-1,0.975} - SE]})\) and \(({[\bar{x}+t_{n-1,0.025} - SE]})\).
  • This interval is referred to as the ‘95% confidence interval’.

5.2 Interpretation of Confidence Intervals

  • The meaning of 95% Confidence Interval:
    The statement “The probability that the parameter value falls within this interval is 95%” is incorrect.
  • The probability that the parameter is included in the 95% confidence interval derived from each sample is either 0% or 100%.

→ The probability that the parameter \((μ = 45)\) is included in the 95% confidence interval of the first sample from the top = 100%.
→ The probability that the parameter \((μ = 45)\) is included in the 95% confidence interval of the eleventh sample from the top = 0%.

  • Assume that with a sample size of 100, and 20 samples taken.
  • 20 “95% Confidence Intervals” are obtained.
  • Out of these, 95% (i.e., 19 of them) will capture the parameter, but 5% (i.e., 1 of them) will fail to capture the parameter.

Using R to Verify the Sentiment Temperature Issue

  • Survey the temperature of sentiment among Japanese voters towards the Democratic Party.

  • Obtain a sample of n = 100 people through simple random sampling.

  • Unbiased standard deviation \(u_x\) = 21.36.

  • Temperature of sentiment towards the Democratic Party: average sentiment temperature \(\bar{x}\) = 55.13 degrees.

  • Can it be said that the average sentiment temperature of all voters is 50 degrees?

  • Assuming an average sentiment temperature of 55 degrees and a standard deviation of 20 in the population, randomly extract a sample of 100 people and display descriptive statistics.

set.seed(2017)
emotion <- rnorm(100, mean=55, sd=20)
summary(emotion)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3.14   39.56   55.20   55.13   69.00  117.75 
hist(emotion)

sd(emotion)
[1] 21.36289
  • Using R to perform a t-test under the condition that the population mean = 50.
t.test(emotion, mu=50)

    One Sample t-test

data:  emotion
t = 2.4016, df = 99, p-value = 0.01819
alternative hypothesis: true mean is not equal to 50
95 percent confidence interval:
 50.89171 59.36943
sample estimates:
mean of x 
 55.13057 
  • Pay attention to the results on the 4th to 5th lines from the bottom “95 percent confidence interval: 50.89171 − 59.36943”.
    → This is the “95% confidence interval for the sentiment temperature”.

結論 ・The 95% confidence interval for the temperature of sentiment towards the DPJ among all voters is between 50.9 degrees and 59.4 degrees.
→ “50 degrees” is not included in the 95% confidence interval for sentiment temperature.
It cannot be said that the average sentiment temperature in the population is 50 degrees

6. Probability Density Function

6.1 Bernoulli Trial

  • An experiment in which the occurrence or non-occurrence of an event is determined by a certain probability, like flipping a coin, is called a Bernoulli Trial.
  • When representing the probability of getting heads on a coin with the lowercase letter \((p)\), a Bernoulli Trial can be expressed by the following formula:

\[P(X = Head) = p\] \[P(X = Tail) = 1 - p\]

  • \((P (X))\) represents probability and takes real number values in the range from 0 to 1.
  • The variable \((X)\) representing heads or tails of a coin is called a probability variable.
  • Whether the coin shows heads or tails is not determined until the coin is tossed.
    → Such a variable is called a “probability variable”.
  • When each value taken by a probability variable (in this case, “Head” and “Tail”) corresponds to the probability of their occurrence, this correspondence is called a probability distribution.
  • A probability variable \((X)\) is “a variable that takes specific values according to a probability distribution”.
  • The probability variable \((X)\) becomes heads (“Head”) with probability \((p)\), and tails (“Tail”) with probability \((1 - p)\).
  • Repeating Bernoulli Trials.
  • The distribution of the number of times heads (“Head”) appears in coin flipping is called a binomial distribution.
    -R has a probability density function for binomial distribution dbinom().
  • When flipping a coin 10 times, the probability density (i.e., probability) of a binomial distribution where heads appear 5 times is
dbinom(5, 10, 0.5)  # dbinom(x, size, probability)
[1] 0.2460938
  • When flipping a coin 10 times, to find the probability of heads appearing from 0 to 10 times each, you would use:
dbinom(0:10, 10, 0.5)
 [1] 0.0009765625 0.0097656250 0.0439453125 0.1171875000 0.2050781250
 [6] 0.2460937500 0.2050781250 0.1171875000 0.0439453125 0.0097656250
[11] 0.0009765625
  • The number immediately to the right of [1] represents the probability of the coin showing heads 0 times, the next number represents the probability of 1 time, and the last number to the right of [11] represents the probability of 10 times.
  • As expected, the probability of getting 5 heads is the highest (0.246), followed by nearly the same probability for 4 and 6 (0.205), and then 3 and 7 (0.117).
  • The probability decreases as the count moves away from 5.
  • When repeating Bernoulli Trials 10 times with a probability of 0.5 for heads, the probability of getting heads 5 times is the highest.
  • When flipping a coin 10 times, plotting not the number of times heads appears, but the probability of each number of times heads appears
# Specifying the number of times the coin lands heads from 0 to 10, with 10 trials and a probability of 0.5 for heads:
coin10 <- dbinom(0:10, 10, 0.5)

# "h" specifies vertical lines in the chart, lwd specifies the thickness of the lines
plot(0:10, coin10, type = "h", lwd = 5) 

  • The height of the vertical bars represents the probability density (= probability).
  • (In the case of continuous quantities, probability density ≠ probability, instead the area of the interval = probability)
  • The probability of occurrence of the data shown here is called the probability distribution.
  • The formula that calculates this probability distribution is referred to as the probability density function (PDF), and is represented by the following formula:

\[f(X) = _NC_XP^X{(1 - P)^{N-X}}\]

  • Let’s substitute the number of trials \((N = 10)\) and the number of successes \((X = 5)\) into this formula.
  • The combinations of getting heads 5 times in 10 dice throws \((_NC_X)\) can be calculated using the choose() code in R.
choose(10, 5)  # The combination of getting heads 5 times out of 10 throws
[1] 252
  • We get the value of \((P^X{(1 - P)^{N-X}})\)
(0.5)^5*(1-0.5)^5
[1] 0.0009765625
  • Therefore, the valur of \((f(X))\) is
(252)*(0.0009765625)
[1] 0.2460938
  • This value corresponds to the height (probability density = probability) of the vertical bars in the histogram above.
  • This probability density (= probability) can be calculated in R as follows:
dbinom(5, 10, 0.5)  # dbinom(x, size, probability)
[1] 0.2460938

6.2 Coin Toss Simulation

  • In the previous section’s simulation, we counted the number of times heads appeared when a coin was tossed 10 times (integers from 0 to 10), which are discrete values.
  • Here, we will conduct a simulation using continuous quantities.
  • In continuous data, the probability is not represented by a specific value, but by a certain interval that includes the value, expressed as an area.
  • As an example, let’s randomly extract the heights of 10 students from a continuous variable (the height of students at Waseda University) that is uniformly distributed between a minimum of 140 and a maximum of 185.
  • The code used here is runif(n, a, b), where \((a)\) is the minimum value, \((b)\) is the maximum value, and \((n)\) is the number of extractions.”
runif(10, 140, 185)
 [1] 162.8133 170.4660 181.2077 177.7272 161.2677 171.0087 176.0036 156.1114
 [9] 147.5133 150.7378
  • The r in runif stands for random, and unif represents uniform (uniform distribution).
  • Since this is a random extraction, different numbers are extracted each time it is executed.
runif(10, 140, 185)
 [1] 141.0996 142.8775 171.5061 140.8381 167.1321 170.8180 177.8217 150.8456
 [9] 151.2468 175.7331
  • When finalizing an academic paper, in order to obtain the same results from random sampling, set.seed() can be used to set a seed for random sampling.
set.seed(2016-01-09)
runif(10, 140, 185)
 [1] 179.9503 179.6794 149.5198 162.7430 148.9673 175.3110 171.7291 168.4825
 [9] 154.0057 140.8939
  • Setting the same value for seed ensures that the same numbers are extracted every time it is executed.
set.seed(2016-01-09)
runif(10, 140, 185)
 [1] 179.9503 179.6794 149.5198 162.7430 148.9673 175.3110 171.7291 168.4825
 [9] 154.0057 140.8939
  • Let’s create a histogram from 10 numbers obtained through random sampling
set.seed(2016-01-09)
hist(runif(10, 140, 185))

  • Let’s create a histogram from 100 numbers obtained through random sampling
set.seed(2016-01-09)
hist(runif(100, 140, 185))

  • Let’s create a histogram from 1000 numbers obtained through random sampling
set.seed(2016-01-09)
hist(runif(1000, 140, 185))

  • Since the population is a continuous variable uniformly distributed with a minimum value of 140 and a maximum value of 185, the more the sample size is increased, the more it resembles the shape of the population.
  • Not only for uniform distributions, but R also allows random sampling from various distributions using the following commands.
  • Commonly used distributions are as follows:
rnorm() : normal distribution
rt() : \(t\) distribution
rchisq() : \((\chi^2)\) Chi-square distribution
  • When randomly sampling from each distribution, it is important to note that there are specific arguments (arguments) unique to each.
  • For example, for a normal distribution, you need to specify the mean (mean) and standard deviation (sd) inside the parentheses of rnorm(), while for the \(t\) distribution and \((\chi^2)\) distribution, the degrees of freedom (df) must be specified inside the parentheses.
  • Let’s try random sampling using rnorm().
  • When randomly extracting n numbers from a normally distributed population N \((\mu)\), \((\sigma^2)\):
    → Input the necessary information in rnorm(n, mean = mu, sd = sigma).
  • If randomly extracting n = 100 numbers from a population normally distributed with a population mean (mu) of 160 and a population variance of 16 (sigma = 4), enter as follows:
rnorm(100, 160, 4)
  [1] 159.3848 159.4990 164.7559 155.6251 153.1917 158.3015 163.9680 153.7831
  [9] 158.1598 160.8625 161.2840 161.1273 164.9580 164.3071 157.1850 152.9570
 [17] 157.9721 165.1655 161.9859 157.9363 159.3190 153.6560 160.8890 160.0436
 [25] 152.7490 156.1923 166.4202 159.1252 161.6708 158.8871 158.8819 163.8937
 [33] 154.8815 163.1115 159.2016 152.5330 157.0690 152.9178 158.2415 162.3454
 [41] 160.8212 160.1693 162.7420 157.0018 162.6093 156.4365 163.4589 157.0134
 [49] 161.0999 163.0928 154.6236 167.2826 165.1174 152.7759 156.5711 156.8450
 [57] 160.3522 162.3860 159.9502 162.2307 159.0250 163.0924 164.7416 161.5864
 [65] 157.7696 167.6563 159.5414 163.9407 164.6144 156.0849 156.9475 164.7797
 [73] 156.1790 157.2057 157.2734 159.0295 157.1015 157.6920 159.8148 160.7410
 [81] 164.8714 157.5850 160.9043 163.3799 157.6321 156.9417 154.8651 154.0923
 [89] 161.5911 163.2804 165.0042 159.4423 163.6247 161.3810 159.2093 160.0146
 [97] 153.1990 164.3284 150.2262 158.1045
hist(rnorm(100, 160, 4))

  • The larger the sample size, the more the randomly extracted samples will resemble the shape of a normal distribution.
quantile(rnorm(100, 160, 4), c(0, .25, .5, .75, 1))
      0%      25%      50%      75%     100% 
150.0572 157.2044 159.8816 163.0106 168.9840 
boxplot(rnorm(100, 160, 4))

7. Expected Value

7.1 Expected Value = Mean Value

  • The expected value \((E(X_i))\) is the sum of the product of ‘the probability of a certain event occurring \((P(X))\)’ and ‘the numerical result of that event \((X_i)\)’.
  • The expected value is also referred to as the ‘mean value’.
  • Expressed in a formula:

\[E(X) = \frac{\sum_{i=1}^N p(X_i)X_i}{N}\]

An example:

  • The expected value when rolling a die is calculated by multiplying ‘the probability of each face appearing (1/6)’ with ‘the numerical result of that outcome (1, 2, 3, 4, 5, 6)’ and then summing the results.

\[E(X) = \frac{\sum_{i=1}^N p(X_i)X_i}{N}\]

\[= \frac{(1/6)*1 + (1/6)*2 + (1/6)*3 + (1/6)*4 + (1/6)*5 + (1/6)*6}{N}\]

\[= 3.5\]

  • Using R, if we roll a die 10 times and calculate its expected value
x <- 1:6
y <- sample(x, 10, replace = TRUE)
mean(y)
[1] 2.7
  • In theory, the expected value when rolling a die is 3.5, but in practice, we do not always obtain exactly 3.5.
  • Let’s roll it another 10 times and see.
y <- sample(x, 10, replace = TRUE)
mean(y)
[1] 4.3
  • If we increase the number of times we roll the die to 10,000, and calculate the expected value, then…
y <- sample(x, 10000, replace = TRUE)
mean(y)
[1] 3.5204
  • As the number of times we roll the die approaches infinity, the expected value approaches 3.5 (i.e., the law of large numbers).

7.2 Expected Value Simulation

Probability Distribution:

  • The process of converting the occurrence of data into probabilities.
  • Examples include rolling a die, tossing a coin, etc.
  • When a certain event happens with a specific probability, that occurrence is referred to as an “event”.
  • R language has a feature for randomly sampling from a population that is distributed with certain probabilities.
  • Use R to simulate tossing a coin 10 times.
  • First, define the heads of a coin as “H” and tails as “T”, and create a hypothetical coin (coin).
  • Toss the coin 10 times and record the number of times (in integers) heads and tails appear.
  • Such integers are referred to as discrete values.
coin <- c("H", "T") 
flip <- sample(coin, 10, replace = TRUE) 
table(flip) 
flip
H T 
2 8 
  • While theoretically, the probability of getting heads (“H”) on a coin toss is 50%, in practice, tossing a coin 10 times doesn’t always result in exactly 5 heads.
  • Try tossing a coin 10 times and repeat this experiment 20 times. Then, represent the results in a histogram.
x <- numeric(20) # Create a vector x and fill it with 20 zeros
for(i in 1:20) { # For each set of 10 coin tosses, increment variable i from 1 to 20
  flip <- sample(coin, 10, rep = TRUE) # Command to toss the coin 10 times
  x[i] <- sum(flip == "H") # Command to assign the result "H" to the left side x
}
table(x) # Command to tabulate the results in x
x
4 5 6 7 8 
2 5 9 3 1 
hist(x, breaks = 0:10) # Limit histogram intervals from 0 to 10

  • Although the results of each simulation vary, when you toss a coin 10 times for 20 trials, the most frequent outcome isn’t always 5 heads.
  • Often, the distribution can be skewed.
  • Try tossing a coin 10 times for 10,000 trials.
x <- numeric(10000) # Create a vector x and fill it with 10,000 zeros
for(i in 1:10000) { # For each set of 10 coin tosses, increment variable i from 1 to 10000
  flip <- sample(coin, 10, rep = TRUE) # Command to toss the coin 10 times
  x[i] <- sum(flip == "H") # Command to assign the result "H" to the left side x
}
table(x) # Command to tabulate the results in x
x
   0    1    2    3    4    5    6    7    8    9   10 
  13  110  427 1213 2021 2429 2072 1139  454  114    8 
hist(x, breaks = 0:10) # Limit histogram intervals from 0 to 10
abline(v = mean(x), col = "red") # Draw a red vertical line at the mean

mean(x)
[1] 4.9975
  • When the results are represented in a histogram, it becomes clear that they consistently approach the mean of 5.

8. Exercise:

  • Q1: Using R, randomly draw 3 samples (number of samples = 3) of sample size 5 from a population with a mean of 50 and a standard deviation of 10. Show the histogram, mean, and standard deviation for each of the 3 samples.

  • Q2: Using R, randomly draw 3 samples (number of samples = 3) of sample size 500 from a population with a mean of 50 and a standard deviation of 10. Show the histogram, mean, and standard deviation for each of the 3 samples.

  • Q3: What can be inferred from these two types of simulations?

Reference
  • 宋財泫 (Jaehyun Song)・矢内勇生 (Yuki Yanai)「私たちのR: ベストプラクティスの探究」
  • 土井翔平(北海道大学公共政策大学院)「Rで計量政治学入門」
  • 矢内勇生(高知工科大学)統計学2
  • 浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年
  • 浅野正彦, 中村公亮.『初めてのRStudio』オーム社、2018年
  • Winston Chang, R Graphics Cookbook, O’Reilly Media, 2012.
  • Kieran Healy, DATA VISUALIZATION, Princeton, 2019
  • Kosuke Imai, Quantitative Social Science: An Introduction, Princeton University Press, 2017