Descriptive Statistics
and
Inferential Statistics
.Types of Statistics | Characteristics |
---|---|
Descriptive Statistics: | Use statistical measures to grasp data trends and properties |
Inferential Statistics: | Test and estimate population parameters (like population mean and variance |
inferential statistics
is that by
increasing the number of randomly drawn samples from the population and
repeating trials infinitely, we can infer about the vast and unknown
population from a part of it, the sample.Descriptive Statistics
, which is the foundational knowledge
commonly used in both types of statistics.parameter
and statistic
.How to estimate ・Calculate
sample statistic (such
as sample mean
: \(\bar{X}\))
→ Infer parameter
(such as population mean
: \(\mu\))
Assumption in statistical estination
・Population distributio is normally
distributed
・But, it is not always the case (we have lots of populations which are
not nornally distributed)
・The Central Limit Theorem enables us to conduct statistical
estimation
Number of Samples
and Sample Size
sample size
and
number of samples
are often confused.Sample Size
: Refers to the number of observations
extracted, denoted by n.Number of Samples
: The number of sample sets extracted
with n observations.dnorm()
function, you can artificially create
a population and display its distribution density.# Specify a population mean of 10, standard deviation of 5
# Specify the range from -10 to 30
curve(dnorm(x, 10, 5), from = -10, to = 30)
rnorm()
function.# # Randomly extract 20 numbers from a population with a mean of 10 and a standard deviation of 5 (sd = 5), name it x1, and display the data and histogram.
x1 <- rnorm(20, mean = 10, sd =5)
x1
[1] 11.978594 6.917834 7.668746 7.307478 10.847311 6.187763 11.212720
[8] 10.524525 21.320335 7.414352 12.755208 7.578064 9.529863 6.792361
[15] 8.679532 2.512480 11.451368 6.328466 13.408677 8.167278
round()
function. [1] 12 7 8 7 11 6 11 11 21 7 13 8 10 7 9 3 11 6 13 8
The sample mean \(\bar{x}\) can be
calculated using the following formula:
\[\bar{x} = \frac{\sum_{i=1}^n
x_i}{n}\]
・In R, the sample mean can be calculated using the
mean()
function.
[1] 9.429148
\[\bar{x} = \frac{\sum_{i=1}^n x_i}{n}\]
[1] 11.051708 19.158094 9.271258 4.834369 16.371883 1.000707 17.500840
[8] 8.262222 16.659910 14.391944 5.199113 12.250473 18.393731 9.871836
[15] 16.134405 12.462566 7.896137 8.655362 7.190878 6.212851
[1] 11 19 9 5 16 1 18 8 17 14 5 12 18 10 16 12 8 9 7 6
[1] 11.13851
bias
in sampling, but there is an error
.x1
and the second sample
x2
do not exactly match 10.min( )
,
median( )
, max( )
allows for easy acquisition
of descriptive statistics such as the minimum
,
median
, and maximum
values.[1] 1.000707
[1] 10.46177
[1] 19.15809
0% 25% 50% 75% 100%
1.000707 7.719822 10.461772 16.193775 19.158094
x
.[1] 22 33 44 55 66 77 88 99 100
var()
is set to use sample unbiased variance \(U^2\) by default.sample unbiased variance
and
sample variance
.sample unbiased variance $U^2$
(default in
R):Used when only a part of the population data is available, and
there is a need to understand the variability of the population data
behind the obtained data.
Sample unbiased variance \(U^2\)
can be calculated using var()
Suppose \(x\) is a sample
extracted from the population.
Since the population variance is unknown, we use the statistics
obtained from the sample to estimate the unknown population
variance.
The statistic for estimating the population variance is the
sample unbiased variance \(U^2\).
Sample unbiased variance \(U^2\) can be calculated using the following formula:
\[U^2 = \frac{\sum_{i=1}^N (x_i - \bar{x})^2}{N-1}\]
・\(\bar{x}\): sample mean.
・Let’s calculate in R.
[1] 808.6111
[1] 808.6111
When a sample extracted from the population is available.
When the interest lies not in the population but only in the
sample.
When the only concern is how much the sample is dispersed.
R does not have a function to calculate sample variance \(S^2\).
\(S^2\) must be calculated by creating the following formula:
\[S^2 = \frac{\sum_{i=1}^N (x_i - \bar{x})^2}{N}\]
・\(\bar{x}\): sample mean
[1] 718.7654
Sample variance \(S^2\ \) (718.7654) is always smaller than sample unbiased variance (808.6111) \(S_x^2\)
10.2 Inferential Statistics (Estimation and Testing)
.Why we use the sum of squares (\(S^2\) or \(U^2\)) in variance calculation?
・The difference (deviation) of each data point in a group from the
average value is an indicator of how far each is from the mean.
→ Deviations can be both positive and negative.
→ As a measure of how far each data point is from the mean, it is not
suitable.
・By squaring the deviations, the measure of distance from the
mean is converted to absolute values, making it an appropriate
indicator.]
choose()
function as follows:[1] 120
[1] 2.755731e+73
Sampling distribution
… the crux of statistical
estimation, linking statistics and parameters with probability.Population Mean
and Sample Mean
Mr.A | Mr.B | Mr.C | |
---|---|---|---|
# of times elected | 3 | 6 | 12 |
win <- c(3, 6, 12) # Create a population with election counts of 3, 6, and 12
mean(win) # Population mean of election counts
[1] 7
[1] 14
[1] 3.741657
[1] 4.5
[1] 2.25
[1] 1.5
sample number | legislator chosen | sample mean | sample sd |
---|---|---|---|
1 | A & B | 4.5 | 1.5 |
2 | A & C | 7.5 | 4.5 |
3 | B & C | 9 | 3 |
Unbiased
means that even if each statistic differs from
the parameter, the average of the obtained values matches the
parameter.\[{μ} = \frac{4.5+7.5+9}{3} = 7\]
Population Standard Deviation
and
Sample Standard Deviation
\[\frac{1.5+4.5+3}{3} = 3 < {3.7}.\]
the sample standard deviations
is always
smaller than the population standard deviation
.The sample standard deviation
(statistic) has a
systematic bias (bias) that underestimates the population standard
deviation (parameter).the sample unbiased standard deviation
\((u_x)\) to estimate
the population standard deviation
\((σ)\).The population standard deviation
\((σ)\) can be estimated by calculating
the sample unbiased standard deviation
\((u_x)\).sd()
function.[1] 4.582576
\[\sigma^2 = \frac{\sum_{i=1}^N (x_i - \mu)^2}{N}\]
\[u_x^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{N-1}\]
R code | |
Population variance \((\sigma^2)\) | var(x) - (length(x) - 1) / length(x) |
Population standard deviation \((\sigma)\) | sqrt(var(x) - (length(x) - 1) / length(x)) |
Sample unbiased variance \((u_x^2)\) | var(x) |
Unbiased standard deviation \((u_x)\) | sd(x) |
[1] 2.754492e+43
[1] 169.903
height <- rnorm(100000, 170, 6)
## Parameters: n = sample size
## T = number of samples
height_experiment <- function(n = 10, T = 500, seed = 2016-01-24) {
means <- rep(NA, T)
s_mat <- matrix(NA, ncol = n, nrow = T)
set.seed(seed)
height <- rnorm(100000, 170, 6)
for (i in 1:T) {
s <- sample(height, n, replace = TRUE)
s_mat[i,] <- s
means[i] <- mean(s)
par(family = "HiraKakuPro-W3", cex.lab = 1.4, cex.axis = 1.2, cex.main = 1.6, mex = 1.4)
hist(means[1:i], axes = FALSE, freq = FALSE, col = "gray",
xlim = c(min(height), max(height)),
xlab = "sample mean", ylab = "probability density",
main = paste0("sample size = ", n, ", sample id = ", i))
axis(1, min(height):max(height))
abline(v = 170, lwd = 2, col = "red")
}
axis(2)
return(s_mat)
}
source("<path-of-the-directory>/height_sim.R")
sim_1 <- height_experiment(n = 10) # Extract 500 samples of size 10.
sim_2 <- height_experiment(n = 90) # Extract 500 samples of size 90
\[SD (\bar{x})= \frac{\sigma}{\sqrt{n}}\]
\[SD (\bar{x})= \frac{\sigma}{\sqrt{n}}\]
\[= \frac{6}{\sqrt{10}}\] \[= 1.9\]
\[SD (\bar{x})= \frac{\sigma}{\sqrt{n}}\]
\[= \frac{6}{\sqrt{90}}\] \[= 0.63\]
\[SD (\bar{x})= \frac{\sigma}{\sqrt{n}}\]
If the values of the population standard deviation \({\sigma}\) and the sample size \(n\) are known, then the standard deviation of the sample mean \(({SD (\bar{x}})\)) can be determined.
While the sample size \(n\) is known, the population standard
deviation \({\sigma}\) is usually
unknown
→ Instead of the population standard deviation \({\sigma}\), use the unbiased standard
deviation \(u_x\) to estimate the
standard deviation of the sample mean \(({SD
(\bar{x}})\)).
→ Refer to the Sample Distribution
section of this
page.
\[SE = \frac{u_x}{\sqrt{n}}\]
\[t = \frac{\bar{x} - μ}{SE} =
\frac{\bar{x} - μ}{u_x / \sqrt{n}} \]
・ \(\bar{x}\) : sample mean
・ \(μ\) : population mean
・ \(n\) : sample size
・ \(u_x\) : unbiased standard
deviation (= sample standard deviation
)
・ \(SE\) : standard
Error
\[Z = \frac{\bar{x} - μ_0}{σ / \sqrt{n}} \]
x <- seq(-4, 4, length=100)
hx <- dnorm(x)
df <- c(1, 3, 8, 30)
colors <- c("blue", "red", "green", "gold", "black")
labels <- c("df=1", "df=3", "df=8", "df=30", "N.distribution")
plot(x, hx, type="l", lty=2, xlab="t value",
ylab="Density", main="Normal Distribution and t Distribution with multiple degrees of freedom")
for (i in 1:4){
lines(x, dt(x,df[i]), lwd=2, col=colors[i])
}
legend("topright", inset=.05, title="Degree of Freedom",
labels, lwd=2, lty=c(1, 1, 1, 1, 2), col=colors)
The area enclosed between the probability density curve and the horizontal axis (where probability density = 0) represents probability.
A criterion for how low a probability must be to be considered ‘unlikely to occur in reality’.
qt()
function can be used to specify a significance
level and find the rejection region.# The t-value for a lower probability of 0.025 with 99 degrees of freedom in a t-distribution.
qt(0.025, 99)
[1] -1.984217
# The t-value for a upper probability of 0.975 with 99 degrees of freedom in a t-distribution.
qt(0.975, 99)
[1] 1.984217
Q1: Using R, find the rejection region for a two-tailed test with 99 degrees of freedom and a significance level of 10%.
Q2: Using R, find the rejection region for a two-tailed test with 99 degrees of freedom and a significance level of 1%.
\(({[\bar{x}-t_{n-1,0.025} - SE, \bar{x} + t_{n-1,0.975} - SE]})\)
\[SE = \frac{u_x}{\sqrt{n}}= \frac{21.36}{\sqrt{100}} = 2.136\]
\(({[\bar{x}-t_{n-1,0.025} - SE < μ < \bar{x} + t_{n-1,0.975} - SE]})\)
55.13 - 1.98 - 2.136 ≦ \((μ)\) ≦
55.13 + 1.98 - 2.136
:::
50.9 ≦ \((μ)\) ≦ 59.4
Conclusions ・The 95% confidence
interval for the temperature of sentiment towards the Democratic Party
among all voters is between 50.9 degrees and 59.4 degrees.
→ 50 degrees is not included in the 95% confidence interval for
sentiment temperature.
→ It cannot be said that the average sentiment temperature in the
population is 50 degrees.
Summary of Confidence Intervals
\[t = \frac{\bar{x} - μ}{SE} = \frac{\bar{x} - μ}{u_x / \sqrt{n}} \]
\[-t_{n-1,0.025} ≦ \frac{\bar{x} - μ}{SE} ≦ t_{n-1,0.0975}\]
\(({[\bar{x}-t_{n-1,0.025} - SE < μ < \bar{x} + t_{n-1,0.975} - SE]})\)
95% Confidence Interval
:→ The probability that the parameter \((μ =
45)\) is included in the 95% confidence interval of the
first sample from the top = 100%.
→ The probability that the parameter \((μ =
45)\) is included in the 95% confidence interval of the
eleventh sample from the top = 0%.
Survey the temperature of sentiment among Japanese voters towards
the Democratic Party.
Obtain a sample of n = 100 people through simple random
sampling.
Unbiased standard deviation \(u_x\) = 21.36.
Temperature of sentiment towards the Democratic Party: average sentiment temperature \(\bar{x}\) = 55.13 degrees.
Can it be said that the average sentiment
temperature of all voters is 50 degrees?
Assuming an average sentiment temperature of 55 degrees and a standard deviation of 20 in the population, randomly extract a sample of 100 people and display descriptive statistics.
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.14 39.56 55.20 55.13 69.00 117.75
[1] 21.36289
One Sample t-test
data: emotion
t = 2.4016, df = 99, p-value = 0.01819
alternative hypothesis: true mean is not equal to 50
95 percent confidence interval:
50.89171 59.36943
sample estimates:
mean of x
55.13057
結論 ・The 95% confidence interval
for the temperature of sentiment towards the DPJ among all voters is
between 50.9 degrees and 59.4 degrees.
→ “50 degrees” is not included in the 95% confidence interval for
sentiment temperature.
→ It cannot be said that the average sentiment
temperature in the population is 50 degrees
\[P(X = Head) = p\] \[P(X = Tail) = 1 - p\]
dbinom()
.[1] 0.2460938
[1] 0.0009765625 0.0097656250 0.0439453125 0.1171875000 0.2050781250
[6] 0.2460937500 0.2050781250 0.1171875000 0.0439453125 0.0097656250
[11] 0.0009765625
number
of times heads appears, but the probability
of each number
of times heads appears# Specifying the number of times the coin lands heads from 0 to 10, with 10 trials and a probability of 0.5 for heads:
coin10 <- dbinom(0:10, 10, 0.5)
# "h" specifies vertical lines in the chart, lwd specifies the thickness of the lines
plot(0:10, coin10, type = "h", lwd = 5)
\[f(X) = _NC_XP^X{(1 - P)^{N-X}}\]
choose()
code in R.[1] 252
[1] 0.0009765625
[1] 0.2460938
[1] 0.2460938
discrete values
.continuous quantities
.probability
is not represented
by a specific value, but by a certain interval
that
includes the value, expressed as an area
.uniformly distributed
between a minimum of 140 and
a maximum of 185.runif(n, a, b)
, where \((a)\) is the minimum value, \((b)\) is the maximum value, and \((n)\) is the number of extractions.” [1] 162.8133 170.4660 181.2077 177.7272 161.2677 171.0087 176.0036 156.1114
[9] 147.5133 150.7378
[1] 141.0996 142.8775 171.5061 140.8381 167.1321 170.8180 177.8217 150.8456
[9] 151.2468 175.7331
set.seed()
can be used to set
a seed for random sampling. [1] 179.9503 179.6794 149.5198 162.7430 148.9673 175.3110 171.7291 168.4825
[9] 154.0057 140.8939
seed
ensures that the same
numbers are extracted every time it is executed. [1] 179.9503 179.6794 149.5198 162.7430 148.9673 175.3110 171.7291 168.4825
[9] 154.0057 140.8939
rnorm() |
: normal distribution |
rt() |
: \(t\) distribution |
rchisq() |
: \((\chi^2)\) Chi-square distribution |
arguments
) unique
to each.mean
) and standard deviation (sd
) inside the
parentheses of rnorm()
, while for the \(t\) distribution and \((\chi^2)\) distribution, the degrees of
freedom (df
) must be specified inside the
parentheses.rnorm()
.rnorm(n, mean = mu, sd = sigma)
.n = 100
numbers from a
population normally distributed with a population mean (mu
)
of 160 and a population variance of 16 (sigma = 4
), enter
as follows: [1] 159.3848 159.4990 164.7559 155.6251 153.1917 158.3015 163.9680 153.7831
[9] 158.1598 160.8625 161.2840 161.1273 164.9580 164.3071 157.1850 152.9570
[17] 157.9721 165.1655 161.9859 157.9363 159.3190 153.6560 160.8890 160.0436
[25] 152.7490 156.1923 166.4202 159.1252 161.6708 158.8871 158.8819 163.8937
[33] 154.8815 163.1115 159.2016 152.5330 157.0690 152.9178 158.2415 162.3454
[41] 160.8212 160.1693 162.7420 157.0018 162.6093 156.4365 163.4589 157.0134
[49] 161.0999 163.0928 154.6236 167.2826 165.1174 152.7759 156.5711 156.8450
[57] 160.3522 162.3860 159.9502 162.2307 159.0250 163.0924 164.7416 161.5864
[65] 157.7696 167.6563 159.5414 163.9407 164.6144 156.0849 156.9475 164.7797
[73] 156.1790 157.2057 157.2734 159.0295 157.1015 157.6920 159.8148 160.7410
[81] 164.8714 157.5850 160.9043 163.3799 157.6321 156.9417 154.8651 154.0923
[89] 161.5911 163.2804 165.0042 159.4423 163.6247 161.3810 159.2093 160.0146
[97] 153.1990 164.3284 150.2262 158.1045
0% 25% 50% 75% 100%
150.0572 157.2044 159.8816 163.0106 168.9840
\[E(X) = \frac{\sum_{i=1}^N p(X_i)X_i}{N}\]
\[E(X) = \frac{\sum_{i=1}^N p(X_i)X_i}{N}\]
\[= \frac{(1/6)*1 + (1/6)*2 + (1/6)*3 + (1/6)*4 + (1/6)*5 + (1/6)*6}{N}\]
\[= 3.5\]
[1] 2.7
[1] 4.3
[1] 3.5204
the law of large numbers
).event
”.discrete values
.flip
H T
2 8
x <- numeric(20) # Create a vector x and fill it with 20 zeros
for(i in 1:20) { # For each set of 10 coin tosses, increment variable i from 1 to 20
flip <- sample(coin, 10, rep = TRUE) # Command to toss the coin 10 times
x[i] <- sum(flip == "H") # Command to assign the result "H" to the left side x
}
table(x) # Command to tabulate the results in x
x
4 5 6 7 8
2 5 9 3 1
x <- numeric(10000) # Create a vector x and fill it with 10,000 zeros
for(i in 1:10000) { # For each set of 10 coin tosses, increment variable i from 1 to 10000
flip <- sample(coin, 10, rep = TRUE) # Command to toss the coin 10 times
x[i] <- sum(flip == "H") # Command to assign the result "H" to the left side x
}
table(x) # Command to tabulate the results in x
x
0 1 2 3 4 5 6 7 8 9 10
13 110 427 1213 2021 2429 2072 1139 454 114 8
hist(x, breaks = 0:10) # Limit histogram intervals from 0 to 10
abline(v = mean(x), col = "red") # Draw a red vertical line at the mean
[1] 4.9975
Q1: Using R, randomly draw 3 samples (number of samples = 3) of sample size 5 from a population with a mean of 50 and a standard deviation of 10. Show the histogram, mean, and standard deviation for each of the 3 samples.
Q2: Using R, randomly draw 3 samples (number of samples = 3) of sample size 500 from a population with a mean of 50 and a standard deviation of 10. Show the histogram, mean, and standard deviation for each of the 3 samples.
Q3: What can be inferred from these two types of simulations?