Descriptive Statistics
and
Inferential Statistics
.Types of Statistics | Characteristics |
---|---|
Descriptive Statistics:: | Use statistical measures to grasp data trends and properties |
Inferential Statistics: | Test and estimate population parameters (like population mean and variance) |
inferential statistics
is that by
increasing the number of randomly drawn samples from the population and
repeating trials infinitely, we can infer about the vast and unknown
population from a part of it, the sample.Descriptive Statistics
, which is the foundational knowledge
commonly used in both types of statistics.Descriptive statistics
are numerical summaries that
encapsulate information contained in a variable.R function | Details | |
---|---|---|
mean() |
mean | |
median() |
median | |
var() |
unbiased variance (note_1) | |
sd() |
unbiased standard deviation (note_1) | |
IQR() |
interquartile range (note_2) | |
min() |
minimum | |
max() |
maximum | |
unbiased variance
and
unbiased standard deviation
, refer to the explanation in
10.1 Inferential Statistics (Basics)
.[1] 22 33 44 55 66 77 88 99 100
\[\bar{x} = \frac{\sum_{i=1}^n x_i}{n}\]
[1] 64.88889
[1] 66
x
in ascending order.[1] 22 33 44 55 66 77 88 99 100
\[Variance = \frac{\sum_{i=1}^N (each\_data\_ point - mean)^2}{N}\]
\[σ^2 = \frac{\sum_{i=1}^N (x_i - \bar{x})^2}{N}\]
x
.[1] 22 33 44 55 66 77 88 99 100
x
and name it
x_mean
.[1] 64.88889
deviation
(each data point - mean) from the
mean and name it x1
.[1] -42.888889 -31.888889 -20.888889 -9.888889 1.111111 12.111111 23.111111
[8] 34.111111 35.111111
x
(22) minus the mean
(64.88889).[1] -42.88889
deviation from the mean (x1)
.deviation from the mean (x1)
and name
it x2
.[1] 1839.456790 1016.901235 436.345679 97.790123 1.234568 146.679012
[7] 534.123457 1163.567901 1232.790123
x2
and name it
sum_x2
.[1] 6468.889
\[σ^2 = \frac{\sum_{i=1}^N (x_i - \bar{x})^2}{N} = \frac{6468.889}{9} = 718.77\]
→ The reason: var()
calculates not the population
variance \(σ^2\), but the unbiased
variance \(U^2\).
→ For more details, refer to 10.1
and
10.2
.
[1] 44
summary()
x
. Min. 1st Qu. Median Mean 3rd Qu. Max.
22.00 44.00 66.00 64.89 88.00 100.00
Descriptive Statistics | Details |
---|---|
Min: | Minimum |
1st Qu.: | 1st Quantile (25%) |
Median: | Median (50%) |
Mean: | Average |
3rd Qu: | 3rd Quantile (75%) |
Max: | Maximum |
dplyr
tidyverse
package, which includes the dplyr
package.Data Preparation (hr96-21.csv
)
・Download hr96-21.csv
to your computer
data
inside the RProject
folder.Manually place the downloaded hr96-21.csv
file into the
data
folder within the RProject folder.hr
.na = "."
is a command that means “interpret dots as
missing values.”hr96_21.csv
contains data from the nine House of
Representatives elections held since the introduction of single-seat
constituencies in 1996 (years: 1996, 2000, 2003, 2005, 2009, 2012, 2014,
2017, 2021).dim()
function, it is clear that
hr
has 9,660 rows and 22 columns.[1] 9660 22
df1
contains 22 variables.Variable Name | Details |
---|---|
year | Election year (1996-2017) |
pref | Prefecture name |
ku | Single-member district name |
kun | Single-member district |
rank | Order of election victory |
wl | Election result: 1 = elected in single-member district, 2 = re-elected, 0 = not elected |
nocand | Number of candidates |
seito | Political party affiliation of the candidate |
j_name | Candidate’s name (in Japanese) |
name | Candidate’s name (in English) |
previous | Number of previous victories (excluding the current election) |
gender | Candidate’s gender: male, female |
age | Candidate’s age |
exp | Election expenses used by the candidate (reported to the Ministry of Internal Affairs and Communications) |
status | Candidate’s status: 0 = non-incumbent, 1 = incumbent, 2 = former incumbent |
vote | Number of votes received |
voteshare | Vote share (%) |
eligible | Number of eligible voters in the single-member district |
turnout | Voting rate (%) in the single-member district (%) |
seshu_dummy | Hereditary candidate dummy: 1 = hereditary, 0 = non-hereditary (inherited base or non-hereditary) |
jiban_seshu | Name and relation of the politician from whom the base was inherited |
nojiban_seshu | Name and relation of the hereditary politician |
summarise()
How to use summarise()
function.
summarise(function, var1, var2,...)
nocand
)# A tibble: 1 × 1
`mean(nocand)`
<dbl>
1 3.89
exp
).# A tibble: 1 × 1
`mean(exp)`
<dbl>
1 NA
→ The reason is that exp
contains missing values.
→ Try adding na.rm = TRUE
(an instruction to remove missing
values).
# A tibble: 1 × 1
`mean(exp, na.rm = TRUE)`
<dbl>
1 7551393.
→ Calculation successful!
summarise()
mean()
of
exp
, age
, and voteshare
in
hr
.# A tibble: 1 × 3
`mean(exp, na.rm = TRUE)` `mean(age, na.rm = TRUE)` mean(voteshare, na.rm = …¹
<dbl> <dbl> <dbl>
1 7551393. 51.2 27.7
# ℹ abbreviated name: ¹`mean(voteshare, na.rm = TRUE)`
tibble
)
format.rename()
function.hr |>
summarise(Mean_exp = mean(exp, na.rm = TRUE),
Mean_age = mean(age, na.rm = TRUE),
Mean_voteshare = mean(voteshare, na.rm = TRUE))
# A tibble: 1 × 3
Mean_exp Mean_age Mean_voteshare
<dbl> <dbl> <dbl>
1 7551393. 51.2 27.7
group_by()
mean()
or
sd()
.dplyr
package is very useful.exp
) for each general election without using
dplyr.
.exp
) for
the 1996 general election, identified as ‘1996’, you would need to write
it as follows.na.rm = TRUE
(an instruction to remove missing
values) is added because exp
contains missing values.[1] 9136316
mean(hr$exp[hr$year == 1996], na.rm = TRUE)
mean(hr$exp[hr$year == 2000], na.rm = TRUE)
mean(hr$exp[hr$year == 2003], na.rm = TRUE)
mean(hr$exp[hr$year == 2005], na.rm = TRUE)
mean(hr$exp[hr$year == 2009], na.rm = TRUE)
mean(hr$exp[hr$year == 2012], na.rm = TRUE)
mean(hr$exp[hr$year == 2014], na.rm = TRUE)
mean(hr$exp[hr$year == 2017], na.rm = TRUE)
mean(hr$exp[hr$year == 2019], na.rm = TRUE)
mean(hr$exp[hr$year == 2021], na.rm = TRUE)
group_by()
function from the
dplyr
package can remarkably simplify the command.How to use group_by()
function
group_by(variable name) |>
summarise(...)
hr
data by year and calculate the average
of exp
.group_by(year)
before summarise()
in
the pipeexp
for each year.”# A tibble: 9 × 2
year Mean_exp
<dbl> <dbl>
1 1996 9136316.
2 2000 8388889.
3 2003 7935408.
4 2005 8142244.
5 2009 6118181.
6 2012 5769988.
7 2014 7440127.
8 2017 9298783.
9 2021 NaN
group_by()
function can also group data by two or
more variables.hr
data by year
and
seito
, and then calculate the average of
exp.
# A tibble: 113 × 3
# Groups: year [9]
year seito Mean_exp
<dbl> <chr> <dbl>
1 1996 さわやか神戸・市民の会 NaN
2 1996 共産 3158354.
3 1996 国民党 NaN
4 1996 市民新党にいがた 3044160
5 1996 政事公団太平会 5540
6 1996 文化フォーラム NaN
7 1996 新党さきがけ 13030901
8 1996 新社会 4545681
9 1996 新進 12395369.
10 1996 日本新進 1898969
# ℹ 103 more rows
seito
), so let’s simplify the
calculation by focusing only on the two parties: the Liberal Democratic
Party (自民)and the Democratic Party(民主). [1] "新進" "自民" "民主"
[4] "共産" "文化フォーラム" "国民党"
[7] "無所" "自由連合" "政事公団太平会"
[10] "新社会" "社民" "新党さきがけ"
[13] "沖縄社会大衆党" "市民新党にいがた" "緑の党"
[16] "さわやか神戸・市民の会" "民主改革連合" "青年自由"
[19] "日本新進" "公明" "諸派"
[22] "保守" "無所属の会" "自由"
[25] "改革クラブ" "保守新" "ニューディールの会"
[28] "新党尊命" "世界経済共同体党" "新党日本"
[31] "国民新党" "新党大地" "幸福"
[34] "みんな" "改革" "日本未来"
[37] "日本維新の会" "当たり前" "政治団体代表"
[40] "安楽死党" "アイヌ民族党" "次世"
[43] "維新" "生活" "立憲"
[46] "希望" "緒派" ""
[49] "N党" "国民" "れい"
hr |>
group_by(year, seito) |>
summarise(Mean_exp = mean(exp, na.rm = TRUE)) |>
filter(seito == "自民" | seito == "民主")
# A tibble: 16 × 3
# Groups: year [9]
year seito Mean_exp
<dbl> <chr> <dbl>
1 1996 民主 9961458.
2 1996 自民 14460093.
3 2000 民主 10207109.
4 2000 自民 14251423.
5 2003 民主 9772028.
6 2003 自民 12821990.
7 2005 民主 9574473.
8 2005 自民 12710075.
9 2009 民主 7802585.
10 2009 自民 11374974.
11 2012 民主 7728045.
12 2012 自民 9335490.
13 2014 民主 9757272
14 2014 自民 11450459.
15 2017 自民 12338476.
16 2021 自民 NaN
ggplot()
function to visualize the average
election expenses of candidates from both the Liberal Democratic Party
and the Democratic Party.hr %>%
group_by(year, seito) |>
summarise(Mean_exp = mean(exp, na.rm = TRUE)) |>
filter(seito == "自民" | seito == "民主") |>
ggplot(aes(x = year,
y = Mean_exp,
color = seito)) +
geom_line(aes(group = seito),
size = 1) +
geom_point(aes(shape = seito),
size = 3) +
labs(x = "Year",
y = "Election expense (yen)",
color = "Party",
shape = "Party") +
ggtitle("Trends in Election Expenses in General Elections (LDP & DPJ)") +
scale_color_manual(values = c("民主" = "orangered",
"自民" = "limegreen")) +
theme_bw(base_family = "HiraKakuProN-W3")# Settings for ggplot to avoid character garbling (for Mac users only)
hr |>
filter(seito == "自民" | seito == "民主") |>
group_by(seito) |>
summarise(Mean_exp = mean(exp, na.rm = TRUE))
# A tibble: 2 × 2
seito Mean_exp
<chr> <dbl>
1 民主 9096866.
2 自民 12456211.
n()
n()
to the
command.hr |>
filter(seito == "自民" | seito == "民主") |>
group_by(year, seito) |>
summarise(Mean_exp = mean(exp, na.rm = TRUE),
Cand_number = n())
# A tibble: 16 × 4
# Groups: year [9]
year seito Mean_exp Cand_number
<dbl> <chr> <dbl> <int>
1 1996 民主 9961458. 143
2 1996 自民 14460093. 288
3 2000 民主 10207109. 242
4 2000 自民 14251423. 271
5 2003 民主 9772028. 267
6 2003 自民 12821990. 277
7 2005 民主 9574473. 289
8 2005 自民 12710075. 290
9 2009 民主 7802585. 271
10 2009 自民 11374974. 291
11 2012 民主 7728045. 264
12 2012 自民 9335490. 289
13 2014 民主 9757272 178
14 2014 自民 11450459. 283
15 2017 自民 12338476. 276
16 2021 自民 NaN 277
ggplot()
function to visualize the number of
candidates for both the Liberal Democratic Party and the Democratic
Party in each general election.geom_text()
function.hr %>%
filter(seito == "自民" | seito == "民主") |>
group_by(year, seito) |>
summarise(Mean_exp = mean(exp, na.rm = TRUE),
Cand_number = n()) |>
ggplot(aes(x = year,
y = Cand_number,
color = seito)) +
geom_line(aes(group = seito),
size = 1) +
geom_point(aes(shape = seito),
size = 3) +
labs(x = "Year",
y = "Number of Candidates",
color = "Party",
shape = "Party") +
geom_text(aes(y = Cand_number + 3, label = Cand_number),
size = 4, vjust = 0) +
ggtitle("Trends in the Number of Candidates in General Elections (LDP & DPJ).") +
scale_color_manual(values = c("民主" = "orangered",
"自民" = "limegreen")) +
theme_bw(base_family = "HiraKakuProN-W3")
# A tibble: 2 × 2
seito Cand_number
<chr> <int>
1 民主 1654
2 自民 2542
across()
across()
function allows you to calculate
multiple descriptive statistics for multiple variables with a shorter
code. [1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
previous
to exp
in
hr
, you would need to enter the following code.hr |>
summarise(previous_Mean = mean(previous, na.rm = TRUE),
previous_SD = mean(previous, na.rm = TRUE),
age_Mean = mean(age, na.rm = TRUE),
age_SD = mean(age, na.rm = TRUE),
exp_Mean = mean(exp, na.rm = TRUE),
exp_SD = mean(exp, na.rm = TRUE))
# A tibble: 1 × 6
previous_Mean previous_SD age_Mean age_SD exp_Mean exp_SD
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.48 1.48 51.2 51.2 7551393. 7551393.
across()
function.”hr |>
summarise(across(previous:exp,
.fns = list(Mean = ~mean(.x, na.rm = TRUE),
SD = ~mean(.x, na.rm = TRUE))))
# A tibble: 1 × 6
previous_Mean previous_SD age_Mean age_SD exp_Mean exp_SD
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.48 1.48 51.2 51.2 7551393. 7551393.
mutate()
mutate()
function enables us to make a new
variableHow to use mutate()
function
data frame |>
mutate(new variable name = formula)
exppv
)
by dividing the election expenses (exp
) in hr
by the number of eligible voters (eligible
), the code would
look like this without using dplyr. [1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu" "exppv"
exppv
is added to the data.dplyr::mutate()
, the code would be as
follows.# A tibble: 9,660 × 23
year pref ku kun wl rank nocand seito j_name gender name previous
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 1996 愛知 aichi 1 1 1 7 新進 河村… male KAWA… 2
2 1996 愛知 aichi 1 0 2 7 自民 今枝… male IMAE… 2
3 1996 愛知 aichi 1 0 3 7 民主 佐藤… male SATO… 2
4 1996 愛知 aichi 1 0 4 7 共産 岩中… female IWAN… 0
5 1996 愛知 aichi 1 0 5 7 文化… 伊東… female ITO,… 0
6 1996 愛知 aichi 1 0 6 7 国民… 山田浩 male YAMA… 0
7 1996 愛知 aichi 1 0 7 7 無所 浅野… male ASAN… 0
8 1996 愛知 aichi 2 1 1 8 新進 青木… male AOKI… 2
9 1996 愛知 aichi 2 0 2 8 自民 田辺… male TANA… 0
10 1996 愛知 aichi 2 2 3 8 民主 古川… male FURU… 0
# ℹ 9,650 more rows
# ℹ 11 more variables: age <dbl>, exp <dbl>, status <dbl>, vote <dbl>,
# voteshare <dbl>, eligible <dbl>, turnout <dbl>, seshu_dummy <dbl>,
# jiban_seshu <chr>, nojiban_seshu <chr>, exppv <dbl>
exppv
is not visible on the output screen.exppv
, being the newly added variable, is
the last column and doesn’t fit on the screen.exppv
has been added successfully and is included among
the variable names that are truncated at the bottom of the output
screen. [1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu" "exppv"
if_else()
if_else()
can be a function used in re-coding a
variableHow to use if_else()
function
if_else(condition, return value if TRUE, return value if FALSE, return value if TRUE, return value if FALSE)
status
in hr
has three values: 0, 1, 2.Variables | Details |
---|---|
status | Candidate’s status: 0 = non-incumbent, 1 = incumbent, 2 = former incumbent |
status
variable and calculate the
average values of vote and exp.
hr |>
group_by(status) |>
summarise(Mean_vote = mean(vote, na.rm = TRUE),
Mean_exp = mean(exp, na.rm = TRUE))
# A tibble: 3 × 3
status Mean_vote Mean_exp
<dbl> <dbl> <dbl>
1 0 33252. 5110084.
2 1 89259. 11545342.
3 2 69632. 9335640.
current officeholder dummy
variable inc using
status.
inc
values as
1 = incumbent, 0 = otherwise
.table()
and unique()
functions to
check if the variable inc has been correctly created.
incumbent non-incumbent
0 0 5517
1 3510 0
2 0 633
[1] "incumbent" "non-incumbent"
re-coding
, it is preferable not to overwrite an
existing variable (in this case, status
), but to add it as
a new variable (in this case, inc
).inc
variable and calculate the
average values of vote and exp
.hr |>
group_by(inc) |>
summarise(Mean_vote = mean(vote, na.rm = TRUE),
Mean_exp = mean(exp, na.rm = TRUE))
# A tibble: 2 × 3
inc Mean_vote Mean_exp
<chr> <dbl> <dbl>
1 incumbent 89259. 11545342.
2 non-incumbent 36996. 5482955.
case_when()
if_else()
function.case_when()
function.How to use case_when
function
data frame |>
mutate(new variable name = case_when(condition_1 ~ new value,
condition_2 ~ new value,
......
TRUE ~ new value)
TRUE ~ new value
means the “value for when none of the
above conditions are met.”status
to
Japanese (incumbent, former, newcomer) and add as the variable
status_E
.hr %>%
mutate(status_E = case_when(status == "1" ~ "incumbent",
status == "2" ~ "former_incumbent",
TRUE ~ "challenger")) %>%
group_by(Cand_status = status_E) %>%
summarise(vs_mean = mean(vote, na.rm = TRUE),
cand_number = n())
# A tibble: 3 × 3
Cand_status vs_mean cand_number
<chr> <dbl> <int>
1 challenger 33252. 5517
2 former_incumbent 69632. 633
3 incumbent 89259. 3510
%in%
seito
has party names in Japanese lower house
elections.変数名 | 詳細 |
---|---|
seito | Candidate’s Affiliated Political Party |
hr
and check the values of seito
. [1] "自民" "立憲" "N党" "国民" "維新" "共産" "れい" "無所" "社民" "諸派"
[11] "公明"
hr2021_1 <- hr2021 %>%
mutate(party = case_when(seito == "自民" ~ "LDP",
seito == "公明" ~ "CGP",
seito == "立憲" ~ "CDP",
seito == "維新" ~ "ISHIN",
seito == "国民" ~ "KOKUMIN",
seito == "N党" ~ "N-PARTY",
seito == "れい" ~ "REIWA",
seito == "社民" ~ "SDP",
seito == "共産" ~ "JCP",
seito == "無所" ~ "IND",
TRUE ~ "SHOHA")) %>%
group_by(PARTY = party) %>%
summarise(total_votes = sum(vote, na.rm = TRUE),
number_of_cand = n(),
ave_vote_per_cand = mean(vote, na.rm = TRUE))
TRUE ~ ""
) for
every condition.seito
is either
LDP or CGP, change it to Ruling Party
.Opposition
.R_or_O
.hr2021_2 <- hr2021 %>%
mutate(party = case_when(seito == "自民" ~ "Ruling",
seito == "公明" ~ "Ruling",
TRUE ~ "Opposition")) %>%
group_by(R_or_O = party) %>%
summarise(total_votes = sum(vote, na.rm = TRUE),
number_of_cand = n(),
average_votes_per_cand = mean(vote, na.rm = TRUE))
# A tibble: 2 × 4
R_or_O total_votes number_of_cand average_votes_per_cand
<chr> <dbl> <int> <dbl>
1 Opposition 28957940 571 50714.
2 Ruling 28499088 286 99647.
%in%
operator can be used.hr2021_1$PARTY
) =>
TRUE.PARTY
in hr2021
corresponds to either LDP or CGP.[1] FALSE TRUE FALSE FALSE FALSE FALSE
The first is FALSE
=> this candidate is neither
from LDP nor CGP.
The second is TRUE
=> this candidate is either
from LDP or CGP.
This can be used as the conditional expression in the
if_else()
function.
hr2021 %>%
mutate(party = case_when(seito == "自民" ~ "Ruling",
seito == "公明" ~ "Ruling",
TRUE ~ "Opposition")) %>%
group_by(R_or_O = party) %>%
summarise(tottal_votes = sum(vote, na.rm = TRUE),
Nunber_of_cand = n(),
average_votes_per_cand = mean(vote, na.rm = TRUE))
# A tibble: 2 × 4
R_or_O tottal_votes Nunber_of_cand average_votes_per_cand
<chr> <dbl> <int> <dbl>
1 Opposition 28957940 571 50714.
2 Ruling 28499088 286 99647.
df1 <- tibble(id = 1:10,
age = c(19, 20, 22, 999, 35, 45, 50, 60, 70, 999),
college = c(1, 0, 0, 1, 1, 99, 1, 1, 99, 0))
df1
# A tibble: 10 × 3
id age college
<int> <dbl> <dbl>
1 1 19 1
2 2 20 0
3 3 22 0
4 4 999 1
5 5 35 1
6 6 45 99
7 7 50 1
8 8 60 1
9 9 70 99
10 10 999 0
Variables | Details |
---|---|
id: | |
age: | 999 = missing values |
college: | 1 = college graduate, 0 = otherwise, 99 = missing values |
[1] 231.9
[1] 20.3
AGE
, and set it to 1 if
age
is below 39, otherwise 0.age
is 999, set it to NA.mutate()
using
case_when()
.Error message Error in
mutate()
: ! Problem while computing
AGE = case_when(age == 999 ~ NA, age <= 39 ~ 1, TRUE ~ 0)
.
Caused by error in names(message) <- `*vtmp*`
: ! ‘names’
attribute [1] must be the same length as the vector [0]
age == 999 ~ NA
.tidyverse
, there are two types
of NA:AGE
are numerical, AGE
is of numeric
type.age == 999 ~ NA_real_
# A tibble: 10 × 4
id age college AGE
<int> <dbl> <dbl> <dbl>
1 1 19 1 1
2 2 20 0 1
3 3 22 0 1
4 4 999 1 NA
5 5 35 1 1
6 6 45 99 0
7 7 50 1 0
8 8 60 1 0
9 9 70 99 0
10 10 999 0 NA
COLLEGE
, and specify it as
“university graduate or higher” if college is 1, “less than university
graduate” if 0, and NA if 99.df1 %>%
mutate(AGE = case_when(age == 999 ~ NA_real_,
age <= 39 ~ 1,
TRUE ~ 0),
COLLEGE = case_when(college == 99 ~ NA,
college == 1 ~ "university graduate or higher",
TRUE ~ "less than university graduate"))
Error message
Error in mutate()
: ! Problem while computing
COLLEGE = case_when(edu) == 1 ~ "university graduate or higher, TRUE ~ "less than university graduate")
.
Caused by error:
college == 99 ~ NA
.COLLEGE
are character type, COLLEGE
is of
character type.college == 99 ~ NA_character_
df1 %>%
mutate(AGE = case_when(age == 999 ~ NA_real_,
age <= 39 ~ 1,
TRUE ~ 0),
COLLEGE = case_when(college == 99 ~ NA_character_,
college == 1 ~ "university graduate or higher",
TRUE ~ "less than university graduate"))
# A tibble: 10 × 5
id age college AGE COLLEGE
<int> <dbl> <dbl> <dbl> <chr>
1 1 19 1 1 university graduate or higher
2 2 20 0 1 less than university graduate
3 3 22 0 1 less than university graduate
4 4 999 1 NA university graduate or higher
5 5 35 1 1 university graduate or higher
6 6 45 99 0 <NA>
7 7 50 1 0 university graduate or higher
8 8 60 1 0 university graduate or higher
9 9 70 99 0 <NA>
10 10 999 0 NA less than university graduate
Points on missing values
(NAs
)
・When missing values are numeric
var == value ~ NA_real_
・When missing values are character
var == value ~ NA_character_
group_by()
and ggplot()
functions
to calculate and visualize the average age of candidates in the general
elections (1996-2021) for all political parties with a line graph.seito
, you find the following Japanee notations on
the three parties:Variable | Details |
---|---|
wl = 0 | Not elected |
wl = 1 | Elected in a single-member district |
wl = 2 | Elected through proportional representation |
if_else()
function to create a single-member
district election dummy (wlsmd
), with 1 for candidates
elected in single-member districts and 0 for those not elected.table()
and unique()
functions to
check if the wlsmd variable is correctly created.filter()
function to extract data from the 2021
general election, and the if_else()
function to create a
Liberal Democratic Party dummy (ldp
) with 1 for LDP
candidates and 0 for others.table()
and unique()
functions to
check if the ldp variable is correctly created.Use the filter()
function to extract data from the 2021
general election and the mutate()
, if_else()
,
%in%
, and group_by()
functions to compare and
display the average age of government party candidates (combined LDP and
Komeito) and opposition candidates.
filter()
function to extract data from the 2021
general election, and the mutate()
, if_else()
,
%in%, and group_by()
functions to compare and display the
average age of elected candidates (including proportional representation
winners) from the government (combined LDP and Komeito) and opposition
parties.filter()
function to extract data from the 2021
general election, and the mutate()
,
case_when()
, and group_by()
functions to
compare and display the average age of elected candidates (including
proportional representation winners) from the LDP, Komeito,
Constitutional Democratic Party, and other opposition parties.