1. Statistics

  • Statistics can be classified into two types: Descriptive Statistics and Inferential Statistics.
Types of Statistics Characteristics
Descriptive Statistics:: Use statistical measures to grasp data trends and properties
Inferential Statistics: Test and estimate population parameters (like population mean and variance)

Descriptive Statistics

  • Descriptive statistics involves calculating statistical measures (like mean, variance, standard deviation) of collected data to describe and summarize it in a form that is understandable to us.
  • This approach reveals the distribution of data, helping to grasp trends and properties of the data using statistical methods.

Inferential Statistics

  • The skill of predicting and estimating unobserved events using available descriptive statistics is known as “Inferential Statistics.
  • It involves randomly extracting a sample from the population, and using statistics such as the sample mean and unbiased variance obtained from this sample to test and estimate population parameters (like population mean and variance).
  • The basic idea of inferential statistics is that by increasing the number of randomly drawn samples from the population and repeating trials infinitely, we can infer about the vast and unknown population from a part of it, the sample.
  • In inferential statistics, it is assumed that the subject of analysis follows a probability distribution.
  • Here, we will explain and practice Descriptive Statistics, which is the foundational knowledge commonly used in both types of statistics.

2. Descriptive statistics

  • Descriptive statistics are numerical summaries that encapsulate information contained in a variable.

Typical Descriptive Statistics

R function Details
mean() mean
median() median
var() unbiased variance (note_1)
sd() unbiased standard deviation (note_1)
IQR() interquartile range (note_2)
min() minimum
max() maximum
  • Note 1: For details on unbiased variance and unbiased standard deviation, refer to the explanation in 10.1 Inferential Statistics (Basics).
  • Note 2: The interquartile range is the difference between the 75th percentile and the 25th percentile.

Suppose 10 students took the TOEFL (iBT).

  • Let’s assign the variable name x to the test scores of these 10 students.
x <- c(22, 33, 44, 55, 66, 77, 88, 99, 100)
x
[1]  22  33  44  55  66  77  88  99 100
  • We will calculate the representative values (or measures of central tendency) for x, including the mean, median, variance, standard deviation, interquartile range, and range.

2.1 Mean

  • The mean of the variable x (\(\bar{x}\)) can be calculated using the following formula:

\[\bar{x} = \frac{\sum_{i=1}^n x_i}{n}\]

How to calculate the mean using R: Method 1

  • The most labor-intensive method
(22 + 33 + 44 + 55 + 66 + 77 + 88 + 99 + 100) / 9
[1] 64.88889

How to calculate the mean using R: Method 2

  • Using sum( )
sum(x / 9)
[1] 64.88889

How to calculate the mean using R: Method 3

  • Using mean( ) function
mean(x)
[1] 64.88889

How to calculate the mean using R: Method 4

  • Using summary() function
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  22.00   44.00   66.00   64.89   88.00  100.00 

2.2 Median (the middle value)

median(x)
[1] 66
  • Sort x in ascending order.
sort(x)
[1]  22  33  44  55  66  77  88  99 100
  • 66 is the median.

2.3 Variance (Measure of Spread)

  • Variance is a measure used to represent the degree of spread in numerical data.
  • Variance is defined as the average of the squared differences from the mean.
  • If there are many data points far from the mean. → larger variance
  • If most data points are close to the mean → smaller variance.

Variance

  • Generally, variance is calculated using the following formula:

\[Variance = \frac{\sum_{i=1}^N (each\_data\_ point - mean)^2}{N}\]

  • Variance \(σ^2\) (sigma squared) is often denoted as:

\[σ^2 = \frac{\sum_{i=1}^N (x_i - \bar{x})^2}{N}\]

Calculating variance of x.

x
[1]  22  33  44  55  66  77  88  99 100
  • Calculate the mean of x and name it x_mean.
x_mean <- mean(x)
x_mean
[1] 64.88889
  • Compute the deviation (each data point - mean) from the mean and name it x1.
x1 <- x - x_mean
x1
[1] -42.888889 -31.888889 -20.888889  -9.888889   1.111111  12.111111  23.111111
[8]  34.111111  35.111111
  • For instance, the number next to [1] (-42.888889) is the value of the first data point of x (22) minus the mean (64.88889).
22 - 64.88889 
[1] -42.88889
  • This is the deviation from the mean (x1).
  • Next, square this deviation from the mean (x1) and name it x2.
x2 <- x1^2
x2
[1] 1839.456790 1016.901235  436.345679   97.790123    1.234568  146.679012
[7]  534.123457 1163.567901 1232.790123
  • Calculate the sum of x2 and name it sum_x2.
sum_x2 <- sum(x2)
sum_x2
[1] 6468.889
  • This completes the numerator of the variance formula.
  • For the denominator (N), use 9.
  • The variance of x \(\sigma^2\) can be calculated as follows:

\[σ^2 = \frac{\sum_{i=1}^N (x_i - \bar{x})^2}{N} = \frac{6468.889}{9} = 718.77\]

Using R function var() to calculate variance

var(x)
[1] 808.6111

The variance calculated above (718.77) differs from this result.

→ The reason: var() calculates not the population variance \(σ^2\), but the unbiased variance \(U^2\).
→ For more details, refer to 10.1 and 10.2.

2.5 interquartile range

  • The interquartile range is the difference between the 75th percentile and the 25th percentile.
IQR(x)
[1] 44

2.6 Range

max(x, na.rm = TRUE) - min(x)
[1] 78

2.7 summary()

  • Display a summary of the statistical measures for the variable x.
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  22.00   44.00   66.00   64.89   88.00  100.00 
  • ここでは変数ごとに次の情報が表示されている 
Descriptive Statistics Details
Min: Minimum
1st Qu.: 1st Quantile (25%)
Median: Median (50%)
Mean: Average
3rd Qu: 3rd Quantile (75%)
Max: Maximum

3. dplyr

  • Load the tidyverse package, which includes the dplyr package.
library(tidyverse)

Data Preparation (hr96-21.csv) ・Download hr96-21.csv to your computer  

  • Create a folder named data inside the RProject folder.
  • Manually place the downloaded hr96-21.csv file into the data folder within the RProject folder.
  • Read the election data and name the data frame as hr.
  • na = "." is a command that means “interpret dots as missing values.”
hr <- read_csv("data/hr96-21.csv",
               na = ".")    
  • hr96_21.csv contains data from the nine House of Representatives elections held since the introduction of single-seat constituencies in 1996 (years: 1996, 2000, 2003, 2005, 2009, 2012, 2014, 2017, 2021).
  • Verify the read election data.
  • Using the dim() function, it is clear that hr has 9,660 rows and 22 columns.
dim(hr)
[1] 9660   22
  • df1 contains 22 variables.
Variable Name Details
year Election year (1996-2017)
pref Prefecture name
ku Single-member district name
kun Single-member district
rank Order of election victory
wl Election result: 1 = elected in single-member district, 2 = re-elected, 0 = not elected
nocand Number of candidates
seito Political party affiliation of the candidate
j_name Candidate’s name (in Japanese)
name Candidate’s name (in English)
previous Number of previous victories (excluding the current election)
gender Candidate’s gender: male, female
age Candidate’s age
exp Election expenses used by the candidate (reported to the Ministry of Internal Affairs and Communications)
status Candidate’s status: 0 = non-incumbent, 1 = incumbent, 2 = former incumbent
vote Number of votes received
voteshare Vote share (%)
eligible Number of eligible voters in the single-member district
turnout Voting rate (%) in the single-member district (%)
seshu_dummy Hereditary candidate dummy: 1 = hereditary, 0 = non-hereditary (inherited base or non-hereditary)
jiban_seshu Name and relation of the politician from whom the base was inherited
nojiban_seshu Name and relation of the hereditary politician

3.1 summarise()

How to use summarise() function.

summarise(function, var1, var2,...)

  • For instance, suppose you want to calculate the average number of candidates (nocand)
hr |> 
  summarise(mean(nocand))
# A tibble: 1 × 1
  `mean(nocand)`
           <dbl>
1           3.89
  • From the 1996 to 2021 general elections, the average number of candidates per single-member district was 3.89.
  • Let’s calculate the average election expenses (exp).
hr |> 
  summarise(mean(exp))
# A tibble: 1 × 1
  `mean(exp)`
        <dbl>
1          NA

The average election expense is not calculated!

The reason is that exp contains missing values.
→ Try adding na.rm = TRUE (an instruction to remove missing values).

hr |> 
  summarise(mean(exp, 
    na.rm = TRUE))
# A tibble: 1 × 1
  `mean(exp, na.rm = TRUE)`
                      <dbl>
1                  7551393.

→ Calculation successful!

  • The average election expense per candidate is about 7.5 million yen!
  • Multiple variable names can be specified in summarise()
  • For example, let’s calculate the average mean() of exp, age, and voteshare in hr.
hr |> 
  summarise(mean(exp, na.rm = TRUE),
    mean(age, na.rm = TRUE),
    mean(voteshare, na.rm = TRUE))
# A tibble: 1 × 3
  `mean(exp, na.rm = TRUE)` `mean(age, na.rm = TRUE)` mean(voteshare, na.rm = …¹
                      <dbl>                     <dbl>                      <dbl>
1                  7551393.                      51.2                       27.7
# ℹ abbreviated name: ¹​`mean(voteshare, na.rm = TRUE)`
  • The output is displayed in a data frame (tibble) format.
  • The variable names are long and difficult to read.
  • Variable names can be made more readable by using the rename() function.
hr |> 
  summarise(Mean_exp = mean(exp, na.rm = TRUE),
    Mean_age = mean(age, na.rm = TRUE),
    Mean_voteshare = mean(voteshare, na.rm = TRUE)) 
# A tibble: 1 × 3
  Mean_exp Mean_age Mean_voteshare
     <dbl>    <dbl>          <dbl>
1 7551393.     51.2           27.7
  • This is much easier to read.

3.2 group_by()

Grouping with One Variable

  • When calculating descriptive statistics for a specific variable, it’s more efficient to use functions like mean() or sd().
  • However, when calculating descriptive statistics for each group, the dplyr package is very useful.
  • For example, let’s calculate the average election expenses (exp) for each general election without using dplyr..
  • To calculate the average election expenses (exp) for the 1996 general election, identified as ‘1996’, you would need to write it as follows.
  • Note: na.rm = TRUE (an instruction to remove missing values) is added because exp contains missing values.
mean(hr$exp[hr$year == 1996], na.rm = TRUE)
[1] 9136316
  • Since this needs to be calculated for each of the eight general elections from 1996 to 2021, it requires the following input.
mean(hr$exp[hr$year == 1996], na.rm = TRUE)
mean(hr$exp[hr$year == 2000], na.rm = TRUE)
mean(hr$exp[hr$year == 2003], na.rm = TRUE)
mean(hr$exp[hr$year == 2005], na.rm = TRUE)
mean(hr$exp[hr$year == 2009], na.rm = TRUE)
mean(hr$exp[hr$year == 2012], na.rm = TRUE)
mean(hr$exp[hr$year == 2014], na.rm = TRUE)
mean(hr$exp[hr$year == 2017], na.rm = TRUE)
mean(hr$exp[hr$year == 2019], na.rm = TRUE)
mean(hr$exp[hr$year == 2021], na.rm = TRUE)
  • However, using the group_by() function from the dplyr package can remarkably simplify the command.

How to use group_by() function
group_by(variable name) |>
  summarise(...)

  • To group the hr data by year and calculate the average of exp.
  • Use group_by(year) before summarise() in the pipe
    → This calculates the average of exp for each year.”
hr |> 
  group_by(year) |> 
  summarise(Mean_exp = mean(exp, na.rm = TRUE))
# A tibble: 9 × 2
   year Mean_exp
  <dbl>    <dbl>
1  1996 9136316.
2  2000 8388889.
3  2003 7935408.
4  2005 8142244.
5  2009 6118181.
6  2012 5769988.
7  2014 7440127.
8  2017 9298783.
9  2021     NaN 

Grouping with Multiple Variables

  • The group_by() function can also group data by two or more variables.
  • Suppose we want to investigate how much election expenses are used by each political party over time.
  • Group the hr data by year and seito, and then calculate the average of exp.
hr |> 
  group_by(year, seito) |> 
  summarise(Mean_exp = mean(exp, na.rm = TRUE))
# A tibble: 113 × 3
# Groups:   year [9]
    year seito                   Mean_exp
   <dbl> <chr>                      <dbl>
 1  1996 さわやか神戸・市民の会      NaN 
 2  1996 共産                    3158354.
 3  1996 国民党                      NaN 
 4  1996 市民新党にいがた        3044160 
 5  1996 政事公団太平会             5540 
 6  1996 文化フォーラム              NaN 
 7  1996 新党さきがけ           13030901 
 8  1996 新社会                  4545681 
 9  1996 新進                   12395369.
10  1996 日本新進                1898969 
# ℹ 103 more rows
  • This approach becomes cumbersome due to the large number of political parties (seito), so let’s simplify the calculation by focusing only on the two parties: the Liberal Democratic Party (自民)and the Democratic Party(民主).
unique(hr$seito)
 [1] "新進"                   "自民"                   "民主"                  
 [4] "共産"                   "文化フォーラム"         "国民党"                
 [7] "無所"                   "自由連合"               "政事公団太平会"        
[10] "新社会"                 "社民"                   "新党さきがけ"          
[13] "沖縄社会大衆党"         "市民新党にいがた"       "緑の党"                
[16] "さわやか神戸・市民の会" "民主改革連合"           "青年自由"              
[19] "日本新進"               "公明"                   "諸派"                  
[22] "保守"                   "無所属の会"             "自由"                  
[25] "改革クラブ"             "保守新"                 "ニューディールの会"    
[28] "新党尊命"               "世界経済共同体党"       "新党日本"              
[31] "国民新党"               "新党大地"               "幸福"                  
[34] "みんな"                 "改革"                   "日本未来"              
[37] "日本維新の会"           "当たり前"               "政治団体代表"          
[40] "安楽死党"               "アイヌ民族党"           "次世"                  
[43] "維新"                   "生活"                   "立憲"                  
[46] "希望"                   "緒派"                   ""                      
[49] "N党"                   "国民"                   "れい"                  
hr |>  
  group_by(year, seito) |> 
  summarise(Mean_exp = mean(exp, na.rm = TRUE)) |> 
  filter(seito == "自民" | seito == "民主") 
# A tibble: 16 × 3
# Groups:   year [9]
    year seito  Mean_exp
   <dbl> <chr>     <dbl>
 1  1996 民主   9961458.
 2  1996 自民  14460093.
 3  2000 民主  10207109.
 4  2000 自民  14251423.
 5  2003 民主   9772028.
 6  2003 自民  12821990.
 7  2005 民主   9574473.
 8  2005 自民  12710075.
 9  2009 民主   7802585.
10  2009 自民  11374974.
11  2012 民主   7728045.
12  2012 自民   9335490.
13  2014 民主   9757272 
14  2014 自民  11450459.
15  2017 自民  12338476.
16  2021 自民       NaN 
  • Use the ggplot() function to visualize the average election expenses of candidates from both the Liberal Democratic Party and the Democratic Party.
hr %>%
   group_by(year, seito) |> 
  summarise(Mean_exp = mean(exp, na.rm = TRUE)) |> 
  filter(seito == "自民" | seito == "民主") |> 

   ggplot(aes(x = year,            
              y = Mean_exp,       
              color = seito)) +    
   geom_line(aes(group = seito), 
             size = 1) +
   geom_point(aes(shape = seito), 
              size = 3) +
  labs(x = "Year", 
       y = "Election expense (yen)",
       color = "Party",
       shape = "Party") +
  ggtitle("Trends in Election Expenses in General Elections (LDP & DPJ)") +
  scale_color_manual(values = c("民主" = "orangered",
                                 "自民" = "limegreen")) +
  theme_bw(base_family = "HiraKakuProN-W3")# Settings for ggplot to avoid character garbling (for Mac users only)  

  • Let’s calculate the average election expenses for both the Liberal Democratic Party and the Democratic Party.
hr |> 
  filter(seito == "自民" | seito == "民主") |> 
  group_by(seito) |> 
  summarise(Mean_exp = mean(exp, na.rm = TRUE))
# A tibble: 2 × 2
  seito  Mean_exp
  <chr>     <dbl>
1 民主   9096866.
2 自民  12456211.

Want to know the number of cases in each group: n()

  • When performing grouping, you may want to know the number of cases in each group.
  • For example, if you want to display not only the average election expenses used by political parties in each general election from 1996 to 2019, but also the number of candidates, add n() to the command.
hr |> 
  filter(seito == "自民" | seito == "民主") |> 
  group_by(year, seito) |> 
  summarise(Mean_exp = mean(exp, na.rm = TRUE),
            Cand_number = n())
# A tibble: 16 × 4
# Groups:   year [9]
    year seito  Mean_exp Cand_number
   <dbl> <chr>     <dbl>       <int>
 1  1996 民主   9961458.         143
 2  1996 自民  14460093.         288
 3  2000 民主  10207109.         242
 4  2000 自民  14251423.         271
 5  2003 民主   9772028.         267
 6  2003 自民  12821990.         277
 7  2005 民主   9574473.         289
 8  2005 自民  12710075.         290
 9  2009 民主   7802585.         271
10  2009 自民  11374974.         291
11  2012 民主   7728045.         264
12  2012 自民   9335490.         289
13  2014 民主   9757272          178
14  2014 自民  11450459.         283
15  2017 自民  12338476.         276
16  2021 自民       NaN          277
  • Use the ggplot() function to visualize the number of candidates for both the Liberal Democratic Party and the Democratic Party in each general election.
  • It’s also possible to display the number of candidates on the chart using the geom_text() function.
hr %>%
   filter(seito == "自民" | seito == "民主") |> 
  group_by(year, seito) |> 
  summarise(Mean_exp = mean(exp, na.rm = TRUE),
            Cand_number = n()) |> 

   ggplot(aes(x = year,           
              y = Cand_number,     
              color = seito)) +   
   geom_line(aes(group = seito), 
             size = 1) +
   geom_point(aes(shape = seito), 
              size = 3) +
  labs(x = "Year", 
       y = "Number of Candidates",
       color = "Party",
       shape = "Party") +
  geom_text(aes(y = Cand_number + 3, label = Cand_number),
            size = 4, vjust = 0) +
  ggtitle("Trends in the Number of Candidates in General Elections (LDP & DPJ).") +
  scale_color_manual(values = c("民主" = "orangered",
                                 "自民" = "limegreen")) +
  theme_bw(base_family = "HiraKakuProN-W3") 

  • If you want to know the total number of LDP and DPJ candidates who ran in the general elections from 1996 to 2019…
hr |> 
  filter(seito == "自民" | seito == "民主") |> 
  group_by(seito) |> 
  summarize(Cand_number = n())
# A tibble: 2 × 2
  seito Cand_number
  <chr>       <int>
1 民主         1654
2 自民         2542

3.3 across()

  • Using the across() function allows you to calculate multiple descriptive statistics for multiple variables with a shorter code.
names(hr)
 [1] "year"          "pref"          "ku"            "kun"          
 [5] "wl"            "rank"          "nocand"        "seito"        
 [9] "j_name"        "gender"        "name"          "previous"     
[13] "age"           "exp"           "status"        "vote"         
[17] "voteshare"     "eligible"      "turnout"       "seshu_dummy"  
[21] "jiban_seshu"   "nojiban_seshu"
  • To calculate the mean and standard deviation for three adjacent variables from previous to exp in hr, you would need to enter the following code.
hr |> 
  summarise(previous_Mean = mean(previous, na.rm = TRUE),
            previous_SD   = mean(previous, na.rm = TRUE),
            age_Mean      = mean(age, na.rm = TRUE),
            age_SD        = mean(age, na.rm = TRUE),
            exp_Mean      = mean(exp, na.rm = TRUE),
            exp_SD        = mean(exp, na.rm = TRUE))
# A tibble: 1 × 6
  previous_Mean previous_SD age_Mean age_SD exp_Mean   exp_SD
          <dbl>       <dbl>    <dbl>  <dbl>    <dbl>    <dbl>
1          1.48        1.48     51.2   51.2 7551393. 7551393.
  • This task can be done in four lines using the across() function.”
hr |> 
  summarise(across(previous:exp,
                   .fns = list(Mean = ~mean(.x, na.rm = TRUE),
                               SD = ~mean(.x, na.rm = TRUE))))
# A tibble: 1 × 6
  previous_Mean previous_SD age_Mean age_SD exp_Mean   exp_SD
          <dbl>       <dbl>    <dbl>  <dbl>    <dbl>    <dbl>
1          1.48        1.48     51.2   51.2 7551393. 7551393.

3.4 mutate()

  • mutate() function enables us to make a new variable

How to use mutate() function data frame |>
  mutate(new variable name = formula)

  • If the new variable name already exists,
    → the existing variable will be overwritten.
  • If the variable name does not already exist,
    → a new variable will be added as the last variable.
  • To create election expenses per eligible voter (exppv) by dividing the election expenses (exp) in hr by the number of eligible voters (eligible), the code would look like this without using dplyr.
hr$exppv <- hr$exp / hr$eligible
names(hr)
 [1] "year"          "pref"          "ku"            "kun"          
 [5] "wl"            "rank"          "nocand"        "seito"        
 [9] "j_name"        "gender"        "name"          "previous"     
[13] "age"           "exp"           "status"        "vote"         
[17] "voteshare"     "eligible"      "turnout"       "seshu_dummy"  
[21] "jiban_seshu"   "nojiban_seshu" "exppv"        
  • exppv is added to the data.
  • When using dplyr::mutate(), the code would be as follows.
hr |> 
  dplyr::mutate(exppv = exp / eligible)
# A tibble: 9,660 × 23
    year pref  ku      kun    wl  rank nocand seito j_name gender name  previous
   <dbl> <chr> <chr> <dbl> <dbl> <dbl>  <dbl> <chr> <chr>  <chr>  <chr>    <dbl>
 1  1996 愛知  aichi     1     1     1      7 新進  河村…  male   KAWA…        2
 2  1996 愛知  aichi     1     0     2      7 自民  今枝…  male   IMAE…        2
 3  1996 愛知  aichi     1     0     3      7 民主  佐藤…  male   SATO…        2
 4  1996 愛知  aichi     1     0     4      7 共産  岩中…  female IWAN…        0
 5  1996 愛知  aichi     1     0     5      7 文化… 伊東…  female ITO,…        0
 6  1996 愛知  aichi     1     0     6      7 国民… 山田浩 male   YAMA…        0
 7  1996 愛知  aichi     1     0     7      7 無所  浅野…  male   ASAN…        0
 8  1996 愛知  aichi     2     1     1      8 新進  青木…  male   AOKI…        2
 9  1996 愛知  aichi     2     0     2      8 自民  田辺…  male   TANA…        0
10  1996 愛知  aichi     2     2     3      8 民主  古川…  male   FURU…        0
# ℹ 9,650 more rows
# ℹ 11 more variables: age <dbl>, exp <dbl>, status <dbl>, vote <dbl>,
#   voteshare <dbl>, eligible <dbl>, turnout <dbl>, seshu_dummy <dbl>,
#   jiban_seshu <chr>, nojiban_seshu <chr>, exppv <dbl>
  • The exppv is not visible on the output screen.
    → This is because exppv, being the newly added variable, is the last column and doesn’t fit on the screen.
  • exppv has been added successfully and is included among the variable names that are truncated at the bottom of the output screen.
names(hr)
 [1] "year"          "pref"          "ku"            "kun"          
 [5] "wl"            "rank"          "nocand"        "seito"        
 [9] "j_name"        "gender"        "name"          "previous"     
[13] "age"           "exp"           "status"        "vote"         
[17] "voteshare"     "eligible"      "turnout"       "seshu_dummy"  
[21] "jiban_seshu"   "nojiban_seshu" "exppv"        

4. Re-coding variables

4.1 if_else()

  • if_else() can be a function used in re-coding a variable

How to use if_else() function if_else(condition, return value if TRUE, return value if FALSE, return value if TRUE, return value if FALSE)

  • Let’s illustrate this with a specific example.
  • For instance, the variable status in hr has three values: 0, 1, 2.
Variables Details
status Candidate’s status: 0 = non-incumbent, 1 = incumbent, 2 = former incumbent
  • Group data by the status variable and calculate the average values of vote and exp.
hr |> 
  group_by(status) |> 
  summarise(Mean_vote = mean(vote, na.rm = TRUE),
            Mean_exp = mean(exp, na.rm = TRUE))
# A tibble: 3 × 3
  status Mean_vote  Mean_exp
   <dbl>     <dbl>     <dbl>
1      0    33252.  5110084.
2      1    89259. 11545342.
3      2    69632.  9335640.
  • Let’s say we want to create a current officeholder dummy variable inc using status.
  • Re-code inc values as 1 = incumbent, 0 = otherwise.
hr <- hr |> 
  mutate(inc = if_else(status == 1, "incumbent", "non-incumbent")) 
  • Use the table() and unique() functions to check if the variable inc has been correctly created.
table(hr$status, hr$inc)
   
    incumbent non-incumbent
  0         0          5517
  1      3510             0
  2         0           633
unique(hr$inc)
[1] "incumbent"     "non-incumbent"
  • When re-coding, it is preferable not to overwrite an existing variable (in this case, status), but to add it as a new variable (in this case, inc).
  • Group data by the inc variable and calculate the average values of vote and exp.
hr |> 
  group_by(inc) |> 
  summarise(Mean_vote = mean(vote, na.rm = TRUE),
            Mean_exp = mean(exp, na.rm = TRUE))
# A tibble: 2 × 3
  inc           Mean_vote  Mean_exp
  <chr>             <dbl>     <dbl>
1 incumbent        89259. 11545342.
2 non-incumbent    36996.  5482955.

4.2 case_when()

  • When you want to re-code a single condition → use the if_else() function.
  • When you want to re-code two or more conditions → use the case_when() function.

How to use case_when function data frame |>
  mutate(new variable name = case_when(condition_1 ~ new value,
                  condition_2 ~ new value,
                  ......
                  TRUE ~ new value)

  • TRUE ~ new value means the “value for when none of the above conditions are met.”
  • Convert the values (1, 2, 0) of variable status to Japanese (incumbent, former, newcomer) and add as the variable status_E.
hr %>%
  mutate(status_E = case_when(status == "1" ~ "incumbent",
                              status == "2" ~ "former_incumbent",
                             TRUE  ~ "challenger")) %>% 
  group_by(Cand_status = status_E) %>%
  summarise(vs_mean = mean(vote, na.rm = TRUE),
            cand_number        = n())
# A tibble: 3 × 3
  Cand_status      vs_mean cand_number
  <chr>              <dbl>       <int>
1 challenger        33252.        5517
2 former_incumbent  69632.         633
3 incumbent         89259.        3510

4.3 %in%

  • seito has party names in Japanese lower house elections.
変数名 詳細
seito Candidate’s Affiliated Political Party
  • Extract only the data of the 2021 general election from hr and check the values of seito.
hr2021 <- hr |> 
  filter(year == 2021)
unique(hr2021$seito)
 [1] "自民" "立憲" "N党" "国民" "維新" "共産" "れい" "無所" "社民" "諸派"
[11] "公明"
  • Convert party names in Japanese to English names.
hr2021_1 <- hr2021 %>%
  mutate(party = case_when(seito == "自民" ~ "LDP",
                          seito == "公明" ~ "CGP",
                          seito == "立憲" ~ "CDP",
                          seito == "維新" ~ "ISHIN",
                          seito == "国民" ~ "KOKUMIN",
                          seito == "N党" ~ "N-PARTY",
                          seito == "れい" ~ "REIWA",
                          seito == "社民" ~ "SDP",
                          seito == "共産" ~ "JCP",
                          seito == "無所" ~ "IND",
                           TRUE  ~ "SHOHA")) %>% 
  group_by(PARTY = party) %>%
  summarise(total_votes           = sum(vote, na.rm = TRUE),
            number_of_cand       = n(),
            ave_vote_per_cand = mean(vote, na.rm = TRUE))
DT::datatable(hr2021_1)
  • There doesn’t need to be a new value (TRUE ~ "") for every condition.
  • For instance, if the value of variable seito is either LDP or CGP, change it to Ruling Party.
  • Change all other values to Opposition.
  • Let’s add this result as a new variable named R_or_O.
hr2021_2 <- hr2021 %>%
  mutate(party = case_when(seito == "自民" ~ "Ruling",
                          seito == "公明" ~ "Ruling",
                           TRUE  ~ "Opposition")) %>% 
  group_by(R_or_O = party) %>%
  summarise(total_votes             = sum(vote, na.rm = TRUE),
            number_of_cand             = n(),
            average_votes_per_cand = mean(vote, na.rm = TRUE))
hr2021_2
# A tibble: 2 × 4
  R_or_O     total_votes number_of_cand average_votes_per_cand
  <chr>            <dbl>          <int>                  <dbl>
1 Opposition    28957940            571                 50714.
2 Ruling        28499088            286                 99647.
  • Here, there are only two types of return values (‘Ruling Party’ or ‘Opposition’).

The %in% operator can be used.

head(hr2021_1$PARTY %in% c("LDP", "CGP"))
  • If the value on the right (which is “LDP” or “CGP”) is included in the vector on the left (which is hr2021_1$PARTY) => TRUE.
  • If it is not included => FALSE.
  • Check if the variable PARTY in hr2021 corresponds to either LDP or CGP.
head(hr2021_1$PARTY %in% c("LDP", "CGP"))
[1] FALSE  TRUE FALSE FALSE FALSE FALSE
  • The first is FALSE => this candidate is neither from LDP nor CGP.

  • The second is TRUE => this candidate is either from LDP or CGP.

  • This can be used as the conditional expression in the if_else() function.

hr2021 %>%
  mutate(party = case_when(seito == "自民" ~ "Ruling",
                          seito == "公明" ~ "Ruling",
                           TRUE  ~ "Opposition")) %>% 
  group_by(R_or_O = party) %>%
  summarise(tottal_votes          = sum(vote, na.rm = TRUE),
            Nunber_of_cand            = n(),
            average_votes_per_cand = mean(vote, na.rm = TRUE))
# A tibble: 2 × 4
  R_or_O     tottal_votes Nunber_of_cand average_votes_per_cand
  <chr>             <dbl>          <int>                  <dbl>
1 Opposition     28957940            571                 50714.
2 Ruling         28499088            286                 99647.

4. Handling Missing Values

  • When dealing with data from public opinion polls and similar sources, missing values are often represented by numbers like 99 or 999.
  • For instance, let’s consider the following hypothetical data.
df1 <- tibble(id      = 1:10,
              age     = c(19, 20, 22, 999, 35, 45, 50, 60, 70, 999),
              college = c(1, 0, 0, 1, 1, 99, 1, 1, 99, 0))
df1
# A tibble: 10 × 3
      id   age college
   <int> <dbl>   <dbl>
 1     1    19       1
 2     2    20       0
 3     3    22       0
 4     4   999       1
 5     5    35       1
 6     6    45      99
 7     7    50       1
 8     8    60       1
 9     9    70      99
10    10   999       0
Variables Details
id:
age: 999 = missing values
college: 1 = college graduate, 0 = otherwise, 99 = missing values
  • Let’s try outputting the average age and education level of the respondents.
mean(df1$age)
[1] 231.9
  • The average age is 231?
mean(df1$college)
[1] 20.3
  • The average rate of college graduate is 20.33?
  • Checking and handling missing values is essential before data analysis.
    -Neglecting this task can lead to significant distortions in research results.
  • This is especially important when missing values are not denoted as NA, but numbers like 99 or 999.

Methods for Handling Missing Values

Step 1:
  • Create a variable AGE, and set it to 1 if age is below 39, otherwise 0.
  • If age is 999, set it to NA.
  • Specify this within mutate() using case_when().
df1 %>%
  mutate(AGE  = case_when(age == 999 ~ NA,
                          age <=  39 ~ 1,
                          TRUE       ~ 0))
  • The following error occurs!

Error message Error in mutate(): ! Problem while computing AGE = case_when(age == 999 ~ NA, age <= 39 ~ 1, TRUE ~ 0). Caused by error in names(message) <- `*vtmp*`: ! ‘names’ attribute [1] must be the same length as the vector [0]

  • The cause of the error is the setting of NA: age == 999 ~ NA.
  • In R, missing values are represented only by NA.
  • However, in the world of tidyverse, there are two types of NA:
numeric type NA
character type NA
  • In the case of the above code, since all values of the new variable AGE are numerical, AGE is of numeric type.
    → Set it as age == 999 ~ NA_real_
df1 %>%
  mutate(AGE  = case_when(age == 999 ~ NA_real_,
                          age <=  39 ~ 1,
                          TRUE       ~ 0))
# A tibble: 10 × 4
      id   age college   AGE
   <int> <dbl>   <dbl> <dbl>
 1     1    19       1     1
 2     2    20       0     1
 3     3    22       0     1
 4     4   999       1    NA
 5     5    35       1     1
 6     6    45      99     0
 7     7    50       1     0
 8     8    60       1     0
 9     9    70      99     0
10    10   999       0    NA
  • Well done!
Step 2:
  • Create a variable COLLEGE, and specify it as “university graduate or higher” if college is 1, “less than university graduate” if 0, and NA if 99.
df1 %>%
  mutate(AGE  = case_when(age == 999 ~ NA_real_,
                          age <=  39 ~ 1,
                          TRUE       ~ 0),
         COLLEGE = case_when(college == 99 ~ NA,
                            college == 1 ~ "university graduate or higher",
                            TRUE          ~ "less than university graduate"))
  • The following error occurs!

Error message

Error in mutate(): ! Problem while computing COLLEGE = case_when(edu) == 1 ~ "university graduate or higher, TRUE ~ "less than university graduate"). Caused by error:

  • The cause is the setting of NA: college == 99 ~ NA.
  • There are two types of NA:
numeric type NA
character type NA
  • In the case of the above code, since all values of the new variable COLLEGE are character type, COLLEGE is of character type.
    → Set it as college == 99 ~ NA_character_
df1 %>%
  mutate(AGE  = case_when(age == 999 ~ NA_real_,
                          age <=  39 ~ 1,
                          TRUE       ~ 0),
         COLLEGE = case_when(college == 99 ~ NA_character_,
                            college == 1 ~ "university graduate or higher",
                            TRUE          ~ "less than university graduate"))
# A tibble: 10 × 5
      id   age college   AGE COLLEGE                      
   <int> <dbl>   <dbl> <dbl> <chr>                        
 1     1    19       1     1 university graduate or higher
 2     2    20       0     1 less than university graduate
 3     3    22       0     1 less than university graduate
 4     4   999       1    NA university graduate or higher
 5     5    35       1     1 university graduate or higher
 6     6    45      99     0 <NA>                         
 7     7    50       1     0 university graduate or higher
 8     8    60       1     0 university graduate or higher
 9     9    70      99     0 <NA>                         
10    10   999       0    NA less than university graduate

Points on missing values (NAs)
・When missing values are numeric
   var == value ~ NA_real_
・When missing values are character
   var == value ~ NA_character_

5. Practice Questions

  • Using the Lower House election data (hr96-21.csv), answer the following questions:

Q5.1:

  • Use the group_by() and ggplot() functions to calculate and visualize the average age of candidates in the general elections (1996-2021) for all political parties with a line graph.

Q5.2:

  • Limit the display to the Liberal Democratic Party (LDP), Democratic Party, and Komeito.
  • In seito, you find the following Japanee notations on the three parties:
    ・自民 = the Liberal Democratic Party (LDP)
    ・公明 = Komeito
    ・民主 = Democratic Party

Q5.3:

  • Calculate and visualize with a line graph the average age of winners in the general elections (1996-2021) for the three political parties (LDP, Democratic Party, Komeito). Also display the number of winners.

Q5.4:

  • The variable wl has three values: 0, 1, 2.
Variable Details
wl = 0 Not elected
wl = 1 Elected in a single-member district
wl = 2 Elected through proportional representation
  • Use the if_else() function to create a single-member district election dummy (wlsmd), with 1 for candidates elected in single-member districts and 0 for those not elected.
  • Use the table() and unique() functions to check if the wlsmd variable is correctly created.

Q5.5:

  • Use the filter() function to extract data from the 2021 general election, and the if_else() function to create a Liberal Democratic Party dummy (ldp) with 1 for LDP candidates and 0 for others.
  • Use the table() and unique() functions to check if the ldp variable is correctly created.

Q5.6:

Use the filter() function to extract data from the 2021 general election and the mutate(), if_else(), %in%, and group_by() functions to compare and display the average age of government party candidates (combined LDP and Komeito) and opposition candidates.

  • Also display the number of candidates for both the government and opposition parties.

Q5.7:

  • Use the filter() function to extract data from the 2021 general election, and the mutate(), if_else(), %in%, and group_by() functions to compare and display the average age of elected candidates (including proportional representation winners) from the government (combined LDP and Komeito) and opposition parties.
  • Also display the number of candidates for both the government and opposition parties.

Q5.8:

  • Use the filter() function to extract data from the 2021 general election, and the mutate(), case_when(), and group_by() functions to compare and display the average age of elected candidates (including proportional representation winners) from the LDP, Komeito, Constitutional Democratic Party, and other opposition parties.
    0 Also display the number of candidates for the LDP, Komeito, Constitutional Democratic Party, and other opposition parties.
Reference
  • 宋財泫 (Jaehyun Song)・矢内勇生 (Yuki Yanai)「私たちのR: ベストプラクティスの探究」
  • 宋財泫「ミクロ政治データ分析実習(2022年度)」
  • 土井翔平(北海道大学公共政策大学院)「Rで計量政治学入門」
  • 矢内勇生(高知工科大学)授業一覧
  • 浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年
  • 浅野正彦, 中村公亮.『初めてのRStudio』オーム社、2018年
  • Winston Chang, R Graphics Cookbook, O’Reilly Media, 2012.
  • Kieran Healy, DATA VISUALIZATION, Princeton, 2019
  • Kosuke Imai, Quantitative Social Science: An Introduction, Princeton University Press, 2017