• R packages used in this section
library(haven)
library(readxl)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.8     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.1
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
  • {tidyverse} includes 8 useful packages
  • To load a data, you need {readr} which is included in {tidyverse}

1. data frame

1.1 How to make a variable

  • A variable is also called a vector

  • A vector is the smallest unit in data handling

  • A vector is an object which includes more than one value (or one element) with the same class (such as numeric, character, factor,…)

  • All of the element in a vector should be the same variable class

  • Let’s make a vector (= variable) including 8 elements (= values)

  • Name it "id"

id <- c(1,2,3,4,5,6,7,8)
  • Make another variable including 8 different names
  • Let’s name it "name"
name <- c("Joe", "Carol", "Mike", "Ross", "Shira", "Inha", "Jih-wen", "Amital")
  • Make another variable including test scores
  • Let’s name it "score"
score <- c(43, 74, 80, 37, 20, 83, 64, 35)

1.2 Make a data frame with variables

Two ways of making a data frame 1. data.frame(var1, var2,....)
2. tidyverse::tibble(var1, var2,...) — Recommended!

tidyverse::tibble() is an extended version of data.frame()
tidyverse::tibble() is recommended in terms of versatility

  • Let’s make a data frame using tibble() here
  • Load {tidyverse} to use tibble()
library(tidyverse)
  • Let’s merge these three variables and make a data frame, df1
df1 <- tibble(id, name, score)
df1
# A tibble: 8 × 3
     id name    score
  <dbl> <chr>   <dbl>
1     1 Joe        43
2     2 Carol      74
3     3 Mike       80
4     4 Ross       37
5     5 Shira      20
6     6 Inha       83
7     7 Jih-wen    64
8     8 Amital     35

1.3 Add a new variable to data frame

  • You can add a new variable (major) to the data frame you made df1 with the following R code:
df1$gakka <- c("Poli-sci", "Econ", "Stat", "History", "Poli-sci", "Econ", "Stat", "Business")
df1
# A tibble: 8 × 4
     id name    score gakka   
  <dbl> <chr>   <dbl> <chr>   
1     1 Joe        43 Poli-sci
2     2 Carol      74 Econ    
3     3 Mike       80 Stat    
4     4 Ross       37 History 
5     5 Shira      20 Poli-sci
6     6 Inha       83 Econ    
7     7 Jih-wen    64 Stat    
8     8 Amital     35 Business
  • Let’s add another variable (gender) to df1
df1$gender <- c("male", "female", "male", "male", "female", "male", "male", "female")
df1
# A tibble: 8 × 5
     id name    score gakka    gender
  <dbl> <chr>   <dbl> <chr>    <chr> 
1     1 Joe        43 Poli-sci male  
2     2 Carol      74 Econ     female
3     3 Mike       80 Stat     male  
4     4 Ross       37 History  male  
5     5 Shira      20 Poli-sci female
6     6 Inha       83 Econ     male  
7     7 Jih-wen    64 Stat     male  
8     8 Amital     35 Business female

1.4 How to merge data frames

  • Let’s make another data frame, df2

  • df2 includes the following two varibles:
    id
    prefecture

  • Let’s make a vector (= variable) including 8 elements (= values)

  • Name it "id"

id <- c(1,2,3,4,5,6,7,8)
  • Make another variable including 8 different cities they come from
  • Let’s name it "city"
city <- c("Eugene", "Portland", "Chicago", "Oxnard", "Seattle", "Soeul", "Taipei", "Seattle")
  • Let’s merge these two variables and make a data frame, df2
df2 <- tibble(id, city)
df2
# A tibble: 8 × 2
     id city    
  <dbl> <chr>   
1     1 Eugene  
2     2 Portland
3     3 Chicago 
4     4 Oxnard  
5     5 Seattle 
6     6 Soeul   
7     7 Taipei  
8     8 Seattle 
df2 <- tibble(id, city)
df2
# A tibble: 8 × 2
     id city    
  <dbl> <chr>   
1     1 Eugene  
2     2 Portland
3     3 Chicago 
4     4 Oxnard  
5     5 Seattle 
6     6 Soeul   
7     7 Taipei  
8     8 Seattle 
  • Merge df1 and df2 by id which is included both in df1 and df2
  • Name the merged data frame as M
M <- merge(df1, df2, by = "id")
M
  id    name score    gakka gender     city
1  1     Joe    43 Poli-sci   male   Eugene
2  2   Carol    74     Econ female Portland
3  3    Mike    80     Stat   male  Chicago
4  4    Ross    37  History   male   Oxnard
5  5   Shira    20 Poli-sci female  Seattle
6  6    Inha    83     Econ   male    Soeul
7  7 Jih-wen    64     Stat   male   Taipei
8  8  Amital    35 Business female  Seattle

2. Basics on data handling

2.1 Fila and folder

  • A computer file is a computer resource for recording data in a computer storage device, primarily identified by its file name.
  • In computers, a folder is the virtual location for applications and computer files (e.g., documents, data, etc.).
  • Folders help in storing and organizing files and data in the computer.
  • A directory is a location for storing files on your computer.
  • In the following example, you see 4 folders (backdoor, maps, R, RDD) on the left side.

  • Folder R has 4 files with two types of extensions (.html & .Rmd)
  • A folder does not have an extension.
  • A folder is also called a directory
  • R Project folder = Working directory

2.2 Path

  • A path is a string of characters used to uniquely identify a location in a directory structure.
  • It is composed by following the directory tree hierarchy in which components, separated by a delimiting character, the slash (/), represent each directory.
  • If you want to know where you are, you type getwd(), meaning get working directory, then you see the path showing where you are.
  • For instance, the following is the result I get when I type getwd() on my computer:

getwd() "/Users/asanomasahiko/Dropbox/statistics/class_materials/R"

  • If you are a Windows user, then you see C drive instead of Users

2.3 Working directory

  • You see the folder, R, on the right end in the path above
  • This path means that you are working at a folder R, which is the name of your R Project folder (= working directory)
  • You can set the path to assign your working directory by using setwd()
    But, it is not efficient for you to assign your working directory every time you knit on RStudio
    ・Once you make a R Project folder, you can save time in assigning the path, reading csv files, and saving figures you made

2.4 User names and folder names

  • You should not use Japanese as a file name and a directory name

  • You should not insert space as a file name and a directory name

  • You should use half size alphabet, numbers, and symbols as a file name and a directory name

  • Avoid using numbers and symbols as the first letter of a file name and a directory name

  • Example:

Bad: 2021_grades (start with numbers)
Good: grades_2021
Bad: grades 2021 (a space inserted between 2021 & grades)
Good: grades2021

3. Load an embedded data in R

  • R and R package have various embedded data
  • Type data() in Console and hit the return key (or Enter key)
data()
  • You see the following:

  • For instance, check the 7 th data, state.x77, and show the first 6 rows:

  • Type head(state.x77) in Console and hit the return key (or Enter key)

head(state.x77)
           Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
California      21198   5114        1.1    71.71   10.3    62.6    20 156361
Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766
  • Let’s display the last 6 rows:

  • Type tail(state.x77) in Console and hit the return key (or Enter key)

tail(state.x77)
              Population Income Illiteracy Life Exp Murder HS Grad Frost  Area
Vermont              472   3907        0.6    71.64    5.5    57.1   168  9267
Virginia            4981   4701        1.4    70.08    9.5    47.8    85 39780
Washington          3559   4864        0.6    71.72    4.3    63.5    32 66570
West Virginia       1799   3617        1.4    69.48    6.7    41.6   100 24070
Wisconsin           4589   4468        0.7    72.48    3.0    54.5   149 54464
Wyoming              376   4566        0.6    70.29    6.9    62.9   173 97203

4. Load a data besides embedded data in R

Load a data depending on the class of data

  • You have two types of data: text data & binary data
  • You can tell the difference by extension

text data

  • Data people understand
    .txt file・・・You use this data when you analyze text
    .tsv file・・・Data divided not with a comma, but with a tab
    .html file・・・You use this data when you want to do web scraping and gather data
    .csv file: (csv means comma-separated values)
    → RStudio uses most commonly use a .csv file

binary data

  • Data PC can understand but people cannot without using a particular software
    .xls file・・・Data used on Excel
    .xlsx file・・・Data used on Excel (newer than.xls)
    .dta file・・・Data used in Stata
    .rds file・・・Data for R only

Why you should not use MS Office Excel but LibreOffice

→ LibreOffice is for free!
→ You can assign character encoding
→ You can avoid the garbled characters

  • If you use data including Japanese, it is very likely for you to have the garbled characters
Solution:
  • Using LibreOffice and assign locale = locale(encoding = "cp932") as follows:
df <- read_csv("data/hr96-21.csv", 
na = ".",
locale = locale(encoding = "cp932"))

4.1 How to read a .csv file

  • Let’s read the lower house election data in Japan: 1996-2021 (hr96-21.csv)
  • Make a folder named data in your R Project folder
  • Click hr96-21.csv and download it to your computer
  • Put hr96-21.csv into the data folder within your your R Project folder
  • Read the election data and name it as df
df <- read_csv("data/hr96-21.csv", 
na = ".")

Check whether you safely read the data

  • hr96_21.csv is a collection of Japanese lower house election data covering 9 national elections (1996, 2000, 2003, 2005, 2009, 2012, 2014, 2017, 2021)
  • Check the name of variables hr contains
names(df)
 [1] "year"          "pref"          "ku"            "kun"          
 [5] "wl"            "rank"          "nocand"        "seito"        
 [9] "j_name"        "gender"        "name"          "previous"     
[13] "age"           "exp"           "status"        "vote"         
[17] "voteshare"     "eligible"      "turnout"       "seshu_dummy"  
[21] "jiban_seshu"   "nojiban_seshu"
  • Display the first 6 rows of the data
head(df)
# A tibble: 6 × 22
   year pref  ku      kun    wl  rank nocand seito   j_name gender name  previ…¹
  <dbl> <chr> <chr> <dbl> <dbl> <dbl>  <dbl> <chr>   <chr>  <chr>  <chr>   <dbl>
1  1996 愛知  aichi     1     1     1      7 新進    河村…  male   KAWA…       2
2  1996 愛知  aichi     1     0     2      7 自民    今枝…  male   IMAE…       2
3  1996 愛知  aichi     1     0     3      7 民主    佐藤…  male   SATO…       2
4  1996 愛知  aichi     1     0     4      7 共産    岩中…  female IWAN…       0
5  1996 愛知  aichi     1     0     5      7 文化フ… 伊東…  female ITO,…       0
6  1996 愛知  aichi     1     0     6      7 国民党  山田浩 male   YAMA…       0
# … with 10 more variables: age <dbl>, exp <dbl>, status <dbl>, vote <dbl>,
#   voteshare <dbl>, eligible <dbl>, turnout <dbl>, seshu_dummy <dbl>,
#   jiban_seshu <chr>, nojiban_seshu <chr>, and abbreviated variable name
#   ¹​previous
  • If you should have the garbled characters, then use the following R code:
df <- read_csv("data/hr96-21.csv", 
na = ".",
locale = locale(encoding = "cp932"))

read.csv() and read_csv() read.csv() is a R base command
read_csv() is a R command you can use only after you finish loading package {readr} which is included in {tidyverse}
・You can use read.csv() and read_csv() when you read a csv file
・Both work fine, but using read_csv() is way faster and reliable especially when you read a big data

  • If you download a csv file which includes Japanese, you need to be careful

When you fail to read a csv file — 2 ways of solutions:

Open your csv file on your EXCEL
  • If you see the garbled characters, you should go check the csv file you want read (in this case, hr96-21.csv) and open the file as a .xls or .xlsx file on your EXCEL
  • It is very likely that you see the following the garbled characters:

  • You go to File and Choose Save as
  • Choose CSV UTF-8 (csv.) and save it

Open your csv file on Libre Office
  • Choose Unicode (UTF-8) on Libre Office

  • If you still cannot open hr96-21.csv, then use the following R code:
hr <- read_csv("data/hr96-21.csv", 
na = ".",
locale = locale(encoding = "cp932"))

4.2 How to read an .xls[x] file

  • You need to install and load {readxl} to read .xls[x] file
library(readxl)
  • Suppose you want to read Freedom House Data
  • Download data/FH_Country.xls to your computer and load it
fh <- read_excel("data/FH_Country.xls")

It is important to keep in mind that loading a data to your computer is just a beginning of an empirical analysis. You need to clean and customize it before starting your empirical analysis

4.3 How to read an .dta file

  • A .dta file is binary data
  • You need to install and load {haven} to read .dta file
library(haven)
  • Suppose you want to load the replicatgion data of Bruce Russett and John R. Oneal (2001) “Triangulating Peace”
  • Download TRIANGLE.DTA to your computer and load it
triangle <- read_dta("data/TRIANGLE.DTA")
head(triangle)
# A tibble: 6 × 19
  statea stateb  year dependa dependb demauta demautb allies dispute1 logdstab
   <dbl>  <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>    <dbl>    <dbl>
1      2     20  1920  0.0157   0.280      10       9      0        0     5.82
2      2     20  1921  0.0115   0.224      10      10      0        0     5.82
3      2     20  1922  0.0113   0.201      10      10      0        0     5.82
4      2     20  1923  0.0112   0.213      10      10      0        0     5.82
5      2     20  1924  0.0110   0.213      10      10      0        0     5.82
6      2     20  1925  0.0108   0.191      10      10      0        0     5.82
# … with 9 more variables: lcaprat2 <dbl>, smigoabi <dbl>, opena <dbl>,
#   openb <dbl>, minrpwrs <dbl>, noncontg <dbl>, smldmat <dbl>, smldep <dbl>,
#   dyadid <dbl>
  • Check the names of variables triangle includes
names(triangle)
 [1] "statea"   "stateb"   "year"     "dependa"  "dependb"  "demauta" 
 [7] "demautb"  "allies"   "dispute1" "logdstab" "lcaprat2" "smigoabi"
[13] "opena"    "openb"    "minrpwrs" "noncontg" "smldmat"  "smldep"  
[19] "dyadid"  
  • If you want to change the extension of RIANGLE.DTA to RIANGLE.csv, use the following R code:
write_excel_csv(triangle, "data/triangle.csv")
  • Check if you see triangle.csv in data folder within your R Project folder

4.4 How to rea an .Rds file

  • You need to install and load {readr} to read Rds file
  • Suppose you want to load hr96-21.Rds to your computer
  • Since {tidyverse} contains {readr}, you need to install and load either of these two packages:
  • Let’s load {readr} here:
library(readr)
hr <- read::read_rds("hr96-21.Rds")

5. What you can do with {tidyverse}?

  • {tidyverse} contains multiple useful R packages for data manipiration and data visualization

  • To use {tidyverse}, you need to take the following 2 steps:
1. install {tidyverse}
  • You type install.packages(“tidyverse”) in Console
    → Hit the Return key
  • You need to do this only once
install.packages("tidyverse")
2. Read {tidyverse}
  • You type library(tidyverse) in a chunk
    → Click knit
library(tidyverse)

5.1 Pipes (%>%)

・pipes (%>%) allow you to express a sequence of multiple operations
・Pipes can greatly simplify your code and make your operations more intuitive
・The pipe operator (%>%) is automatically imported as part of the {tidyverse} library
・Pipes (%>%) are included in {magrittr} package
・{magrittr} package is included in {tidyverse} package
→ You need to read either of the following packages to use the pipe operator (%>%)

library(magrittr)  
library(tidyverse)

How pipes (%>%) are used

The pipe operator (%>%) automatically passes the output from the first line into the next line as the input

Let’s take a look at an exmple of using the pipe
  • Generate vectors from 1 to 10
1:10
 [1]  1  2  3  4  5  6  7  8  9 10
  • Calculate 1 + 2 + .... + 10
sum(1:10)
[1] 55
  • If you use pipe (%>%), you write R code as follows:
1:10 %>%  # Generate vectors from 1 to 10 
  sum()  # Add them all  
[1] 55
  • If you want to calculate the square root, …
1:10 %>%# Generate vectors from 1 to 10 
  sum() %>% # Add them all  
  sqrt()   # Calculate the square root  
[1] 7.416198
  • Generate vectors from 1 to 10 → Add them all → Calculate the square root
  • This is easier to interpret the sequence of operations
If you don’t use pipes,…
sqrt(sum(1:10))
[1] 7.416198
  • This is less intuitive because you have to think backward
  • Calculate the square roof ← Add them all ← Generate vectors

How to interpret R code with pipes (%>%) ・You can interpret the R code you made in 4.1 A simple scatter plot as follows:

df1 %>%       # Use df1 as data
  ggplot(aes(x = math,   # Assign x = math 
 y = stat,   # Assign y = stat
 color = gender)) + # Dots are colored by gender
  geom_point()  # Draw a scatter plot

df1 %>% ggplot() means the first argument of ggplot() is df1 Interpretation of the R code:
Use df1 as data
Assign x = math
Assign y = stat
Dots are colored by gender
Draw a scatter plot
・You don’t have to go backward in interpreting the R code
・R code with pipes (%>%) are intuitive and easy to follow

5.2 Data preparation

  • dplyr is one of the {tidyverse} packages which is powerful in manipulating data
  • You can easily select a row and column with {dplyr}

Data preparation:hr96-21.csv ・Download Japanese lower house election data (1996-2021): hr96-21.csv
・In your RProject folder, make a new folder, named it data, and put hr96-21.csv in it

hr <- read_csv("data/hr96-21.csv", na = ".")

hr96_21.csv is a collection of Japanese lower house election data covering 9 national elections (1996, 2000, 2003, 2005, 2009, 2012, 2014, 2017, 2021)
・Check the name of variables hr contains

names(hr)
 [1] "year"          "pref"          "ku"            "kun"          
 [5] "wl"            "rank"          "nocand"        "seito"        
 [9] "j_name"        "gender"        "name"          "previous"     
[13] "age"           "exp"           "status"        "vote"         
[17] "voteshare"     "eligible"      "turnout"       "seshu_dummy"  
[21] "jiban_seshu"   "nojiban_seshu"

hr has the following 23 variables

variable detail
year Election year (1996-2017)
pref Prefecture
ku Electoral district name
kun Number of electoral district
rank Ascending order of votes
wl 0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner
nocand Number of candidates in each district
seito Candidate’s affiliated party (in Japanese)
j_name Candidate’s name (Japanese)
name Candidate’s name (English)
previous Previous wins
gender Candidate’s gender:“male”, “female”
age Candidate’s age
exp Election expenditure (yen) spent by each candidate
status 0 = challenger / 1 = incumbent / 2 = former incumbent
vote votes each candidate garnered
voteshare Vote share (%)
eligible Eligible voters in each district
turnout Turnout in each district (%)
castvote Total votes cast in each district
seshu_dummy 0 = Not-hereditary candidates, 1 = hereditary candidate
jiban_seshu Relationship between candidate and his predecessor
nojiban_seshu Relationship between candidate and his predecessor

・Check the data frame (hr) you get

names(hr)
 [1] "year"          "pref"          "ku"            "kun"          
 [5] "wl"            "rank"          "nocand"        "seito"        
 [9] "j_name"        "gender"        "name"          "previous"     
[13] "age"           "exp"           "status"        "vote"         
[17] "voteshare"     "eligible"      "turnout"       "seshu_dummy"  
[21] "jiban_seshu"   "nojiban_seshu"
  • {dplyr} is one of the {tidyverse} packages which is powerful in manipulating data
  • Using hr, let’s take a look at what we can do with dplyr

5.2 Select a column: select()

  • Using select(), you can select a column (or columns) you want from the data frame

How to use :Without using pipe %>% select(data frame, Var1, Var2, ...)

  • The pipe operator (%>%) automatically passes the output from the first line (in this case, data frame) into the next line (in this case, select()) as the input

How to use :Without use pipe %>% data frame %>% select(Var1, Var2,...)

  • Display the variable names included in the data frame hr
names(hr)
 [1] "year"          "pref"          "ku"            "kun"          
 [5] "wl"            "rank"          "nocand"        "seito"        
 [9] "j_name"        "gender"        "name"          "previous"     
[13] "age"           "exp"           "status"        "vote"         
[17] "voteshare"     "eligible"      "turnout"       "seshu_dummy"  
[21] "jiban_seshu"   "nojiban_seshu"
  • You select 4 variables (year, ku, kun, seito, j_name) from the data frame hr
hr %>%   
  select(year, ku, kun, seito, j_name) 
# A tibble: 9,660 × 5
    year ku      kun seito          j_name    
   <dbl> <chr> <dbl> <chr>          <chr>     
 1  1996 aichi     1 新進           河村たかし
 2  1996 aichi     1 自民           今枝敬雄  
 3  1996 aichi     1 民主           佐藤泰介  
 4  1996 aichi     1 共産           岩中美保子
 5  1996 aichi     1 文化フォーラム 伊東マサコ
 6  1996 aichi     1 国民党         山田浩    
 7  1996 aichi     1 無所           浅野光雪  
 8  1996 aichi     2 新進           青木宏之  
 9  1996 aichi     2 自民           田辺広雄  
10  1996 aichi     2 民主           古川元久  
# … with 9,650 more rows

5.2.1 Select a column (select consequtive variables)

  • If you want to select consequtive variables, then you can use an operator :
hr %>%  
  select(c(year:j_name)) # Select the 1st to the 9th consequtive variable  
# A tibble: 9,660 × 9
    year pref  ku      kun    wl  rank nocand seito          j_name    
   <dbl> <chr> <chr> <dbl> <dbl> <dbl>  <dbl> <chr>          <chr>     
 1  1996 愛知  aichi     1     1     1      7 新進           河村たかし
 2  1996 愛知  aichi     1     0     2      7 自民           今枝敬雄  
 3  1996 愛知  aichi     1     0     3      7 民主           佐藤泰介  
 4  1996 愛知  aichi     1     0     4      7 共産           岩中美保子
 5  1996 愛知  aichi     1     0     5      7 文化フォーラム 伊東マサコ
 6  1996 愛知  aichi     1     0     6      7 国民党         山田浩    
 7  1996 愛知  aichi     1     0     7      7 無所           浅野光雪  
 8  1996 愛知  aichi     2     1     1      8 新進           青木宏之  
 9  1996 愛知  aichi     2     0     2      8 自民           田辺広雄  
10  1996 愛知  aichi     2     2     3      8 民主           古川元久  
# … with 9,650 more rows

Put another object name and save it ・Here, what you did is temporarily select variables and display them ・You did not save it as an object
・Let’s check how hr looks now

names(hr)
 [1] "year"          "pref"          "ku"            "kun"          
 [5] "wl"            "rank"          "nocand"        "seito"        
 [9] "j_name"        "gender"        "name"          "previous"     
[13] "age"           "exp"           "status"        "vote"         
[17] "voteshare"     "eligible"      "turnout"       "seshu_dummy"  
[21] "jiban_seshu"   "nojiban_seshu"

・The object hr has 22 variables
・If you want to use an object with 9 variables, you need to save it with different object name from hr, such as hr1

hr1 <- hr %>% 
  select(c(year:j_name)) # Select the 1st to the 9th consequtive variable  
  • Check hr1
names(hr1)
[1] "year"   "pref"   "ku"     "kun"    "wl"     "rank"   "nocand" "seito" 
[9] "j_name"

・This is how you get what you want

5.2.2 Simultaneously select a column and change the column name

  • Using select(), you can simultaneously select a column and change the column name
  • When you select a column, you assign a new name = old name
  • For instance, suppose you want to select 6 variables (year, ku, kun, seito, j_name, vote) from the data frame hr and change j_name to namae
hr %>% 
  select(year, ku, kun, seito, namae = j_name, vote)
# A tibble: 9,660 × 6
    year ku      kun seito          namae       vote
   <dbl> <chr> <dbl> <chr>          <chr>      <dbl>
 1  1996 aichi     1 新進           河村たかし 66876
 2  1996 aichi     1 自民           今枝敬雄   42969
 3  1996 aichi     1 民主           佐藤泰介   33503
 4  1996 aichi     1 共産           岩中美保子 22209
 5  1996 aichi     1 文化フォーラム 伊東マサコ   616
 6  1996 aichi     1 国民党         山田浩       566
 7  1996 aichi     1 無所           浅野光雪     312
 8  1996 aichi     2 新進           青木宏之   56101
 9  1996 aichi     2 自民           田辺広雄   44938
10  1996 aichi     2 民主           古川元久   43804
# … with 9,650 more rows

5.2.3 Change the name of a variable

  • Using rename(), you can change the name of a variable
  • Suppose you want to change the variable names as follows:
    ku => senkyoku
    seito => party
hr %>% 
  rename(senkyoku = ku, 
         party = seito)
# A tibble: 9,660 × 22
    year pref  senkyoku   kun    wl  rank nocand party       j_name gender name 
   <dbl> <chr> <chr>    <dbl> <dbl> <dbl>  <dbl> <chr>       <chr>  <chr>  <chr>
 1  1996 愛知  aichi        1     1     1      7 新進        河村…  male   KAWA…
 2  1996 愛知  aichi        1     0     2      7 自民        今枝…  male   IMAE…
 3  1996 愛知  aichi        1     0     3      7 民主        佐藤…  male   SATO…
 4  1996 愛知  aichi        1     0     4      7 共産        岩中…  female IWAN…
 5  1996 愛知  aichi        1     0     5      7 文化フォー… 伊東…  female ITO,…
 6  1996 愛知  aichi        1     0     6      7 国民党      山田浩 male   YAMA…
 7  1996 愛知  aichi        1     0     7      7 無所        浅野…  male   ASAN…
 8  1996 愛知  aichi        2     1     1      8 新進        青木…  male   AOKI…
 9  1996 愛知  aichi        2     0     2      8 自民        田辺…  male   TANA…
10  1996 愛知  aichi        2     2     3      8 民主        古川…  male   FURU…
# … with 9,650 more rows, and 11 more variables: previous <dbl>, age <dbl>,
#   exp <dbl>, status <dbl>, vote <dbl>, voteshare <dbl>, eligible <dbl>,
#   turnout <dbl>, seshu_dummy <dbl>, jiban_seshu <chr>, nojiban_seshu <chr>

5.2.4 How NOT to select a variable

  • If you want to select 20 variables from 22 variables, it would be more efficient to assign the variables you DO NOT select
  • Suppose you DO NOT want to select the year and pref from the data frame hr, you use the following R code:
hr %>% 
  select(!c(year, pref))
# A tibble: 9,660 × 20
   ku      kun    wl  rank nocand seito       j_name  gender name  previ…¹   age
   <chr> <dbl> <dbl> <dbl>  <dbl> <chr>       <chr>   <chr>  <chr>   <dbl> <dbl>
 1 aichi     1     1     1      7 新進        河村た… male   KAWA…       2    47
 2 aichi     1     0     2      7 自民        今枝敬… male   IMAE…       2    72
 3 aichi     1     0     3      7 民主        佐藤泰… male   SATO…       2    53
 4 aichi     1     0     4      7 共産        岩中美… female IWAN…       0    43
 5 aichi     1     0     5      7 文化フォー… 伊東マ… female ITO,…       0    51
 6 aichi     1     0     6      7 国民党      山田浩  male   YAMA…       0    51
 7 aichi     1     0     7      7 無所        浅野光… male   ASAN…       0    45
 8 aichi     2     1     1      8 新進        青木宏… male   AOKI…       2    51
 9 aichi     2     0     2      8 自民        田辺広… male   TANA…       0    71
10 aichi     2     2     3      8 民主        古川元… male   FURU…       0    30
# … with 9,650 more rows, 9 more variables: exp <dbl>, status <dbl>,
#   vote <dbl>, voteshare <dbl>, eligible <dbl>, turnout <dbl>,
#   seshu_dummy <dbl>, jiban_seshu <chr>, nojiban_seshu <chr>, and abbreviated
#   variable name ¹​previous
  • Using operators like c() and :, you DO NOT select consequtive variables
  • Suppose you DO NOT want to select the consequtive 7 variables from year to nocand, you use the following R code:
hr %>% 
  select(!c(year:nocand))
# A tibble: 9,660 × 15
   seito  j_name gender name  previ…¹   age     exp status  vote votes…² eligi…³
   <chr>  <chr>  <chr>  <chr>   <dbl> <dbl>   <dbl>  <dbl> <dbl>   <dbl>   <dbl>
 1 新進   河村…  male   KAWA…       2    47  9.83e6      1 66876    40    346774
 2 自民   今枝…  male   IMAE…       2    72  9.31e6      2 42969    25.7  346774
 3 民主   佐藤…  male   SATO…       2    53  9.23e6      1 33503    20.1  346774
 4 共産   岩中…  female IWAN…       0    43  2.18e6      0 22209    13.3  346774
 5 文化…  伊東…  female ITO,…       0    51 NA           0   616     0.4  346774
 6 国民党 山田浩 male   YAMA…       0    51 NA           0   566     0.3  346774
 7 無所   浅野…  male   ASAN…       0    45 NA           0   312     0.2  346774
 8 新進   青木…  male   AOKI…       2    51  1.29e7      1 56101    32.9  338310
 9 自民   田辺…  male   TANA…       0    71  1.65e7      2 44938    26.4  338310
10 民主   古川…  male   FURU…       0    30  1.14e7      0 43804    25.7  338310
# … with 9,650 more rows, 4 more variables: turnout <dbl>, seshu_dummy <dbl>,
#   jiban_seshu <chr>, nojiban_seshu <chr>, and abbreviated variable names
#   ¹​previous, ²​voteshare, ³​eligible

5.2.5 Select variables starting with particular name: starts_with()

  • hr includes the following 22 variables:
names(hr)
 [1] "year"          "pref"          "ku"            "kun"          
 [5] "wl"            "rank"          "nocand"        "seito"        
 [9] "j_name"        "gender"        "name"          "previous"     
[13] "age"           "exp"           "status"        "vote"         
[17] "voteshare"     "eligible"      "turnout"       "seshu_dummy"  
[21] "jiban_seshu"   "nojiban_seshu"
  • You can select variables which start with ku in addition to j_name
hr %>% 
  select(j_name, ends_with("ku"))
# A tibble: 9,660 × 2
   j_name     ku   
   <chr>      <chr>
 1 河村たかし aichi
 2 今枝敬雄   aichi
 3 佐藤泰介   aichi
 4 岩中美保子 aichi
 5 伊東マサコ aichi
 6 山田浩     aichi
 7 浅野光雪   aichi
 8 青木宏之   aichi
 9 田辺広雄   aichi
10 古川元久   aichi
# … with 9,650 more rows

5.2.6 Select variables ending with particular name: ends_with()

  • You can select variables which end with seshu in addition to j_name
hr %>% 
  select(j_name, ends_with("seshu"))
# A tibble: 9,660 × 3
   j_name     jiban_seshu                    nojiban_seshu
   <chr>      <chr>                          <chr>        
 1 河村たかし <NA>                           <NA>         
 2 今枝敬雄   <NA>                           <NA>         
 3 佐藤泰介   <NA>                           <NA>         
 4 岩中美保子 <NA>                           <NA>         
 5 伊東マサコ <NA>                           <NA>         
 6 山田浩     <NA>                           <NA>         
 7 浅野光雪   <NA>                           <NA>         
 8 青木宏之   <NA>                           <NA>         
 9 田辺広雄   伯父=加藤鐐五郎(衆議院議員) <NA>         
10 古川元久   <NA>                           <NA>         
# … with 9,650 more rows

5.2.7 Select varialbes in order you like

  • You can select variables in order you like with the following R code:
names(hr)
 [1] "year"          "pref"          "ku"            "kun"          
 [5] "wl"            "rank"          "nocand"        "seito"        
 [9] "j_name"        "gender"        "name"          "previous"     
[13] "age"           "exp"           "status"        "vote"         
[17] "voteshare"     "eligible"      "turnout"       "seshu_dummy"  
[21] "jiban_seshu"   "nojiban_seshu"
  • Suppose you want to select year
  • Also, suppose you want to select 12 variables (name to nojiban_seshu) after year
  • Also, suppose you want to select 9 variables (pref to gender) after seshu
  • You can make this happen with the following R code:
hr1 <- hr %>% 
  select(year, name:nojiban_seshu, pref:gender)

names(hr1)
 [1] "year"          "name"          "previous"      "age"          
 [5] "exp"           "status"        "vote"          "voteshare"    
 [9] "eligible"      "turnout"       "seshu_dummy"   "jiban_seshu"  
[13] "nojiban_seshu" "pref"          "ku"            "kun"          
[17] "wl"            "rank"          "nocand"        "seito"        
[21] "j_name"        "gender"       

5.2.8 How to change the order of variables: relocate()

  • Suppose you do not want to select variables, and change their order
  • Supose you want to put the group of variables (from name to nojiban_seshu) after year

How to use relocate()

Relocate var2 after var1
data frame %>% relocate(var2, .after = var1)

Relocate var2 before var1
data frame %>% relocate(var1, .before = var2)

hr2 <- hr %>% 
  relocate(name:nojiban_seshu, .after = year)
names(hr2)
 [1] "year"          "name"          "previous"      "age"          
 [5] "exp"           "status"        "vote"          "voteshare"    
 [9] "eligible"      "turnout"       "seshu_dummy"   "jiban_seshu"  
[13] "nojiban_seshu" "pref"          "ku"            "kun"          
[17] "wl"            "rank"          "nocand"        "seito"        
[21] "j_name"        "gender"       

5.3 Select a row: filter()

select(): Select a column
filter(): Select a row

How to use filter() data frame %>% filter(condition1, condition2,...)

  • To use filter(), you need to understand what logical operators, such as ==, >, & mean

  • Let’s use the data frame on Japanese lower house election hr here

names(hr)
 [1] "year"          "pref"          "ku"            "kun"          
 [5] "wl"            "rank"          "nocand"        "seito"        
 [9] "j_name"        "gender"        "name"          "previous"     
[13] "age"           "exp"           "status"        "vote"         
[17] "voteshare"     "eligible"      "turnout"       "seshu_dummy"  
[21] "jiban_seshu"   "nojiban_seshu"
  • Using unique(), check the elements (values) in year
unique(hr$year)
[1] 1996 2000 2003 2005 2009 2012 2014 2017 2021

Select the candidates running in the 2021 HR election

hr %>% 
  filter(year == 2021)
# A tibble: 857 × 22
    year pref  ku      kun    wl  rank nocand seito j_name  gender name  previ…¹
   <dbl> <chr> <chr> <dbl> <dbl> <dbl>  <dbl> <chr> <chr>   <chr>  <chr>   <dbl>
 1  2021 愛知  aichi     1     1     1      3 自民  熊田裕… male   KUMA…       3
 2  2021 愛知  aichi     1     2     2      3 立憲  吉田統… male   YOSH…       2
 3  2021 愛知  aichi     1     0     3      3 N党  門田節… female KADO…       0
 4  2021 愛知  aichi     2     1     1      2 国民  古川元… male   FURU…       8
 5  2021 愛知  aichi     2     2     2      2 自民  中川貴… male   NAKA…       0
 6  2021 愛知  aichi     3     1     1      2 立憲  近藤昭… male   KOND…       8
 7  2021 愛知  aichi     3     2     2      2 自民  池田佳… male   IKED…       3
 8  2021 愛知  aichi     4     1     1      3 自民  工藤彰… male   KUDO…       3
 9  2021 愛知  aichi     4     2     2      3 立憲  牧義夫  male   MAKI…       6
10  2021 愛知  aichi     4     0     3      3 維新  中田千… female NAKA…       0
# … with 847 more rows, 10 more variables: age <dbl>, exp <dbl>, status <dbl>,
#   vote <dbl>, voteshare <dbl>, eligible <dbl>, turnout <dbl>,
#   seshu_dummy <dbl>, jiban_seshu <chr>, nojiban_seshu <chr>, and abbreviated
#   variable name ¹​previous

Select the LDP candidates

  • Using unique(), check the elements (values) in seito in the data frame hr
unique(hr$seito)
 [1] "新進"                   "自民"                   "民主"                  
 [4] "共産"                   "文化フォーラム"         "国民党"                
 [7] "無所"                   "自由連合"               "政事公団太平会"        
[10] "新社会"                 "社民"                   "新党さきがけ"          
[13] "沖縄社会大衆党"         "市民新党にいがた"       "緑の党"                
[16] "さわやか神戸・市民の会" "民主改革連合"           "青年自由"              
[19] "日本新進"               "公明"                   "諸派"                  
[22] "保守"                   "無所属の会"             "自由"                  
[25] "改革クラブ"             "保守新"                 "ニューディールの会"    
[28] "新党尊命"               "世界経済共同体党"       "新党日本"              
[31] "国民新党"               "新党大地"               "幸福"                  
[34] "みんな"                 "改革"                   "日本未来"              
[37] "日本維新の会"           "当たり前"               "政治団体代表"          
[40] "安楽死党"               "アイヌ民族党"           "次世"                  
[43] "維新"                   "生活"                   "立憲"                  
[46] "希望"                   "緒派"                   ""                      
[49] "N党"                   "国民"                   "れい"                  
  • Select the LDP candidates
hr %>% 
  filter(seito == "自民")
# A tibble: 2,542 × 22
    year pref  ku      kun    wl  rank nocand seito j_name  gender name  previ…¹
   <dbl> <chr> <chr> <dbl> <dbl> <dbl>  <dbl> <chr> <chr>   <chr>  <chr>   <dbl>
 1  1996 愛知  aichi     1     0     2      7 自民  今枝敬… male   IMAE…       2
 2  1996 愛知  aichi     2     0     2      8 自民  田辺広… male   TANA…       0
 3  1996 愛知  aichi     3     0     2      7 自民  片岡武… male   KATA…       2
 4  1996 愛知  aichi     4     0     2      6 自民  塚本三… male   TSUK…       9
 5  1996 愛知  aichi     5     2     2      7 自民  木村隆… male   KIMU…       0
 6  1996 愛知  aichi     6     0     2      8 自民  伊藤勝… male   ITO,…       0
 7  1996 愛知  aichi     7     0     2      7 自民  丹羽太… male   NIWA…       0
 8  1996 愛知  aichi     8     1     1      5 自民  久野統… male   KUNO…       2
 9  1996 愛知  aichi     9     0     2      7 自民  吉川博  male   YOSH…       0
10  1996 愛知  aichi    10     0     2      7 自民  森治男  male   MORI…       0
# … with 2,532 more rows, 10 more variables: age <dbl>, exp <dbl>,
#   status <dbl>, vote <dbl>, voteshare <dbl>, eligible <dbl>, turnout <dbl>,
#   seshu_dummy <dbl>, jiban_seshu <chr>, nojiban_seshu <chr>, and abbreviated
#   variable name ¹​previous

Simultaneously use select() and filter()

  • Suppose you want to select the LDP candidates (seito == "自民") running in the 2021 HR election (year == "2021")
hr2021 <- hr %>% 
  filter(year == 2021) %>% 
  filter(seito == "自民") %>% 
  select(pref:gender)
  • check whether you got it right
DT::datatable(hr2021)

Order matters in using both filter() and select()

① Order matters ○ filter()select()

  • Suppose you use the following R code:
hr2021 <- hr %>% 
  select(pref:gender) %>% 
  filter(year == 2021) %>% 
  filter(seito == "自民") 
  • Then the following error returns:

Error in filter():
! Problem while computing ..1 = year == 2021.
Caused by error:
! object year not found

  • This error tell you that object "year" not found.
  • It makes perfect sense that R returns this error message because you deleted year on the 2nd R code: select(pref:turnout)
Solution:
  • Change the order of R code as follows:
hr2021 <- hr %>% 
  filter(year == 2021) %>% 
  filter(seito == "自民") %>% 
  select(pref:gender)

Variable class matters

  • If the variable classes you use are numeric (that is, double, numeric, or integer), then these variables include numeric elements.
  • If the variable class you use is character, then the elements of the variable is shown within the ""
Examples:

filter(year == 2021)・・・the variable class of the year is numeric
filter(seito == "自民")・・・the variable class of the seito is character

Select variables simultaneously satisfying 2 conditions:

  • Use either an operator & or ,
  • For instance, suppose you want to select the rows simultaneously satisfying the following two conditions from hr:

・Condition 1: Candidates nominated in the 2009 HR election (year == 2009)
・Condition 2: Female Candidates (gender == "female")

hr %>% 
  filter(year == 2021 & gender == "female") %>% # Select 2021 HR election data and female candidates  
  select(pref, kun, seito, j_name, gender, rank, wl) # Select 7 variables  
# A tibble: 140 × 7
   pref    kun seito j_name     gender  rank    wl
   <chr> <dbl> <chr> <chr>      <chr>  <dbl> <dbl>
 1 愛知      1 N党  門田節代   female     3     0
 2 愛知      4 維新  中田千代   female     3     0
 3 愛知      5 維新  岬麻紀     female     3     2
 4 愛知      7 共産  須山初美   female     3     0
 5 愛知     10 れい  安井美沙子 female     4     0
 6 愛媛      2 国民  石井智恵   female     2     0
 7 茨城      4 維新  武藤優子   female     2     0
 8 茨城      4 共産  大内久美子 female     3     0
 9 茨城      5 共産  飯田美弥子 female     3     0
10 茨城      6 自民  国光文乃   female     1     1
# … with 130 more rows
  • We see the number of female candidates nominated in the 2021 HR election was 130

Select variables satisfying one of the multiple conditions:

  • Use the operator |
  • For instance, suppose you want to select the rows satisfying one of the following two conditions from hr:

・Condition 1: CDP Candidates nominated in the 2021 HR election (seito == "立憲")
・Condition 1: SDP Candidates nominated in the 2021 HR election (seito == "社民")

hr %>% 
  filter(year == 2021) %>%  # Select 2021 HR election data 
  filter(seito  == "立憲" | seito == "社民") %>%  # Select either CDP or SDP candidates   
  select(pref, kun, seito, j_name, gender, rank, wl) # Select 7 variables  
# A tibble: 223 × 7
   pref    kun seito j_name   gender  rank    wl
   <chr> <dbl> <chr> <chr>    <chr>  <dbl> <dbl>
 1 愛知      1 立憲  吉田統彦 male       2     2
 2 愛知      3 立憲  近藤昭一 male       1     1
 3 愛知      4 立憲  牧義夫   male       2     2
 4 愛知      5 立憲  西川厚志 male       2     0
 5 愛知      6 立憲  松田功   male       2     0
 6 愛知      7 立憲  森本和義 male       2     0
 7 愛知      8 立憲  伴野豊   male       2     2
 8 愛知      9 立憲  岡本充功 male       2     0
 9 愛知     10 立憲  藤原規真 male       3     0
10 愛知     12 立憲  重徳和彦 male       1     1
# … with 213 more rows
  • We see the number of the CDP or SDP candidates nominated in the 2021 HR election was 213

Select variable using both & and |

  • Suppose you want to select the candidates nominated in the 2021 HR election and they satisfy one of the following two conditions (1 or 2) and satisfy condition 3.

・Condition 1: CDP Candidates nominated in the 2021 HR election (seito == "立憲")
・Condition 1: SDP Candidates nominated in the 2021 HR election (seito == "社民")
・Condition 3: Winners in single-member-district (wl == 1)

hr %>% 
  filter(year == 2021) %>%    
  filter(seito  == "立憲" | seito == "社民") %>% 
  filter(wl == 1) %>% 
  select(pref, kun, seito, j_name, gender, rank, wl) 
# A tibble: 58 × 7
   pref    kun seito j_name     gender  rank    wl
   <chr> <dbl> <chr> <chr>      <chr>  <dbl> <dbl>
 1 愛知      3 立憲  近藤昭一   male       1     1
 2 愛知     12 立憲  重徳和彦   male       1     1
 3 愛知     13 立憲  大西健介   male       1     1
 4 沖縄      2 社民  新垣邦男   male       1     1
 5 岩手      1 立憲  階猛       male       1     1
 6 宮崎      1 立憲  渡辺創     male       1     1
 7 宮城      2 立憲  鎌田さゆり female     1     1
 8 宮城      5 立憲  安住淳     male       1     1
 9 広島      6 立憲  佐藤公治   male       1     1
10 香川      1 立憲  小川淳也   male       1     1
# … with 48 more rows
  • We see the number of the the CDP or the SDP single-member-district winners in the 2021 HR election was 48.

%in%

  • By using %in%, you can further simplify the R code:
hr %>% 
  filter(year == 2021) %>%   
  filter(seito %in% c("立憲","社民"), 
         wl == 1) %>% 
  select(pref, kun, seito, j_name, gender, rank, wl) 
# A tibble: 58 × 7
   pref    kun seito j_name     gender  rank    wl
   <chr> <dbl> <chr> <chr>      <chr>  <dbl> <dbl>
 1 愛知      3 立憲  近藤昭一   male       1     1
 2 愛知     12 立憲  重徳和彦   male       1     1
 3 愛知     13 立憲  大西健介   male       1     1
 4 沖縄      2 社民  新垣邦男   male       1     1
 5 岩手      1 立憲  階猛       male       1     1
 6 宮崎      1 立憲  渡辺創     male       1     1
 7 宮城      2 立憲  鎌田さゆり female     1     1
 8 宮城      5 立憲  安住淳     male       1     1
 9 広島      6 立憲  佐藤公治   male       1     1
10 香川      1 立憲  小川淳也   male       1     1
# … with 48 more rows

6. Missing values

  • Usually, a data set will include missing values.
  • In R, you represent a missing value as NA.
  • Here, we delete the row which contains a missing value.
  • In many cases, if your data frame include missing values, then it is very likely that you cannot use a function in R.
  • For example, candidate election campaign expenditure (exp) includes a number of missing values.
  • Suppose you want to calculte the mean of exp.
mean(hr$exp)
[1] NA
  • You get NA

Solution ①:

  • You add na.rm = TRUE within mean(), excludeing the missing values in calculating the mean.
mean(hr$exp, na.rm = TRUE)
[1] 7551393

Solution ②:

Drop the missing values before calculating the mean.

  • You add drop_na() before calculating the mean.
hr %>% 
  drop_na(exp) %>% 
  select(year, j_name, exp) 
# A tibble: 6,829 × 3
    year j_name          exp
   <dbl> <chr>         <dbl>
 1  1996 河村たかし  9828097
 2  1996 今枝敬雄    9311555
 3  1996 佐藤泰介    9231284
 4  1996 岩中美保子  2177203
 5  1996 青木宏之   12940178
 6  1996 田辺広雄   16512426
 7  1996 古川元久   11435567
 8  1996 石山淳一    2128510
 9  1996 藤原美智子  3270533
10  1996 吉田幸弘   11245219
# … with 6,819 more rows
  • Now, we see the list of those candidates (6819) who reported their campaign expenditure.

  • Calculate the mean of exp for these 6819 candidates.

hr1 <- hr %>% 
  drop_na(exp) %>% 
  select(year, j_name, exp) 
mean(hr1$exp)
[1] 7551393

7. Exercise

In reference to 5.3 Select a row: filter(), answer the following 2 questions:
・Use the following 6 variables

  1. year
  2. pref
  3. kun
  4. wl
  5. seito
  6. j_name
  7. voteshare

・In answering these questions, use hr96_21.csv

  • Q1: Select the candidates with the following conditions and show the data frame with DT::datatable()
  1. Those female candidates nominated in the 2005 HR election
  2. Those won in the single-member-district (wl == 1)
  3. The DPJ candidates (seito == "民主")
  • Q2: Select the candidates with the following conditions and show the data frame with DT::datatable()
  1. Those female candidates nominated in the 2009 HR election
  2. Those won in the single-member-district (wl == 1)
  3. The DPJ candidates (seito == "民主")
References
  • Tidy Animated Verbs
  • 宋財泫 (Jaehyun Song)・矢内勇生 (Yuki Yanai)「私たちのR: ベストプラクティスの探究」
  • 宋財泫「ミクロ政治データ分析実習(2022年度)」
  • 土井翔平(北海道大学公共政策大学院)「Rで計量政治学入門」
  • 矢内勇生(高知工科大学)授業一覧
  • 浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年
  • 浅野正彦, 中村公亮.『初めてのRStudio』オーム社、2018年
  • Winston Chang, R Graphics Cookbook, O’Reilly Media, 2012.
  • Kieran Healy, DATA VISUALIZATION, Princeton, 2019
  • Kosuke Imai, Quantitative Social Science: An Introduction, Princeton University Press, 2017