R packages
used in this sectionlibrary(haven)
library(readxl)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.6 ✔ purrr 0.3.4
✔ tibble 3.1.8 ✔ dplyr 1.0.9
✔ tidyr 1.2.0 ✔ stringr 1.4.1
✔ readr 2.1.2 ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
{tidyverse}
includes 8 useful packages{readr}
which is included in
{tidyverse}
A variable
is also called a
vector
A vector
is the smallest unit in data handling
A vector
is an object
which includes
more than one value
(or one element
) with the
same class
(such as numeric
,
character
, factor
,…)
All of the element
in a vector
should be the same variable class
Let’s make a vector
(= variable
)
including 8 elements
(= values
)
Name it "id"
<- c(1,2,3,4,5,6,7,8) id
variable
including 8 different names"name"
<- c("Joe", "Carol", "Mike", "Ross", "Shira", "Inha", "Jih-wen", "Amital") name
"score"
<- c(43, 74, 80, 37, 20, 83, 64, 35) score
Two ways of making a data frame 1.
data.frame(var1, var2,....)
2. tidyverse::tibble(var1, var2,...)
— Recommended!
・tidyverse::tibble()
is an extended version of
data.frame()
・tidyverse::tibble()
is recommended in terms of
versatility
tibble()
heretidyverse
} to use tibble()
library(tidyverse)
df1
<- tibble(id, name, score)
df1 df1
# A tibble: 8 × 3
id name score
<dbl> <chr> <dbl>
1 1 Joe 43
2 2 Carol 74
3 3 Mike 80
4 4 Ross 37
5 5 Shira 20
6 6 Inha 83
7 7 Jih-wen 64
8 8 Amital 35
major
) to the data frame
you made df1
with the following R code:$gakka <- c("Poli-sci", "Econ", "Stat", "History", "Poli-sci", "Econ", "Stat", "Business")
df1 df1
# A tibble: 8 × 4
id name score gakka
<dbl> <chr> <dbl> <chr>
1 1 Joe 43 Poli-sci
2 2 Carol 74 Econ
3 3 Mike 80 Stat
4 4 Ross 37 History
5 5 Shira 20 Poli-sci
6 6 Inha 83 Econ
7 7 Jih-wen 64 Stat
8 8 Amital 35 Business
gender
) to
df1
$gender <- c("male", "female", "male", "male", "female", "male", "male", "female")
df1 df1
# A tibble: 8 × 5
id name score gakka gender
<dbl> <chr> <dbl> <chr> <chr>
1 1 Joe 43 Poli-sci male
2 2 Carol 74 Econ female
3 3 Mike 80 Stat male
4 4 Ross 37 History male
5 5 Shira 20 Poli-sci female
6 6 Inha 83 Econ male
7 7 Jih-wen 64 Stat male
8 8 Amital 35 Business female
Let’s make another data frame, df2
df2
includes the following two varibles:
①id
②prefecture
Let’s make a vector
(= variable
)
including 8 elements
(= values
)
Name it "id"
<- c(1,2,3,4,5,6,7,8) id
variable
including 8 different cities they
come from"city"
<- c("Eugene", "Portland", "Chicago", "Oxnard", "Seattle", "Soeul", "Taipei", "Seattle") city
df2
<- tibble(id, city)
df2 df2
# A tibble: 8 × 2
id city
<dbl> <chr>
1 1 Eugene
2 2 Portland
3 3 Chicago
4 4 Oxnard
5 5 Seattle
6 6 Soeul
7 7 Taipei
8 8 Seattle
<- tibble(id, city)
df2 df2
# A tibble: 8 × 2
id city
<dbl> <chr>
1 1 Eugene
2 2 Portland
3 3 Chicago
4 4 Oxnard
5 5 Seattle
6 6 Soeul
7 7 Taipei
8 8 Seattle
df1
and df2
by id
which
is included both in df1
and df2
M
<- merge(df1, df2, by = "id")
M M
id name score gakka gender city
1 1 Joe 43 Poli-sci male Eugene
2 2 Carol 74 Econ female Portland
3 3 Mike 80 Stat male Chicago
4 4 Ross 37 History male Oxnard
5 5 Shira 20 Poli-sci female Seattle
6 6 Inha 83 Econ male Soeul
7 7 Jih-wen 64 Stat male Taipei
8 8 Amital 35 Business female Seattle
R
has 4 files with two types of extensions
(.html
& .Rmd
)A folder
is also called a directory
/
), represent each directory.getwd()
,
meaning get working directory
, then you see the path
showing where you are.getwd()
on my computer:getwd()
"/Users/asanomasahiko/Dropbox/statistics/class_materials/R"
C drive
instead
of Users
R
, on the right end in the path
aboveR
,
which is the name of your R Project folder
(=
working directory
)setwd()
You should not use Japanese as a file name and a directory
name
You should not insert space as a file name and a directory
name
You should use half size alphabet, numbers, and symbols as a file
name and a directory name
Avoid using numbers and symbols as the first letter of a file name and a directory name
Example:
Bad: | 2021_grades (start with numbers) |
Good: | grades_2021 |
Bad: | grades 2021 (a space inserted between 2021 &
grades ) |
Good: | grades2021 |
R
and R package
have various embedded
datadata()
in Console and hit the return key (or Enter
key)data()
For instance, check the 7 th data, state.x77
, and
show the first 6 rows:
Type head(state.x77)
in Console and hit the return
key (or Enter key)
head(state.x77)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
Let’s display the last 6 rows:
Type tail(state.x77)
in Console and hit the return
key (or Enter key)
tail(state.x77)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Vermont 472 3907 0.6 71.64 5.5 57.1 168 9267
Virginia 4981 4701 1.4 70.08 9.5 47.8 85 39780
Washington 3559 4864 0.6 71.72 4.3 63.5 32 66570
West Virginia 1799 3617 1.4 69.48 6.7 41.6 100 24070
Wisconsin 4589 4468 0.7 72.48 3.0 54.5 149 54464
Wyoming 376 4566 0.6 70.29 6.9 62.9 173 97203
Load a data depending on the class of data
.txt
file・・・You use this data when you analyze
text.tsv
file・・・Data divided not with a comma, but with a
tab.html
file・・・You use this data when you want to do web
scraping and gather data.csv
file:
(csv means comma-separated values).xls
file・・・Data used on Excel.xlsx
file・・・Data used on Excel (newer
than.xls
).dta
file・・・Data used in Stata.rds
file・・・Data for R only→ LibreOffice is for free!
→ You can assign character encoding
→ You can avoid the garbled characters
locale = locale(encoding = "cp932")
as follows:<- read_csv("data/hr96-21.csv",
df na = ".",
locale = locale(encoding = "cp932"))
.csv
filehr96-21.csv
)data
in your
R Project folder
hr96-21.csv
into the data
folder
within your your R Project folder
df
<- read_csv("data/hr96-21.csv",
df na = ".")
hr96_21.csv
is a collection of Japanese lower house
election data covering 9 national elections (1996, 2000, 2003, 2005,
2009, 2012, 2014, 2017, 2021)hr
containsnames(df)
[1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
head(df)
# A tibble: 6 × 22
year pref ku kun wl rank nocand seito j_name gender name previ…¹
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 1996 愛知 aichi 1 1 1 7 新進 河村… male KAWA… 2
2 1996 愛知 aichi 1 0 2 7 自民 今枝… male IMAE… 2
3 1996 愛知 aichi 1 0 3 7 民主 佐藤… male SATO… 2
4 1996 愛知 aichi 1 0 4 7 共産 岩中… female IWAN… 0
5 1996 愛知 aichi 1 0 5 7 文化フ… 伊東… female ITO,… 0
6 1996 愛知 aichi 1 0 6 7 国民党 山田浩 male YAMA… 0
# … with 10 more variables: age <dbl>, exp <dbl>, status <dbl>, vote <dbl>,
# voteshare <dbl>, eligible <dbl>, turnout <dbl>, seshu_dummy <dbl>,
# jiban_seshu <chr>, nojiban_seshu <chr>, and abbreviated variable name
# ¹previous
<- read_csv("data/hr96-21.csv",
df na = ".",
locale = locale(encoding = "cp932"))
read.csv()
and
read_csv()
・read.csv()
is a R base
command
・read_csv()
is a R command you can use only after you
finish loading package {readr
} which is included in
{tidyverse
}
・You can use read.csv()
and read_csv()
when
you read a csv file
・Both work fine, but using read_csv()
is way faster and
reliable especially when you read a big data
hr96-21.csv
) and open the file
as a .xls
or .xlsx
file on your EXCELFile
and Choose Save as
CSV UTF-8 (csv.)
and save itUnicode (UTF-8)
on Libre Officehr96-21.csv
, then use the
following R code:<- read_csv("data/hr96-21.csv",
hr na = ".",
locale = locale(encoding = "cp932"))
.xls[x]
filereadxl
} to read
.xls[x]
filelibrary(readxl)
<- read_excel("data/FH_Country.xls") fh
It is important to keep in mind that loading a data to your computer is just a beginning of an empirical analysis. You need to clean and customize it before starting your empirical analysis
.dta
file.dta
file is binary datahaven
} to read
.dta
filelibrary(haven)
Bruce Russett and John R. Oneal (2001) “Triangulating Peace”
<- read_dta("data/TRIANGLE.DTA")
triangle head(triangle)
# A tibble: 6 × 19
statea stateb year dependa dependb demauta demautb allies dispute1 logdstab
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 20 1920 0.0157 0.280 10 9 0 0 5.82
2 2 20 1921 0.0115 0.224 10 10 0 0 5.82
3 2 20 1922 0.0113 0.201 10 10 0 0 5.82
4 2 20 1923 0.0112 0.213 10 10 0 0 5.82
5 2 20 1924 0.0110 0.213 10 10 0 0 5.82
6 2 20 1925 0.0108 0.191 10 10 0 0 5.82
# … with 9 more variables: lcaprat2 <dbl>, smigoabi <dbl>, opena <dbl>,
# openb <dbl>, minrpwrs <dbl>, noncontg <dbl>, smldmat <dbl>, smldep <dbl>,
# dyadid <dbl>
triangle
includesnames(triangle)
[1] "statea" "stateb" "year" "dependa" "dependb" "demauta"
[7] "demautb" "allies" "dispute1" "logdstab" "lcaprat2" "smigoabi"
[13] "opena" "openb" "minrpwrs" "noncontg" "smldmat" "smldep"
[19] "dyadid"
RIANGLE.DTA
to
RIANGLE.csv
, use the following R code:write_excel_csv(triangle, "data/triangle.csv")
triangle.csv
in data
folder within your R Project folder
.Rds
filereadr
} to read
Rds
filehr96-21.Rds
to your
computertidyverse
} contains {readr
}, you
need to install and load either of these two packages:readr
} here:library(readr)
<- read::read_rds("hr96-21.Rds") hr
{tidyverse}
?{tidyverse}
contains multiple useful R packages for
data manipiration and data visualization{tidyverse}
, you need to take the following 2
steps:{tidyverse}
Console
Return key
install.packages("tidyverse")
{tidyverse}
knit
library(tidyverse)
%>%
)・pipes (%>%
) allow you to express a sequence of
multiple operations
・Pipes can greatly simplify your code and make your operations more
intuitive
・The pipe operator (%>%
) is automatically imported as
part of the {tidyverse
} library
・Pipes (%>%
) are included in {magrittr
}
package
・{magrittr
} package is included in
{tidyverse}
package
→ You need to read either of the following packages to use the pipe
operator (%>%
)
library(magrittr)
library(tidyverse)
How pipes (%>%
) are
used
The pipe operator
(%>%
) automatically passes the output from the first
line into the next line as the input
1:10
[1] 1 2 3 4 5 6 7 8 9 10
1 + 2 + .... + 10
sum(1:10)
[1] 55
%>%
), you write R code as
follows:1:10 %>% # Generate vectors from 1 to 10
sum() # Add them all
[1] 55
1:10 %>%# Generate vectors from 1 to 10
sum() %>% # Add them all
sqrt() # Calculate the square root
[1] 7.416198
sqrt(sum(1:10))
[1] 7.416198
How to interpret R code with pipes
(%>%
) ・You can interpret the R code you made in
4.1 A simple scatter plot
as follows:
%>% # Use df1 as data
df1 ggplot(aes(x = math, # Assign x = math
y = stat, # Assign y = stat
color = gender)) + # Dots are colored by gender
geom_point() # Draw a scatter plot
・df1 %>% ggplot()
means the first argument of
ggplot
() is df1
Interpretation of the
R code:
・Use df1
as data
→ Assign x = math
→ Assign y = stat
→ Dots are colored by gender
→ Draw a scatter plot
・You don’t have to go backward in interpreting the R code
・R code with pipes (%>%
) are intuitive and easy to
follow
dplyr
is one of the {tidyverse
} packages
which is powerful in manipulating datadplyr
}Data preparation:hr96-21.csv
・Download Japanese lower house election data (1996-2021): hr96-21.csv
・In your RProject folder
, make a new folder, named it
data
, and put hr96-21.csv
in it
<- read_csv("data/hr96-21.csv", na = ".") hr
・hr96_21.csv
is a collection of Japanese lower house
election data covering 9 national elections (1996, 2000, 2003, 2005,
2009, 2012, 2014, 2017, 2021)
・Check the name of variables hr
contains
names(hr)
[1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
・hr
has the following 23 variables
variable | detail |
---|---|
year | Election year (1996-2017) |
pref | Prefecture |
ku | Electoral district name |
kun | Number of electoral district |
rank | Ascending order of votes |
wl | 0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner |
nocand | Number of candidates in each district |
seito | Candidate’s affiliated party (in Japanese) |
j_name | Candidate’s name (Japanese) |
name | Candidate’s name (English) |
previous | Previous wins |
gender | Candidate’s gender:“male”, “female” |
age | Candidate’s age |
exp | Election expenditure (yen) spent by each candidate |
status | 0 = challenger / 1 = incumbent / 2 = former incumbent |
vote | votes each candidate garnered |
voteshare | Vote share (%) |
eligible | Eligible voters in each district |
turnout | Turnout in each district (%) |
castvote | Total votes cast in each district |
seshu_dummy | 0 = Not-hereditary candidates, 1 = hereditary candidate |
jiban_seshu | Relationship between candidate and his predecessor |
nojiban_seshu | Relationship between candidate and his predecessor |
・Check the data frame (hr
) you get
names(hr)
[1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
dplyr
} is one of the {tidyverse
} packages
which is powerful in manipulating datahr
, let’s take a look at what we can do with
dplyr
select()
select()
, you can select a column (or columns)
you want from the data frameHow to use :Without using pipe
%>%
select(data frame, Var1, Var2, ...)
%>%
) automatically passes the
output from the first line (in this case, data frame
) into
the next line (in this case, select()
) as the inputHow to use :Without use pipe
%>%
data frame %>% select(Var1, Var2,...)
hr
names(hr)
[1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
year
, ku
,
kun
, seito
, j_name
) from the data
frame hr
%>%
hr select(year, ku, kun, seito, j_name)
# A tibble: 9,660 × 5
year ku kun seito j_name
<dbl> <chr> <dbl> <chr> <chr>
1 1996 aichi 1 新進 河村たかし
2 1996 aichi 1 自民 今枝敬雄
3 1996 aichi 1 民主 佐藤泰介
4 1996 aichi 1 共産 岩中美保子
5 1996 aichi 1 文化フォーラム 伊東マサコ
6 1996 aichi 1 国民党 山田浩
7 1996 aichi 1 無所 浅野光雪
8 1996 aichi 2 新進 青木宏之
9 1996 aichi 2 自民 田辺広雄
10 1996 aichi 2 民主 古川元久
# … with 9,650 more rows
:
%>%
hr select(c(year:j_name)) # Select the 1st to the 9th consequtive variable
# A tibble: 9,660 × 9
year pref ku kun wl rank nocand seito j_name
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 1996 愛知 aichi 1 1 1 7 新進 河村たかし
2 1996 愛知 aichi 1 0 2 7 自民 今枝敬雄
3 1996 愛知 aichi 1 0 3 7 民主 佐藤泰介
4 1996 愛知 aichi 1 0 4 7 共産 岩中美保子
5 1996 愛知 aichi 1 0 5 7 文化フォーラム 伊東マサコ
6 1996 愛知 aichi 1 0 6 7 国民党 山田浩
7 1996 愛知 aichi 1 0 7 7 無所 浅野光雪
8 1996 愛知 aichi 2 1 1 8 新進 青木宏之
9 1996 愛知 aichi 2 0 2 8 自民 田辺広雄
10 1996 愛知 aichi 2 2 3 8 民主 古川元久
# … with 9,650 more rows
Put another object name and save it
・Here, what you did is temporarily select variables and display them
・You did not save it as an object
・Let’s check how hr
looks now
names(hr)
[1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
・The object hr
has 22 variables
・If you want to use an object with 9 variables, you need to save it
with different object name from hr
, such as
hr1
<- hr %>%
hr1 select(c(year:j_name)) # Select the 1st to the 9th consequtive variable
hr1
names(hr1)
[1] "year" "pref" "ku" "kun" "wl" "rank" "nocand" "seito"
[9] "j_name"
・This is how you get what you want
select()
, you can simultaneously select a column
and change the column namea new name = old name
year, ku, kun, seito, j_name, vote
) from the data frame
hr
and change j_name
to
namae
%>%
hr select(year, ku, kun, seito, namae = j_name, vote)
# A tibble: 9,660 × 6
year ku kun seito namae vote
<dbl> <chr> <dbl> <chr> <chr> <dbl>
1 1996 aichi 1 新進 河村たかし 66876
2 1996 aichi 1 自民 今枝敬雄 42969
3 1996 aichi 1 民主 佐藤泰介 33503
4 1996 aichi 1 共産 岩中美保子 22209
5 1996 aichi 1 文化フォーラム 伊東マサコ 616
6 1996 aichi 1 国民党 山田浩 566
7 1996 aichi 1 無所 浅野光雪 312
8 1996 aichi 2 新進 青木宏之 56101
9 1996 aichi 2 自民 田辺広雄 44938
10 1996 aichi 2 民主 古川元久 43804
# … with 9,650 more rows
rename()
, you can change the name of a
variableku
=> senkyoku
seito
=> party
%>%
hr rename(senkyoku = ku,
party = seito)
# A tibble: 9,660 × 22
year pref senkyoku kun wl rank nocand party j_name gender name
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 1996 愛知 aichi 1 1 1 7 新進 河村… male KAWA…
2 1996 愛知 aichi 1 0 2 7 自民 今枝… male IMAE…
3 1996 愛知 aichi 1 0 3 7 民主 佐藤… male SATO…
4 1996 愛知 aichi 1 0 4 7 共産 岩中… female IWAN…
5 1996 愛知 aichi 1 0 5 7 文化フォー… 伊東… female ITO,…
6 1996 愛知 aichi 1 0 6 7 国民党 山田浩 male YAMA…
7 1996 愛知 aichi 1 0 7 7 無所 浅野… male ASAN…
8 1996 愛知 aichi 2 1 1 8 新進 青木… male AOKI…
9 1996 愛知 aichi 2 0 2 8 自民 田辺… male TANA…
10 1996 愛知 aichi 2 2 3 8 民主 古川… male FURU…
# … with 9,650 more rows, and 11 more variables: previous <dbl>, age <dbl>,
# exp <dbl>, status <dbl>, vote <dbl>, voteshare <dbl>, eligible <dbl>,
# turnout <dbl>, seshu_dummy <dbl>, jiban_seshu <chr>, nojiban_seshu <chr>
year
and
pref
from the data frame hr
, you use the
following R code:%>%
hr select(!c(year, pref))
# A tibble: 9,660 × 20
ku kun wl rank nocand seito j_name gender name previ…¹ age
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 aichi 1 1 1 7 新進 河村た… male KAWA… 2 47
2 aichi 1 0 2 7 自民 今枝敬… male IMAE… 2 72
3 aichi 1 0 3 7 民主 佐藤泰… male SATO… 2 53
4 aichi 1 0 4 7 共産 岩中美… female IWAN… 0 43
5 aichi 1 0 5 7 文化フォー… 伊東マ… female ITO,… 0 51
6 aichi 1 0 6 7 国民党 山田浩 male YAMA… 0 51
7 aichi 1 0 7 7 無所 浅野光… male ASAN… 0 45
8 aichi 2 1 1 8 新進 青木宏… male AOKI… 2 51
9 aichi 2 0 2 8 自民 田辺広… male TANA… 0 71
10 aichi 2 2 3 8 民主 古川元… male FURU… 0 30
# … with 9,650 more rows, 9 more variables: exp <dbl>, status <dbl>,
# vote <dbl>, voteshare <dbl>, eligible <dbl>, turnout <dbl>,
# seshu_dummy <dbl>, jiban_seshu <chr>, nojiban_seshu <chr>, and abbreviated
# variable name ¹previous
c()
and :
, you DO NOT
select consequtive variablesyear
to nocand
, you use the following R
code:%>%
hr select(!c(year:nocand))
# A tibble: 9,660 × 15
seito j_name gender name previ…¹ age exp status vote votes…² eligi…³
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 新進 河村… male KAWA… 2 47 9.83e6 1 66876 40 346774
2 自民 今枝… male IMAE… 2 72 9.31e6 2 42969 25.7 346774
3 民主 佐藤… male SATO… 2 53 9.23e6 1 33503 20.1 346774
4 共産 岩中… female IWAN… 0 43 2.18e6 0 22209 13.3 346774
5 文化… 伊東… female ITO,… 0 51 NA 0 616 0.4 346774
6 国民党 山田浩 male YAMA… 0 51 NA 0 566 0.3 346774
7 無所 浅野… male ASAN… 0 45 NA 0 312 0.2 346774
8 新進 青木… male AOKI… 2 51 1.29e7 1 56101 32.9 338310
9 自民 田辺… male TANA… 0 71 1.65e7 2 44938 26.4 338310
10 民主 古川… male FURU… 0 30 1.14e7 0 43804 25.7 338310
# … with 9,650 more rows, 4 more variables: turnout <dbl>, seshu_dummy <dbl>,
# jiban_seshu <chr>, nojiban_seshu <chr>, and abbreviated variable names
# ¹previous, ²voteshare, ³eligible
starts_with()
hr
includes the following 22 variables:names(hr)
[1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
ku
in
addition to j_name
%>%
hr select(j_name, ends_with("ku"))
# A tibble: 9,660 × 2
j_name ku
<chr> <chr>
1 河村たかし aichi
2 今枝敬雄 aichi
3 佐藤泰介 aichi
4 岩中美保子 aichi
5 伊東マサコ aichi
6 山田浩 aichi
7 浅野光雪 aichi
8 青木宏之 aichi
9 田辺広雄 aichi
10 古川元久 aichi
# … with 9,650 more rows
ends_with()
seshu
in
addition to j_name
%>%
hr select(j_name, ends_with("seshu"))
# A tibble: 9,660 × 3
j_name jiban_seshu nojiban_seshu
<chr> <chr> <chr>
1 河村たかし <NA> <NA>
2 今枝敬雄 <NA> <NA>
3 佐藤泰介 <NA> <NA>
4 岩中美保子 <NA> <NA>
5 伊東マサコ <NA> <NA>
6 山田浩 <NA> <NA>
7 浅野光雪 <NA> <NA>
8 青木宏之 <NA> <NA>
9 田辺広雄 伯父=加藤鐐五郎(衆議院議員) <NA>
10 古川元久 <NA> <NA>
# … with 9,650 more rows
names(hr)
[1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
year
name
to
nojiban_seshu
) after year
pref
to
gender
) after seshu
<- hr %>%
hr1 select(year, name:nojiban_seshu, pref:gender)
names(hr1)
[1] "year" "name" "previous" "age"
[5] "exp" "status" "vote" "voteshare"
[9] "eligible" "turnout" "seshu_dummy" "jiban_seshu"
[13] "nojiban_seshu" "pref" "ku" "kun"
[17] "wl" "rank" "nocand" "seito"
[21] "j_name" "gender"
relocate()
name
to nojiban_seshu
) after
year
How to use relocate()
Relocate var2
after
var1
data frame %>% relocate(var2, .after = var1)
Relocate var2
before
var1
data frame %>% relocate(var1, .before = var2)
<- hr %>%
hr2 relocate(name:nojiban_seshu, .after = year)
names(hr2)
[1] "year" "name" "previous" "age"
[5] "exp" "status" "vote" "voteshare"
[9] "eligible" "turnout" "seshu_dummy" "jiban_seshu"
[13] "nojiban_seshu" "pref" "ku" "kun"
[17] "wl" "rank" "nocand" "seito"
[21] "j_name" "gender"
filter()
select() : |
Select a column |
filter() : |
Select a row |
How to use filter()
data frame %>% filter(condition1, condition2,...)
To use filter()
, you need to understand what logical
operators, such as ==
, >
,
&
mean
Let’s use the data frame on Japanese lower house election
hr
here
names(hr)
[1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
unique()
, check the elements (values) in
year
unique(hr$year)
[1] 1996 2000 2003 2005 2009 2012 2014 2017 2021
%>%
hr filter(year == 2021)
# A tibble: 857 × 22
year pref ku kun wl rank nocand seito j_name gender name previ…¹
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 2021 愛知 aichi 1 1 1 3 自民 熊田裕… male KUMA… 3
2 2021 愛知 aichi 1 2 2 3 立憲 吉田統… male YOSH… 2
3 2021 愛知 aichi 1 0 3 3 N党 門田節… female KADO… 0
4 2021 愛知 aichi 2 1 1 2 国民 古川元… male FURU… 8
5 2021 愛知 aichi 2 2 2 2 自民 中川貴… male NAKA… 0
6 2021 愛知 aichi 3 1 1 2 立憲 近藤昭… male KOND… 8
7 2021 愛知 aichi 3 2 2 2 自民 池田佳… male IKED… 3
8 2021 愛知 aichi 4 1 1 3 自民 工藤彰… male KUDO… 3
9 2021 愛知 aichi 4 2 2 3 立憲 牧義夫 male MAKI… 6
10 2021 愛知 aichi 4 0 3 3 維新 中田千… female NAKA… 0
# … with 847 more rows, 10 more variables: age <dbl>, exp <dbl>, status <dbl>,
# vote <dbl>, voteshare <dbl>, eligible <dbl>, turnout <dbl>,
# seshu_dummy <dbl>, jiban_seshu <chr>, nojiban_seshu <chr>, and abbreviated
# variable name ¹previous
unique()
, check the elements (values) in
seito
in the data frame hr
unique(hr$seito)
[1] "新進" "自民" "民主"
[4] "共産" "文化フォーラム" "国民党"
[7] "無所" "自由連合" "政事公団太平会"
[10] "新社会" "社民" "新党さきがけ"
[13] "沖縄社会大衆党" "市民新党にいがた" "緑の党"
[16] "さわやか神戸・市民の会" "民主改革連合" "青年自由"
[19] "日本新進" "公明" "諸派"
[22] "保守" "無所属の会" "自由"
[25] "改革クラブ" "保守新" "ニューディールの会"
[28] "新党尊命" "世界経済共同体党" "新党日本"
[31] "国民新党" "新党大地" "幸福"
[34] "みんな" "改革" "日本未来"
[37] "日本維新の会" "当たり前" "政治団体代表"
[40] "安楽死党" "アイヌ民族党" "次世"
[43] "維新" "生活" "立憲"
[46] "希望" "緒派" ""
[49] "N党" "国民" "れい"
%>%
hr filter(seito == "自民")
# A tibble: 2,542 × 22
year pref ku kun wl rank nocand seito j_name gender name previ…¹
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 1996 愛知 aichi 1 0 2 7 自民 今枝敬… male IMAE… 2
2 1996 愛知 aichi 2 0 2 8 自民 田辺広… male TANA… 0
3 1996 愛知 aichi 3 0 2 7 自民 片岡武… male KATA… 2
4 1996 愛知 aichi 4 0 2 6 自民 塚本三… male TSUK… 9
5 1996 愛知 aichi 5 2 2 7 自民 木村隆… male KIMU… 0
6 1996 愛知 aichi 6 0 2 8 自民 伊藤勝… male ITO,… 0
7 1996 愛知 aichi 7 0 2 7 自民 丹羽太… male NIWA… 0
8 1996 愛知 aichi 8 1 1 5 自民 久野統… male KUNO… 2
9 1996 愛知 aichi 9 0 2 7 自民 吉川博 male YOSH… 0
10 1996 愛知 aichi 10 0 2 7 自民 森治男 male MORI… 0
# … with 2,532 more rows, 10 more variables: age <dbl>, exp <dbl>,
# status <dbl>, vote <dbl>, voteshare <dbl>, eligible <dbl>, turnout <dbl>,
# seshu_dummy <dbl>, jiban_seshu <chr>, nojiban_seshu <chr>, and abbreviated
# variable name ¹previous
select()
and
filter()
seito == "自民"
) running in the 2021 HR election
(year == "2021"
)<- hr %>%
hr2021 filter(year == 2021) %>%
filter(seito == "自民") %>%
select(pref:gender)
::datatable(hr2021) DT
Order matters in using both
filter()
and select()
filter()
→ select()
<- hr %>%
hr2021 select(pref:gender) %>%
filter(year == 2021) %>%
filter(seito == "自民")
Error in filter()
:
! Problem while computing
..1 = year == 2021
.
Caused by error:
! object year
not found
object "year" not found
.select(pref:turnout)
<- hr %>%
hr2021 filter(year == 2021) %>%
filter(seito == "自民") %>%
select(pref:gender)
Variable class
mattersdouble
, numeric
, or integer
),
then these variables include numeric elements.""
filter(year == 2021)
・・・the variable class of the
year
is numeric
filter(seito == "自民")
・・・the variable class of the
seito
is character
&
or ,
hr
:・Condition 1: Candidates nominated in the 2009 HR election
(year == 2009
)
・Condition 2: Female Candidates (gender == "female"
)
%>%
hr filter(year == 2021 & gender == "female") %>% # Select 2021 HR election data and female candidates
select(pref, kun, seito, j_name, gender, rank, wl) # Select 7 variables
# A tibble: 140 × 7
pref kun seito j_name gender rank wl
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 愛知 1 N党 門田節代 female 3 0
2 愛知 4 維新 中田千代 female 3 0
3 愛知 5 維新 岬麻紀 female 3 2
4 愛知 7 共産 須山初美 female 3 0
5 愛知 10 れい 安井美沙子 female 4 0
6 愛媛 2 国民 石井智恵 female 2 0
7 茨城 4 維新 武藤優子 female 2 0
8 茨城 4 共産 大内久美子 female 3 0
9 茨城 5 共産 飯田美弥子 female 3 0
10 茨城 6 自民 国光文乃 female 1 1
# … with 130 more rows
|
hr
:・Condition 1: CDP Candidates nominated in the 2021 HR election
(seito == "立憲"
)
・Condition 1: SDP Candidates nominated in the 2021 HR election
(seito == "社民"
)
%>%
hr filter(year == 2021) %>% # Select 2021 HR election data
filter(seito == "立憲" | seito == "社民") %>% # Select either CDP or SDP candidates
select(pref, kun, seito, j_name, gender, rank, wl) # Select 7 variables
# A tibble: 223 × 7
pref kun seito j_name gender rank wl
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 愛知 1 立憲 吉田統彦 male 2 2
2 愛知 3 立憲 近藤昭一 male 1 1
3 愛知 4 立憲 牧義夫 male 2 2
4 愛知 5 立憲 西川厚志 male 2 0
5 愛知 6 立憲 松田功 male 2 0
6 愛知 7 立憲 森本和義 male 2 0
7 愛知 8 立憲 伴野豊 male 2 2
8 愛知 9 立憲 岡本充功 male 2 0
9 愛知 10 立憲 藤原規真 male 3 0
10 愛知 12 立憲 重徳和彦 male 1 1
# … with 213 more rows
&
and
|
・Condition 1: CDP Candidates nominated in the 2021 HR election
(seito == "立憲"
)
・Condition 1: SDP Candidates nominated in the 2021 HR election
(seito == "社民"
)
・Condition 3: Winners in single-member-district
(wl == 1
)
%>%
hr filter(year == 2021) %>%
filter(seito == "立憲" | seito == "社民") %>%
filter(wl == 1) %>%
select(pref, kun, seito, j_name, gender, rank, wl)
# A tibble: 58 × 7
pref kun seito j_name gender rank wl
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 愛知 3 立憲 近藤昭一 male 1 1
2 愛知 12 立憲 重徳和彦 male 1 1
3 愛知 13 立憲 大西健介 male 1 1
4 沖縄 2 社民 新垣邦男 male 1 1
5 岩手 1 立憲 階猛 male 1 1
6 宮崎 1 立憲 渡辺創 male 1 1
7 宮城 2 立憲 鎌田さゆり female 1 1
8 宮城 5 立憲 安住淳 male 1 1
9 広島 6 立憲 佐藤公治 male 1 1
10 香川 1 立憲 小川淳也 male 1 1
# … with 48 more rows
%in%
%in%
, you can further simplify the R
code:%>%
hr filter(year == 2021) %>%
filter(seito %in% c("立憲","社民"),
== 1) %>%
wl select(pref, kun, seito, j_name, gender, rank, wl)
# A tibble: 58 × 7
pref kun seito j_name gender rank wl
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 愛知 3 立憲 近藤昭一 male 1 1
2 愛知 12 立憲 重徳和彦 male 1 1
3 愛知 13 立憲 大西健介 male 1 1
4 沖縄 2 社民 新垣邦男 male 1 1
5 岩手 1 立憲 階猛 male 1 1
6 宮崎 1 立憲 渡辺創 male 1 1
7 宮城 2 立憲 鎌田さゆり female 1 1
8 宮城 5 立憲 安住淳 male 1 1
9 広島 6 立憲 佐藤公治 male 1 1
10 香川 1 立憲 小川淳也 male 1 1
# … with 48 more rows
NA
.exp
) includes a number of missing values.exp
.mean(hr$exp)
[1] NA
NA
na.rm = TRUE
within mean()
,
excludeing the missing values in calculating the mean.mean(hr$exp, na.rm = TRUE)
[1] 7551393
drop_na()
before calculating the mean.%>%
hr drop_na(exp) %>%
select(year, j_name, exp)
# A tibble: 6,829 × 3
year j_name exp
<dbl> <chr> <dbl>
1 1996 河村たかし 9828097
2 1996 今枝敬雄 9311555
3 1996 佐藤泰介 9231284
4 1996 岩中美保子 2177203
5 1996 青木宏之 12940178
6 1996 田辺広雄 16512426
7 1996 古川元久 11435567
8 1996 石山淳一 2128510
9 1996 藤原美智子 3270533
10 1996 吉田幸弘 11245219
# … with 6,819 more rows
Now, we see the list of those candidates (6819) who reported their campaign expenditure.
Calculate the mean of exp
for these 6819
candidates.
<- hr %>%
hr1 drop_na(exp) %>%
select(year, j_name, exp)
mean(hr1$exp)
[1] 7551393
・In reference to 5.3 Select a row: filter()
,
answer the following 2 questions:
・Use the following 6 variables
・In answering these questions, use hr96_21.csv
DT::datatable()
wl == 1
)seito == "民主"
)DT::datatable()
wl == 1
)seito == "民主"
)