R packages used in this section── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
{tidyverse} includes 8 useful packages{readr} which is included in
{tidyverse}A variable is also called a
vector
A vector is the smallest unit in data handling
A vector is an object which includes
more than one value (or one element) with the
same class (such as numeric,
character, factor,…)
All of the element in a vector
should be the same variable class
Let’s make a vector (= variable)
including 8 elements (= values)
Name it "id"
variable including 8 different names"name""score"Two ways of making a data frame 1.
data.frame(var1, var2,....)
2. tidyverse::tibble(var1, var2,...) — Recommended!
・tidyverse::tibble() is an extended version of
data.frame()
・tidyverse::tibble() is recommended in terms of
versatility
tibble() heretidyverse} to use tibble()df1# A tibble: 8 × 3
id name score
<dbl> <chr> <dbl>
1 1 Joe 43
2 2 Carol 74
3 3 Mike 80
4 4 Ross 37
5 5 Shira 20
6 6 Inha 83
7 7 Jih-wen 64
8 8 Amital 35
major) to the data frame
you made df1 with the following R code:# A tibble: 8 × 4
id name score gakka
<dbl> <chr> <dbl> <chr>
1 1 Joe 43 Poli-sci
2 2 Carol 74 Econ
3 3 Mike 80 Stat
4 4 Ross 37 History
5 5 Shira 20 Poli-sci
6 6 Inha 83 Econ
7 7 Jih-wen 64 Stat
8 8 Amital 35 Business
gender) to
df1# A tibble: 8 × 5
id name score gakka gender
<dbl> <chr> <dbl> <chr> <chr>
1 1 Joe 43 Poli-sci male
2 2 Carol 74 Econ female
3 3 Mike 80 Stat male
4 4 Ross 37 History male
5 5 Shira 20 Poli-sci female
6 6 Inha 83 Econ male
7 7 Jih-wen 64 Stat male
8 8 Amital 35 Business female
Let’s make another data frame, df2
df2 includes the following two varibles:
①id
②prefecture
Let’s make a vector (= variable)
including 8 elements (= values)
Name it "id"
variable including 8 different cities they
come from"city"df2# A tibble: 8 × 2
id city
<dbl> <chr>
1 1 Eugene
2 2 Portland
3 3 Chicago
4 4 Oxnard
5 5 Seattle
6 6 Soeul
7 7 Taipei
8 8 Seattle
# A tibble: 8 × 2
id city
<dbl> <chr>
1 1 Eugene
2 2 Portland
3 3 Chicago
4 4 Oxnard
5 5 Seattle
6 6 Soeul
7 7 Taipei
8 8 Seattle
df1 and df2 by id which
is included both in df1 and df2M id name score gakka gender city
1 1 Joe 43 Poli-sci male Eugene
2 2 Carol 74 Econ female Portland
3 3 Mike 80 Stat male Chicago
4 4 Ross 37 History male Oxnard
5 5 Shira 20 Poli-sci female Seattle
6 6 Inha 83 Econ male Soeul
7 7 Jih-wen 64 Stat male Taipei
8 8 Amital 35 Business female Seattle
R has 4 files with two types of extensions
(.html & .Rmd)A folder is also called a directory/), represent each directory.getwd(),
meaning get working directory, then you see the path
showing where you are.getwd() on my computer:getwd()
"/Users/asanomasahiko/Dropbox/statistics/class_materials/R"
C drive instead
of UsersR, on the right end in the path
aboveR,
which is the name of your R Project folder (=
working directory)setwd()You should not use Japanese as a file name and a directory
name
You should not insert space as a file name and a directory
name
You should use half size alphabet, numbers, and symbols as a file
name and a directory name
Avoid using numbers and symbols as the first letter of a file name and a directory name
Example:
| Bad: | 2021_grades (start with numbers) |
| Good: | grades_2021 |
| Bad: | grades 2021 (a space inserted between 2021 &
grades) |
| Good: | grades2021 |
R and R package have various embedded
datadata() in Console and hit the return key (or Enter
key)For instance, check the 7 th data, state.x77, and
show the first 6 rows:
Type head(state.x77) in Console and hit the return
key (or Enter key)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
Let’s display the last 6 rows:
Type tail(state.x77) in Console and hit the return
key (or Enter key)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Vermont 472 3907 0.6 71.64 5.5 57.1 168 9267
Virginia 4981 4701 1.4 70.08 9.5 47.8 85 39780
Washington 3559 4864 0.6 71.72 4.3 63.5 32 66570
West Virginia 1799 3617 1.4 69.48 6.7 41.6 100 24070
Wisconsin 4589 4468 0.7 72.48 3.0 54.5 149 54464
Wyoming 376 4566 0.6 70.29 6.9 62.9 173 97203
Load a data depending on the class of data
.txt file・・・You use this data when you analyze
text.tsv file・・・Data divided not with a comma, but with a
tab.html file・・・You use this data when you want to do web
scraping and gather data.csv file:
(csv means comma-separated values).xls file・・・Data used on Excel.xlsx file・・・Data used on Excel (newer
than.xls).dta file・・・Data used in Stata.rds file・・・Data for R only→ LibreOffice is for free!
→ You can assign character encoding
→ You can avoid the garbled characters
.csv filehr96-24.csv)data in your
R Project folderhr96-24.csv into the data folder
within your your R Project folderdfhr96-24.csv is a collection of Japanese lower house
election data covering 9 national elections (1996, 2000, 2003, 2005,
2009, 2012, 2014, 2017, 2021)hr contains [1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
# A tibble: 6 × 22
year pref ku kun wl rank nocand seito j_name gender name previous
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 1996 愛知 aichi 1 1 1 7 新進 河村… male KAWA… 2
2 1996 愛知 aichi 1 0 2 7 自民 今枝… male IMAE… 2
3 1996 愛知 aichi 1 0 3 7 民主 佐藤… male SATO… 2
4 1996 愛知 aichi 1 0 4 7 共産 岩中… female IWAN… 0
5 1996 愛知 aichi 1 0 5 7 文化… 伊東… female ITO,… 0
6 1996 愛知 aichi 1 0 6 7 国民党 山田浩 male YAMA… 0
# ℹ 10 more variables: age <dbl>, exp <dbl>, status <dbl>, vote <dbl>,
# voteshare <dbl>, eligible <dbl>, turnout <dbl>, seshu_dummy <dbl>,
# jiban_seshu <chr>, nojiban_seshu <chr>
read.csv() and
read_csv() ・read.csv() is a R base
command
・read_csv() is a R command you can use only after you
finish loading package {readr} which is included in
{tidyverse}
・You can use read.csv() and read_csv() when
you read a csv file
・Both work fine, but using read_csv() is way faster and
reliable especially when you read a big data
hr96-24.csv) and open the file
as a .xls or .xlsx file on your EXCELFile and Choose Save asCSV UTF-8 (csv.) and save it.xls[x] filereadxl} to read
.xls[x] fileIt is important to keep in mind that loading a data to your computer is just a beginning of an empirical analysis. You need to clean and customize it before starting your empirical analysis
.dta file.dta file is binary datahaven} to read
.dta fileBruce Russett and John R. Oneal (2001) “Triangulating Peace”# A tibble: 6 × 19
statea stateb year dependa dependb demauta demautb allies dispute1 logdstab
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 20 1920 0.0157 0.280 10 9 0 0 5.82
2 2 20 1921 0.0115 0.224 10 10 0 0 5.82
3 2 20 1922 0.0113 0.201 10 10 0 0 5.82
4 2 20 1923 0.0112 0.213 10 10 0 0 5.82
5 2 20 1924 0.0110 0.213 10 10 0 0 5.82
6 2 20 1925 0.0108 0.191 10 10 0 0 5.82
# ℹ 9 more variables: lcaprat2 <dbl>, smigoabi <dbl>, opena <dbl>, openb <dbl>,
# minrpwrs <dbl>, noncontg <dbl>, smldmat <dbl>, smldep <dbl>, dyadid <dbl>
triangle includes [1] "statea" "stateb" "year" "dependa" "dependb" "demauta"
[7] "demautb" "allies" "dispute1" "logdstab" "lcaprat2" "smigoabi"
[13] "opena" "openb" "minrpwrs" "noncontg" "smldmat" "smldep"
[19] "dyadid"
RIANGLE.DTA to
RIANGLE.csv, use the following R code:triangle.csv in data
folder within your R Project folder.Rds filereadr} to read
Rds filehr96-21.Rds to your
computertidyverse} contains {readr}, you
need to install and load either of these two packages:readr} here:{tidyverse}?{tidyverse} contains multiple useful R packages for
data manipiration and data visualization{tidyverse}, you need to take the following 2
steps:{tidyverse}ConsoleReturn key%>%)・pipes (%>%) allow you to express a sequence of
multiple operations
・Pipes can greatly simplify your code and make your operations more
intuitive
・The pipe operator (%>%) is automatically imported as
part of the {tidyverse} library
・Pipes (%>%) are included in {magrittr}
package
・{magrittr} package is included in
{tidyverse} package
→ You need to read either of the following packages to use the pipe
operator (%>%)
How pipes (%>%) are
used
The pipe operator
(%>%) automatically passes the output from the first
line into the next line as the input
[1] 1 2 3 4 5 6 7 8 9 10
1 + 2 + .... + 10[1] 55
%>%), you write R code as
follows:[1] 55
[1] 7.416198
How to interpret R code with pipes
(%>%) ・You can interpret the R code you made in
4.1 A simple scatter plot as follows:
df1 %>% # Use df1 as data
ggplot(aes(x = math, # Assign x = math
y = stat, # Assign y = stat
color = gender)) + # Dots are colored by gender
geom_point() # Draw a scatter plot・df1 %>% ggplot() means the first argument of
ggplot() is df1 Interpretation of the
R code:
・Use df1 as data
→ Assign x = math
→ Assign y = stat
→ Dots are colored by gender
→ Draw a scatter plot
・You don’t have to go backward in interpreting the R code
・R code with pipes (%>%) are intuitive and easy to
follow
dplyr is one of the {tidyverse} packages
which is powerful in manipulating datadplyr}Data preparation:hr96-24.csv
・Download Japanese lower house election data (1996-2021): hr96-24.csv
・In your RProject folder, make a new folder, named it
data, and put hr96-24.csv in it
・hr96-24.csv is a collection of Japanese lower house
election data covering 9 national elections (1996, 2000, 2003, 2005,
2009, 2012, 2014, 2017, 2021)
・Check the name of variables hr contains
[1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
・hr has the following 23 variables
| variable | detail |
|---|---|
| year | Election year (1996-2017) |
| pref | Prefecture |
| ku | Electoral district name |
| kun | Number of electoral district |
| rank | Ascending order of votes |
| wl | 0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner |
| nocand | Number of candidates in each district |
| seito | Candidate’s affiliated party (in Japanese) |
| j_name | Candidate’s name (Japanese) |
| name | Candidate’s name (English) |
| previous | Previous wins |
| gender | Candidate’s gender:“male”, “female” |
| age | Candidate’s age |
| exp | Election expenditure (yen) spent by each candidate |
| status | 0 = challenger / 1 = incumbent / 2 = former incumbent |
| vote | votes each candidate garnered |
| voteshare | Vote share (%) |
| eligible | Eligible voters in each district |
| turnout | Turnout in each district (%) |
| castvote | Total votes cast in each district |
| seshu_dummy | 0 = Not-hereditary candidates, 1 = hereditary candidate |
| jiban_seshu | Relationship between candidate and his predecessor |
| nojiban_seshu | Relationship between candidate and his predecessor |
・Check the data frame (hr) you get
[1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
dplyr} is one of the {tidyverse} packages
which is powerful in manipulating datahr, let’s take a look at what we can do with
dplyrselect()select(), you can select a column (or columns)
you want from the data frameHow to use :Without using pipe
%>%
select(data frame, Var1, Var2, ...)
%>%) automatically passes the
output from the first line (in this case, data frame) into
the next line (in this case, select()) as the inputHow to use :Without use pipe
%>%
data frame %>% select(Var1, Var2,...)
hr [1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
year, ku,
kun, seito, j_name) from the data
frame hr# A tibble: 10,773 × 5
year ku kun seito j_name
<dbl> <chr> <dbl> <chr> <chr>
1 1996 aichi 1 新進 河村たかし
2 1996 aichi 1 自民 今枝敬雄
3 1996 aichi 1 民主 佐藤泰介
4 1996 aichi 1 共産 岩中美保子
5 1996 aichi 1 文化フォーラム 伊東マサコ
6 1996 aichi 1 国民党 山田浩
7 1996 aichi 1 無所 浅野光雪
8 1996 aichi 2 新進 青木宏之
9 1996 aichi 2 自民 田辺広雄
10 1996 aichi 2 民主 古川元久
# ℹ 10,763 more rows
:# A tibble: 10,773 × 9
year pref ku kun wl rank nocand seito j_name
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 1996 愛知 aichi 1 1 1 7 新進 河村たかし
2 1996 愛知 aichi 1 0 2 7 自民 今枝敬雄
3 1996 愛知 aichi 1 0 3 7 民主 佐藤泰介
4 1996 愛知 aichi 1 0 4 7 共産 岩中美保子
5 1996 愛知 aichi 1 0 5 7 文化フォーラム 伊東マサコ
6 1996 愛知 aichi 1 0 6 7 国民党 山田浩
7 1996 愛知 aichi 1 0 7 7 無所 浅野光雪
8 1996 愛知 aichi 2 1 1 8 新進 青木宏之
9 1996 愛知 aichi 2 0 2 8 自民 田辺広雄
10 1996 愛知 aichi 2 2 3 8 民主 古川元久
# ℹ 10,763 more rows
Put another object name and save it
・Here, what you did is temporarily select variables and display them
・You did not save it as an object
・Let’s check how hr looks now
[1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
・The object hr has 22 variables
・If you want to use an object with 9 variables, you need to save it
with different object name from hr, such as
hr1
hr1[1] "year" "pref" "ku" "kun" "wl" "rank" "nocand" "seito"
[9] "j_name"
・This is how you get what you want
select(), you can simultaneously select a column
and change the column namea new name = old nameyear, ku, kun, seito, j_name, vote) from the data frame
hr and change j_name to
namae# A tibble: 10,773 × 6
year ku kun seito namae vote
<dbl> <chr> <dbl> <chr> <chr> <dbl>
1 1996 aichi 1 新進 河村たかし 66876
2 1996 aichi 1 自民 今枝敬雄 42969
3 1996 aichi 1 民主 佐藤泰介 33503
4 1996 aichi 1 共産 岩中美保子 22209
5 1996 aichi 1 文化フォーラム 伊東マサコ 616
6 1996 aichi 1 国民党 山田浩 566
7 1996 aichi 1 無所 浅野光雪 312
8 1996 aichi 2 新進 青木宏之 56101
9 1996 aichi 2 自民 田辺広雄 44938
10 1996 aichi 2 民主 古川元久 43804
# ℹ 10,763 more rows
rename(), you can change the name of a
variableku => senkyokuseito => party# A tibble: 10,773 × 22
year pref senkyoku kun wl rank nocand party j_name gender name
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 1996 愛知 aichi 1 1 1 7 新進 河村… male KAWA…
2 1996 愛知 aichi 1 0 2 7 自民 今枝… male IMAE…
3 1996 愛知 aichi 1 0 3 7 民主 佐藤… male SATO…
4 1996 愛知 aichi 1 0 4 7 共産 岩中… female IWAN…
5 1996 愛知 aichi 1 0 5 7 文化フォー… 伊東… female ITO,…
6 1996 愛知 aichi 1 0 6 7 国民党 山田浩 male YAMA…
7 1996 愛知 aichi 1 0 7 7 無所 浅野… male ASAN…
8 1996 愛知 aichi 2 1 1 8 新進 青木… male AOKI…
9 1996 愛知 aichi 2 0 2 8 自民 田辺… male TANA…
10 1996 愛知 aichi 2 2 3 8 民主 古川… male FURU…
# ℹ 10,763 more rows
# ℹ 11 more variables: previous <dbl>, age <dbl>, exp <dbl>, status <dbl>,
# vote <dbl>, voteshare <dbl>, eligible <dbl>, turnout <dbl>,
# seshu_dummy <dbl>, jiban_seshu <chr>, nojiban_seshu <chr>
year and
pref from the data frame hr, you use the
following R code:# A tibble: 10,773 × 20
ku kun wl rank nocand seito j_name gender name previous age
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 aichi 1 1 1 7 新進 河村… male KAWA… 2 47
2 aichi 1 0 2 7 自民 今枝… male IMAE… 2 72
3 aichi 1 0 3 7 民主 佐藤… male SATO… 2 53
4 aichi 1 0 4 7 共産 岩中… female IWAN… 0 43
5 aichi 1 0 5 7 文化フォー… 伊東… female ITO,… 0 51
6 aichi 1 0 6 7 国民党 山田浩 male YAMA… 0 51
7 aichi 1 0 7 7 無所 浅野… male ASAN… 0 45
8 aichi 2 1 1 8 新進 青木… male AOKI… 2 51
9 aichi 2 0 2 8 自民 田辺… male TANA… 0 71
10 aichi 2 2 3 8 民主 古川… male FURU… 0 30
# ℹ 10,763 more rows
# ℹ 9 more variables: exp <dbl>, status <dbl>, vote <dbl>, voteshare <dbl>,
# eligible <dbl>, turnout <dbl>, seshu_dummy <dbl>, jiban_seshu <chr>,
# nojiban_seshu <chr>
c() and :, you DO NOT
select consequtive variablesyear to nocand, you use the following R
code:# A tibble: 10,773 × 15
seito j_name gender name previous age exp status vote voteshare
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 新進 河村… male KAWA… 2 47 9.83e6 1 66876 40
2 自民 今枝… male IMAE… 2 72 9.31e6 2 42969 25.7
3 民主 佐藤… male SATO… 2 53 9.23e6 1 33503 20.1
4 共産 岩中… female IWAN… 0 43 2.18e6 0 22209 13.3
5 文化フォー… 伊東… female ITO,… 0 51 NA 0 616 0.4
6 国民党 山田浩 male YAMA… 0 51 NA 0 566 0.3
7 無所 浅野… male ASAN… 0 45 NA 0 312 0.2
8 新進 青木… male AOKI… 2 51 1.29e7 1 56101 32.9
9 自民 田辺… male TANA… 0 71 1.65e7 2 44938 26.4
10 民主 古川… male FURU… 0 30 1.14e7 0 43804 25.7
# ℹ 10,763 more rows
# ℹ 5 more variables: eligible <dbl>, turnout <dbl>, seshu_dummy <dbl>,
# jiban_seshu <chr>, nojiban_seshu <chr>
starts_with()hr includes the following 22 variables: [1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
ku in
addition to j_name# A tibble: 10,773 × 2
j_name ku
<chr> <chr>
1 河村たかし aichi
2 今枝敬雄 aichi
3 佐藤泰介 aichi
4 岩中美保子 aichi
5 伊東マサコ aichi
6 山田浩 aichi
7 浅野光雪 aichi
8 青木宏之 aichi
9 田辺広雄 aichi
10 古川元久 aichi
# ℹ 10,763 more rows
ends_with()seshu in
addition to j_name# A tibble: 10,773 × 3
j_name jiban_seshu nojiban_seshu
<chr> <chr> <chr>
1 河村たかし <NA> <NA>
2 今枝敬雄 <NA> <NA>
3 佐藤泰介 <NA> <NA>
4 岩中美保子 <NA> <NA>
5 伊東マサコ <NA> <NA>
6 山田浩 <NA> <NA>
7 浅野光雪 <NA> <NA>
8 青木宏之 <NA> <NA>
9 田辺広雄 伯父=加藤鐐五郎(衆議院議員) <NA>
10 古川元久 <NA> <NA>
# ℹ 10,763 more rows
[1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
yearname to
nojiban_seshu) after yearpref to
gender) after seshu [1] "year" "name" "previous" "age"
[5] "exp" "status" "vote" "voteshare"
[9] "eligible" "turnout" "seshu_dummy" "jiban_seshu"
[13] "nojiban_seshu" "pref" "ku" "kun"
[17] "wl" "rank" "nocand" "seito"
[21] "j_name" "gender"
relocate()name to nojiban_seshu) after
yearHow to use relocate()
Relocate var2 after
var1
data frame %>% relocate(var2, .after = var1)
Relocate var2 before
var1
data frame %>% relocate(var1, .before = var2)
[1] "year" "name" "previous" "age"
[5] "exp" "status" "vote" "voteshare"
[9] "eligible" "turnout" "seshu_dummy" "jiban_seshu"
[13] "nojiban_seshu" "pref" "ku" "kun"
[17] "wl" "rank" "nocand" "seito"
[21] "j_name" "gender"
filter()select(): |
Select a column |
filter(): |
Select a row |
How to use filter()
data frame %>% filter(condition1, condition2,...)
To use filter(), you need to understand what logical
operators, such as ==, >,
& mean
Let’s use the data frame on Japanese lower house election
hr here
[1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
unique(), check the elements (values) in
year [1] 1996 2000 2003 2005 2009 2012 2014 2017 2021 2024
# A tibble: 857 × 22
year pref ku kun wl rank nocand seito j_name gender name previous
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 2021 愛知 aichi 1 1 1 3 自民 熊田… male KUMA… 3
2 2021 愛知 aichi 1 2 2 3 立憲 吉田… male YOSH… 2
3 2021 愛知 aichi 1 0 3 3 N党 門田… female KADO… 0
4 2021 愛知 aichi 2 1 1 2 国民 古川… male FURU… 8
5 2021 愛知 aichi 2 2 2 2 自民 中川… male NAKA… 0
6 2021 愛知 aichi 3 1 1 2 立憲 近藤… male KOND… 8
7 2021 愛知 aichi 3 2 2 2 自民 池田… male IKED… 3
8 2021 愛知 aichi 4 1 1 3 自民 工藤… male KUDO… 3
9 2021 愛知 aichi 4 2 2 3 立憲 牧義夫 male MAKI… 6
10 2021 愛知 aichi 4 0 3 3 維新 中田… female NAKA… 0
# ℹ 847 more rows
# ℹ 10 more variables: age <dbl>, exp <dbl>, status <dbl>, vote <dbl>,
# voteshare <dbl>, eligible <dbl>, turnout <dbl>, seshu_dummy <dbl>,
# jiban_seshu <chr>, nojiban_seshu <chr>
unique(), check the elements (values) in
seito in the data frame hr [1] "新進" "自民" "民主"
[4] "共産" "文化フォーラム" "国民党"
[7] "無所" "自由連合" "政事公団太平会"
[10] "社民" "新社会" "日本新進"
[13] "新党さきがけ" "青年自由" "さわやか神戸・市民の会"
[16] "民主改革連合" "市民新党にいがた" "沖縄社会大衆党"
[19] "緑の党" "公明" "諸派"
[22] "保守" "無所属の会" "自由"
[25] "改革クラブ" "保守新" "新党尊命"
[28] "ニューディールの会" "世界経済共同体党" "新党日本"
[31] "国民新党" "新党大地" "幸福"
[34] "みんな" "改革" "日本未来"
[37] "日本維新の会" "当たり前" "アイヌ民族党"
[40] "政治団体代表" "安楽死党" "次世"
[43] "維新" "生活" "立憲"
[46] "希望" "緒派" "N党"
[49] "国民" "れい" "参政"
# A tibble: 2,808 × 22
year pref ku kun wl rank nocand seito j_name gender name previous
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 1996 愛知 aichi 1 0 2 7 自民 今枝… male IMAE… 2
2 1996 愛知 aichi 2 0 2 8 自民 田辺… male TANA… 0
3 1996 愛知 aichi 3 0 2 7 自民 片岡… male KATA… 2
4 1996 愛知 aichi 4 0 2 6 自民 塚本… male TSUK… 9
5 1996 愛知 aichi 5 2 2 7 自民 木村… male KIMU… 0
6 1996 愛知 aichi 6 0 2 8 自民 伊藤… male ITO,… 0
7 1996 愛知 aichi 7 0 2 7 自民 丹羽… male NIWA… 0
8 1996 愛知 aichi 8 1 1 5 自民 久野… male KUNO… 2
9 1996 愛知 aichi 9 0 2 7 自民 吉川博 male YOSH… 0
10 1996 愛知 aichi 10 0 2 7 自民 森治男 male MORI… 0
# ℹ 2,798 more rows
# ℹ 10 more variables: age <dbl>, exp <dbl>, status <dbl>, vote <dbl>,
# voteshare <dbl>, eligible <dbl>, turnout <dbl>, seshu_dummy <dbl>,
# jiban_seshu <chr>, nojiban_seshu <chr>
select() and
filter()seito == "自民") running in the 2021 HR election
(year == "2021")Order matters in using both
filter() and select()
filter() → select()Error in filter():
! Problem while computing
..1 = year == 2021.
Caused by error:
! object year not found
object "year" not found.select(pref:turnout)Variable class mattersdouble, numeric, or integer),
then these variables include numeric elements.""filter(year == 2021)・・・the variable class of the
year is numeric
filter(seito == "自民")・・・the variable class of the
seito is character
& or ,hr:・Condition 1: Candidates nominated in the 2009 HR election
(year == 2009)
・Condition 2: Female Candidates (gender == "female")
hr %>%
filter(year == 2021 & gender == "female") %>% # Select 2021 HR election data and female candidates
select(pref, kun, seito, j_name, gender, rank, wl) # Select 7 variables # A tibble: 140 × 7
pref kun seito j_name gender rank wl
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 愛知 1 N党 門田節代 female 3 0
2 愛知 4 維新 中田千代 female 3 0
3 愛知 5 維新 岬麻紀 female 3 2
4 愛知 7 共産 須山初美 female 3 0
5 愛知 10 れい 安井美沙子 female 4 0
6 青森 1 共産 斎藤美緒 female 3 0
7 青森 2 立憲 高畑紀子 female 2 0
8 青森 2 共産 田端深雪 female 3 0
9 千葉 6 共産 浅野史子 female 3 0
10 千葉 7 立憲 竹内千春 female 2 0
# ℹ 130 more rows
|hr:・Condition 1: CDP Candidates nominated in the 2021 HR election
(seito == "立憲")
・Condition 1: SDP Candidates nominated in the 2021 HR election
(seito == "社民")
hr %>%
filter(year == 2021) %>% # Select 2021 HR election data
filter(seito == "立憲" | seito == "社民") %>% # Select either CDP or SDP candidates
select(pref, kun, seito, j_name, gender, rank, wl) # Select 7 variables # A tibble: 223 × 7
pref kun seito j_name gender rank wl
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 愛知 1 立憲 吉田統彦 male 2 2
2 愛知 3 立憲 近藤昭一 male 1 1
3 愛知 4 立憲 牧義夫 male 2 2
4 愛知 5 立憲 西川厚志 male 2 0
5 愛知 6 立憲 松田功 male 2 0
6 愛知 7 立憲 森本和義 male 2 0
7 愛知 8 立憲 伴野豊 male 2 2
8 愛知 9 立憲 岡本充功 male 2 0
9 愛知 10 立憲 藤原規真 male 3 0
10 愛知 12 立憲 重徳和彦 male 1 1
# ℹ 213 more rows
& and
|・Condition 1: CDP Candidates nominated in the 2021 HR election
(seito == "立憲")
・Condition 1: SDP Candidates nominated in the 2021 HR election
(seito == "社民")
・Condition 3: Winners in single-member-district
(wl == 1)
hr %>%
filter(year == 2021) %>%
filter(seito == "立憲" | seito == "社民") %>%
filter(wl == 1) %>%
select(pref, kun, seito, j_name, gender, rank, wl) # A tibble: 58 × 7
pref kun seito j_name gender rank wl
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 愛知 3 立憲 近藤昭一 male 1 1
2 愛知 12 立憲 重徳和彦 male 1 1
3 愛知 13 立憲 大西健介 male 1 1
4 秋田 2 立憲 緑川貴士 male 1 1
5 千葉 1 立憲 田嶋要 male 1 1
6 千葉 4 立憲 野田佳彦 male 1 1
7 千葉 8 立憲 本庄知史 male 1 1
8 千葉 9 立憲 奥野総一郎 male 1 1
9 福岡 5 立憲 堤かなめ female 1 1
10 福岡 10 立憲 城井崇 male 1 1
# ℹ 48 more rows
%in%%in%, you can further simplify the R
code:hr %>%
filter(year == 2021) %>%
filter(seito %in% c("立憲","社民"),
wl == 1) %>%
select(pref, kun, seito, j_name, gender, rank, wl) # A tibble: 58 × 7
pref kun seito j_name gender rank wl
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 愛知 3 立憲 近藤昭一 male 1 1
2 愛知 12 立憲 重徳和彦 male 1 1
3 愛知 13 立憲 大西健介 male 1 1
4 秋田 2 立憲 緑川貴士 male 1 1
5 千葉 1 立憲 田嶋要 male 1 1
6 千葉 4 立憲 野田佳彦 male 1 1
7 千葉 8 立憲 本庄知史 male 1 1
8 千葉 9 立憲 奥野総一郎 male 1 1
9 福岡 5 立憲 堤かなめ female 1 1
10 福岡 10 立憲 城井崇 male 1 1
# ℹ 48 more rows
NA.exp) includes a number of missing values.exp.[1] NA
NAna.rm = TRUE within mean(),
excludeing the missing values in calculating the mean.[1] 7125574
drop_na() before calculating the mean.# A tibble: 9,479 × 3
year j_name exp
<dbl> <chr> <dbl>
1 1996 河村たかし 9828097
2 1996 今枝敬雄 9311555
3 1996 佐藤泰介 9231284
4 1996 岩中美保子 2177203
5 1996 青木宏之 12940178
6 1996 田辺広雄 16512426
7 1996 古川元久 11435567
8 1996 石山淳一 2128510
9 1996 藤原美智子 3270533
10 1996 吉田幸弘 11245219
# ℹ 9,469 more rows
Now, we see the list of those candidates (6819) who reported their campaign expenditure.
Calculate the mean of exp for these 6819
candidates.
[1] 7125574
・In reference to 5.3 Select a row: filter(),
answer the following 2 questions:
・Use the following 7 variables
・In answering these questions, use hr96-24.csv
DT::datatable()wl == 1)seito == "民主")DT::datatable()wl == 1)seito == "民主")