• Load the R package {tidyverse} to be used here
library(haven)
library(readxl)
library(tidyverse)

In this section, you will learn:

  • How to create variables from scratch.
  • How to merge variables and create data frames.
  • How to merge multiple data frames.
  • How to import data of various file formats into RStudio.
  • As a preliminary step to data analysis, an explanation of basic terminology in the field will be provided.

Specialized terms explained here include:

  • Text data
  • Binary data
  • File extension
  • Path
  • File
  • Folder
  • R project
  • R project folder
  • Working directory
  • Missing values
  • Variable types
  • Built-in R data
  • Pipe operator (%>%, |>)
  • Variable types (class)
  • AND operator (&)
  • OR operator (|)
  • %in% operator
  • Missing values
Loading Data for Analysis:
  • Download the vote data for the House of Representatives general election from hr96-21.csv
  • Create a folder named data within your R project folder.
  • Place the downloaded hr96-21.csv file inside the data folder.
  • To read the .csv file, use the read_csv() function included within the {tidyverse} package.
hr <- read_csv("data/hr96-21.csv", 
               na = ".")  # Commands for handling missing values    
  • hr looks like this:
DT::datatable(hr)

1. Sorting Rows: arrange()

  • The first argument is the data frame object.
  • Subsequent arguments are the variable names used for sorting.
  • Rows with smaller values appear at the top.
  • For example, using hr, you can display candidates among the winners (wl > 0) with the lowest number of votes (vote).
hr %>%
  filter(wl > 0) %>%
  arrange(vote) %>%
  select(ku, kun, seito, name, wl, gender, rank, vote)
# A tibble: 3,659 × 8
   ku          kun seito        name                  wl gender  rank  vote
   <chr>     <dbl> <chr>        <chr>              <dbl> <chr>  <dbl> <dbl>
 1 tokyo        22 社民         HOSAKA, NOBUTO         2 male       5 13904
 2 kyoto         3 日本維新の会 MORI, NATSUE           2 female     4 16511
 3 saitama       6 社民         FUKADA, HAJIME         2 male       4 17909
 4 saitama       8 共産         SHIOKAWA, TETSUYA      2 male       4 18512
 5 nara          1 民主         IENISHI, SATORU        2 male       4 18994
 6 okinawa       1 共産         AKAMINE, SEIKEN        2 male       4 19528
 7 kochi         3 共産         HARUNA, NAOAKI         2 male       3 19549
 8 kyoto         5 希望         INOUE, KAZUNORI        2 male       4 19586
 9 tokushima     1 維新         YOSHIDA, TOMOYO        2 female     3 20065
10 ehime         2 維新         YOKOYAMA, HIROYUKI     2 male       3 22677
# ℹ 3,649 more rows

Excluding cases with missing values in advance: is.na()

  • To determine whether a value is NA or not, you can use the is.na() function.
hr %>%
  filter(is.na(exp)) %>%
  select(year, name, exp)
# A tibble: 181 × 3
    year name                 exp
   <dbl> <chr>              <dbl>
 1  1996 ITO, MASAKO           NA
 2  1996 YAMADA, HIROSHI       NA
 3  1996 ASANO, KOSETSU        NA
 4  1996 ISHIKAWA, KAZUMI      NA
 5  1996 MURAMATSU, YOICHI     NA
 6  1996 YAMAZAKI, YOSHIAKI    NA
 7  1996 NAKANO, YOKO          NA
 8  1996 OGAWA, OSAMU          NA
 9  1996 ITO, TADAHIKO         NA
10  1996 ITO, TAKAYOSHI        NA
# ℹ 171 more rows
  • It can be seen that there are as many as 2831 NA values in exp.
  • Check the number of candidates (N) and the number of variables included in hr.
dim(hr)
[1] 9660   22
  • It is evident that out of 9660, 2831 are missing values in exp.  
  • Since we want to exclude the missing values in exp here, we use the negation symbol !.
hr %>%
  filter(!is.na(exp)) %>%
  select(year, name, exp)
# A tibble: 9,479 × 3
    year name                    exp
   <dbl> <chr>                 <dbl>
 1  1996 KAWAMURA, TAKASHI   9828097
 2  1996 IMAEDA, NORIO       9311555
 3  1996 SATO, TAISUKE       9231284
 4  1996 IWANAKA, MIHOKO     2177203
 5  1996 AOKI, HIROYUKI     12940178
 6  1996 TANABE, HIROO      16512426
 7  1996 FURUKAWA, MOTOHISA 11435567
 8  1996 ISHIYAMA, JUNICHI   2128510
 9  1996 FUJIWARA, MICHIKO   3270533
10  1996 YOSHIDA, YUKIHIRO  11245219
# ℹ 9,469 more rows
  • We were able to extract 6829 rows where exp is not missing.

2. Sorting Rows

Display in ascending order of value: arrange()

  • Displaying in Ascending Order.
  • To sort rows, use the arrange() function.
  • The first argument is the data frame object.
  • Subsequent arguments are the variable names used for sorting.
  • Rows with smaller values appear at the top.
  • For example, you can extract rows of winners (that is, wl > 0) from hr and display the winning candidates with the fewest votes (vote) in ascending order.
hr %>%
  filter(wl > 0) %>%
  arrange(vote) %>%
  select(year, ku, kun, seito, name, wl, gender, rank, vote)
# A tibble: 3,659 × 9
    year ku          kun seito        name                 wl gender  rank  vote
   <dbl> <chr>     <dbl> <chr>        <chr>             <dbl> <chr>  <dbl> <dbl>
 1  1996 tokyo        22 社民         HOSAKA, NOBUTO        2 male       5 13904
 2  2017 kyoto         3 日本維新の会 MORI, NATSUE          2 female     4 16511
 3  1996 saitama       6 社民         FUKADA, HAJIME        2 male       4 17909
 4  2003 saitama       8 共産         SHIOKAWA, TETSUYA     2 male       4 18512
 5  1996 nara          1 民主         IENISHI, SATORU       2 male       4 18994
 6  2003 okinawa       1 共産         AKAMINE, SEIKEN       2 male       4 19528
 7  1996 kochi         3 共産         HARUNA, NAOAKI        2 male       3 19549
 8  2017 kyoto         5 希望         INOUE, KAZUNORI       2 male       4 19586
 9  2021 tokushima     1 維新         YOSHIDA, TOMOYO       2 female     3 20065
10  2014 ehime         2 維新         YOKOYAMA, HIROYU…     2 male       3 22677
# ℹ 3,649 more rows
  • In the general elections from 1996 onwards, the candidate who won with the fewest votes was Mr. HOSAKA, NOBUTO in Tokyo’s 22nd district in 1996, with 13,904 votes (wl = 2: Zombie winner).
  • If you want to find the candidate with the smallest number of votes in a single-member constituency, excluding zombie winners, you can specify the condition as wl == 1.
hr %>%
  filter(wl == 1) %>%
  arrange(vote) %>%
  select(year, ku, kun, seito, name, wl, gender, rank, vote)
# A tibble: 2,674 × 9
    year ku          kun seito name                 wl gender  rank  vote
   <dbl> <chr>     <dbl> <chr> <chr>             <dbl> <chr>  <dbl> <dbl>
 1  1996 kochi         1 共産  YAMAHARA, KENJIRO     1 male       1 33523
 2  2000 kochi         1 自民  FUKUI, TERU           1 male       1 40765
 3  2000 osaka        17 自民  OKASHITA, NOBUKO      1 female     1 41781
 4  1996 kyoto         2 自民  OKUDA, MIKIO          1 male       1 43060
 5  2003 kochi         1 自民  FUKUI, TERU           1 male       1 43232
 6  2012 kochi         1 自民  FUKUI, TERU           1 male       1 44027
 7  2009 kochi         1 自民  FUKUI, TERU           1 male       1 44068
 8  1996 kanagawa      4 自民  IIJIMA, TADAYOSHI     1 male       1 46389
 9  1996 tokushima     1 民主  SENGOKU, YOSHITO      1 male       1 47057
10  1996 fukui         1 新進  SASAKI, RYUZO         1 male       1 48214
# ℹ 2,664 more rows
  • Since 1996, the person who won an election with the fewest votes in a general election in Japan is YAMAHARA, KENJIRO from Kochi 1st district in 1996, with 33,523 votes (winning in a single-member district).

Display in descending order of value: arrange(desc())

  • If you want to display the candidates with the most votes at the top, enclose the variable name with the desc() function
hr %>%
  arrange(desc(vote)) %>%
  select(year, ku, kun, seito, name, wl, gender, rank, vote)
# A tibble: 9,660 × 9
    year ku         kun seito name                  wl gender  rank   vote
   <dbl> <chr>    <dbl> <chr> <chr>              <dbl> <chr>  <dbl>  <dbl>
 1  2021 kanagawa    15 自民  KONO, TARO             1 male       1 210515
 2  2009 hokkaido     9 民主  HATOYAMA, YUKIO        1 male       1 201461
 3  2009 shizuoka     6 民主  WATANABE, SHU          1 male       1 197688
 4  2005 kanagawa    11 自民  KOIZUMI, JUNICHIRO     1 male       1 197037
 5  2012 kanagawa    15 自民  KONO, TARO             1 male       1 192604
 6  2009 saitama      6 民主  OSHIMA, ATSUHI         1 male       1 186993
 7  2005 kanagawa    15 自民  KONO, TARO             1 male       1 186770
 8  2009 hokkaido     3 民主  ARAI, SATOSHI          1 male       1 186081
 9  2012 kanagawa    11 自民  KOIZUMI, SHINJIRO      1 male       1 184360
10  2009 shizuoka     5 民主  HOSONO, GOSHI          1 male       1 184328
# ℹ 9,650 more rows
  • Since 1996, the candidate who won with the highest number of votes in a general election is KONO, TARO from Kanagawa 15th district in 2021, with 210,515 votes (winning in a single-member district).
  • If you want to know about candidates who are not in the top ranks, it’s convenient to use the DT::datatable() function.
hr2 <- hr %>%
  filter(wl == 1) %>% # Extract only those single-member-constituency winners  
  arrange(desc(vote)) %>%
  select(year, ku, kun, seito, name, wl, gender, rank, vote, voteshare)

DT::datatable(hr2)

3. Row Binding: bind_rows()

  • When you want to vertically combine multiple data.frames, you can use bind_rows().
  • Let’s assume you have the follwoing two data.frames: df1 and df2
df1 <- data.frame(id = 1:5,
                  name = c("A", "B", "C", "D", "E"),
                  score = c(100, 90, 80, 70, 60))

df2 <- data.frame(id = 6:8,
                  name = c("F", "G", "H"),
                  score = c(50, 40, 30))
  • Check how df1 and df2 look like
df1
  id name score
1  1    A   100
2  2    B    90
3  3    C    80
4  4    D    70
5  5    E    60
df2
  id name score
1  6    F    50
2  7    G    40
3  8    H    30
  • The two data.frames share the same variable names: name and score.
    → You can stack them vertically using the bind_rows()function.
  • When combining data vertically, it’s necessary for the variable names in each data.frame to match.
  • In this case, id, name, and score match.
bind_rows(df1, df2)
  id name score
1  1    A   100
2  2    B    90
3  3    C    80
4  4    D    70
5  5    E    60
6  6    F    50
7  7    G    40
8  8    H    30

list() Function

  • Let’s say df1 and df2 represent student data for the 1st and 2nd-year seminars, respectively.
  • When you use bind_rows() to combine them vertically, you lose information about the students’ academic year.
  • If you want to add an identification variable for students (1st year or 2nd year) while combining the data vertically:
    → You can group the data using the list() function.
    → Add the .id argument.
bind_rows(list("freshman" = df1,
               "sophomore" = df2),
          .id = "year")
       year id name score
1  freshman  1    A   100
2  freshman  2    B    90
3  freshman  3    C    80
4  freshman  4    D    70
5  freshman  5    E    60
6 sophomore  6    F    50
7 sophomore  7    G    40
8 sophomore  8    H    30
  • Now, you can distinguish the academic year of each student.

4. Column Merge:

  • There are variations when merging data horizontally by columns.
Important points to note:
  • You need an identifier variable (key variable) for the merge.
  • Identifier variable (key variable) — a variable common to two data.frames (in the example below, it’s univ).
df1 <- data.frame(univ  = c("拓殖大学", "早稲田大学", "UCLA"),
                  city   = c("東京", "東京", "LA"),
                  pop  = c(8600, 47000, 45000))

df2 <- data.frame(univ  = c("拓殖大学", "早稲田大学", "UCLA"),
                  color  = c("オレンジ", "えんじ", "黄色と青"))
  • How to merge these two sets of data:
  1. left_join()
  2. right_join()
  3. inner_join()
  4. full_join()
All of these functions works in the same way.
  • Enter the object names of the two data.frames to be merged.
  • Add the argument by = "key variable name".

4.1 4.1 The Mechanism of the left_join() function

  • left_join(x, y) is a function that preserves x.
  • Data frame x contains: 拓殖大学、早稲田大学、and UCLA.
  • Data frame y contains: 拓殖大学、早稲田大学、and 東北大学.
  • 拓殖大学 and 早稲田大学, where the variable names match, can be merged without any issue.
  • What happens with UCLA in x and 東北大学 in y?
  • Using left_join(x, y), UCLA in x is prioritized and retained.
  • 東北大学 disappears → The founder for UCLA will be shown as a missing value (NA).


When a key variable with matching variable names is included in both datasets
x <- data.frame(univ  = c("拓殖大学", "早稲田大学", "UCLA"),
                pop  = c(8600, 47000, 45000))

y <- data.frame(univ  = c("拓殖大学", "早稲田大学", "東北大学"),
                founder  = c("桂太郎", "大隈重信", "日本国"))
left_join(x, y, by = "univ")
        univ   pop  founder
1   拓殖大学  8600   桂太郎
2 早稲田大学 47000 大隈重信
3       UCLA 45000     <NA>

4.2 The Mechanism of the right_join() function

  • right_join(x, y) is a function that preserves data frame y.
  • Data frame x contains: 拓殖大学、早稲田大学、and UCLA.
  • Data frame y contains: 拓殖大学、早稲田大学、and 東北大学.
  • 拓殖大学 and 早稲田大学, where the variable names match, can be merged without any issue.
  • What happens with UCLA in x and 東北大学 in y?
  • Using right_join(x, y), 東北大学 in y is prioritized and retained.
  • UCLA disappears → The pop for 東北大学 will be shown as a missing value (NA).


When a key variable with matching variable names is included in both datasets
x <- data.frame(pop  = c(8600, 47000, 45000),
                univ  = c("拓殖大学", "早稲田大学", "UCLA"))

y <- data.frame(univ  = c("拓殖大学", "早稲田大学", "東北大学"),
                  founder  = c("桂太郎", "大隈重信", "日本国"))
right_join(x, y, by = "univ")
    pop       univ  founder
1  8600   拓殖大学   桂太郎
2 47000 早稲田大学 大隈重信
3    NA   東北大学   日本国

4.3 The Mechanism of the inner_join() function.

  • Data frame x contains: 拓殖大学、早稲田大学、and UCLA.
  • Data frame y contains: 拓殖大学、早稲田大学、and 東北大学.
  • inner_join(x, y) merges only the rows that exist simultaneously in both x and y datasets.
    → Only 拓殖大学 and 早稲田大学 will be merged.


x <- data.frame(pop  = c(8600, 47000, 45000),
                univ  = c("拓殖大学", "早稲田大学", "UCLA"))

y <- data.frame(univ  = c("拓殖大学", "早稲田大学", "東北大学"),
                  founder  = c("桂太郎", "大隈重信", "日本国"))
inner_join(x, y, by = "univ")
    pop       univ  founder
1  8600   拓殖大学   桂太郎
2 47000 早稲田大学 大隈重信

4.4 The Mechanism of the full_join() Function

  • Data frame x contains: 拓殖大学、早稲田大学、and UCLA.
  • Data frame y contains: 拓殖大学、早稲田大学、and 東北大学.
  • full_join(x, y) preserves everything in both x and y.
    → Cells that are missing will be displayed as missing values (NA).


x <- data.frame(pop  = c(8600, 47000, 45000),
                univ  = c("拓殖大学", "早稲田大学", "UCLA"))

y <- data.frame(univ  = c("拓殖大学", "早稲田大学", "東北大学"),
                  founder  = c("桂太郎", "大隈重信", "日本国"))
full_join(x, y, by = "univ")
    pop       univ  founder
1  8600   拓殖大学   桂太郎
2 47000 早稲田大学 大隈重信
3 45000       UCLA     <NA>
4    NA   東北大学   日本国

5. “Tidy Data” Structure

  • “Tidy data” is a data structure suitable for data analysis, proposed by Hadley Wickham.
  • Tidy data is data that is easy to read for computers.
  • Many analyses in R are based on tidy data.
    (However, it’s not necessarily easy to read for humans.)
  • The tidyr package transforms untidy data (i.e., cluttered data) into tidy data.

4 Principles on tidy data

1. one column = one variable
2. one row = one observation
3. one cell = one value
4. one table = one unit of analysis

11.1 one column = one variable principle

  • It may seem like an obvious principle, but most data does not meet this criterion.
  • The table below shows data collected from 10 people who were asked to rate Mos Burger and McDonald’s.
  • There are three variables that make up this data: participant number, mos, mc.
cluttered data
cluttered data
  • The problem with this data is that the variable burger score is split into two columns (mos and mc, two separate variables).
  • In one column, there are two variables representing burger scores: mos, mc.
    → There is a need to create separate variables for burger score (score) and type of burger (burger).
  • This cluttered data needs to be transformed into tidy data as shown below.
Transformation from cluttered data to tidy data
Transformation from cluttered data to tidy data
  • In one column, only one variable (score) should represent burger scores.
  • In one column, only one variable (burger) should represent type of burger.

5.2 one row = one row principle

  • It might seem obvious that one row should contain only one observation.
  • In the cluttered data on the left side of the above figure, one row contains two observations (mos, mc) regarding burger scores.
  • There is a need to create new variables: score and burger.
    → In one row, there should be only one observation (score) regarding burger score.
    → In one row, there should be only one observation (burger) regarding type of burger.

5.3 one cell = one value principle

  • The principle that each cell should contain only one value (The 3rd principle) is usually maintained.
  • For instance, it’s rare to summarize data such as the case where Participant No. 3 rated both Mos and McDonald’s as 80 points in a single cell, as shown below.

5.4 one table = one unit of analysis principle

  • Many tables contain multiple units of analysis, especially in government statistics.
  • It’s common for a single table to have a mix of units of analysis such as country, prefecture, municipality, and administrative district.

6. Examples of Column Merging

6.1 Calculation of the Average Score for the Check Tests

  • Let’s create the ariable that shows the average score of the check tests
  • Let’s calculate the average score of the check tests conducted multiple times at a tutoring school

What we want to do here ・ 50 students are enrolled
・ 5 check tests were conducted during the spring semester
・ Not all students took every check test
・ How can we fairly calculate the check test scores?

  • Let’s assume we have the following mock data at the end of the semester
  • Check test data for 50 students across 5 tests: checktest.csv, (N=50)

データの準備

Check Test Data: checktest.csv

  • Load the {tidyverse} package, which is necessary for analysis
library(tidyverse)
  • Download checktest.csv

  • Load the final exam data and name it df_checktest

  • Specify na = "." to display missing values as "."

df_check <- read_csv("data/checktest.csv", 
  na = ".")
  • Check the contents of df_checktest
DT::datatable(df_check)
  • You can check the check test results for 50 students
  • Display the descriptive statistics of df_check
  • Specify {r, results = "asis"} in the chunk options
stargazer::stargazer(as.data.frame(df_check),
  type = "html")
Statistic N Mean St. Dev. Min Max
test_1 43 80.256 11.146 60 100
test_2 46 77.804 11.268 55 100
test_3 42 80.357 8.647 60 100
test_4 44 78.182 10.308 50 100
test_5 43 80.721 9.407 55 100

・Each check test has 4 to 8 missing values (NA)

  • Calculate the average score of the 5 check tests taken by 50 students.
df_check1 <- df_check |> 
  mutate(ave = (test_1 + test_2 + test_3 + test_4 + test_5) / 5)
DT::datatable(df_check1)
  • We were able to calculate the average score for students who submitted all 5 check tests.
  • However, the average score was not calculated for students who submitted only 2, 3, or 4 tests.

Solution

Add rowwise() and na.rm = TRUE

→ This allows for calculating the average score while considering missing values.

df_check1 <- df_check1 |> 
  rowwise() |> 
  mutate(ave = mean(c(test_1, test_2, test_3, test_4, test_5), 
    na.rm = TRUE))
DT::datatable(df_check1)
  • Narrow down to only the necessary variables.
df_check2 <- df_check1 |> 
  select(name, ave)
DT::datatable(df_check2)

Create a variable (submission) that indicates the number of submitted check tests.

Step 1

  • Convert the wide-format data df_check into long-format data df_check_long.
df_check_long <- df_check |> 
    pivot_longer(cols      = test_1:test_5,
                 names_to  = "test_number", # Input the test type    
                 values_to = "score")       # Input the test score  
DT::datatable(df_check_long)

Step 2

・Exclude students with missing values in score from df_check_long.

df_check_long <-  df_check_long |> 
  filter(!is.na(score))
DT::datatable(df_check_long)

Step 3

  • For each student (from Student_1 to Student_50), calculate and display the number of check tests they took:
    → Use the group_by() function to count the number of times each student’s name appears in the name column of df_check_long.
    → This represents the number of tests taken.
    → Store the “number of tests taken” in a variable called submission
df_check_nosub <- df_check_long |> 
  group_by(name) |> 
  summarise(submission = n())
DT::datatable(df_check_nosub)

Step 4

  • Merge the “average score” and “number of submissions” for the check tests.
  • df_check2 + df_check_nosub => df_st_list
  • Merge them using the common variable (name) as the key.

df_check2

DT::datatable(df_check2)

df_check_nosub

DT::datatable(df_check_nosub)
df_st_list <- left_join(df_check2, df_check_nosub, by = "name")
DT::datatable(df_st_list)

Check the relationship between the number of check tests taken and the average score.

  • Only Mac users should enter the following command
  theme_set(theme_bw(base_family = "HiraKakuProN-W3"))
df_st_list  |>  
  ggplot(aes(submission, ave)) +
  geom_point() +
  labs(x = "Number of Check Tests Taken", y = "Average Check Test Score",
         title = "Scatter Plot of Number of Check Tests Taken and Average Score") + 
  stat_smooth(method = lm) +  # (method = lm, se = FALSE) → The 95% confidence interval disappears
  geom_text(aes(y = ave + 0.2, 
                label = name), 
            size = 4, 
            vjust = 0)

  • Students who take more check tests tend to have higher average scores.

6.2 Evaluation of Grades for Enrolled Courses

  • Let’s consider the case of grade evaluations listed in the syllabus for courses taken at a university.

Grading Criteria ・Final Exam(60%)
・Assignment (40%)

  • Here, let’s assume that at the end of the semester, we have obtained two sets of fictional data as follows:
  • Final exam data for 100 students: R01_exam_score.csv, (N=100)
  • Data for five submitted homework assignments: R01_hw.csv, (N=500).

データの準備

Final Exam data: R01_exam_score.csv

  • Load {tidyverse} package
library(tidyverse)
  • Download R01_exam_score.csv to your computer.
  • Read this final exam data and name it df_exam.
  • Remember to specify na = "." to indicate missing values as “.”.
df_exam <- read_csv("data/R01_exam_score.csv", na = ".")
  • Display the content of df_exam using DT package
DT::datatable(df_exam)
  • The final exam results for 100 students can be viewed.  
  • Display the descriptive statistics for df_exam by using summary() function.
summary(df_exam)
     name               score       
 Length:100         Min.   : 45.00  
 Class :character   1st Qu.: 68.00  
 Mode  :character   Median : 74.00  
                    Mean   : 75.23  
                    3rd Qu.: 84.00  
                    Max.   :100.00  
                    NA's   :9       
  • 9 missing values (NA) are confirmed in score.
  • Let’s display the distribution of final exam results using a histogram.
df_exam %>%
  filter(!is.na(score)) %>%
  ggplot() +
  geom_histogram(aes(x = score), color = "white", 
                 binwidth = 10, boundary = 0) +
  labs(x = "Exam score", y = "Number of Students") +
  geom_vline(xintercept = mean(df_exam$score, 
                               na.rm = TRUE), # Since score contains missing values, specify na.rm = TRUE
             col = "magenta")  # Draw a vertical line in true magenta color at the average value

  • Although it’s not a perfect normal distribution, it’s nearly symmetrical, and we can see that the average is around 75 points.

Assignment data: R01_hw.csv

  • Download R01_df_hw.csv to your computer
  • Read this final exam data and name it df_exam.
  • Remember to specify na = "." to indicate missing values as “.”.
df_hw <- read_csv("data/R01_hw.csv", na = ".")
  • Display the content of df_hw using DT package
DT::datatable(df_hw)
  • 提出 = submitted, 未提出 = not submitted

  • We can see whether 100 students submitted their homework five times.

  • Since it’s the result of 100 students submitting homework five times, N = 500.

The problem here:

  • df_exam is N = 100, df_hw is N = 500.
  • To assign grades, we need the grade data for 100 students.
    → The number of observations needs to be unified to N = 100.
  • Calculate and display how many times each student (Student_1 to Student_100) submitted homework.
    → Use the group_by() function to group name and home work in df_hw and calculate the total number of submissions.
    → Store the total number of submissions in a variable called submission.
df_hw <- df_hw %>%
  group_by(name, hw) %>%
  summarise(submission = n())
DT::datatable(df_hw)
  • submission displays not only the number of times homework was submitted but also the number of times it was not submitted.
  • Since we don’t need the count of not submitted, use the filter() function to hide it.
  • The variable hw is also unnecessary, so use the select() function to hide it.
df_hw <- df_hw %>%
  filter(hw == "提出") %>%
  select(name, submission)
DT::datatable(df_hw)
  • There should be 100 students enrolled, but data is available for only 92 students.   → This indicates that 7 students did not submit homework even once.  
  • Let’s display the distribution of the 92 students who submitted homework using a histogram.
df_hw %>%
  filter(!is.na(submission)) %>% # 欠損値処理
  ggplot() +
  geom_histogram(aes(x = submission), color = "white", 
                 binwidth = 1, boundary = 0) +
  labs(x = "Number of Submissions", y = "Number of Students") +
  geom_vline(xintercept = mean(df_hw$submission,
    na.rm = T), 
    col = "yellow",
    linetype = "dotted") # Draw a vertical dotted line in yellow color at the average value 

  • Most students submitted their homework 3 or 4 times, and we can see that the average is about 3.6 submissions.
  • Let’s use the summary() function to display accurate descriptive statistics.
summary(df_hw$submission)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   3.000   4.000   3.652   4.000   5.000 
  • Here, we have obtained two data frames (df_exam and df_hw) that are necessary for assigning grades.
  • The question is which of the following functions to use to join these two data frames:
  1. left_join()
  2. right_join()
  3. inner_join()
  4. full_join()
  • The choice of which function to use depends on the criteria for determining grades.

7. Merging data frames

7.1 Joining data frames: left_join()

  • Suppose your grading criteria are as follows:

Grading Criteria_1 ・Final Exam (60%)
・Homework (40%)
Only those who took the final exam are eligible for grading

  • This means that “Only those who took the final exam are eligible for grading, regardless of whether they submitted homework.”
    → We preserve data frame x (in this case, df_exam) and join it with data frame y (in this case, df_hw).
    → Use the left_join() function.

  • In this case, the resulting joined data frame will have the following structure.

  • left_join() is a joining method that preserves x.
  • Only Student_1, Student_2, and Student_3 from x are preserved.
  • nStudent_4, which is present in y but not in x, is excluded.
    → In this case, name is the key variable.
df_left <- left_join(df_exam, df_hw, by = "name")
DT::datatable(df_left)
  • left_join() includes all 100 students from x, giving priority to x.
    → We get the data where all of them took the final exam.
  • Creating the Required Variables for Grading.
  • We have created the two variables (score and submission) required for grading.
  • However, while score is on a scale of 100 points, submission is on a scale of 5 points.
    → We need to convert submission to a scale of 100 points (sub).
df_left <- df_left %>%
  mutate(sub = submission * 100/5) %>%
  select(name, score, sub) # submission  を非表示にする 
DT::datatable(df_left)

Calculating Grades

  • The algorithm for determining grades is as follows:

Grading Criteria_1 ・Final Exam (60%)
・Homework (40%)
Only those who took the final exam are eligible for grading

  • Calculate the total score according to these criteria.
df_left <- df_left %>%
  mutate(grade = score * 0.6 + sub * 0.4)
DT::datatable(df_left)
  • Since grade is displayed with decimal places, use the round() function to round it to the nearest whole number and remove decimal places.
  • Narrow down the displayed variables to name and grade.
df_left <- df_left %>%
  mutate(grade = score * 0.6 + sub * 0.4) %>%
  mutate(grade = round(grade, digits = 0)) %>%
  select(name, grade)
DT::datatable(df_left)
  • With this, the final grades for students have been calculated.
  • Let’s visualize the distribution of students’ final grades with a histogram.
df_left %>%
  filter(!is.na(grade)) %>%
  ggplot() +
  geom_histogram(aes(x = grade), color = "white", 
                 binwidth = 10, boundary = 0) +
  labs(x = "Final Grade", y = "Number of Students") +
  geom_vline(xintercept = mean(df_left$grade,
    na.rm = T), 
    col = "yellow",
    linetype = "dotted") # Draw a vertical dotted line in yellow color at the average value  

  • The descriptive statistics for the final grades are as follows:
summary(df_left$grade)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  53.00   67.75   75.00   74.62   81.00  100.00      16 
  • The grades for the 84 students who took the final exam have been completed.

7.2 Joining data frames: full_join()

  • Suppose your grading criteria are as follows:

Grading Criteria_2 ・Final Exam (60%)
・Homework (40%)
Regardless of whether homework has been submitted or not, if a perfect score is obtained in the final exam, a passing grade (= 60 points) will be awarded

  • This means that “All students will be graded regardless of homework submission, and a passing grade of 60 points will be awarded if a perfect score is obtained in the final exam.
  • In this case, all rows present in both data frames are included in the join, and the full_join() function is used.
  • The resulting data frame will have the following structure:

  • full_join() includes all rows that exist in both x and y data frames.
  • If Student_3 is present in x but not in y, the score value for Student_3 will be retained.
  • If Student_4 is present in y but not in x, the submission value for Student_4 will be retained.
  • In the case of df_exam and df_hw, they can be joined without any issues, and all students from Student_1 to Student_100 can be joined together.
  • In this scenario, name serves as the key variable.
df_full <- full_join(df_exam, df_hw, by = "name")
DT::datatable(df_full)

Creating the Necessary Variables for Grading

  • Two variables required for determining grades (score and submission) have been created.
  • However, while score is on a scale of 100 points, submission is on a 5-point scale.
    → It is necessary to convert submission to a 100-point scale value (sub).
df_full <- df_full %>%
  mutate(sub = submission * 100/5) %>%
  select(name, score, sub) # submission  を非表示にする 
DT::datatable(df_full)
summary(df_full)
     name               score             sub        
 Length:100         Min.   : 45.00   Min.   : 20.00  
 Class :character   1st Qu.: 68.00   1st Qu.: 60.00  
 Mode  :character   Median : 74.00   Median : 80.00  
                    Mean   : 75.23   Mean   : 73.04  
                    3rd Qu.: 84.00   3rd Qu.: 80.00  
                    Max.   :100.00   Max.   :100.00  
                    NA's   :9        NA's   :8       

Missing Values:

  • 9 students did not take the final exam (NA's = 9).
  • 8 students did not submit homework even once (NA's = 8).

Calculating Grades:

  • The algorithm for determining grades is as follows:

Grading Criteria_2 ・Final Exam (60%)
・Homework (40%)
Regardless of whether homework has been submitted or not, if a perfect score is obtained in the final exam, a passing grade (= 60 points) will be awarded

  • Calculate the total score according to these criteria.
df_full <- df_full %>%
  mutate(grade = score * 0.6 + sub * 0.4)
DT::datatable(df_full)
  • Since grade is displayed with decimal places, we can use the round() function to round it to the nearest whole number and remove the decimal places.
  • Let’s also narrow down the displayed variables to name and grade.
df_full <- df_full %>%
  mutate(grade = score * 0.6 + sub * 0.4) %>%
  mutate(grade = round(grade, digits = 0)) %>%
  select(name, score, sub,grade)
DT::datatable(df_full)
  • With this, the final grades for students have been calculated.
  • Let’s visualize the distribution of students’ final grades with a histogram.
df_full %>%
  filter(!is.na(grade)) %>%
  ggplot() +
  geom_histogram(aes(x = grade), color = "white", 
                 binwidth = 10, boundary = 0) +
  labs(x = "Final Grade", y = "Number of Students") +
  geom_vline(xintercept = mean(df_full$grade,
    na.rm = T), 
    col = "yellow",
    linetype = "dotted") # Draw a vertical dotted line in yellow color at the average value    

  • The descriptive statistics for the final grades are as follows:
summary(df_full$grade)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  53.00   67.75   75.00   74.62   81.00  100.00      16 
  • There are no students who scored a perfect grade on the final exam without submitting any homework assignments.
  • Student_8 didn’t submit any homework assignments but scored 97 points on the final exam.
  • However, the final grade is calculated as 97 x 0.6 = 58 points, so they cannot be awarded a passing grade.

7.3 Joining data frames: inner_join()

  • Suppose your grading criteria are as follows:

Grading Criteria_3 ・Final exam (60%)
・Homework (40%)
・Individuals who have never submitted homework or have not taken the final exam will not be graded.

  • This means that “In addition to those who submitted homework AND took the final exam, those who have either submitted homework at least once OR have taken the final exam will also be graded.”
  • Rows that exist in both data frames are eligible for joining.
  • The inner_join() function is used in this case.
  • In this scenario, the resulting merged data frame will have the following structure.

  • inner_join() combines rows that exist in both x and y data frames.
  • If Student_3 exists in x but not in y, Student_3 will not be included.
  • If Student_4 exists in y but not in x, Student_4 will not be included.
  • Merging df_exam and df_hw should be joined without issues.
  • It’s not guaranteed that all students from Student_1 to Student_100 will be included in the merge.
  • In this case, name is the key variable used for merging.
df_final <- inner_join(df_exam, df_hw, by = "name")
DT::datatable(df_final)
  • inner_join() combines only the 92 students who exist simultaneously in both x and y.

Creating the required variables for grading

  • Two variables necessary for determining grades (score and submission) have been created.
  • However, score is in a 100-point scale, while submission is in a 5-point scale.
  • We need to convert submission into a 100-point scale (sub) to align the scales.
df_final <- df_final %>%
  mutate(sub = submission * 100/5) %>%
  select(name, score, sub) # submission  を非表示にする  
DT::datatable(df_final)
  • Individuals who submitted homework at least once and took the final exam amount to 92 people.

Calculating Grades:

  • The algorithm for determining grades is as follows:

Grading Criteria_3 ・Final exam (60%)
・Homework (40%)
・Individuals who have never submitted homework or have not taken the final exam will not be graded.

  • Calculate the total score according to these criteria.
df_final <- df_final %>%
  mutate(grade = score * 0.6 + sub * 0.4)
DT::datatable(df_final)
  • Since grade is displayed with decimal points, I will use the round() function to round it to the nearest whole number and display only the variables name and grade.
df_final <- df_final %>%
  mutate(grade = score * 0.6 + sub * 0.4) %>%
  mutate(grade = round(grade, digits = 0)) %>%
  select(name, grade)
DT::datatable(df_final)
  • With this, the final grades for students have been calculated.
  • Let’s visualize the distribution of students’ final grades with a histogram.
df_final %>%
  filter(!is.na(grade)) %>%
  ggplot() +
  geom_histogram(aes(x = grade), color = "white", 
                 binwidth = 10, boundary = 0) +
  labs(x = "Final Grade", y = "Number of Students") +
  geom_vline(xintercept = mean(df_final$grade,
    na.rm = T), 
    col = "yellow",
    linetype = "dotted")  # Draw a vertical dotted line in yellow color at the average value      

  • The descriptive statistics for the final grades are as follows:
summary(df_final$grade)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  53.00   67.75   75.00   74.62   81.00  100.00       8 

8. Wide to Long Transformation  

  • There are three types of data transformations using the tidyr package.
pivot_longer() wide (cluttered data) → long (tidy data)
pivot_wider() long (tidy data) → wide (cluttered data)
separate() separating cell (year/momth/day → year, month, day)

8.1 pivot_longer()

Data Preparation (mos_mc_paired.csv) ・Download mos_mc_paired.csv to your computer.

・Manually place the downloaded mos_mc.csv into the data folder inside the RProject folder.
・Read the data.

df_mosmc <- read_csv("data/mos_mc_paired.csv",
               na = ".")  
  • Check the data
DT::datatable(df_mosmc)
  • This data is in wide format, making it difficult to analyze in R.
    → Convert to long format using the pivot_longer() function.

pivot_longer() function new_data_frame <- original_data_frame %>%
pivot_longer(cols = row containing variables, names_to = "Variable name for the original column names",
values_to = "Variable name for the values of the variables")

df_long <- df_mosmc %>% 
    pivot_longer(cols      = mos:mc,
                 names_to  = "burger",   
                 values_to = "score")   
  • Check df_long
DT::datatable(df_long)

8.2 pivot_wider()

  • Use pivot_wider() function to reshape long type data into wide type data.
  • Messy data often takes the form of wide type data.
  • Messy data can sometimes be more readable for humans.
  • When creating diagrams or tables, it is always necessary to consider from the perspective of the reader.
  • Use the pivot_wider()function to convert to wider format.

pivot_wider() function new_data_frame <- original_data_frame %>%
pivot_wider(cols = row containing variables, names_to = "Variable name for the original column names",
values_to = "Variable name for the values of the variables")

df_wide <- df_long %>%
  pivot_wider(names_from  = "burger", 
              values_from = "score")   
  • Check df_wide
DT::datatable(df_wide)

8.3 separate()

Data Preparation: COVID19_Worldwide.csv

Analysis of COVID-19 New Infection Data

・Here, we use data COVID19_Worldwide.csv independently collected from the internet, etc., by Professor Jaehyun SONG (宋財泫) of Kansai University.

Variable Name Details
ID
Country Country Name
Date Date (Year-Month-Day)
Confirmed_Day Number of New COVID-19 Cases per Day
Confirmed_Total Total Cumulative Number of COVID-19 Cases
Death_Day Number of New COVID-19 Deaths per Day
Death_Total Total Cumulative Number of COVID-19 Deaths
Test_Day Number of New COVID-19 Tests per Day
Test_Total Total Cumulative Number of COVID-19 Tests
covid_df <- read_csv("data/COVID19_Worldwide.csv", 
                       guess_max = 10000) 
                      # Set to determine the data type after reading the first 10,000 rows
  • 変数を確かめる
names(covid_df)
[1] "ID"              "Country"         "Date"            "Confirmed_Day"  
[5] "Confirmed_Total" "Death_Day"       "Death_Total"     "Test_Day"       
[9] "Test_Total"     
  • Narrow down to only the variables used for analysis.
df1 <- covid_df %>%
  select(Country, Date, Confirmed_Total, Death_Total)
Variable Name Details
ID
Country Country Name
Date Date (Year-Month-Day)
Confirmed_Total Total Cumulative Number of COVID-19 Cases
Test_Day Number of New COVID-19 Tests per Day
  • Check df1
DT::datatable(df1)

df1 meets the four principles of tidy data.

1. one column = one variable
2. one row = one observation
3. one cell = one value
4. one table = one unit of analysis

→ This data is tidy data (=long type).
→ There is no need to transform the data. - Display the descriptive statistics of the variables in covid_df.

library(stargazer)
  • Remember to specify {r, results = "asis"} in the chunk options
stargazer(as.data.frame(df1), 
          type ="html",
          digits = 2)
Statistic N Mean St. Dev. Min Max
Confirmed_Total 31,806 18,250.14 115,471.60 0 3,184,582
Death_Total 31,806 1,039.01 6,565.51 0 134,094
  • Here, we will explain how to use the separate() function.
  • Suppose you want to summarize the number of new infections by year, having the date column as ‘YYYY/MM/DD’ is inconvenient.
    → It is necessary to split it into year, month, and day.
  • Let’s try splitting the separate() column of covid_df into Year, Month, and Day.

How to use separate() function data frame %>%
  separate(col = "Variable to split",
  into = c("Name of variable 1 after splitting",
  "Name of variable 2 after splitting",
  "Name of variable 3 after splitting", ...),
  sep = "Criterion for splitting")   

  • In the case of 2020/1/22, it is split into “2020”, “1”, and “22” based on the criterion of “/”.
df1 <- df1 %>%
  separate(col  = "Date",
           into = c("Year", "Month", "Day"),
           sep  = "/")

Let’s calculate the Total Number of Infections and Total Number of Deaths by country for the year 2020 (from January 22nd to July 10th).

death_country <- df1 %>%
  group_by(Country, Year) %>%
  summarise(Death = sum(Death_Total),
            Infected = sum(Confirmed_Total))
DT::datatable(death_country)
  • The descriptive statistics for the death_country are as follows:

  • Remember to specify {r, results = "asis"} in the chunk options

stargazer(as.data.frame(death_country), 
          type ="html",
          digits = 2)
Statistic N Mean St. Dev. Min Max
Death 186 177,670.60 782,130.30 0 8,616,010
Infected 186 3,120,774.00 13,105,502.00 731 160,231,690
  • Let’s draw a scatter plot of the cumulative number of Covid-19 cases and the cumulative number of deaths.
plot_1 <- death_country %>% 
  ggplot(aes(Infected, Death)) +
  geom_point() +
  stat_smooth(method = lm) +
  ggrepel::geom_text_repel(aes(label = Country),
            size = 3, 
            family = "HiraKakuPro-W3") +
  labs(x = "Total Cumulative Number of COVID-19 Cases", y = "Total Cumulative Number of COVID-19 Deaths")+
  theme_bw(base_family = "HiraKakuProN-W3")

plot_1

  • Try excluding the United States, which is an outlier.
plot_2 <- death_country %>% 
  filter(Country != "United States") %>%
  ggplot(aes(Infected, Death)) +
  geom_point() +
  stat_smooth(method = lm) +
  ggrepel::geom_text_repel(aes(label = Country),
            size = 3, 
            family = "HiraKakuPro-W3") +
  labs(x = "Total Cumulative Number of COVID-19 Cases", y = "Total Cumulative Number of COVID-19 Deaths")+
  theme_bw(base_family = "HiraKakuProN-W3")
plot_2

9. Exercise

Q9.1:

  • Refer to 1. Sorting Rows: arrange() and answer the following questions:
  • Use the House of Representatives election dataset (hr96-21.csv) for the analysis.
  • Limit the displayed variables to the following six:
  1. year
  2. ku
  3. kun
  4. seito
  5. name
  6. vote
  7. voteshare
  • Q1: Among the candidates of the 2021 general election, list the top 10 candidates in descending order of the number of votes (vote) obtained.

  • Q2: Among the candidates of the 2021 general election, list the top 10 candidates in descending order of vote share (voteshare).

Q9.2:

  • Refer to 7.3 separate() and answer the following questions:

  • Use COVID19_Worldwide.csv data in the analysis.

  • Q1: Draw a scatter plot for the year 2020 (January 22nd to July 10th) with Test_Total on the x-axis and Confirmed_Total on the y-axis.

  • Q2: Draw a scatter plot for the year 2020 (January 22nd to July 10th) with Test_Total on the x-axis and Confirmed_Total on the y-axis. If there are outliers, show a scatter plot excluding them.

Reference
  • Tidy Animated Verbs
  • 宋財泫 (Jaehyun Song)・矢内勇生 (Yuki Yanai)「私たちのR: ベストプラクティスの探究」
  • 宋財泫「ミクロ政治データ分析実習(2022年度)」
  • 土井翔平(北海道大学公共政策大学院)「Rで計量政治学入門」
  • 矢内勇生(高知工科大学)授業一覧
  • 浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年
  • 浅野正彦, 中村公亮.『初めてのRStudio』オーム社、2018年
  • Winston Chang, R Graphics Cookbook, O’Reilly Media, 2012.
  • Kieran Healy, DATA VISUALIZATION, Princeton, 2019
  • Kosuke Imai, Quantitative Social Science: An Introduction, Princeton University Press, 2017