• R packages used in this section
library(DT)
library(gapminder)
library(gghighlight)
library(ggrepel)
library(stargazer)
library(tidyverse)

1. Histogram

  • A histogram is an approximate representation of the distribution of numerical data.

Diffenece between a barchart and a histogram

Variable types How to visualize Feature
Discrete variable Bar chart Lines between bars
Continuous variable Histogram No Lines between bars
When x-axis is year of elections
  • There is no values between 2017 and 2021 (because there was no election held between them)
  • Years of election look like numeric, but it is not treated as numeric
When x-axis is vote share
  • The value of vote share ranges from 0%, 0.1%, …. to 100%
    → We need infinite number of bars for each value
    → We do not use bars for each value
  • We use limited number of bars

2. How to draw a histogram using ggplot2

  • You can draw a histogram by using geom_histogram()
  • You need to map a continuous variable on x-axis
  • You don’t have to map on y-axis
  • Let’s draw a histogram of vote share in the lower house elections in Japan between 1996 and 2021

2.1 Draw a simple histogram

  • Make a folder, named data in your R Project folder
  • Download hr96-21.csv onto the data folder in your R Project
  • Read the election data, hr96-21.csv and name it df
df <- read_csv("data/hr96-21.csv",
               na = ".")  
  • Using if_else(), make a dummy variable: ldp
    ##### mutate(ldp = if_else(seito == "自民", "LDP", "Non-LDP") - This command means make a dummy variable, named ldp
  • If a value in variable seito is “自民”, then replace it with “LDP” and replace the other values (that is, the other party names) in seito with “Non-LDP”
df <- df %>% 
  mutate(ldp = if_else(seito == "自民", "LDP", "Non-LDP"))
  • Draw a histogram of vote share
df %>%
  ggplot() +
  geom_histogram(aes(x = voteshare)) 

2.2 What is a good histogram?

  • A good histogram clearly address the following questions:
  1. What is the the number of peaks in distribution?
  2. If you have peaks, where is most densely (frequently) populated?
  • You need to adjust the number of bins to show an appropriate histogram
  • Which one is the best histogram among the following three histograms?
  • Usually, a histogram does not have lines between bars, let me add white lines here.

  • The histogram in the left shows two peaks but it has too narrow bins
  • The histogram in the right has too wide bins and we fail to see the two peaks
  • The histogram in the middle has an appropriate width showing two peaks
  • If you have too narrow bins (such as the one in the left), then you show too much information in the histogram
  • If you have too wide bins (such as the one in the right), then you might lose some important information data contains
  • You need to choose an appripriate size of bins
How to customize the number of bins:
  • You can customize the number of bins using bins = XX
How to customize the width of bins:
  • You can customize the width of bins using binwidth = XX
How to customize colors
  • By adding lines between bins, you can make your histogram way easier to see.
  • You can customize the color (such as color = "white") out of aes() and within geom_histogram()

bin and binwidth ・You can customize either binwidth (the width of bins) or bins (the numbder of bins)
・You cannot simultaneously customize both

2.3 Customize the scale on x-axis and y-axis

scale_x_continuous() customize x-axis
scale_y_continuous() customize y-axis
  • The following is the histogram generated in the previous section
hist_plot1 <- df %>%
  ggplot() +
  geom_histogram(aes(x = voteshare),
                 color = "white",
                 binwidth = 5) 
hist_plot1

  • Let’s change the display on x-axis on thie histogram
  • Using scale_x_*(), you can customize the scale onx-axis`
  • The range of x-axis is between 1 and 100
  • The interval is 10
hist_plot2 <- hist_plot1 +
  scale_x_continuous(breaks = seq(0, 100, by = 10),
                     labels = seq(0, 100, by = 10))
hist_plot2

2.4 Add another dimension

  • When you want to add another dimension, you have two ways of getting it done.

(1) facet_wrap(~)

  • Let’s add another dimension, ldp, to the histogram above by using facet_wrap(~)
hist_plot2 + 
  facet_wrap(~ldp) +
  theme_bw(base_family = "HiraKakuProN-W3")

(2) fill = ….

  • Let’s add another dimension, ldp, to the histogram above by using fill = ...
  • You need to add position = "identity"
df %>%
  mutate(ldp = if_else(seito == "自民", "LDP", "Non-LDP")) %>%
   ggplot() +
   geom_histogram(aes(x = voteshare, 
                      fill = ldp), 
                  position = "identity",
                  binwidth = 10, 
                  color = "white") +
   labs(x = "Vote Share", 
        y = "The Number of Candidates ",
        fill = "") +
  theme_bw(base_family = "HiraKakuProN-W3")

  • This histogram has a serious problem!
  • You fail to see the hidden number of non-LDP candidates behind the green bars.
Solution:
  • Using alpha = ..., you can make the bars transparent
    alpha = 0・・・transparent
    alpha = 1・・・not transparent
  • Let’s set as alpha = 0.5 here
df %>%
  mutate(ldp = if_else(seito == "自民", "LDP", "Non-LDP")) %>%
   ggplot() +
   geom_histogram(aes(x = voteshare, 
                      fill = ldp), 
                  position = "identity",
                  binwidth = 10, 
                  color = "white",
                  alpha = 0.5) +
   labs(x = "Vote Share", 
        y = "The Number of Candidates",
        fill = "") +
  theme_bw(base_family = "HiraKakuProN-W3")

You had better not do this!!
  • Let’s generate a histogram of vote share for each party in the lower house elections between 1996 and 2021
df %>%
   ggplot() +
   geom_histogram(aes(x = voteshare, 
                      fill = seito), 
                  position = "identity",
                  binwidth = 10, 
                  color = "white",
                  alpha = 0.5) +
   labs(x = "Vote Share", 
        y = "The Number of Candidates",
        fill = "") +
  theme_bw(base_family = "HiraKakuProN-W3")

  • You have too much information to figure it out.
Solution:
  • Reduce the number of parties, say 3 parties.
df %>%
  filter(seito == "自民"|seito == "共産"|seito == "民主") %>%
  mutate(party = case_when(seito == "自民" ~ "LDP",
                           seito == "共産" ~ "JCP",
                           seito == "民主" ~ "CDP")) |> 
   ggplot() +
   geom_histogram(aes(x = voteshare, 
                      fill = party), 
                  position = "identity",
                  binwidth = 10, 
                  color = "white",
                  alpha = 0.5) +
   labs(x = "Vote Share", 
        y = "The Number of Candidates",
        fill = "") +
  theme_bw(base_family = "HiraKakuProN-W3")

  • Delete position = "identity", then you can add up each distribution in one bar
df %>%
  filter(seito == "自民"|seito == "共産"|seito == "民主") %>%
  mutate(party = case_when(seito == "自民" ~ "LDP",
                           seito == "共産" ~ "JCP",
                           seito == "民主" ~ "CDP")) |> 
   ggplot() +
   geom_histogram(aes(x = voteshare, 
                      fill = party), 
                  binwidth = 10, 
                  color = "white",
                  alpha = 0.5) +
   labs(x = "Vote Share", 
        y = "The Number of Candidates",
        fill = "") +
  theme_bw(base_family = "HiraKakuProN-W3")

  • If you want to overlup the 3 distribution in one facet, then you should use geom_density() instead of geom_histogram()
df %>%
  filter(seito == "自民"|seito == "共産"|seito == "民主") %>%
  mutate(party = case_when(seito == "自民" ~ "LDP",
                           seito == "共産" ~ "JCP",
                           seito == "民主" ~ "CDP")) |> 
   ggplot() +
   geom_density(aes(x = voteshare, 
                      fill = party), 
                  color = "white",
                  alpha = 0.5,
                adjust = 0.8) +
   labs(x = "Vote Share", 
        y = "Density",
        fill = "") +
  theme_bw(base_family = "HiraKakuProN-W3")

3 Exercise

  • Q3.1:
    In reference to 2.4 Add another dimension, draw a histogram of vote share (voteshare) for the LDP (Liberal Democratic Party) and CDP (Constitutional Democratic Party) candidates in the 2021 lower house election.
    ・You see the party name in variable, seito in hr96_21.csv
    ・You need to use geom_histogram() to draw a histogram.
    ・You need to pay attention to the hidden values behind the bars in drawing a histogram.

  • Q3.2:
    In reference to 2.4 Add another dimension, draw a histogram of vote share (voteshare) for the LDP (Liberal Democratic Party) and CDP (Constitutional Democratic Party) and JCP (Japan Communist Party) candidates in the 2021 lower house election.
    ・You see the party name in variable, seito in hr96_21.csv
    ・You need to use geom_histogram() to draw a histogram.
    ・You need to pay attention to the hidden values behind the bars in drawing a histogram.

In answering Q3.1 and Q3.2 questions, use hr96-21.csv
hr96_21.csv is a collection of Japanese lower house election data covering 9 national elections (1996, 2000, 2003, 2005, 2009, 2012, 2014, 2017, 2021)
・You need the following three variables which are included in hr96_21.csv to draw histograms:

variable detail
year Election year (1996-2021)
voteshare Vote share (%)
seito Candidate’s affiliated party (in Japanese)

hr96_21.csv contains the following 23 variables:

variable detail
year Election year (1996-2021)
pref Prefecture
ku Electoral district name
kun Number of electoral district
rank Ascending order of votes
wl 0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner
nocand Number of candidates in each district
seito Candidate’s affiliated party (in Japanese)
j_name Candidate’s name (Japanese)
name Candidate’s name (English)
previous Previous wins
gender Candidate’s gender:“male”, “female”
age Candidate’s age
exp Election expenditure (yen) spent by each candidate
status 0 = challenger / 1 = incumbent / 2 = former incumbent
vote votes each candidate garnered
voteshare Vote share (%)
eligible Eligible voters in each district
turnout Turnout in each district (%)
castvote Total votes cast in each district
seshu_dummy 0 = Not-hereditary candidates, 1 = hereditary candidate
jiban_seshu Relationship between candidate and his predecessor
nojiban_seshu Relationship between candidate and his predecessor
References
  • Tidy Animated Verbs
  • 宋財泫 (Jaehyun Song)・矢内勇生 (statuki statanai)「私たちのR: ベストプラクティスの探究」
  • 宋財泫「ミクロ政治データ分析実習(2022年度)」
  • 土井翔平(北海道大学公共政策大学院)「Rで計量政治学入門」
  • 矢内勇生(高知工科大学)授業一覧
  • 浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年
  • 浅野正彦, 中村公亮.『初めてのRStudio』オーム社、2018年
  • Winston Chang, R Graphics Coo %>%kbook, O’Reilly Media, 2012.
  • Kieran Healy, DATA VISUALIZATION, Princeton, 2019
  • Kosuke Imai, Quantitative Social Science: An Introduction, Princeton University Press, 2017