R packages used in this section

library(DT)
library(gapminder)
library(gghighlight)
library(ggrepel)
library(stargazer)
library(tidyverse)

1. Histogram

A histogram is an approximate representation of the distribution of numerical data.

Diffenece between a barchart and a histogram

Variable types	How to visualize	Feature
Discrete variable	Bar chart	Lines between bars
Continuous variable	Histogram	No Lines between bars

When `x-axis` is year of elections

There is no values between 2017 and 2021 (because there was no election held between them)
Years of election look like numeric, but it is not treated as numeric

2. How to draw a histogram using ggplot2

You can draw a histogram by using geom_histogram()
You need to map a continuous variable on x-axis
You don’t have to map on y-axis
Let’s draw a histogram of vote share in the lower house elections in Japan between 1996 and 2021

2.1 Draw a simple histogram

Make a folder, named data in your R Project folder
Download hr96-21.csv onto the data folder in your R Project
Read the election data, hr96-21.csv and name it df

df <- read_csv("data/hr96-21.csv",
               na = ".")

Using if_else(), make a dummy variable: ldp
##### mutate(ldp = if_else(seito == "自民", "LDP", "Non-LDP") - This command means make a dummy variable, named ldp
If a value in variable seito is “自民”, then replace it with “LDP” and replace the other values (that is, the other party names) in seito with “Non-LDP”

df <- df %>% 
  mutate(ldp = if_else(seito == "自民", "LDP", "Non-LDP"))

Draw a histogram of vote share

df %>%
  ggplot() +
  geom_histogram(aes(x = voteshare))

2.2 What is a good histogram?

A good histogram clearly address the following questions:

What is the the number of peaks in distribution?
If you have peaks, where is most densely (frequently) populated?

You need to adjust the number of bins to show an appropriate histogram
Which one is the best histogram among the following three histograms?
Usually, a histogram does not have lines between bars, let me add white lines here.

The histogram in the left shows two peaks but it has too narrow bins
The histogram in the right has too wide bins and we fail to see the two peaks
The histogram in the middle has an appropriate width showing two peaks
If you have too narrow bins (such as the one in the left), then you show too much information in the histogram
If you have too wide bins (such as the one in the right), then you might lose some important information data contains
You need to choose an appripriate size of bins

How to customize the number of bins:

You can customize the number of bins using bins = XX

How to customize the width of bins:

You can customize the width of bins using binwidth = XX

How to customize colors

By adding lines between bins, you can make your histogram way easier to see.
You can customize the color (such as color = "white") out of aes() and within geom_histogram()

bin and binwidth ・You can customize either binwidth (the width of bins) or bins (the numbder of bins)
・You cannot simultaneously customize both

2.3 Customize the scale on `x-axis` and `y-axis`

`scale_x_continuous()`	customize `x-axis`
`scale_y_continuous()`	customize `y-axis`

The following is the histogram generated in the previous section

hist_plot1 <- df %>%
  ggplot() +
  geom_histogram(aes(x = voteshare),
                 color = "white",
                 binwidth = 5) 
hist_plot1

Let’s change the display on x-axis on thie histogram
Using scale_x_*(), you can customize the scale onx-axis`
The range of x-axis is between 1 and 100
The interval is 10

hist_plot2 <- hist_plot1 +
  scale_x_continuous(breaks = seq(0, 100, by = 10),
                     labels = seq(0, 100, by = 10))
hist_plot2

2.4 Add another dimension

When you want to add another dimension, you have two ways of getting it done.

(2) fill = ….

Let’s add another dimension, ldp, to the histogram above by using fill = ...
You need to add position = "identity"

df %>%
  mutate(ldp = if_else(seito == "自民", "LDP", "Non-LDP")) %>%
   ggplot() +
   geom_histogram(aes(x = voteshare, 
                      fill = ldp), 
                  position = "identity",
                  binwidth = 10, 
                  color = "white") +
   labs(x = "Vote Share", 
        y = "The Number of Candidates ",
        fill = "") +
  theme_bw(base_family = "HiraKakuProN-W3")

This histogram has a serious problem!
You fail to see the hidden number of non-LDP candidates behind the green bars.

Solution:

Using alpha = ..., you can make the bars transparent
・alpha = 0・・・transparent
・alpha = 1・・・not transparent
Let’s set as alpha = 0.5 here

df %>%
  mutate(ldp = if_else(seito == "自民", "LDP", "Non-LDP")) %>%
   ggplot() +
   geom_histogram(aes(x = voteshare, 
                      fill = ldp), 
                  position = "identity",
                  binwidth = 10, 
                  color = "white",
                  alpha = 0.5) +
   labs(x = "Vote Share", 
        y = "The Number of Candidates",
        fill = "") +
  theme_bw(base_family = "HiraKakuProN-W3")

You had better not do this!!

Let’s generate a histogram of vote share for each party in the lower house elections between 1996 and 2021

df %>%
   ggplot() +
   geom_histogram(aes(x = voteshare, 
                      fill = seito), 
                  position = "identity",
                  binwidth = 10, 
                  color = "white",
                  alpha = 0.5) +
   labs(x = "Vote Share", 
        y = "The Number of Candidates",
        fill = "") +
  theme_bw(base_family = "HiraKakuProN-W3")

You have too much information to figure it out.

Solution:

Reduce the number of parties, say 3 parties.

df %>%
  filter(seito == "自民"|seito == "共産"|seito == "民主") %>%
  mutate(party = case_when(seito == "自民" ~ "LDP",
                           seito == "共産" ~ "JCP",
                           seito == "民主" ~ "CDP")) |> 
   ggplot() +
   geom_histogram(aes(x = voteshare, 
                      fill = party), 
                  position = "identity",
                  binwidth = 10, 
                  color = "white",
                  alpha = 0.5) +
   labs(x = "Vote Share", 
        y = "The Number of Candidates",
        fill = "") +
  theme_bw(base_family = "HiraKakuProN-W3")

Delete position = "identity", then you can add up each distribution in one bar

df %>%
  filter(seito == "自民"|seito == "共産"|seito == "民主") %>%
  mutate(party = case_when(seito == "自民" ~ "LDP",
                           seito == "共産" ~ "JCP",
                           seito == "民主" ~ "CDP")) |> 
   ggplot() +
   geom_histogram(aes(x = voteshare, 
                      fill = party), 
                  binwidth = 10, 
                  color = "white",
                  alpha = 0.5) +
   labs(x = "Vote Share", 
        y = "The Number of Candidates",
        fill = "") +
  theme_bw(base_family = "HiraKakuProN-W3")

If you want to overlup the 3 distribution in one facet, then you should use geom_density() instead of geom_histogram()

df %>%
  filter(seito == "自民"|seito == "共産"|seito == "民主") %>%
  mutate(party = case_when(seito == "自民" ~ "LDP",
                           seito == "共産" ~ "JCP",
                           seito == "民主" ~ "CDP")) |> 
   ggplot() +
   geom_density(aes(x = voteshare, 
                      fill = party), 
                  color = "white",
                  alpha = 0.5,
                adjust = 0.8) +
   labs(x = "Vote Share", 
        y = "Density",
        fill = "") +
  theme_bw(base_family = "HiraKakuProN-W3")

3 Exercise

Q3.1:
In reference to 2.4 Add another dimension, draw a histogram of vote share (voteshare) for the LDP (Liberal Democratic Party) and CDP (Constitutional Democratic Party) candidates in the 2021 lower house election.
・You see the party name in variable, seito in hr96_21.csv
・You need to use geom_histogram() to draw a histogram.
・You need to pay attention to the hidden values behind the bars in drawing a histogram.
Q3.2:
In reference to 2.4 Add another dimension, draw a histogram of vote share (voteshare) for the LDP (Liberal Democratic Party) and CDP (Constitutional Democratic Party) and JCP (Japan Communist Party) candidates in the 2021 lower house election.
・You see the party name in variable, seito in hr96_21.csv
・You need to use geom_histogram() to draw a histogram.
・You need to pay attention to the hidden values behind the bars in drawing a histogram.

In answering Q3.1 and Q3.2 questions, use hr96-21.csv
・hr96_21.csv is a collection of Japanese lower house election data covering 9 national elections (1996, 2000, 2003, 2005, 2009, 2012, 2014, 2017, 2021)
・You need the following three variables which are included in hr96_21.csv to draw histograms:

variable	detail
year	Election year (1996-2021)
voteshare	Vote share (%)
seito	Candidate’s affiliated party (in Japanese)

・hr96_21.csv contains the following 23 variables:

variable	detail
year	Election year (1996-2021)
pref	Prefecture
ku	Electoral district name
kun	Number of electoral district
rank	Ascending order of votes
wl	0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner
nocand	Number of candidates in each district
seito	Candidate’s affiliated party (in Japanese)
j_name	Candidate’s name (Japanese)
name	Candidate’s name (English)
previous	Previous wins
gender	Candidate’s gender:“male”, “female”
age	Candidate’s age
exp	Election expenditure (yen) spent by each candidate
status	0 = challenger / 1 = incumbent / 2 = former incumbent
vote	votes each candidate garnered
voteshare	Vote share (%)
eligible	Eligible voters in each district
turnout	Turnout in each district (%)
castvote	Total votes cast in each district
seshu_dummy	0 = Not-hereditary candidates, 1 = hereditary candidate
jiban_seshu	Relationship between candidate and his predecessor
nojiban_seshu	Relationship between candidate and his predecessor

References

Tidy Animated Verbs

宋財泫 (Jaehyun Song)・矢内勇生 (statuki statanai)「私たちのR: ベストプラクティスの探究」

宋財泫「ミクロ政治データ分析実習（2022年度）」

土井翔平（北海道大学公共政策大学院）「Rで計量政治学入門」

矢内勇生（高知工科大学）授業一覧

浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年

浅野正彦, 中村公亮.『初めてのRStudio』オーム社、2018年

Winston Chang, R Graphics Coo %>%kbook, O’Reilly Media, 2012.

Kieran Healy, DATA VISUALIZATION, Princeton, 2019

Kosuke Imai, Quantitative Social Science: An Introduction, Princeton University Press, 2017

5. ggplot2 (Histogram)

Masahiko Asano

2022-10-26

1. Histogram

Diffenece between a barchart and a histogram

When `x-axis` is year of elections

2. How to draw a histogram using ggplot2

2.1 Draw a simple histogram

2.2 What is a good histogram?

How to customize the number of bins:

How to customize the width of bins:

How to customize colors

2.3 Customize the scale on `x-axis` and `y-axis`

2.4 Add another dimension

(1) `facet_wrap(~)`

(2) fill = ….

Solution:

You had better not do this!!

Solution:

3 Exercise

5. ggplot2 (Histogram)

Masahiko Asano

2022-10-26

1. Histogram

Diffenece between a barchart and a histogram

When x-axis is year of elections

When x-axis is vote share

2. How to draw a histogram using ggplot2

2.1 Draw a simple histogram

2.2 What is a good histogram?

How to customize the number of bins:

How to customize the width of bins:

How to customize colors

2.3 Customize the scale on x-axis and y-axis

2.4 Add another dimension

(1) facet_wrap(~)

(2) fill = ….

Solution:

You had better not do this!!

Solution:

3 Exercise

When `x-axis` is year of elections

When `x-axis` is vote share

2.3 Customize the scale on `x-axis` and `y-axis`

(1) `facet_wrap(~)`