• R Packages we use in this section
library(DT)
library(gapminder)
library(gghighlight)
library(ggrepel)
library(stargazer)
library(tidyverse)

1. What is a Boxplot?

  • A boxplot is one of the methods used to visualize the distribution of continuous variables.
Variable Type Visualization Method Features
Continuous Boxplot ① Provides instant information about median, quartile range, minimum, and maximum values
② Allows comparison of distributions across groups within a single plot.
  • Widely used visualization method alongside histograms.
  • Histograms show “where the density is highest,” while boxplots allow for quick comparison of descriptive statistics for specific variables.
  • With a boxplot, you can see the minimum value, maximum value, first quartile, second quartile (median), third quartile, and identify outliers.
  • You can view the distribution of variables across groups without the need for color-coding or facet splitting.

How to Read a Boxplot

Source:浅野・矢内『Rによる計量政治学』p.107.

Quartiles

  • Create a variable x
x <- c(10, 7, 9, 1, 0, 2, 5, 8, 3, 4, 6)
x
 [1] 10  7  9  1  0  2  5  8  3  4  6
  • Arrange x in ascending order
sort(x)
 [1]  0  1  2  3  4  5  6  7  8  9 10
  • Buttom end end (First Quartile: Minimum) … 0
  • Middle (Second Quartile = Median) … 5
  • Upper end (Third Quartile: Maximum) … 10
  • Try drawing a boxplot of x using Base R
boxplot(x)

2. Creating a Box Plot

df <- read_csv("data/hr96-21.csv",
               na = ".")  

2.1 Simple Boxplot

  • Using df, let’s create a boxplot for the vote share (voteshare).
  • The geometric object name is geom_boxplot().
  • Similar to a histogram, only one mapping is required.
  • You can map it to both y and x.

The pipe operator: %>% or |> ・After the release of R 4.1 in May 2021, the pipe operator |> was added as an in-built operator in R.
The pipe operator can be used as %>% or |>,
and the results are the same regardless of which one you use.

2.2 Changing the Color of Boxplots

  • Bar graphs and histograms represent the distribution of a single variable.
  • Boxplots are often used when showing the distribution of a variable across different groups.
  • Let’s use a boxplot to illustrate the party-wise vote share in the 2021 general election.
df %>%
  filter(year == 2021) %>% #  Narrowing down to 2021 general election data
  ggplot() +
  geom_boxplot(aes(y = voteshare, x = seito)) + 
  labs(x = "Party", y = "Voteshare") +
  ggtitle("Vote Share by Party: 2021 General Election") +
  theme_bw(base_family = "HiraKakuProN-W3")

How to Change the Color of Boxplot

(1) If you want to change the ‘border color’ of the boxplot: color = ""
  • You can specify color = "skyblue" within aes()
df %>%
  filter(year == 2021) %>% # Narrowing down to 2021 general election 
  ggplot() +
  geom_boxplot(aes(y = voteshare, 
                   x = seito),
               color = "skyblue") + 
   labs(x = "Party", y = "Voteshare") +
  ggtitle("Vote Share by Party: 2021 General Election") +
  theme_bw(base_family = "HiraKakuProN-W3")

(2) If you want to change the color inside the box: fill = ""
  • You should specify fill = "skyblue" outside of aes().
  • When you map color = "skyblue", it changes the color of the box outline, not the color inside the box.
df %>%
  filter(year == 2021) %>% # Narrowing down to 2021 general election 
  ggplot() +
  geom_boxplot(aes(y = voteshare, 
                   x = seito),
               fill = "skyblue") + 
   labs(x = "Party", y = "Voteshare") +
  ggtitle("Vote Share by Party: 2021 General Election") +
  theme_bw(base_family = "HiraKakuProN-W3")

(3) If you want to change the color for each party:
  • To change the color for each party, you can add the scale_fill_manual() layer.
df %>%
  filter(year == 2021) %>% # Narrowing down to 2021 general election 
  ggplot() +
  geom_boxplot(aes(y = voteshare, 
                   x = seito,
                   fill = seito)) + 
   labs(x = "Party", y = "Voteshare") +
  ggtitle("Vote Share by Party: 2021 General Election") +
  theme_bw(base_family = "HiraKakuProN-W3")

(3) If you want to customize colors for each party:
  • When you want to customize colors for each party, use scale_color_manual().
df %>%
  filter(year == 2021) %>%# Narrowing down to 2021 general election 
  ggplot() +
  geom_boxplot(aes(y = voteshare, 
                   x = seito,
                   color = seito)) + 
    labs(x = "Party", y = "Voteshare") +
  ggtitle("Vote Share by Party: 2021 General Election") +
  theme_bw(base_family = "HiraKakuProN-W3") +
  theme(legend.position = "none") +
  scale_color_manual(values = c("N党" = "grey", 
                               "れい" = "grey",
                               "公明" = "red",
                               "共産" = "grey",
                               "国民" = "grey", 
                               "無所" = "grey",
                               "社民" = "grey",
                               "立憲" = "grey",
                               "維新" = "grey",
                               "自民" = "red",
                               "諸派" = "grey"))

2.3 Add a dimension to a box plot

  • If adding a dimension, do it by color-coding.
  • For example, try adding the gender of candidates as a new dimension by specifying the gender (gender) of the candidates.
  • Specify it inside aes() like fill = "gender".
df %>%
  filter(year == 2021) %>% # 2021年総選挙データだけに絞る
  ggplot() +
  geom_boxplot(aes(y = voteshare, 
                   x = seito,
                   fill = gender)) + 
   labs(x = "Party", y = "Voteshare") +
   ggtitle("Vote Share by Gender: 2021 General Election") +
  theme_bw(base_family = "HiraKakuProN-W3") +
  theme(legend.position = "bottom")

2.4 Customizing the Presentation of Box Plots

2.4.1 Narrowing or Widening the Width of the Box

  • To narrow the width of the box, specify width outside of aes().
  • The default width of the box is 0.75 if not specified Here, we will try setting the box width to 0.25
df %>%
  filter(year == 2021) %>% # Focus only on the 2021 general election data
  ggplot() +
  geom_boxplot(aes(y = voteshare,
    x = seito),
    width = 0.25) + # Set width to 0.25
  labs(x = "Party", y = "Vote Share") +
  ggtitle("Vote Share by Party: 2021 General Election") +
  theme_bw(base_family = "HiraKakuProN-W3")

2.4.2 Vertical Display ↔︎ Horizontal Display

  • When there are too many boxes, the figure becomes horizontally elongated.
df %>%
  filter(year == 2021) %>% # Focus only on the 2021 general election data
  ggplot() +
  geom_boxplot(aes(y = voteshare, 
                   x = seito)) +              
  labs(x = "Party", y = "Vote Share") +
  ggtitle("Vote Share by Party: 2021 General Election") +
  theme_bw(base_family = "HiraKakuProN-W3")

Solution:

  • Swapping x and y will make it vertically oriented.
df %>%
  filter(year == 2021) %>% # Focus only on the 2021 general election data
  ggplot() +
  geom_boxplot(aes(y = seito, 
                   x = voteshare)) +              
  labs(x = "Party", y = "Vote Share") +
  ggtitle("Vote Share by Party: 2021 General Election") +
  theme_bw(base_family = "HiraKakuProN-W3")

2.4.3 Reverse the order of party names display

  • Use scale_y_discrete(limits = rev)
df %>%
  filter(year == 2021) %>% # Focus only on the 2021 general election data
  ggplot() +
  geom_boxplot(aes(y = seito, 
                   x = voteshare)) +              
  labs(x = "Party", y = "Vote Share") +
  theme_bw(base_family = "HiraKakuProN-W3") +
  ggtitle("Vote Share by Party: 2021 General Election") +
  scale_y_discrete(limits = rev)

2.4.4 Customizing the Order of Party Names Display

  • Suppose you want to display in order of highest vote share.
  • For example, you want to draw box plots for each party with the highest average vote share in the 2021 general election.

Check the average vote share for each party.

df |> 
  filter(year == 2021) |> 
  group_by(seito) |> 
  summarize(ave_vs_2021 = mean(voteshare)) |> 
  DT::datatable()
  • Before passing to ggplot(), convert seito into a factor and set the order.
  • Try drawing the plot after converting seito to a factor
df %>%
  filter(year == 2021) %>% # Focus only on the 2021 general election data
  mutate(seito = factor(seito,
                        levels = c("公明", 
                                       "自民", 
                                       "立憲", 
                                       "国民", 
                                       "維新",
                                       "社民", 
                                       "無所", 
                                       "共産", 
                                       "れい", 
                                       "諸派",
                                       "N党"))) %>%    # 'N' of "N Party" is a full-width character
  ggplot() +
  geom_boxplot(aes(y = seito, 
                   x = voteshare)) +              
  labs(x = "Party", y = "Vote Share") +
  ggtitle("Vote Share by Party: 2021 General Election") +
  theme_bw(base_family = "HiraKakuProN-W3") 

mean and median ・The thick vertical line inside the box plot represents the median.
・Here, the positions of the parties are arranged based on the mean.
→ The mean and median do not necessarily always coincide.

  • Box plots allow for quick comparison of the distributions of variables across groups.
  • In the 2021 House of Representatives election, the party with the highest average vote share was Komeito(公明党).
  • Visually, the party with the largest variance is Democratic Party for the People(国民民主党).
  • Compared to the Liberal Democratic Party(自民党), Komeito(公明党) has less variance.
  • N Party(N党), Other Factions(諸派), Reiwa(れい), and Komeito(公明党)have smaller variances.

Check the variance for each party

df |> 
  filter(year == 2021) |> 
  group_by(seito) |> 
  summarize(ave_sd_2021 = sd(voteshare)) |> 
  DT::datatable()
  • The furthest end of the line extending to the right represents the maximum value.
  • The furthest end of the line extending to the left represents the minimum value.
  • There are values greater than the maximum or smaller than the minimum!
    → The dots at the ends of these lines “●” = outliers
  • The minimum value in the box plot: The smallest value greater than “first quartile - 1.5 × interquartile range.
  • The maximum value in the box plot: The largest value less than “third quartile + 1.5 × interquartile range”.
  • Cases that fall outside these ‘minimum’ and ‘maximum’ values are displayed as outliers.
  • The Liberal Democratic Party(自民)has a few candidates with extremely high vote shares (outliers).
  • Let’s display the names of these candidates.
df %>% 
  filter(year == 2021) %>%            # Focus only on the 2021 general election data
  filter(voteshare > 75) %>%  # Filter only candidates with a vote share over 75%
  select(pref, kun, seito, age, voteshare, vote, j_name) %>% 
  head(10)
# A tibble: 10 × 7
   pref    kun seito   age voteshare   vote j_name    
   <chr> <dbl> <chr> <dbl>     <dbl>  <dbl> <chr>     
 1 宮崎      3 自民     56      80.7 111845 古川禎久  
 2 宮城      6 自民     61      83.2 119555 小野寺五典
 3 群馬      5 自民     47      76.6 125702 小渕優子  
 4 広島      1 自民     64      80.7 133704 岸田文雄  
 5 香川      3 自民     53      79.8  94437 大野敬太郎
 6 山口      2 自民     62      76.9 109914 岸信夫    
 7 山口      3 自民     60      76.9  96983 林芳正    
 8 秋田      3 自民     57      78.0 134734 御法川信英
 9 石川      2 自民     47      78.4 137032 佐々木紀  
10 鳥取      1 自民     64      84.1 105441 石破茂    

  • By adding fill = seito as a mapping element, it can be displayed in color.
df %>%
  filter(year == 2021) %>% # Focus only on the 2021 general election data
  mutate(seito = factor(seito,
                            levels = c("公明", 
                                       "自民", 
                                       "立憲", 
                                       "国民", 
                                       "維新",
                                       "社民", 
                                       "れい", 
                                       "共産", 
                                       "無所", 
                                       "N党", 
                                       "諸派"))) %>% # N党の「N」は全角英字
  ggplot() +
  geom_boxplot(aes(x = voteshare,
                   y = seito,
                   fill = seito)) +              
   labs(x = "Party", y = "Vote Share") +
  ggtitle("Vote Share by Party: 2021 General Election") +
  theme_bw(base_family = "HiraKakuProN-W3") +
  theme(legend.position = "none")

3 Exercise

Q3.1:

  • Using the analysis method from 2.2 Changing the Color of Box Plots, draw a box plot for the 2009 general election vote share for each political party.
  • Customize the colors for each party, with Democratic Party candidates in blue, Liberal Democratic Party candidates in red, and candidates from other parties in grey.

Q3.2:

  • Using the analysis method from 2.4.4 Customizing the Order of Party Names Display, draw a box plot for the 2009 general election vote share for each political party. Display the box plots in black and white, in order of highest vote share from the top.

Q3.3:

  • Using the analysis method from 2.4.4 Customizing the Order of Party Names Display, display a list of the names, electoral districts, ages, vote shares, and vote counts of the Democratic Party candidates with the highest vote share in the 2009 general election.
Reference
  • Tidy Animated Verbs
  • 宋財泫 (Jaehyun Song)・矢内勇生 (statuki statanai)「私たちのR: ベストプラクティスの探究」
  • 宋財泫「ミクロ政治データ分析実習(2022年度)」
  • 土井翔平(北海道大学公共政策大学院)「Rで計量政治学入門」
  • 矢内勇生(高知工科大学)授業一覧
  • 浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年
  • 浅野正彦, 中村公亮.『初めてのRStudio』オーム社、2018年
  • Winston Chang, R Graphics Coo %>% kbook, O’Reilly Media, 2012.
  • Kieran Healy, DATA VISUALIZATION, Princeton, 2019
  • Kosuke Imai, Quantitative Social Science: An Introduction, Princeton University Press, 2017