• R Packages we use in this section
library(DT)
library(gapminder)
library(gghighlight)
library(ggrepel)
library(stargazer)
library(tidyverse)

1. What is a Scatter plot?

  • A scatter plot is a common method of visualizing the relationship between two continuous variables (variables measured on an interval or ratio scale).
  • A scatter plot visualizes multiple points on a two-dimensional plane.
  • The geometric objects used here are ggplot() and geom_point().
Geometric Object Meaning
ggplot() Prepare the canvas for drawing the figure
geom_point() Draw the scatter plot
  • To illustrate, let’s extract the 2009 Election data from the 1996-2021 general election data and create a scatter plot for ‘Election Expenses’ and ‘Vote Share’ using the 2014 data.
  • The necessary data for this are as follows:
  1. Election Expenses (exp)
  2. Vote Share (voteshare)
  3. Load the data.

Download hr96-21.csv

df <- read_csv("data/hr96-21.csv",
               na = ".")  
  • Extract the data from 2009 and name it df09.
df09 <- df %>%
  filter(year == 2009)

1.1 The Simplest Scatter Plot

  • Map the horizontal axis position to x and the vertical axis position to y.
  • When considering a cause-and-effect relationship (causal relationship) between two variables:
  • Set the factor thought to be the cause on the x axis (here, Election Expenses: exp).
  • Set the factor thought to be the result on the y axis (here, Vote Share: voteshare).
df09 %>%
  ggplot() +
  geom_point(aes(x = exp, 
                 y = voteshare)) 

  • The horizontal and vertical axis titles are exp and voteshare.
  • These titles may not be clear to the viewer in terms of what they represent.
    => We need to specify what they mean.

1.2 Customizing Labels and Dot Colors

  • Add labels to the x and y axes.
  • Use the ggtitle() function to add a main title.
  • Specify the color of the dots.
df09 %>%
   ggplot() +
   geom_point(aes(x = exp, 
                  y = voteshare), 
              color = "royalblue") +
   labs(x = "Election Expenses", 
        y = "Vote Share") +
  ggtitle("Scatter Plot of Election Expenses and Vote Share: 2009 HR Election") +
  theme_bw(base_family = "HiraKakuProN-W3")

1.3 Customizing the Shape of Dots

  • You can specify the shape of the dots using shape = number.
  • Here, let’s try setting it to a triangle ().
df09 %>%
   ggplot() +
   geom_point(aes(x = exp, 
                  y = voteshare), 
              color = "royalblue",
              shape = 2) +
   labs(x = "Election Expenses", 
        y = "Vote Share") +
  ggtitle("Scatter Plot of Election Expenses and Vote Share: 2009 HR Election") +
  theme_bw(base_family = "HiraKakuProN-W3")

  • You can choose from the following 25 types of dot shapes:
  • The default shape of the dot is 19, which is a solid circle (●).

Shapes Details
0〜14 These are shapes with transparent insides and only outlines.
15〜20 These are solid shapes without outlines.
21〜25 The outline is adjusted with color, and the inside fill color is adjusted with fill.
  • To change the color of the outline・・・adjust with the color argument.
  • The color of the inside can also be adjusted with the color argument.

Let’s try using shape = 22 to display a square (□) with ‘violet’ inside and ‘aquamarine’ outline.

df09 %>%
   ggplot() +
   geom_point(aes(x = exp, 
                  y = voteshare), 
              color = "violet",    # Specify the color of the outline
              fill = "aquamarine",  # Specify the color for the inside fill
              shape = 22) +        # Specify the shape of the dot
   labs(x = "Election Expenses", 
        y = "Vote Share") +
  ggtitle("Scatter Plot of Election Expenses and Vote Share: 2009 General Election") +
  theme_bw(base_family = "HiraKakuProN-W3")

2. Adding Dimensions

  • The scatter plot created in the previous section represents two pieces of information (= 2 dimensions): exp and voteshare. - Scatter plots can expand the number of pieces of information (= dimensions) they represent.
  • Here, let’s use the if_else() function to create a Democratic Party dummy variable dpj and add a dimension to the scatter plot.
  • Inside the aes() function, specify shape = dpj.

2.1 Adding Dimension by Changing the Shape of Dots

  • Specify shape = dpj inside aes().
    → This allows changing the shape of the dot depending on whether it’s a Democratic Party candidate or not.
  • seito contains party name each candidate is affilated with.
  • 民主 means the Democratic Party Japan in Japanese.
df09 %>%
  mutate(dpj = if_else(seito == "民主", "Democratic Party", "Non-Democratic Party")) %>%
   ggplot() +
   geom_point(aes(x = exp, 
                  y = voteshare,
                  shape = dpj)) +  # Differentiate dpj by the shape of the dot
   labs(x = "Election Expenses", 
        y = "Vote Share") +
  ggtitle("Scatter Plot of Election Expenses and Vote Share: 2009 HR Election") +
  theme_bw(base_family = "HiraKakuProN-W3")

  • When specifying shape = dpj, R automatically assigns shapes like ‘solid circle’ (●) and ‘triangle’ (▲).
  • Although it’s possible to distinguish between Democratic Party candidates and others upon closer inspection, it’s hard to see the overall trend of Democratic Party candidates.
    → Try specifying the shapes of the dots:
    ・Democratic Party: ‘empty circle’ (○)
    ・Non-Democratic Party: ‘cross’ (×)
  • Use scale_shape_manual().
df09 %>%
  mutate(dpj = if_else(seito == "民主", "Democratic Party", "Non-Democratic Party")) %>%
   ggplot() +
   geom_point(aes(x = exp, 
                  y = voteshare,
                  shape = dpj)) +  # Differentiate dpj by the shape of the dot
   labs(x = "Election Expenses", 
        y = "Vote Share") +
  ggtitle("Scatter Plot of Election Expenses and Vote Share: 2009 General Election") +
  theme_bw(base_family = "HiraKakuProN-W3")  +
  theme(legend.position = "bottom") + # Position the legend at the bottom
  scale_shape_manual(values = c("Democratic Party" = 1,    # 'empty circle' is 1  
                                "Non-Democratic Party" = 4))  # 'cross' is 4    

  • It has become much easier to read than ‘solid circle’ (●) and ‘triangle’ (▲).

2.2 Adding Dimension by Changing the Dot Colors

  • To make the trend of Democratic Party candidates more visible among all candidates:
    → Try displaying with different colors.
  • Inside the aes() function, specify color = dpj.
df09 %>%
  mutate(dpj = if_else(seito == "民主", "Democratic Party", "Non-Democratic Party")) %>%
   ggplot() +
   geom_point(aes(x = exp, 
                  y = voteshare,
                  color = dpj,    # Differentiate dpj by color
                  alpha = 0.5)) + # Add transparency
   labs(x = "Election Expenses", 
        y = "Vote Share") +
  ggtitle("Scatter Plot of Election Expenses and Vote Share: 2009 General Election") +
  theme_bw(base_family = "HiraKakuProN-W3") +
  theme(legend.position = "bottom") # Position the legend at the bottom

  • In the 2009 general election, where there was a power shift from the Liberal Democratic Party to the Democratic Party, it is clear that Democratic Party candidates tended to receive more votes.

2.3 Specifying Dot Colors

  • If you want to specify the dots in your preferred colors:
  • Add a scale_color_manual() layer.
  • The argument is values.
  • Specify as c(“value1” = “color1”, “value2” = “color2”, …):
    → Specify a character type vector.
df09 %>%
  mutate(dpj = if_else(seito == "民主", "Democratic Party", "Non-Democratic Party")) %>%
   ggplot() +
   geom_point(aes(x = exp, 
                  y = voteshare,
                  color = dpj,    # Differentiate dpj by color
                  alpha = 0.5)) + # Add transparency
   labs(x = "Election Expenses", 
        y = "Vote Share") +
  ggtitle("Scatter Plot of Election Expenses and Vote Share: 2009 General Election") +
  theme_bw(base_family = "HiraKakuProN-W3") +
  theme(legend.position = "bottom") + # Position the legend at the bottom
  scale_color_manual(values = c("Democratic Party" = "blue",
                                "Non-Democratic Party" = "gold"))

  • In the 2009 general election, where there was a power shift from the Liberal Democratic Party to the Democratic Party, it is clear that Democratic Party candidates tended to receive more votes.
  • There are 657 colors available for use in ggplot2!
  • Colors can be specified by name, like “red”, “skyblue”, or “royalblue”.
  • A list of colors available in R can be viewed by typing colors() in the console.
  • Here, let’s show the first 6 colors.
head(colors())
[1] "white"         "aliceblue"     "antiquewhite"  "antiquewhite1"
[5] "antiquewhite2" "antiquewhite3"
  • In addition to named colors, you can also specify colors using RGB values (HEX codes; hexadecimal notation).
  • For example, red can be specified as #FF0000, and royal blue as #4169E1.
  • When using HEX codes, you can specify colors very precisely.
    → This allows for 16,777,216 different colors!!!
  • The following examples show a subset of the colors that can be used in R.

3. Scatter Plot with Regression Line (1)

  • Let’s add a regression line to the scatter plot of ‘Election Expenses’ and ‘Vote Share’ from the 2009 general election.
  • To add a regression line, include geom_smooth(method = lm).
  • Specify the aes() function inside ggplot(), and set the x axis, y axis, and color.
plot_vs_09 <- df09 %>%
   ggplot(aes(x = exp,
              y = voteshare,
              color = seito,
              alpha = 0.5)) + # Specify the transparency of the dots
   geom_point() +     
  geom_smooth(method = lm) +  # Draw the regression line
   labs(x = "Election Expenses", 
        y = "Vote Share") +
  ggtitle("Scatter Plot of Election Expenses and Vote Share: 2009 HR Election") +
  theme_bw(base_family = "HiraKakuProN-W3")

plot_vs_09

  • Let’s make it more readable by party using the facet_wrap() function.
plot_vs_09 +
  facet_wrap(~seito)    # Facet by each political party. 

  • The labels on the x axis are long and overlapping each other.
    → Let’s try rotating the x axis labels by 40 degrees.
  • Add a theme() layer and specify axis.text.x.
plot_vs_09 +
  facet_wrap(~seito)  +  # Facet by political party
  theme(legend.position = "none") + # Hide the legend
  theme(axis.text.x  = element_text(angle = 40, vjust = 1, hjust = 1)) # Rotate by 35 degrees

  • The 2009 general election was a power shift from the Liberal Democratic Party (LDP) to the Democratic Party.
  • We want to compare the LDP and the Democratic Party.
  • However, the plots for the LDP and the Democratic Party are not next to each other.
  • Convert seito to a factor before passing it to ggplot() to change the order.
df09 %>%
   mutate(seito = factor(seito,
                         levels = c("民主", "自民", "公明", "みんな",
                                    "共産", "国民新党", "幸福", "新党日本", 
                                    "無所", "社民"))) %>% 
   ggplot(aes(x = exp,
              y = voteshare,
              color = seito,
              alpha = 0.5)) + # Specify the transparency of the dots
   geom_point() +     
  geom_smooth(method = lm) +  # Draw the regression line
   labs(x = "Election Expenses", 
        y = "Vote Share") +
  ggtitle("Scatter Plot of Election Expenses and Vote Share: 2009 HR Election") +
  theme_bw(base_family = "HiraKakuProN-W3") +
  facet_wrap(~seito, ncol = 4) + # Display in 4 columns
  theme(legend.position = "none") + # Hide the legend
  theme(axis.text.x  = element_text(angle = 40, vjust = 1, hjust = 1)) # Rotate by 35 degrees

Visually Observable Insights
  • The Democratic Party has received more votes than the Liberal Democratic Party.
    The Democratic Party has spent less on election expenses than the Liberal Democratic Party.

Calculating Vote Share by Party

  • Use the group_by() function to calculate the vote share for each party.
df09_ave <- df09 %>%
  group_by(seito) %>%
  summarise(ave_vs = mean(voteshare, 
                          na.rm = TRUE))
DT::datatable(df09_ave)

4. Scatter Plot with Regression Line (2)

  • Scatter plot of ‘GDP’ and ‘Life Expectancy’.
  • Here, we will use the {gapminder} data available in R to introduce a method for visualizing information in more than three dimensions on a two-dimensional plane.
  • Download the {gapminder} package.
library(gapminder)
  • The {gapminder} package includes the following variables:
変数名 詳細
country Country Name
continent Continent Name
year Year
lifeExp Life Expectancy
pop Population
gdpPercap GDP per Capita, in US dollars as of 2005
  • Show descriptive statistics.
library(stargazer)
  • Remeber to specify {r, results = "asis"} in the chunk options.
stargazer(as.data.frame(gapminder), 
          type ="html",
          digits = 2)
Statistic N Mean St. Dev. Min Max
year 1,704 1,979.50 17.27 1,952 2,007
lifeExp 1,704 59.47 12.92 23.60 82.60
pop 1,704 29,601,212.00 106,157,897.00 60,011 1,318,683,096
gdpPercap 1,704 7,215.33 9,857.45 241.17 113,523.10
DT::datatable(gapminder)

4.1 Simple Scatter Plot {gapminder}

  • First, let’s draw a scatter plot by specifying ‘GDP per Capita (gdpPercap)’ for the x axis and ‘Average Life Expectancy (lifeExp)’ for the y axis.
  • Before passing to ggplot(), create a variable pop_m that converts the population into ‘millions of people’.
gapminder %>%
  mutate(pop_m = pop / 1000000) %>%  # Create variable pop_m, converting population to millions
  ggplot() +
  geom_point(aes(x = gdpPercap, 
                 y = lifeExp)) +
  labs(x = "GDP per Capita (USD)", 
       y = "Life Expectancy") +
  theme_bw(base_family = "HiraKakuProN-W3")

  • When looking at the descriptive statistics for ‘GDP per Capita’, the data range is very wide, extending from $241 (lowest) to $113,523 (highest).
  • The relationship between ‘GDP per Capita’ and ‘Average Life Expectancy’ appears more logarithmic than linear.
    → For better visibility, transform the x axis to log scale. Specify as x = log(gdpPercap).
gapminder %>%
  mutate(pop_m = pop / 1000000) %>% 
  ggplot() +
  geom_point(aes(x = log(gdpPercap), # Log transform gdpPercap
                 y = lifeExp)) +
  labs(x = "Logarithmic Value of GDP per Capita (USD)", 
       y = "Life Expectancy") +
  theme_bw(base_family = "HiraKakuProN-W3")

  • After log-transforming gdpPercap, a quite clear linear relationship became apparent.
  • Let’s check out the super-rich countries with exceptionally high ‘GDP per Capita’ located at the right end of the screen.
gapminder %>% 
  dplyr::filter(gdpPercap > 60000) %>% 
  dplyr::select(year, country, gdpPercap)
# A tibble: 5 × 3
   year country gdpPercap
  <int> <fct>       <dbl>
1  1952 Kuwait    108382.
2  1957 Kuwait    113523.
3  1962 Kuwait     95458.
4  1967 Kuwait     80895.
5  1972 Kuwait    109348.

4.2 Add continents and population to the scatter plot.

  • Add color = continent inside aes() to color-code by continent.
  • Specify size = pop_m inside aes() to synchronize the size of the dots with the population. - When the dots get larger, they overlap and become difficult to see, so set the dots to be semi-transparent (alpha = 0.5).
  • Since the analysis results are the same, use the logarithmically transformed graph for a clearer view: coord_trans(x = "log10").
gapminder %>%
  mutate(pop_m = pop / 1000000) %>% 
  ggplot() +
  geom_point(aes(x = log(gdpPercap), 
                 y = lifeExp, 
                 color = continent,
                 size = pop_m),
             alpha = 0.5) +
  labs(x = "Logarithmic Value of GDP Per Capita (USD)", 
       y = "Life Expectancy",
       size = "Population",
       color = "Continent") +
  theme_bw(base_family = "HiraKakuProN-W3")

  • Insert geom_smooth() to draw a fit line.
gapminder %>% 
  mutate(pop_m = pop / 1000000) %>% 
  ggplot(aes(x = log(gdpPercap), 
             y = lifeExp, 
             col = continent, 
             size = pop_m)) +
  geom_point(alpha = 0.5) +
  labs(x = "Logarithmic Value of GDP Per Capita (USD)", 
       y = "Life Expectancy",
       size = "Population",
       color = "Continent") +
  theme_bw(base_family = "HiraKakuProN-W3") +
  geom_smooth()

4.3 Drawing a linear line

  • Specify geom_smooth(method = lm) to draw a linear line.
gapminder %>% 
  mutate(pop_m = pop / 1000000) %>% 
  ggplot(aes(x = log(gdpPercap), 
             y = lifeExp, 
             col = continent, 
             size = pop_m)) +
  geom_point(alpha = 0.5) +
  labs(x = "Logarithmic Value of GDP Per Capita (USD)", 
       y = "Life Expectancy",
       size = "Population",
       color = "Continent") +
  theme_bw(base_family = "HiraKakuProN-W3") +
  geom_smooth(method = lm)

4.4 Display categorized by dimensions

  • Use facet_wrap(~continent) to draw separate scatter plots for each continent.
gapminder %>% 
  mutate(pop_m = pop / 1000000) %>% 
  ggplot(aes(x = log(gdpPercap), 
             y = lifeExp, 
             col = continent, 
             size = pop_m)) +
  geom_point(alpha = 0.5) +
  labs(x = "Logarithmic Value of GDP Per Capita (USD)", 
       y = "Life Expectancy",
       size = "Population",
       color = "Continent") +
  theme_bw(base_family = "HiraKakuProN-W3") +
  geom_smooth(method = lm) +
  facet_wrap(~continent)

Insights from the gapminder Data from 1979 to 2007
・In the continents of Africa, Asia, and Europe, ‘GDP per capita’ has a comparable impact on ‘Life Expectancy’.
・The continents where ‘GDP per capita’ has a significant impact on ‘Life Expectancy’ (i.e., where the slope is steeper) are the Americas and Oceania.”

4.5 Highlighting Specific Countries

  • Try identifying data for Japan, the United States, and China. Use the {gghighlight} package.
library(gghighlight)
gapminder %>% 
  mutate(pop_m = pop / 1000000) %>% 
  ggplot(aes(x = log(gdpPercap), 
             y = lifeExp, 
             col = country, 
             size = pop_m)) +
  geom_point(alpha = 0.5) +
  gghighlight(country %in% c("Japan", "China", "United States"),
              label_params = list(size = 3)) +
  labs(x = "Logarithmic Value of GDP Per Capita (USD)", 
       y = "Life Expectancy",
       size = "Population",
       color = "Country") +
  theme_bw(base_family = "HiraKakuProN-W3") 

Conclusions ・In all these countries, life expectancy increases as per capita GDP increases.
・In China, there is a sharp increase in life expectancy when GDP is around 6%-7%, followed by a more gradual increase thereafter.
・In the United States (represented by blue dots), life expectancy also increases as per capita GDP increases, but the life expectancy is shorter than in Japan.

5. Displaying Text in a Scatter Plot

##5.1 Data Preparation
- In this section, we consider a scatter plot using data on eligible voters aged 18 and the voting rate.
- Download the data from the 24th (2016) House of Councillors election vote_18.csv and save it in the RProject folder.
- Load the data.

hc2016 <- read_csv("data/vote_18.csv")
  • 変数の詳細は次のとおり
Variable Name Details
pref Prefecture
age18 Voting rate of eligible voters aged 18
age19 Voting rate of eligible voters aged 19
age1819 Voting rate of eligible voters aged 18 and 19
all Voting rate of the prefecture
did Population density of the prefecture
  • Displaying Descriptive Statistics of the Data.
  • Note: Enter {r, results = “asis”} in the chunk options
stargazer(data.frame(hc2016), 
          type = "html")
Statistic N Mean St. Dev. Min Max
serial 47 24.000 13.711 1 47
age18 47 48.389 5.307 35.290 62.230
age19 47 38.419 6.033 26.580 53.800
age1819 47 43.446 5.558 30.930 57.840
all 47 54.968 3.887 45.520 62.860
did 47 655.374 1,194.258 68.650 6,168.040
  • Use the {DT} package to display interactive descriptive statistics of the data.
DT::datatable(hc2016)

For reference:

  • if you just want to display the entire dataframe on the screen at once, enter the command knitr::kable(hr2005).

  • Data sources: Data on the voting rate of 18 and 19-year-olds in the 24th House of Councillors ordinary election: Ministry of Internal Affairs and Communications, conducted on July 10, 2016 (Heisei 28).

  • The population density data (did) is downloadabale here Data as of April 1, 2016

5.2 Adding dimensions to a scatter plot.

5.2.1 Making Scatter Plots More Readable

  • “Urbanization index” and “Turnout”.
  • Add dimensions (prefecture names) to the scatter plot of “Urbanization index” and “Turnout”.
  • Specify “Urbanization index (did)” on the x-axis and “Turnout” on the y-axis to create the scatter plot.
  • Use the geom_text() function to display prefecture names.
hc2016 %>% 
  ggplot(aes(did, all)) +
  geom_point() +
  stat_smooth(method = lm) +
  geom_text(aes(y = all + 0.5, 
                label = pref),
            size = 2, 
            family = "HiraKakuPro-W3") +
  labs(x = "Urbanization index (did)", y = "Turnout") +
  ggtitle("Voting rate by prefecture in the 2016 Japanese House of Councillors election") +
  theme_bw(base_family = "HiraKakuProN-W3")

  • The text overlaid on the prefectures is difficult to distinguish in the graph:
  • The three prefectures with the highest population density, Tokyo, Osaka, and Kanagawa, are isolated on the right as outliers.
  • This makes the graph hard to read.

Solutions:

  1. Exclude outliers (in this case, Tokyo(東京), Osaka(大阪), and Kanagawa(神奈川)) from the data.
  2. Apply a logarithmic transformation to variables with large value differences (in this case, the degree of urbanization did).

Try each method

(1) Scatter Plot Excluding Specific ‘Outliers’.

  • Draw a scatter plot excluding the top three prefectures with high population density: Tokyo, Osaka, and Kanagawa.
  • The ‘degree of urbanization’ of these three prefectures is visually over 2000.
    → Specify dplyr::filter(did < 2000).

{ggrepel}:

  • Use the {ggrepel} package to make the display of prefecture names more readable and spread out.”
hc2016 %>%
  filter(did < 2000) %>% # Exclude Tokyo, Osaka, Kanagawa
  ggplot(aes(did, age18)) +
  geom_point() +
  stat_smooth(method = lm) +
  ggrepel::geom_text_repel(aes(label = pref),
            size = 2, 
            family = "HiraKakuPro-W3") +
  labs(x = "Degree of Urbanization", y = "Voting Rate of 18-year-olds") +
  ggtitle("Voting Rate in the 2016 House of Councillors Election (Excluding Tokyo, Osaka, Kanagawa)") +
  theme_bw(base_family = "HiraKakuProN-W3")

- This makes the graph much easier to read.
- However, there is a downside of not including the three outliers.

(2) Scatter Plot with Logarithmic Transformation of ‘Degree of Urbanization (did)’.

→ To improve readability, transform the x-axis ‘Degree of Urbanization (did)’ to logarithmic scale.
- Specify x = log(did).

hc2016 %>%
  ggplot(aes(log(did), age18)) +
  geom_point() +
  stat_smooth(method = lm) +
  geom_text(aes(y = age18 + 0.7, 
                label = pref),
            size = 2, 
            family = "HiraKakuPro-W3") +
  labs(x = "Degree of Urbanization (Log Transformed)", y = "Voting Rate of 18-year-olds") +
  ggtitle("Voting Rate in the 2016 House of Councillors Election (Degree of Urbanization Log Transformed)") +
  theme_bw(base_family = "HiraKakuProN-W3")

  • This significantly improves the readability of the graph.
  • It has the advantage of including all cases.
  • However, there is a downside: if conducting regression analysis, the interpretation of coefficients is not straightforward.

Placing Prefecture Names in Boxes.

  • There is also a method to place prefecture names in boxes using the geom_text_repel() function.
hc2016 %>%
  ggplot(aes(log(did), age18)) +
  geom_point() +
  stat_smooth(method = lm) +
  ggrepel::geom_label_repel(aes(label = pref),
            size = 2, 
            family = "HiraKakuPro-W3") +
  labs(x = "Degree of Urbanization (Log Transformed)", y = "Voting Rate of 18-year-olds") +
  ggtitle("Voting Rate in the 2016 House of Councillors Election (Degree of Urbanization Log Transformed)") +
  theme_bw(base_family = "HiraKakuProN-W3")

Summary of the 24th House of Councillors Election Data (2016) ・There is a very weak negative correlation between ‘voting rate by prefecture’ and ‘degree of urbanization’.
→ Almost no correlation.
・However, there is a positive correlation between ‘voting rate of 18-year-olds’ and ‘degree of urbanization’”

5.2.2 Highlighting Specific Points with Label Display

  • It is possible to highlight only the data that meets certain conditions and also display the names of the prefectures.
  • In this case, we will highlight and display only the six prefectures of the Tohoku region.
hc2016 %>%
  ggplot(aes(log(did), age18)) +
  geom_point() +
  geom_text(aes(y = age18 + 0.7, 
                label = pref),
                size = 2, 
    family = "HiraKakuPro-W3") +
  labs(x = "Degree of Urbanization (Log Transformed)", 
    y = "Voting Rate of 18-year-olds") +
  ggtitle("Voting Rate of 18-years-olds:(2016 HC Election)") +
  theme_bw(base_family = "HiraKakuProN-W3") +
  gghighlight::gghighlight(
    pref == "青森"|
      pref == "秋田"|
      pref == "岩手"| 
      pref == "山形"| 
      pref == "宮城"| 
      pref == "福島")

  • If stat_smooth(method = lm) is added, it’s possible to simultaneously display the regression line for all prefectures and a separate regression line just for the six Tohoku prefectures.
hc2016 %>%
  ggplot(aes(log(did), age18)) +
  geom_point() +
  stat_smooth(method = lm) +
  geom_text(aes(y = age18 + 0.7, 
                label = pref),
            size = 2, 
            family = "HiraKakuPro-W3") +
  labs(x = "Degree of Urbanization (Log Transformed)", y = "Voting Rate of 18-year-olds") +
  ggtitle("Voting Rate in the 2016 House of Councillors Election (Degree of Urbanization Log Transformed)") +
  theme_bw(base_family = "HiraKakuProN-W3") +
  gghighlight::gghighlight(
    pref == "青森"|
      pref == "秋田"|
      pref == "岩手"| 
      pref == "山形"| 
      pref == "宮城"| 
      pref == "福島")

6. Scatterplot of Political Polarization in the US House of Representatives

6.1 Data Preparation

  • Use data on the ideological points of all members regarding bills in the US House of Representatives from the 80th Congress (1947-1948) to the 112th Congress (2011-2012).

DW-NOMINATE Score

  • dwnom1 (x-axis): Economic Issues・・・-1 (Liberal) to 1 (Conservative)
  • dwnom1 (y-axis): Racial Issues・・・-1 (Liberal) to 1 (Conservative)
  • Sample size: 14552

  • Download the survey data targeting members of the US House of Representatives from (congress.csv).
  • Load the data and name it congress
congress <- read_csv("data/congress.csv")
  • Check the variables in congress.
names(congress)
[1] "congress" "district" "state"    "party"    "name"     "dwnom1"   "dwnom2"  
  • Display only the initial part of the variables
head(congress)
# A tibble: 6 × 7
  congress district state   party    name          dwnom1 dwnom2
     <dbl>    <dbl> <chr>   <chr>    <chr>          <dbl>  <dbl>
1       80        0 USA     Democrat TRUMAN      -0.276   0.0160
2       80        1 ALABAMA Democrat BOYKIN  F.  -0.0260  0.796 
3       80        2 ALABAMA Democrat GRANT  G.   -0.0420  0.999 
4       80        3 ALABAMA Democrat ANDREWS  G. -0.00800 1.00  
5       80        4 ALABAMA Democrat HOBBS  S.   -0.0820  1.07  
6       80        5 ALABAMA Democrat RAINS  A.   -0.170   0.870 
tail(congress)
# A tibble: 6 × 7
  congress district state   party      name     dwnom1   dwnom2
     <dbl>    <dbl> <chr>   <chr>      <chr>     <dbl>    <dbl>
1      112        4 WISCONS Democrat   MOORE    -0.538 -0.458  
2      112        5 WISCONS Republican SENSENBR  1.20  -0.438  
3      112        6 WISCONS Republican PETRI     0.776 -0.00300
4      112        7 WISCONS Republican DUFFY     0.781 -0.270  
5      112        8 WISCONS Republican RIBBLE    0.886 -0.193  
6      112        1 WYOMING Republican LUMMIS    0.932 -0.211  
  • Use the filter() function to extract data for each Congress session.
eighty <- congress %>% 
  filter(congress == 80)  # 80th Congress

twelve <- congress %>% 
  filter(congress == 112) # 112th Congress
  • Display the beginning of the extracted data for the 80th Congress.
head(eighty)
# A tibble: 6 × 7
  congress district state   party    name          dwnom1 dwnom2
     <dbl>    <dbl> <chr>   <chr>    <chr>          <dbl>  <dbl>
1       80        0 USA     Democrat TRUMAN      -0.276   0.0160
2       80        1 ALABAMA Democrat BOYKIN  F.  -0.0260  0.796 
3       80        2 ALABAMA Democrat GRANT  G.   -0.0420  0.999 
4       80        3 ALABAMA Democrat ANDREWS  G. -0.00800 1.00  
5       80        4 ALABAMA Democrat HOBBS  S.   -0.0820  1.07  
6       80        5 ALABAMA Democrat RAINS  A.   -0.170   0.870 

6.2 Scatterplot of Data for the 80th Congress

  • Let’s draw a scatter plot of data between economic issues (x-axis) and racial issues (y-axis) for the 80th Congress.
eighty %>% 
  ggplot(aes(x = dwnom1, y = dwnom2)) +
  geom_point(aes(color = party)) +
  labs(x = "Economic Issues(dwnom1)",
       y = "Racial Issues(dwnom2)") +
  ggtitle("US 80th Congress") +
  theme_bw(base_family = "HiraKakuProN-W3") 

  • Regarding racial issues (y-axis), Democratic members tend to be more conservative, resulting in a concentration towards the upper side (+) of the graph.
  • Regarding economic issues (x-axis), Republican members tend to be more conservative, resulting in a concentration towards the right side (+) of the graph.

6.3 Scatterplot of Data for the 112th Congress

  • Scatterplot of Data for the 112th Congress
twelve %>% 
  ggplot(aes(x = dwnom1, y = dwnom2)) +
  geom_point(aes(color = party)) +
  labs(x = "Economic Issues(dwnom1)",
       y = "Racial Issues(dwnom2)") +
  ggtitle("US 112th Congress") +
  theme_bw(base_family = "HiraKakuProN-W3") 

Analysis Results of the 80th/121st US House of Representatives Survey Data (1947-2012)

Compared to the 80th Congress, in the 121st Congress:

  • Regarding racial issues (y-axis), both Democratic and Republican members are concentrated around 0.
  • The difference between Democratic and Republican members regarding racial issues (y-axis) has disappeared.
    → The difference in views on racial issues no longer serves as a significant factor in distinguishing between Democrats and Republicans.
  • Regarding economic issues (x-axis), Republican members are more conservative, resulting in a concentration on the right side (+) of the graph.
    → The difference in views on economic issues has become more important in distinguishing between Democrats and Republicans.

7. Exercise

Q7.1:

  • Refer to 1.3 Customizing the Shape of Dots and draw a scatter plot of ‘election expenses’ and ‘vote percentage’ in the 2009 House of Representatives election.
  • Use shape = 23 to display a with ‘yellow’ inside and ‘magenta’ outline.

Q7.2:

  • Refer to 2.2 Adding Dimension by Changing Dot Colors and draw a scatter plot of ‘election expenses’ and ‘vote percentage’ in the 2009 House of Representatives election.
  • Distinguish between candidates from the Liberal Democratic Party and others by using different colors.

Q7.3:

  • Refer to 2.3 Specifying Dot Colors and draw a scatter plot of ‘election expenses’ and ‘vote percentage’ in the 2009 House of Representatives election.
  • Color the Liberal Democratic Party candidates in ‘red’ and all other candidates in ‘grey’.

Q7.4:

  • Q1: Refer to 3. Scatter Plot with Regression Line (1) and draw a scatter plot of ‘election expenses’ and ‘vote percentage’ in the 2005 House of Representatives election.
  • Use the facet_wrap() function to display scatter plots for each political party, ensuring that the Liberal Democratic Party and the Democratic Party are displayed next to each other.
  • Q2: Use the group_by() function to calculate the vote percentage by political party and display the results using the DT::datatable() function.

Q7.5:

  • Q1: Refer to 4.5 Highlighting Specific Countries using {gapminder} to display a scatter plot of ‘log value of GDP per capita (USD)’ and ‘life expectancy’.
  • Choose three countries of interest and display them in different colors.
  • Q2: Summarize concisely what can be understood from the above graph.

Q7.6:

  • Q1: Refer to 5.2 Adding dimensions to a scatter plot and draw a scatter plot for the 24th (2016) House of Councillors election, with ‘population density of prefectures (did)’ on the x-axis and ‘voting rate of 19-year-old voters (age19)’ on the y-axis.
  • Apply logarithmic transformation to variables as needed.
    = Q2: Summarize concisely what can be understood from the above graph.
Reference
  • Tidy Animated Verbs
  • 宋財泫 (Jaehyun Song)・矢内勇生 (statuki statanai)「私たちのR: ベストプラクティスの探究」
  • 宋財泫「ミクロ政治データ分析実習(2022年度)」
  • 土井翔平(北海道大学公共政策大学院)「Rで計量政治学入門」
  • 矢内勇生(高知工科大学)授業一覧
  • 浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年
  • 浅野正彦, 中村公亮.『初めてのRStudio』オーム社、2018年
  • Winston Chang, R Graphics Coo %>%kbook, O’Reilly Media, 2012.
  • Kieran Healy, DATA VISUALIZATION, Princeton, 2019
  • Kosuke Imai, Quantitative Social Science: An Introduction, Princeton University Press, 2017