• R packages we use in this section
library(tidyverse)
library(broom)
library(patchwork)
library(DT)
library(ggbeeswarm)
library(ggsignif)
library(rcompanion)
library(rmarkdown)

1. Why we conduct a t-test?

  • A t-test is a statistical test that is used to compare the means of two groups.
  • It is often used in hypothesis testing to determine whether a process or treatment actually has an effect on the population of interest, or whether two groups are different each other.
  • The Central Limit Theorem suggests that even if the original variables themselves are not normally distributed, distribution of samples converges to normal as the number of samples increases.
    → This is the theoretical foundation that we can conduct statistical inference and hypothesis test using a normal distribution.
  • However, it is not always the case that we can have access to sizable data.
  • t test provides us a solution of this small N problem in hypothesis testing.

2. What type of t-test should we use?

When you conduct a t-test, you need to consider two things:
(1) whether the groups being compared come from a single population or two different populations
(3) whether you want to test the difference in a specific direction or both directions

One-sample, two-sample, or paired t-test?

  • If the groups come from a single population (for example, measuring before and after an medical treatment), perform a paired t-test.
  • If the groups come from two different populations (for exmaple, comparing which hamberger tastes better between McDonald’s and In-N-Out Burger), perform a two-sample t-test.
  • If there is one group being compared against a standard value (for example, comparing your test score to the averaged test score of your friends in the same class), perform a one-sample t-test.

One-tailed or two-tailed t-test?

  • If you only care whether the two populations are different from one another, perform a two-tailed t-test.
  • If you want to know whether one population mean is greater than or less than the other, perform a one-tailed t-test.
  • A one-tailed t-test is way harder to be passed.

3. How to show results on ttest

  • Four ways of showing the result of ttest

How to show the results Features
(1) Simple output Rich info, but not easy to see
(2) Boxplot/violin plot Easy to see the result & statistical significance
(3) Bar chart easy to see
(4) Show the difference Easy to see
  • Researchers prefer to use (3) and (4) lately in academia

4. Paired data

  • Paired-samples t-test
  • Respondents 1 to 10 (10 people) eat both Mos burgers and Mc bergers
  • Respondents 1 to 10 (10 people) evaluate them in 0 - 100 scale.
A sample: (mos_mc.csv)
A sample: (mos_mc.csv)
  • Sample means: mos (80.5), mc (79.5) => Mos is better
  • Is this also the case in the population?
  • We need to conduct t-test and confirm this.

4.1 Load mos_mc.csv

df_mos_mc <- read_csv("data/mos_mc_paired.csv")
  • Check the data
DT::datatable(df_mos_mc)
  • mos_mc.csv is called wide format
  • wide format should be changed to long format in analyzing in R

4.2 Wide format => Long format

→ Using tidyr::pivot_longer() function, we change mos_mc.csv to long format

df_long <- df_mos_mc %>% 
    tidyr::pivot_longer(mos:mc,
                 names_to = "burger",   
                 values_to = "score")   
  • Check df_long
DT::datatable(df_long)

4.3 Visualization

4.3.1 Boxplot

  • In drawing a Box Plot, we use long format data
df_long %>% 
  mutate(burger = fct_inorder(burger)) %>% 
  ggplot(aes(x = burger, y = score)) +
  geom_boxplot() +
  scale_x_discrete(labels = c( "Mos Burger", "McDonald's")) +
  labs(x = "Shop names", y = "Evaluation")

4.3.2 Violin Plot

  • Violin Plot is similar to Box Plot
  • Researchers tend to use Violin Plot more often than Box Plot
df_long %>% 
  mutate(burger = fct_inorder(burger)) %>% 
  ggplot(aes(x = burger, y = score)) +
  geom_violin() + 
  scale_x_discrete(labels = c( "Mos Burger", "McDonald's")) +
  labs(x = "Shop names", y = "Evaluation")

4.3.3 Box Plot + Violin Plot

  • It is possible to simultaneously draw a Box Plot and Violin Plot
df_long %>% 
  mutate(burger = fct_inorder(burger)) %>% 
  ggplot(aes(x = burger, y = score, color = burger)) +
  geom_violin() + 
  geom_boxplot(width = .1) + # Set the width of Box Plot as 0.1  
  stat_summary(fun.y = mean, geom = "point") + # Show the average as dots  
  scale_x_discrete(labels = c( "Mos Burger", "McDonald's")) +
  labs(x = "Shop names", y = "Evaluation")

  • Show the descriptive statistic of the data for Mos Burger and McDonald’s
summary(df_mos_mc)
      mos              mc       
 Min.   :70.00   Min.   :70.00  
 1st Qu.:76.25   1st Qu.:75.00  
 Median :80.00   Median :80.00  
 Mean   :80.50   Mean   :79.50  
 3rd Qu.:85.00   3rd Qu.:83.75  
 Max.   :90.00   Max.   :90.00  
  • In this sample, we get the following results:
  • Sample means: Mos (80.5), Mc (79.5) => Mos is better
  • What we want to know:
    => Is this also the case in the population?
  • It may be a result we get by chance
  • To confirm this, we need to conduct paired ttest

4.4 Process of t-test: by hand

  • Null hypothesis: \(H_0\):
    There is no difference in tastes between McDonald’s and Mos Burger
  • Alternative hypothesis: \(H_1\):
    There is a difference in tastes between McDonald’s and Mos Burger
  • Calculate t-value
  • Identify the critical value with which we reject the null hypothesis
  • Check if the t-value we get lies within the rejection area
  • Conclusion
  • Equation for calculating t-value using Paired data

    \[T = \frac{\bar{d} - d_0}{u_x / \sqrt{n}}\] Where,

    \[\bar{d} = \frac{\sum (x_i - y_i)}{n}\]

    • \(n\) : Sample size (10)

    • \(u_x\) : Unbiased standard deviation

    • \(\bar{d}\) : The difference of evaluation between McDonald’s and Mos burger

    • \(d_0\) : The value we want to estimate (0)

    • \(x_i\) : Evaluation for McDonald’s burger

    • \(y_i\) : Evaluation for Mos burger

    • Using the equation above, we can calculate the t-value

    x <- df_mos_mc$mos
    y <- df_mos_mc$mc
    d <- x - y
    t <- (mean(d) - 0) / (sd(d) / sqrt(10))
    t
    [1] 0.557086
    • This is the t-value (0.557086) we calculated by hand

    • The following is the t-distribution table

    • Figures show the critical values corresponding to several significance levels
    • Significance levels is shown as \(α\)
    • When the t-value calculated with the sample data is larger than the critical value (you chose), then you reject the null hypothesis. - In this case, we want to know if there is difference (or not) between McDonald’s and Mos burger in tastes.
    • It is common to use the significance level = 5%(\(α = 0.05\)).
    • It is also common to use two-paired t-test (because what we want to know is NOT McDonald’s is tastier than Mos Burger, or vise-verse).
      => What we want to know is whether McDonald’s burger is different from Mos Burger in tastes.
      => We should use two-paired t-test.
      => We should take a look at the cell with \(α = 0.025\)
    • The number of cases we use in this comparison is 10
      =>degree of freedom (v) = 10 - 1 = 9
      => We should take a look at the cell between \(α = 0.025\) and \(v = 9\)
    • 2.262 is the critical value in conducting our t-test
    • We have two critical values (-2.262 & 2.262)

    Interpretation The t-value we calculated from our sample (0.557086) lies in between -2.26 and 2.26
    => We cannot reject the null hypothesis
    => We cannot say anything about the difference between McDonald’s and Mos burger in tastes.
    The sample mean we get (Mos 80.5, Mc 79.5) from our sample are highly likely to gained by chance. This is not the case in population.

    4.5 Process of t-test: by R

    4.5.1 In case you use long format data

    • Let’s use the datafreame (df_long) we modified at Section 4.2
    • Check df_long
    df_long
    # A tibble: 20 × 2
       burger score
       <chr>  <dbl>
     1 mos       80
     2 mc        75
     3 mos       75
     4 mc        70
     5 mos       80
     6 mc        80
     7 mos       90
     8 mc        85
     9 mos       85
    10 mc        90
    11 mos       80
    12 mc        75
    13 mos       75
    14 mc        85
    15 mos       85
    16 mc        80
    17 mos       85
    18 mc        80
    19 mos       70
    20 mc        75
    • Conduct a t-test
    t.test(df_long$score[df_long$burger == "mos"],
           df_long$score[df_long$burger == "mc"]) # unpaired is default  
    
        Welch Two Sample t-test
    
    data:  df_long$score[df_long$burger == "mos"] and df_long$score[df_long$burger == "mc"]
    t = 0.37354, df = 18, p-value = 0.7131
    alternative hypothesis: true difference in means is not equal to 0
    95 percent confidence interval:
     -4.624301  6.624301
    sample estimates:
    mean of x mean of y 
         80.5      79.5 
    Interpretation ・Null Hypothesis:
    There is no difference in tastes between McDonald’s and Mos Burger.
    ・See t = 0.37354 in line 4
    The p-value R calculated from our sample = 0.7131
    => We cannot reject the null hypothesis because 0.7131 is larger than 0.05
    => We cannot say anything about the difference between McDonald’s and Mos burger in tastes.
    The sample mean we get (Mos 80.5, Mc 79.5) from our sample are highly likely to gained by chance. This is not the case in population.
    • Using ggsignif() function, we can draw a boxplot with statistical significance, p-value
    • unpaired data is default in ggsignif() function
      test.args = list(paired = TRUE)

    test.arg( ) ・If you use paired data → test.args = list(paired = TRUE)
    ・If you use unpaired data → test.args = list(paired = FALSE) or
    → You don’t have to write this code.

    df_long %>% 
      mutate(burger = fct_inorder(burger)) %>% 
      ggplot(aes(x = burger, y = score, color = burger)) +
      geom_violin() +
      geom_boxplot(width = .1) + 
      stat_summary(fun.y = mean, geom = "point") + 
      scale_x_discrete(labels = c( "MOS Burger", "McDonald's")) +
      labs(x = "Store", y = "Evaluation") +
        ggsignif::geom_signif(comparisons = combn(sort(unique(df_long$burger)), 2, FUN = list),
                              test = "t.test",
                              test.args = list(paired = TRUE), 
                              na.rm = T,
                              step_increase = 0.1)

    → When you use unpaired data, then you delete test.args = list(paired = TRUE)

    4.5.2 In case you use wide format data

    df_mos_mc
    # A tibble: 10 × 2
         mos    mc
       <dbl> <dbl>
     1    80    75
     2    75    70
     3    80    80
     4    90    85
     5    85    90
     6    80    75
     7    75    85
     8    85    80
     9    85    80
    10    70    75
    t.test(df_mos_mc$mos,
           df_mos_mc$mc, 
           paired = TRUE) 
    
        Paired t-test
    
    data:  df_mos_mc$mos and df_mos_mc$mc
    t = 0.55709, df = 9, p-value = 0.5911
    alternative hypothesis: true mean difference is not equal to 0
    95 percent confidence interval:
     -3.060696  5.060696
    sample estimates:
    mean difference 
                  1 

    4.5.3 t-test using One Sample t-test

    • Using One Sample t-test, we can do the same t-test in the Section 4.5.1

    • Calcualte the difference in mean (diff) between Mos Burger and McDonald’

    • Here, we use wide format data: df_mos_mc

    • Null Hypothesis:mean of diff = 0
      there is no difference in tastes between Mos Burger and McDonald’s

    diff <- df_mos_mc$mos - df_mos_mc$mc
    diff
     [1]   5   5   0   5  -5   5 -10   5   5  -5
    • Calculate mean of diff
    mean(diff)
    [1] 1
    t.test(diff)
    
        One Sample t-test
    
    data:  diff
    t = 0.55709, df = 9, p-value = 0.5911
    alternative hypothesis: true mean is not equal to 0
    95 percent confidence interval:
     -3.060696  5.060696
    sample estimates:
    mean of x 
            1 

    5. Unpaired data

    • Unpaired-samples t-test
    • 20 people eat either Mos Burgers or Mc burgers.
    • Respondents 1 to 10 eat Mos Burger.
    • Respondents 11 to 20 eat McDonald’s burger.
    • These 20 respondents evaluate them in 0 - 100 scale.
    A sample: (mos_mc.csv)
    A sample: (mos_mc.csv)
    • Sample means: Mos (80.5), Mc (79.5) => Mos is better
    • Is this also the case in the population?
    • We need to conduct t-test and confirm this.

    Summary: paired and unpaired data

    Type of data Details
    Paired 10 people eat both burgers
    Unpaired 10 people eat Mos and the other 10 people eat Mc burgers

    5.1 mos_mc_paired.csv

    df_mos_mc_paired <- read_csv("data/mos_mc_paired.csv")
    • Check the data
    DT::datatable(df_mos_mc_paired)

    5.2 Wide format => Long format

    • mos_mc_paired.csv is called wide format
    • wide format should be changed to long format in R
      → Using tidyr::pivot_longer() function, we change mos_mc.csv to long format
    df_long <- df_mos_mc_paired %>% 
        tidyr::pivot_longer(mos:mc,
                     names_to = "burger",  
                     values_to = "score") 
    • Check df_long
    DT::datatable(df_long)

    5.3 t-test by R

    5.4.1 In case you use long format data

    t.test(df_long$score[df_long$burger == "mos"],
           df_long$score[df_long$burger == "mc"]) # unpaired is default 
    
        Welch Two Sample t-test
    
    data:  df_long$score[df_long$burger == "mos"] and df_long$score[df_long$burger == "mc"]
    t = 0.37354, df = 18, p-value = 0.7131
    alternative hypothesis: true difference in means is not equal to 0
    95 percent confidence interval:
     -4.624301  6.624301
    sample estimates:
    mean of x mean of y 
         80.5      79.5 
    Interpretation ・Null Hypothesis:
    There is no difference in tastes between McDonald’s and Mos Burger.
    ・See t = 0.55709 in line 4
    ・This t-value is exactly the same that we calculated by hand at Section 4.4
    The p-value R calculated from our sample = 0.5911
    => We cannot reject the null hypothesis because 0.5911 is larger than 0.05
    => We cannot say anything about the difference between McDonald’s and Mos burger in tastes.
    The sample mean we get (Mos 80.5, Mc 79.5) from our sample are highly likely to gained by chance. This is not the case in population.

    5.4.2 Visualize the results

    • Using ggsignif() function, we can draw a boxplot + violin with statistical significance, p-value
    • unpaired data is default in ggsignif() function
      test.args = list(paired = TRUE)
    df_long %>% 
      mutate(burger = fct_inorder(burger)) %>% 
      ggplot(aes(x = burger, y = score, fill = burger)) +
      geom_violin() + 
      geom_boxplot(width = .1) + # 箱ひげ図の幅を 0.1 と指定
      stat_summary(fun.y = mean, geom = "point") + # 平均値を点で示す 
        ggbeeswarm::geom_beeswarm() +
      scale_x_discrete(labels = c( "Mos Burger", "McDonald's")) +
      labs(x = "Shop names", y = "Evaluation") +
        ggsignif::geom_signif(comparisons = combn(sort(unique(df_long$burger)), 2, FUN = list),
                              test = "t.test",
                              na.rm = T,
                              step_increase = 0.1)

    5.4.2 In case you use wide format data

    df_mos_mc
    # A tibble: 10 × 2
         mos    mc
       <dbl> <dbl>
     1    80    75
     2    75    70
     3    80    80
     4    90    85
     5    85    90
     6    80    75
     7    75    85
     8    85    80
     9    85    80
    10    70    75
    t.test(df_mos_mc$mos,
           df_mos_mc$mc) # unpaired is default 
    
        Welch Two Sample t-test
    
    data:  df_mos_mc$mos and df_mos_mc$mc
    t = 0.37354, df = 18, p-value = 0.7131
    alternative hypothesis: true difference in means is not equal to 0
    95 percent confidence interval:
     -4.624301  6.624301
    sample estimates:
    mean of x mean of y 
         80.5      79.5 
    • We get exactly the same result in Section 5.4.1 where we used long format data

    6. Visualize the results by Bar Chart

    • Recently, researchers tend to show their t-test results by bar chart in major academic journals in Social Sciences

    • We use a hypothetical survey data on Mos Burgers and McDonald’s.

    • Unpaired-samples t-test

    • 20 people eat fried potatoes of either Mos Burgers or McDonald’s.

    • Respondents 1 to 10 eat fried potato of Mos Burger.

    • Respondents 11 to 20 eat fried potato McDonald’s burger.

    • These 20 respondents evaluate them in 0 - 100 scale.

    • What we want to know here:

    Which flied potatoes is better in tasts between Mos Burgers and McDonald’s

    6.1 Load menu.csv

    df_menu <- read_csv("data/menu.csv")
    • Check the data
    DT::datatable(df_menu)
    • df_menu is a long format data
      => We can use the data as it is

    6.2 t-test by R (long format data)

    • Unpaired-samples t-test
    potato_ttest <- t.test(fried_potato ~ mosmc, data = df_menu)
    potato_ttest
    
        Welch Two Sample t-test
    
    data:  fried_potato by mosmc
    t = 4.4463, df = 15.32, p-value = 0.0004489
    alternative hypothesis: true difference in means between group mc and group mos is not equal to 0
    95 percent confidence interval:
     3.233228 9.166772
    sample estimates:
     mean in group mc mean in group mos 
                 80.9              74.7 
    Interpretation ・Null Hypothesis:
    There is no difference in tastes between McDonald’s fried potato and Mos Burger’s fried potato.
    ・See t = 4.4463 in line 3
    The p-value R calculated from our sample = 0.0004489
    => We can reject the null hypothesis because 0.0004489 is way smaller than 0.05
    => We can conclude that it is not the case that There is no difference in tastes between McDonald’s fried potato and Mos Burger’s fried potato
    => There IS a difference between them.
    The sample mean we get (Mc 80.9, Mos 74.7) from our sample are not gained by chance. The McDonald’s fried potato is better in tastes than Mos’s

    6.3 Preparation to draw a bar chart

    • Make a function and prepare for a data to draw a bar chart
    • Make a function to define 95% confidence intervals (mean_ci)
    mean_ci <- function(data, by, vari){
        se <- function(x) sqrt(var(x)/length(x))
        meanci <- data %>% 
            group_by({{by}}) %>%
            summarise(n = n(),
                      mean_out = mean({{vari}}),
                      se_out = se({{vari}}),
                      .group = "drop"
                      ) %>%
            mutate(
                lwr = mean_out - 1.96 * se_out,
                upr = mean_out + 1.96 * se_out
                ) %>%
            mutate(across(where(is.double), round, 1)) %>%
            mutate(mean_label = format(round(mean_out, 1), nsmall = 1)) %>% 
            select({{by}}, mean_out, lwr, upr, mean_label) %>% 
            mutate(across(.cols = {{by}}, as.factor))
        return(meanci)
    }
    • Using the function made above, let’s calculate the mean and 95% confidence intervals of fried potatoes for Mos Burger and McDonald’s
    potato_mean <- df_menu %>% 
      mean_ci(mosmc, fried_potato)
    
    potato_mean
    # A tibble: 2 × 5
      mosmc mean_out   lwr   upr mean_label
      <fct>    <dbl> <dbl> <dbl> <chr>     
    1 mc        80.9  79.4  82.4 80.9      
    2 mos       74.7  72.4  77   74.7      
    • lwr — the lower bound of 95% confidence interval

    • upr — the lower bound of 95% confidence interval

    • mean_label — the labels pasted on the bar chart

    • Using broom::tidy() function, we change the results we get into tibble format

    • We need p-values we want to show on the bar chart

    • We assign some conditions with which we show p-values

    • We use long format data here (df_menu)

    potato_ttest <- t.test(fried_potato ~ mosmc, data = df_menu)
    potato_tidy <- tidy(potato_ttest) %>% 
      select(estimate, p.value, conf.low, conf.high) %>%  
      mutate(
        p_label = case_when(                           
          p.value <= 0.01 ~ "p < .01",                  
          p.value > 0.01 & p.value <= 0.05 ~ "p < .05", 
          p.value > 0.05 & p.value <= 0.1 ~ "p < .1",  
          p.value > 0.1 ~ "N.S"                        
          )
        )
    • Merge the two data frame, potato_mean and potato_tidy by using bind_cols() function
    • Name it df_potato
    • Make a new variable, mosmc
    • Make a new variable, p_label
    df_potato <- bind_cols(potato_mean, potato_tidy) %>% 
      mutate(
        mosmc = as.factor(if_else(mosmc == "mc", 
                                  "McDonalds", "Mos Burger")),
        p_label = if_else(mosmc == "McDonalds", p_label, NA_character_),
        menu = "Fried Potato"
        )
    • check df_potato
    df_potato
    # A tibble: 2 × 11
      mosmc      mean_out   lwr   upr mean_label estimate p.value conf.low conf.high
      <fct>         <dbl> <dbl> <dbl> <chr>         <dbl>   <dbl>    <dbl>     <dbl>
    1 McDonalds      80.9  79.4  82.4 80.9            6.2 4.49e-4     3.23      9.17
    2 Mos Burger     74.7  72.4  77   74.7            6.2 4.49e-4     3.23      9.17
    # ℹ 2 more variables: p_label <chr>, menu <chr>
    • Show the bar chart using the data generated
    pl_potato <- df_potato %>% 
      ggplot(aes(x = mosmc, y = mean_out, fill = mosmc)) +
      geom_bar(stat = "identity") +
      geom_errorbar(aes(ymin = lwr, ymax = upr, width = 0.3)) +
      geom_label(aes(label = mean_label),
                 size = 7.5, position = position_stack(vjust = 0.5),
                 show.legend = F, fill = "white") +
      geom_segment(aes(x = 1, y = 90, xend = 1, yend = 95)) +
      geom_segment(aes(x = 1, y = 95, xend = 2, yend = 95)) +
      geom_segment(aes(x = 2, y = 90, xend = 2, yend = 95)) +
      geom_text(aes(x = 1.5, y = 100, label = p_label), 
                size = 4.5, family = "Times New Roman", inherit.aes = FALSE) +
      scale_fill_manual(values = c("red", "green4")) +
      scale_y_continuous(expand = c(0, 0),
                         limits = c(0, 105)) +
      labs(x = NULL, y = "Average of Evaluation",
           title = "Comparing the Average Score of Fried Potato") +
      theme(legend.position = "none",
            plot.title = element_text(size = 12, hjust = 0.5),
            axis.title = element_text(size = 13),
            axis.text = element_text(size = 13))
    
    pl_potato

    7. Visualize the results by difference

    7.1 Data Preparation

    df_potato <- df_potato %>% 
      mutate(across(where(is.double), ~ round(.x, 1))) %>% 
      mutate(
        diff_x = "difference",  
        diff_label = format(round(estimate, 1),  
                            nsmall = 1)
        )
    • Check the data
    df_potato
    # A tibble: 2 × 13
      mosmc      mean_out   lwr   upr mean_label estimate p.value conf.low conf.high
      <fct>         <dbl> <dbl> <dbl> <chr>         <dbl>   <dbl>    <dbl>     <dbl>
    1 McDonalds      80.9  79.4  82.4 80.9            6.2       0      3.2       9.2
    2 Mos Burger     74.7  72.4  77   74.7            6.2       0      3.2       9.2
    # ℹ 4 more variables: p_label <chr>, menu <chr>, diff_x <chr>, diff_label <chr>
    • Visualize the average values with 95% confidence intervals
    pl_mean_poteto <- df_potato %>% 
      ggplot(aes(x = mosmc, 
                 y = mean_out, 
                 ymin = lwr, 
                 ymax = upr)) +
      geom_pointrange(size = 1) +
      geom_text(aes(label = mean_label), 
                size = 6.5, 
                nudge_x = .13) +  
      ylim(70, 100) +
      labs(x = NULL, y = NULL, title = "Average of Evaluation") +
      theme(plot.title  = element_text(hjust = 0.5, size = 16),
            axis.text = element_text(size = 17),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
            strip.background = element_blank(),
            strip.text.y = element_blank())

    7.2 Visualize the difference between Mos and Mc

    pl_diff_poteto <- df_potato %>% 
      ggplot(aes(x = diff_x, y = estimate)) +
      geom_hline(yintercept = 0, col = "red") +
      geom_pointrange(aes(ymin = conf.low, ymax = conf.high), size = 1) +
      geom_text(aes(label = diff_label), 
                size = 6.5, 
                nudge_x = .19) +  
      labs(x = NULL, y = NULL, title = "difference") +
      theme(plot.title  = element_text(hjust = 0.5, size = 16),
            axis.text = element_text(size = 17),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
            strip.text.y = element_text(size = 17),
            strip.background = element_blank())
    • Using patchwork package, show the results we get here
    pl_mean_diff <- pl_mean_poteto + pl_diff_poteto + plot_layout(widths = c(3, 1))
    
    pl_mean_diff

    7.3 Boxplot and Violin (diff)

    df_menu %>% 
        ggplot(aes(mosmc, fried_potato, fill = mosmc)) + 
        geom_violin() +
      geom_boxplot(width = .1) + 
        stat_summary(fun.y = mean, geom = "point") + 
      scale_x_discrete(labels = c( "Mos Burger", "McDonald's")) +
      labs(x = "Shop Name", y = "Evaluation") +
        ggsignif::geom_signif(comparisons = combn(sort(unique(df_menu$mosmc)), 2, FUN = list),
                              test = "t.test", na.rm = T,
                              step_increase = 0.1)

    8. Exercise

    Exercise 8.1

    • Using the general election data hr96-21.csv, we want to find out if there is a difference in the average vote share of the Liberal Democratic Party (LDP) and Komeito in the 2021 general election (single-member districts).

    Q1:

    • Write the null hypothesis for this test.

    Q2:

    • Write the alternative hypothesis for this test.

    Q3:

    • Use the t.test() function to output the test result, and explain the result in simple terms.

    Q4: Create a boxplot + violin plot with the geom_signif() function to show the analysis results, including statistical significance.

    Q5:

    • Show the results of the t-test using a bar graph.
      #### Q6:
    • Show the results of the t-test using ‘difference’.

    Exercise 8.2

    • There is fictional data ( menu.csv ) where 20 respondents rated ‘Cheeseburger’, ‘Fries’, ‘Teriyaki Burger’, and ‘Shake’ at Mos Burger and McDonald’s.
    • Assume that 20 respondents ate all four types of food at either Mos Burger or McDonald’s.
    • The question here is “Which is tastier for the ‘Cheeseburger’, Mos or Mc?”

    Q1:

    • Write the null hypothesis for this test.

    Q2:

    • Write the alternative hypothesis for this test.

    Q3:

    • Use the t.test() function to output the test result, and explain the result in simple terms.

    Q4:

    • Create a boxplot + violin plot with the geom_signif() function to show the analysis results, including statistical significance.

    Q5:

    • Show the results of the t-test using a bar graph.

    Q6:

    • Show the results of the t-test using ‘difference’.

    Exercise 8.3

    • The following data is the test results (fictional data) of 30 students in a “Quantitative Analysis (Politics)” course: test_score.csv

    • The test results on the first day of class are shown as ‘before’, and those on the last day as ‘after’.
    • The aim here is to find out through a t-test if the scores in quantitative political science improved as a result of attending the class.

    Q1:

    • Write the null hypothesis for this test.

    Q2:

    • Write the alternative hypothesis for this test.

    Q3:

    • Use the t.test() function to output the test result, and explain the result in simple terms.

    Q4:

    • Create a boxplot + violin plot with the geom_signif() function to show the analysis results, including statistical significance.

    Q5:

    • Show the results of the t-test using a bar graph.

    Q6:

    • Show the results of the t-test using ‘difference’.

    Exercise 8.4

    • We want to investigate whether there is a difference in the ages (age) of candidates from the Liberal Democratic Party and the Constitutional Democratic Party in the 2021 general election.
    • You can download the data here ).
      #### Q1:
    • Write the null hypothesis for this test.

    Q2:

    • Write the alternative hypothesis for this test.

    Q3:

    • Use the t.test() function to output the test result, and explain the result in simple terms.

    Q4:

    • Create a boxplot + violin plot with the geom_signif() function to show the analysis results, including statistical significance.

    Q5:

    • Show the results of the t-test using a bar graph.

    Q6:

    • Show the results of the t-test using ‘difference’.
    Reference
  • 宋財泫 (Jaehyun Song)- 矢内勇生 (Yuki Yanai)「私たちのR: ベストプラクティスの探究」
  • グラフ作成に関しては遠藤勇哉氏(東北大学大学院情報科学研究科博士後期課程)の助言を参考にしています
  • 土井翔平(北海道大学公共政策大学院)「Rで計量政治学入門」
  • 矢内勇生(高知工科大学)授業一覧
  • 浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年
  • 浅野正彦, 中村公亮.『初めてのRStudio』オーム社、2018年
  • Winston Chang, R Graphics Cookbook, O’Reilly Media, 2012.
  • Kieran Healy, DATA VISUALIZATION, Princeton, 2019
  • Kosuke Imai, Quantitative Social Science: An Introduction, Princeton University Press, 2017