R packages we use in this section
library(broom)
library(car)
library(jtools)
library(patchwork)
library(QuantPsyc)
library(sjPlot)
library(stargazer)
library(summarytools)
library(tidyverse)
X (Cause) | Y (Effect) |
---|---|
explanatory variable |
response variable |
independent variable |
dependent variable |
predictor |
outcome |
explanatory variable |
explained variable |
regressor |
regressand |
\[\hat{Y} = \hat{α} + \hat{β}x\]
\[\hat{ε} = Y - \hat{Y}\]
\[\hat{Y} = \hat{α} + \hat{β}x \]
\[\hat{Y} = \hat{α} + \hat{β}\bar{X}\]
\[\hat{Y} = (\bar{Y} - \hat{β}\bar{X}) + \hat{β}\bar{X} = \bar{Y}\]
\[\bar{Y} - \hat{β}\bar{X} = \hat{α}\]
When the value of \(x\) is equal
to its mean (\(\bar{X}\)), the value of
\(Y\) is also equal to its mean (\(\bar{Y}\)).
In the above plot, we see that this is indeed the case.
The regression line runs through the intersection of the vertical and horizontal dotted lines, which represent the means of X and Y, \((\bar{X}, \bar{Y})\), respectively.
A linear regression model always has zero average prediction error across all data points in the sample, but this does not necessarily mean that the linear regression model accurately represents the actual data-generating process.
Source: Imai, Kosuke. Quantitative Social Science: An Introduction (p.141-147). Princeton University Press. Kindle.
\[\hat{Y} = \hat{α} + \hat{β}x\]
\[\hat{β} = r \frac{s_y}{s_x}\]
\(r\) → ピアソンの相関係数
\(s_y\) → y の標準偏差
\(s_x\) → x の標準偏差
\[\hat{α} = \bar{Y} - β\bar{x}\]
\[r = \frac{\sum((x-\bar{x})(y-\bar{y}))}{\sqrt{Σ(x-\bar{x})^2Σ(y-\bar{y})^2}}= \frac{422.95}{\sqrt{(745.55)(297.55)}}= \frac{422.95}{{471}}= 0.89\]
\(s_y\) → y’s standard deviation
\[s_y= \sqrt{\frac{Σ(y-\bar{y})^2}{{n-1}}}= \sqrt{\frac{297.55}{{19}}}=3.96\]
\(s_x\) → x’s standard deviasion
\[s_x=
\sqrt{\frac{Σ(x-\bar{x})^2}{{n-1}}}=
\sqrt{\frac{745.55}{{19}}}=6.26\]
\[\hat{β} = r \frac{s_y}{s_x}=
0.89*\frac{3.96}{6.26}=0.56\]
\[\hat{α} = \bar{Y} - β\bar{x}= 9.15 - 0.55*11.35 =2.9\]
① Find a puzzle
② Present your theory to explain the puzzle
③ Present a testable hypothesis drawn from your theory
④ Test your hypothesis with data
Research Question
Why do we witness the differences on local government’s performance in
Italy?
Data:
region
: Abbreviation of Italian local governmentsgov_p
: Performance of Italian local governmentsTheory: Social capital enhances local government’s performance.
Source: Robert Putnam, (1994) Making Democracy Work: Civic Traditions
in Modern Italy,
Princeton, NJ: Princeton University Press)
Theory
The differences on local government’s performance can be explained by the degree of social capital in each local government.
Social capital can be defined as “the networks of relationships
among people who live and work in a particular society, enabling that
society to function effectively”.
It involves the effective functioning of social groups through
interpersonal relationships, a shared sense of identity, a shared
understanding, shared norms, shared values, trust, cooperation, and
reciprocity.
Social capital help people build cooperation one another.
→ In the area with more social capital, the more people trust
and cooperate one another, which leads to high quality government
performance.
Hypothesis:
putnam.csv
)data
folder in your
RProject
folderputnam
<- read_csv("data/putnam.csv") putnam
names(putnam)
[1] "region" "gov_p" "cc" "econ" "location"
Data:
Types of variables | Variables | Details |
---|---|---|
Outcome | gov_p |
Performance of Italian local governments |
Predictor | region |
Abbreviation of Italian local governments |
Predictor | cc |
Civic Community Index |
Predictor | econ |
Economy Index (the larger, the better) |
Predictor | location |
Area dummy (north ,south ) |
putnum
::datatable(putnam) DT
summary()
summary(putnam)
region gov_p cc econ
Length:20 Min. : 1.50 Min. : 1.000 Min. : 2.50
Class :character 1st Qu.: 6.25 1st Qu.: 3.875 1st Qu.: 6.25
Mode :character Median :10.00 Median :15.000 Median :11.75
Mean : 9.15 Mean :11.350 Mean :10.43
3rd Qu.:11.25 3rd Qu.:16.250 3rd Qu.:14.50
Max. :16.00 Max. :18.000 Max. :19.00
location
Length:20
Class :character
Mode :character
stargazer()
library("stargazer")
stargazer(as.data.frame(putnam),
type = "text",
title = "Descriptve Statistics on Local Government's Performance",
digits = 2)
Descriptve Statistics on Local Government's Performance
======================================
Statistic N Mean St. Dev. Min Max
--------------------------------------
gov_p 20 9.15 3.96 1.50 16.00
cc 20 11.35 6.26 1.00 18.00
econ 20 10.43 5.08 2.50 19.00
--------------------------------------
cc
and
gov_p
ggplot(putnam, aes(cc, gov_p)) +
geom_point() +
labs(x = "Civic Community Index (cc)", y = "Performance of Italian local governments (gov_p)",
title = "erformance of Italian local governments & Civic Community Index") +
stat_smooth(method = lm) +
geom_text(aes(y = gov_p + 0.2, label = region), size = 4, vjust = 0)
gov_p
and
cc
cor.test(putnam$gov_p, putnam$cc)
Pearson's product-moment correlation
data: putnam$gov_p and putnam$cc
t = 8.6583, df = 18, p-value = 7.806e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7558102 0.9593028
sample estimates:
cor
0.8979883
gov_p
and
cc
is 0.8979883<- lm(gov_p ~ cc, data = putnam) Model_1
Summary()
summary(Model_1)
Call:
lm(formula = gov_p ~ cc, data = putnam)
Residuals:
Min 1Q Median 3Q Max
-2.5043 -1.3481 -0.2087 0.9764 3.4957
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.71115 0.84443 3.211 0.00485 **
cc 0.56730 0.06552 8.658 7.81e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.789 on 18 degrees of freedom
Multiple R-squared: 0.8064, Adjusted R-squared: 0.7956
F-statistic: 74.97 on 1 and 18 DF, p-value: 7.806e-08
tidy()
tidy(Model_1)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.71 0.844 3.21 0.00485
2 cc 0.567 0.0655 8.66 0.0000000781
stargazer()
{r}
with
{r, results = "asis"}
as the chunk optionstargazer(Model_1, type = "html")
Dependent variable: | |
gov_p | |
cc | 0.567*** |
(0.066) | |
Constant | 2.711*** |
(0.844) | |
Observations | 20 |
R2 | 0.806 |
Adjusted R2 | 0.796 |
Residual Std. Error | 1.789 (df = 18) |
F Statistic | 74.967*** (df = 1; 18) |
Note: | p<0.1; p<0.05; p<0.01 |
\[\hat{gov_p}\ = 2.7 + .567cc\]
tidy(Model_1)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.71 0.844 3.21 0.00485
2 cc 0.567 0.0655 8.66 0.0000000781
cc
(.567):cc
increases one unit, then gov_p
increases by .567 pointsp-value
is 0.0000000781cc
= 0
in population) with 95% confidence intervalp-value
(0.0000000781) is way
smaller than 0.05.cc
= 0 in population) with 99% confidence intervalcc
is not zero in populationNull Hypothesis \(H_0: β_1 =
0\)
Alternative Hypothesis \(H_a: β_1 ≠
0\)
cc
is positively related to gov_p
in
populationgov_p
is explained by
cc
in Model_1R-squared
, we need to add
another variable which affects the ourcome, gov_p
cc
increases one unit, then
gov_p
increases by .567 pointsA Solution to avoid selection bias →
Multiple Regression Analysis
Multiple Regression analysis enables us to simultaneously control the
effect of other variables
→ We can avoid selection bias problem to a certain degree
→ Multiple regression analysis is a most basic way of avoiding
selection bias
econ
as the control variable
to our modelData:
Types of variables | Variables | Details |
---|---|---|
Outcome | gov_p |
Performance of Italian local governments |
Predictor | region |
Abbreviation of Italian local governments |
Predictor | cc |
Civic Community Index |
Predictor | econ |
Economy Index (the larger, the better) |
Predictor | location |
Area dummy (north ,south ) |
F test
::datatable(putnam) DT
stargazer(as.data.frame(putnam),
type = "text",
title = "Descriptive Statistics on putnam",
digits = 2)
Descriptive Statistics on putnam
======================================
Statistic N Mean St. Dev. Min Max
--------------------------------------
gov_p 20 9.15 3.96 1.50 16.00
cc 20 11.35 6.26 1.00 18.00
econ 20 10.43 5.08 2.50 19.00
--------------------------------------
gov_p
(Y-axis) and
cc
(X-axis)<- ggplot(putnam, aes(cc, gov_p)) +
plt_1 geom_point() +
labs(x = "Civic Community Index (cc)", y = "Performance of Italian local governments (gov_p)",
title = "Performance of Italian local governments & Civic Community Index") +
stat_smooth(method = lm) +
geom_text(aes(y = gov_p + 0.2, label = region), size = 4, vjust = 0)
<- ggplot(putnam, aes(econ, gov_p)) +
plt_2 geom_point() +
labs(x = "Economy Index (econ)", y = "Performance of Italian local governments (gov_p)",
title = "Performance of Italian local governments & Economy Index") +
stat_smooth(method = lm) +
geom_text(aes(y = gov_p + 0.2, label = region), size = 4, vjust = 0)
+ plt_2 plt_1
<- lm(gov_p ~ cc, data = putnam) Model_1
<- lm(gov_p ~ econ, data = putnam) Model_2
stargazer(Model_1, Model_2,
digits = 3,
style = "ajps",
dep.var.caption = "Outcome",
dep.var.labels = "Gov.Performance",
title = "Results of Model_1 & Model_2",
type ="html")
Gov.Performance | ||
Model 1 | Model 2 | |
cc | 0.567*** | |
(0.066) | ||
econ | 0.589*** | |
(0.120) | ||
Constant | 2.711*** | 3.011** |
(0.844) | (1.385) | |
N | 20 | 20 |
R-squared | 0.806 | 0.572 |
Adj. R-squared | 0.796 | 0.549 |
Residual Std. Error (df = 18) | 1.789 | 2.659 |
F Statistic (df = 1; 18) | 74.967*** | 24.097*** |
p < .01; p < .05; p < .1 |
Stargazer()
automatically add asterisks to show the
level of statistical significant levels (1%, 5%, and 10%)
p < .01
(1%)- - - 「\(***\)」
p < .05
(5%)- - - 「\(**\)」
p < .1
(10%)- - - 「\(*\)」
Results (Model_1
)
cc
(.567):cc
increases one unit, then gov_p
increases by .567 pointscc
cc
= 0
in population) with 99% confidence intervalResults (Model_2
)
econ
(.589):econ
increases one unit, then gov_p
increases by .589 pointsecon
cc
= 0
in population) with 99% confidence interval
cc
and econ
are statistically
significantNote
t-value
and p-value
,
you should use summary()
or tidy()
as
follows:tidy(Model_1, conf.int = TRUE)
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.71 0.844 3.21 0.00485 0.937 4.49
2 cc 0.567 0.0655 8.66 0.0000000781 0.430 0.705
tidy(Model_2, conf.int = TRUE)
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.01 1.38 2.17 0.0433 0.102 5.92
2 econ 0.589 0.120 4.91 0.000113 0.337 0.841
<- lm(gov_p ~ cc + econ, data = putnam) Model_3
stargazer(Model_1, Model_2, Model_3,
digits = 3,
style = "ajps",
dep.var.caption = "Outcome",
dep.var.labels = "Gov.Performance",
title = "Results of Model_1 & Model_2 & Model_3",
type ="html")
Gov.Performance | |||
Model 1 | Model 2 | Model 3 | |
cc | 0.567*** | 0.754*** | |
(0.066) | (0.152) | ||
econ | 0.589*** | -0.253 | |
(0.120) | (0.187) | ||
Constant | 2.711*** | 3.011** | 3.236*** |
(0.844) | (1.385) | (0.912) | |
N | 20 | 20 | 20 |
R-squared | 0.806 | 0.572 | 0.825 |
Adj. R-squared | 0.796 | 0.549 | 0.805 |
Residual Std. Error | 1.789 (df = 18) | 2.659 (df = 18) | 1.749 (df = 17) |
F Statistic | 74.967*** (df = 1; 18) | 24.097*** (df = 1; 18) | 40.126*** (df = 2; 17) |
p < .01; p < .05; p < .1 |
Sample Regression Function formula on Model_3
\[\hat{gov_p}\ = 3.236 + 0.754cc - 0.253econ\]
Interpretation :Model_2, Model_2, and Model_3
cc
is statistically significant.econ
is statistically significant.cc
is statistically significant,
but econ
is not.gov_p
and econ
is
spurious correlationcc
(.754):cc
increases one unit, then gov_p
increases by .754 pointsAn F-test is any statistical test in which the test statistic has
an F-distribution under the null hypothesis.
It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled.
Check the values on F-statistics on Model_3
F-statistic: 40.13 on 2 and 17 DF, p-value: 3.645e-07
F-statistic
enables us to test whether the coefficient
of cc
and econ
are simultaneously equal to
0cc
and econ
are simultaneously equal to 0F-statistic
= 40.13, p-value
=
3.645e-07Model_3
Adjusted R-squared: 0.805
gov_p
is explained by
cc
and econ
in Model_3cm
,
kg
, US dollors
, yen
,
%
, etc.Standardized Beta Coefficient
), we can compare the
relative impact of each variable on the outcome. library("QuantPsyc")
<- lm(gov_p ~ cc + econ, data = putnam)
Model_3 lm.beta(Model_3)
cc econ
1.1932019 -0.3255233
Interpretation of β (Model_3)
The coefficient (1.1932019) → when cc
increases one
standard deviation, gov_p
increases its standard deviation
by 1.19
→ Statistically significant with the 1% significant level
The coefficient (-0.3255233) → when econ
increases
one standard deviation, gov_p
decreases its standard
deviation by 0.33
→ Not statistically significant
Tables
stargazer(Model_1, Model_2, Model_3,
digits = 3,
style = "ajps",
dep.var.caption = "Outcome",
dep.var.labels = "Gov.Performance",
title = "Results of Model_1 & Model_2 & Model_3",
type ="html")
Gov.Performance | |||
Model 1 | Model 2 | Model 3 | |
cc | 0.567*** | 0.754*** | |
(0.066) | (0.152) | ||
econ | 0.589*** | -0.253 | |
(0.120) | (0.187) | ||
Constant | 2.711*** | 3.011** | 3.236*** |
(0.844) | (1.385) | (0.912) | |
N | 20 | 20 | 20 |
R-squared | 0.806 | 0.572 | 0.825 |
Adj. R-squared | 0.796 | 0.549 | 0.805 |
Residual Std. Error | 1.789 (df = 18) | 2.659 (df = 18) | 1.749 (df = 17) |
F Statistic | 74.967*** (df = 1; 18) | 24.097*** (df = 1; 18) | 40.126*** (df = 2; 17) |
p < .01; p < .05; p < .1 |
Caterpillar plots
(1) sjPlot
<- sjPlot::plot_model(Model_3,
coef_sjplot show.values = T,
show.p = T,
vline.color = "black",
order.terms = c(1, 2),
axis.labels = c("Economy Index (econ)", "Civic Community Index (cc)"),
axis.title = "Estimates", title = "Performance of Italian local governments, Civic Community, and Economy",
digits = 3)
coef_sjplot
【How to interpret sjPlot】:
dot (●) is the estimate (= coefficient) for each explanatory
variable
Holizontal lines (blue and red) show the 95% confidence intervals
with α = 0.05 (= 5%)
Blue
line・・・The blue line does not cross the black
vertical line
→ Statistically significant with 5% significant level
Red line・・・The
blue line does cross the black vertical line
→ Not statistically significant with 5% significant level
(2) jtools
::plot_summs(Model_3) jtools
data
folder in your
RProject
folderhr
<- read_csv("data/hr96-21.csv",
hr na = ".")
hr
names(hr)
[1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
df1
contains the following 23 variablesvariable | detail |
---|---|
year | Election year (1996-2017) |
pref | Prefecture |
ku | Electoral district name |
kun | Number of electoral district |
rank | Ascending order of votes |
wl | 0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner |
nocand | Number of candidates in each district |
seito | Candidate’s affiliated party (in Japanese) |
j_name | Candidate’s name (Japanese) |
name | Candidate’s name (English) |
previous | Previous wins |
gender | Candidate’s gender:“male”, “female” |
age | Candidate’s age |
exp | Election expenditure (yen) spent by each candidate |
status | 0 = challenger / 1 = incumbent / 2 = former incumbent |
vote | votes each candidate garnered |
voteshare | Voteshare (%) |
eligible | Eligible voters in each district |
turnout | Turnout in each district (%) |
castvote | Total votes cast in each district |
seshu_dummy | 0 = Not-hereditary candidates, 1 = hereditary candidate |
jiban_seshu | Relationship between candidate and his predecessor |
nojiban_seshu | Relationship between candidate and his predecessor |
hr96-21.csv
, and
answer the following questions.Q1: Using seito
variable, make a party
variable (DPJ dummy: dpj
) where 民主党 = 1, others = 0. Add
dpj
to the dataframe, hr
, as a new
variable.
Q2: Using exp
variable, make a new
variable (expm
) where the unit is not “Yen” but “million
Yen”. Add expm
to the dataframe, hr
, as a new
variable.
Q3: Using status
variable, make a new
variable, inc
, where incumbent = 1, challenger and former
incumbent = 0. Add inc
to the dataframe, hr
,
as a new variable.
Q4: Show the descriptive statistics of
hr
using stargazer
package.
Q5: Draw a scatter plot between expm
(x-axis) and voteshare
(y-axis)
Q6: Briely explain the relationship between
expm
(x-axis) and voteshare
(y-axis)
Q7: Suppose you want to know whether campaign money
(expm
) affects votes (voteshare
).
○ Explanatory Variable — expm
○ Response Variable — voteshare
summary()
,
tidy()
, and stargazer
package.voteshare.
variable | detail |
---|---|
1. mag | District magnitude (Number of candidate elected) |
2. rank | Ascending order of votes |
3. nocand | Number of candidates in each district |
4. j_name | Candidate’s name (Japanese) |
5. previous | Previous wins |
6. gender | Candidate’s gender:“male”, “female” |
7. age | Candidate’s age |
8. wl | 0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner |
9. vote | votes each candidate garnered |
10. eligible | Eligible voters in each district |
11.turnout | Turnout in each district (%) |
12.castvote | Total votes cast in each district |
13. seshu_dummy | 0 = Not-hereditary candidates, 1 = hereditary candidate |
14. dpj_dummy | 0 = Not-dpj candidate, 1 = dpj candidate |
15. inc | challenger and former incumbent = 0, incumbent = 1 |
Q8: Present your statistical results using
sjPlot
or jtools
.
Q9: Interpret the results of multiple regression analysis.
References