Main purposes of checking model fit:
- Selection of the model (1PL, 2PL, 3PL, etc.)
- Detection of misfitting items
- Verification of the model’s reliability and predictive accuracy
library(DT)
library(glpkAPI)
library(irt)
library(irtoys)
library(ltm)
library(plink)
library(plyr)
library(psych)
library(reactable)
library(tidyverse)
The Aim of Item Response Theory
・By evaluating the quality of test items,Item Parameter | Symbol | Description |
Discrimination | a | The sensitivity of the response to the latent trait θ (slope of the ICC) |
Difficulty | b | The ability level required to answer the item correctly (center of the ICC) |
ICC
)○ If the difficulty level of the item is the same・・・
→ The difficulty of an item can be determined by identifying the point on the horizontal axis where the ICC reaches a 50% probability of a correct response.
→ The steeper the slope of the curve, the better the item distinguishes between examinees with low ability and those with high ability.
Features of Item Response Theory
・Test results composed of different items can be
compared with each other.
・Test results obtained from different groups can be
compared with each other.
Advantages of Item Response Theory
→ The set of items presented changes depending on the examinee’s
responses.
→ Ensures that the most suitable items for estimating an examinee’s
ability are administered.
→ Allows estimation of ability with fewer items and in less time.
・Main methods:
1. Correlation analysis with test scores
2. Calculation of a discrimination index
3. Application of Item Response Theory
Solution
・To compare School A and School B, estimate the item parameters of each
using an IRT model.
→ Use common items (or common examinees) to equate the scales.
→ This allows examinee ability (θ) to be compared on the same scale.
・Because the difficulty of the test items differs
Solution
・Separately estimate the difficulty of the test items and the ability of the examinees, and evaluate them on a common scale.
・Example: In a 2PL model, estimate the discrimination parameter (a) and the difficulty parameter (b) for each item.
・Include at least a few identical items across last year’s and this
year’s tests.
・This makes it possible to estimate transformation coefficients (A and
B) to align the scales.
・(Main methods: Stocking–Lord method, Haebara method, etc.)
・On the equated scale, compare the ability distributions (θ) of
students from last year and this year.
・If there is a significant difference in the means or distributions of
θ, then it can be concluded that “academic ability has improved.”
Summary
・Even if raw test scores are converted into standardized scores
(deviation values), examinees’ abilities cannot be accurately compared
in the following cases:
(1) The ability levels of examinee groups differ
across schools
(2) Because the tests were administered at
different times, the difficulty of the items differs
Classical Test Theory | Item Response Theory |
Raw score | Latent trait value \(θ\) (theta) = examinee’s ability |
Standardized score (deviation value) | Item characteristics (item difficulty and discrimination) |
item characteristics
and
examinee ability
are confoundedClassical Test Theory | Item Response Theory |
Cannot distinguish between item difficulty and examinee ability | Can represent item characteristics and examinee ability separately |
Assumption | Description |
1. Local Independence | The assumption that there are no unnecessary dependencies among items |
2. Monotonicity of the Probability of a Correct Response | The assumption that item difficulty is appropriate (the probability of a correct response increases with ability) |
3. Goodness of Fit | The assumption that the data fit the item response model |
Probability | The magnitude of the likelihood of what will happen in the future |
Likelihood | The magnitude of the likelihood behind what has already happened |
\[\frac{3}{5} × \frac{2}{5} × \frac{3}{5} = \frac{18}{125}\]
→ the assumption of local independence, meaning that each draw is independent and does not influence the others.
Key Point
・The assumption of local independence
is a fundamental prerequisite for using IRT.
・If this assumption is not satisfied, calculating scores using IRT
becomes meaningless.
Fill in the blank with the correct form:
He ___ his homework before dinner every day
A. do
B. does
C. doing
D. done
Both Question 1 and Question 2 ask about the third-person
singular present tense verb form (subject–verb agreement), and thus
depend on the same grammatical knowledge.
→ Examinees who answer Question 1 correctly are also highly likely to
answer Question 2 correctly.
→ This is not explained solely by ability θ, but rather by a substantive
relationship between the two items (dependence on a common skill).
→ In this case, conditional dependence arises between the items, and the
assumption of local independence is violated.
Number of Balls | Likelihood |
1 Red, 4 White | \(\frac{1}{5}×\frac{4}{5}×\frac{1}{5}= \frac{4}{125}\) |
2 Red, 4 white | \(\frac{2}{5}×\frac{3}{5}×\frac{2}{5}= \frac{12}{125}\) |
3 Red, 2 White | \(\frac{3}{5}×\frac{2}{5}×\frac{3}{5}= \frac{18}{125}\) |
4 Red, 1 White | \(\frac{4}{5}×\frac{1}{5}×\frac{4}{5}= \frac{16}{125}\) |
Key Point
・In IRT analysis, the estimation method used is Maximum Likelihood Estimation (MLE).
・MLE = Maximum Likelihood Estimation.
・For example, suppose a student’s responses are: “correct on Item 1,
incorrect on Item 2, correct on Item 3.”
→ In the bag analogy, this corresponds to “1st draw: red, 2nd draw:
white, 3rd draw: red.”
・The test results are observed (just as the colors of the drawn balls
are known).
・At this point, we estimate the student’s ability θ (analogous to
estimating the unseen mix of red and white balls in the bag).
・The contents of the bag are not directly observable.
- Likewise, the student’s ability θ is unobservable.
・In IRT, the student’s ability θ is estimated in this way.
MLE in IRT: Illustrated with the Bag
Example
・Observed data: “Red, White, Red”
・Hypothesized parameters: possible combinations of red and white balls
in the bag (e.g., 1 Red & 4 White, 2 Red & 3 White, 3 Red &
2 White, 4 Red & 1 White)
・ For each hypothesis (parameter set), calculate the probability
(likelihood) of obtaining the observed outcome.
👉 Adopt the parameter (ball combination) that yields the highest
likelihood.
・In other words:
→ From among the candidate parameter sets, choose the one that best
explains the observed data.
Level of Fit | Overview | Main Methods |
Overall Model | Whether the entire model fits the data | AIC, BIC, M2, RMSEA |
Item Level | Whether each item is consistent with the model | Residual analysis, S-X² test, infit/outfit |
Person Level | Whether each examinee’s responses follow the model | Person-fit statistics (e.g., l_z) |
Scoring Criteria | May reflect the test administrator’s judgment |
Scoring | Must not involve subjectivity |
1. Number-correct score | The number of items answered correctly |
2. Weighted number-correct score | The total score based on assigned weights for each item |
3. Standardization (Z-scores) / Deviation scores | Numerical representation of one’s relative standing within the same group |
1. When tests contain different items, number-correct scores and
weighted number-correct scores cannot be compared.
2. In number-correct scores, the difficulty and importance of items are
not reflected.
3. In weighted number-correct scores, there is no clear, rational, and
objective way to determine weights.
4. Dviation scores from different groups cannot be compared.
Regarding the problems of conventional test
scores
・When tests contain different items, number-correct scores and weighted
number-correct scores cannot be compared.
→ With IRT, test scores can be compared even
when the tests contain different items
(However, the assumptions of “local independence” and
“unidimensionality” must be satisfied.)
・In number-correct scores, the difficulty and importance of items are
not reflected.
→ IRT allows item difficulty to be analyzed and
quantified in a rational way
・(You can determine the “difficulty” of items.)
→ IRT shows how much correct/incorrect
responses differ depending on examinees’ ability
(You can determine the “discrimination” or “importance” of items.)
・In weighted number-correct scores, there is no clear, rational, or
objective way to assign weights.
→ In IRT analysis, it is not necessary to
decide weights either before or after the test
→ Instead, IRT uses parameters in place of
weights Deviation scores from different groups cannot be
compared.
→ With IRT, results from different groups can
be compared
Data Structure Used in IRT
・Indicates whether examinee \(i\)
answered item \(j\) correctly or
incorrectly.
・The starting point of test theory is “response data” (1 = correct, 0 =
incorrect).
・Suppose there are 3 examinees: A = 1st, B = 2nd, C = 3rd → the
examinee index is represented by \(i\).
・Suppose there are 3 items: Q1 = 1st, Q2 = 2nd, Q3 = 3rd → the item
index is represented by \(j\).
→ No matter how many easy items are answered
correctly, the value of ability θ cannot be considered high
→ The more difficult the items that are answered
correctly, the higher the ability θ is judged (i.e.,
estimated)
Method of Analysis | Indicator for Measuring Item Difficulty |
Vision Test | Ring size (Guttman scale) |
IRT | Item characteristics |
The probability of correctly answering an item
according to differences in ability θ
・Examinees with low ability θ
→ the probability of a correct response rises gradually
・Examinees with medium ability θ
→ the probability of a correct response rises
sharply
→ The slope is steepest
→ the probability of a correct response (i.e.,
discrimination) increases the most when ability θ rises by one
unit
・Examinees with high ability θ
→ the probability of a correct response rises gradually
Item characteristics are obtained by conducting
IRT analysis on test results
The item characteristics derived from the analysis are unique to each individual item → The shape of the Item Characteristic Curve differs from item to item
In IRT analysis, various models are used, but here we introduce
the most commonly applied two-parameter logistic model.
The shape of the Item Characteristic Curve (ICC) in the
two-parameter logistic model is determined by the item’s difficulty and
discrimination.
Because the shape of the ICC is determined by these two
parameters, the model is called the two-parameter logistic model.
Difficulty and Discrimination = Item parameters
Notation | Details |
\(P(\theta)\) | The probability that an examinee with ability θ correctly answers a given item |
\(\theta\) | Examinee’s ability, assumed to follow a normal distribution with mean 0 and standard deviation 1 |
\(a\) | Discrimination parameter |
\(b\) | Difficulty parameter = location parameter |
\(j\) | Item number |
Recommended range of difficulty \(b\) | −3 to 3 |
Range where estimation is most stable | −2 to 2 |
Appropriate Magnitude of the Difficulty
Parameter \(b\)
−3 〜 3
(Source:芝祐順編『項目反応理論』p.34)
Key Point about Discrimination
・Discrimination serves as a criterion for
judging how sensitively an item can distinguish between correct and
incorrect responders based on differences in ability θ.
・However, this applies only when ability θ is
around the difficulty level of 0.
The figure below shows Item Characteristic Curves for three cases where discrimination is 0, −0.5, and 10.
Appropriate Magnitude of Discrimination
(a)
0.3 〜
2.0
(Source:芝祐順編『項目反応理論』p.34)
Method for Estimating Ability Using IRT ・Assume that ability is distributed with a mean of 0 and a standard deviation of 1
・Measure that ability with a “ruler” that has an origin of 0 and tick marks from −3 to 3
Variable Name | Details |
ID | Examinee ID |
Item1 | Response to Item 1 (0 or 1) |
Item2 | Response to Item 2 (0 or 1) |
Item3 | Response to Item 3 (0 or 1) |
Item4 | Response to Item 4 (0 or 1) |
Item5 | Response to Item 5 (0 or 1) |
SS | Total score from Items 1–5 |
class | Classification of examinees based on raw score |
class: Ranges from 1 to 5, with higher values indicating higher
raw score groups.
This dataset is analyzed using IRT to evaluate the characteristics of both items and the test as a whole.
Load the necessary packages for the analysis.
LSAT
[1] "Item 1" "Item 2" "Item 3" "Item 4" "Item 5"
Item 1
to item1
(remove the
half-width space from the variable name)LSAT <- LSAT |>
rename("item1" = "Item 1",
"item2" = "Item 2",
"item3" = "Item 3",
"item4" = "Item 4",
"item5" = "Item 5")
item1
through item5Purpose | Appropriate Analysis |
Examine item difficulty and discrimination | ltm() and ICC plots |
Evaluate the effectiveness of the test across the θ-space | ICC or Test Information Function (TIF) analysis |
Assign ability scores to individuals | mlebme() or
factor.scores() |
irtoys
package.psych
package.mlebme
package.Analysis Method | Package::Function Used | Content of Analysis |
1. Calculation of correct answer rates | colMeans() | Understand item difficulty by average correct answer rate |
2. Calculation of item–total (I–T) correlation | cor() | Check whether each item aligns with the total score (an indicator of discrimination) |
3. Examination of unidimensionality | psych::fa.parallel() | Check whether all items are measured by a single latent trait |
4. Estimation of item parameters | irtoys::est() | Numerically estimate each item’s discrimination a and difficulty b |
5. Item Characteristic Curve (ICC) | irtoys::irf() | Visualize how the probability of a correct response changes with ability θ |
6. Estimation of latent trait values (item side) | ltm::ltm() | “Item-side” estimation of θ (model fit, estimation of a and b) |
7. Test Information Curve (TIC) | irtoys::tif() | Determine at which ability levels this test measures most precisely |
8. Examination of local independence | irtoys::irf(), base::cor() | Check whether there are extra dependencies among items (violation of model assumptions) |
9. Examination of item fit | irtoys::itf() | Assess how well each item fits the IRT model |
10. Test Characteristic Curve (TCC) | irtoys::trf() | How many points can a person with ability θ expect to score? |
11. Estimation of latent trait values (examinee side) | irtoys::mlebme() | “Examinee-side” estimation of θ (how capable each person is) |
Note 1: The “I–T correlation discrimination” is only a simple indicator of discrimination within classical test theory and is not the same as the IRT discrimination (\(a\) parameter) in the strict sense.
Note 2: When estimating latent trait values, keep in mind that there are θ values from both the “item side” and the “examinee side.”
colMeans()
item1 item2 item3 item4 item5
0.924 0.709 0.553 0.763 0.870
df_crr <- data.frame( # Specify the name of the data frame (here, df_crr)
item = names(crr), # Specify the variable name (here, item)
seikai = as.numeric(crr) # Specify the variable name (here, seikai)
)
item seikai
1 item1 0.924
2 item2 0.709
3 item3 0.553
4 item4 0.763
5 item5 0.870
ggplot(df_crr, aes(x = seikai, y = item)) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_text(aes(label = round(seikai, 2)), # Round to 2 decimal places
hjust = 1.2, size = 6) + # Display inside the bars
labs(
title = "Correct Answer Rate for Each Item",
x = "Item",
y = "Correct Answer Rate"
) +
theme_minimal() +
theme_bw(base_family = "HiraKakuProN-W3") # Prevent garbled characters
Key Points for Calculating Correct Answer
Rates ・Check whether there are
items with extremely high or low correct answer
rates
・If there are items with extremely high/low rates → Problematic
・If there are no such items
→ No problem
→ In this case, there are no items with extreme rates
→ No problem
→ Proceed to the next analysis
cor()
cor()
function to calculate the correlation
between raw scores (item1
–item5
) and the total
score total. [,1]
item1 0.3620104
item2 0.5667721
item3 0.6184398
item4 0.5344183
item5 0.4353664
# 行列をデータフレームに変換
df_it <- as.data.frame(it)
# 行名を項目名として列に追加
df_it$item <- rownames(df_it)
# 列名をわかりやすく変更(オプション)
colnames(df_it) <- c("correlation", "item")
ggplot(df_it, aes(x = item, y = correlation)) +
geom_bar(stat = "identity", fill = "orange") +
geom_text(aes(label = round(correlation, 3)),
vjust = -0.5, size = 4) +
ylim(0, 0.7) +
labs(
title = "Item–Total Correlation",
x = "Item",
y = "Correlation Coefficient"
) +
theme_minimal() +
theme_bw(base_family = "HiraKakuProN-W3") # Prevent garbled characters
item1
–item5
) and the total score (total)
range from 0.36 to 0.61.Key Points in Calculating I–T
Correlations
・Check whether correlations are
observed between responses to each item (item1–item5) and the total
score (ss)
I–T Correlation Value | Evaluation | Treatment of Item |
~ 0.2 | Extremely low (caution) | Consider excluding |
0.2–0.3 | Somewhat low | Re-examine depending on content |
0.3–0.4 | Reasonable level | Pending / context-dependent |
0.4 and above | Good (desirable) | Acceptable to retain |
→ Here, all items show correlations above 0.2
→ No problem
→ Proceed to the next analysis without excluding items
sych::fa.parallel()
psych::fa.parallel()
(Horn, 1965).library(psych)
data <- read.csv("data/IRT_LSAT.csv")
item_data <- data[, -1] # Exclude the first column from the analysis
# Examination of unidimensionality (principal components + parallel analysis)
fa.parallel(item_data,
fa = "pc",
n.iter = 100) # Specify 100 simulations
Parallel analysis suggests that the number of factors = NA and the number of components = 1
Axis | What it Represents | Interpretation Point |
X-axis | Factor number (principal component number) | Factors 1–5 |
Y-axis | Eigenvalue (amount of variance explained) | Greater than 1 → Potentially meaningful factor |
→ Eigenvalue
=> In principal component analysis (PCA) or factor analysis, this is
the value that indicates “the amount of total variance in the data
explained by that factor.”
→ It represents how much variance each factor explains.
→ The larger the eigenvalue, the more information (i.e., variation,
structure) the factor captures.
→ A factor with a large eigenvalue can summarize and explain the scores
of many items.
Line Color/Type | What It Represents | Interpretation |
Blue line | Eigenvalues from actual data | Strength of factors derived from real item correlations |
Red dashed line (thin) | Simulated | Average eigenvalues from completely random data |
Red dashed line (thicker) | Resampled | Average eigenvalues from resampled versions of the original data |
Black line | Eigenvalue = 1 | If greater than 1 → potentially meaningful factor |
ltm()
or mirt()
).
・Here, the two-parameter logistic model (2PL: generalized logistic model) is used for analysis.
• It is generally assumed that the latent trait values follow a normal distribution with a mean of 0 and variance of 1 (standard normal distribution) for estimation.
Target of Estimation | Information Required for Estimation |
Ability | The response pattern of the examinee whose ability is to be estimated |
Item Parameters | Response patterns of all test-takers |
・Item parameters are derived through IRT analysis based on the correct/incorrect response data of examinees who took the test containing the item.
What can and cannot be done regarding the estimation of item parameters
・Create a problem in advance with predetermined values such as
“discrimination = ○○, difficulty = △△”
← Because item parameters cannot be obtained without conducting IRT
analysis using actual test results
・Develop a large number of items, administer them to examinees, and
then conduct IRT analysis
→ Accumulate verification and evaluation of each individual item
Feature | 1PL Model (Rasch Model) | 2PL Model (Generalized Logistic Model) |
Model Formula | \(P(\text{Correct}) = \frac{1}{1+e^{-(\theta-b)}}\) | \(P(\text{Correct}) = \frac{1}{1+e^{-a(\theta-b)}}\) |
Discrimination a | Same for all items (fixed) | Estimated for each item |
Difficulty b | Estimated for each item | Estimated for each item |
Number of Parameters | Number of items (only b) + 1 (fixed a) | Number of items × 2 (both a and b estimated) |
Focus of Analysis | Relationship between ability and item difficulty b | Relationship among ability, difficulty b, and discrimination a (a more flexible model) |
Notation | Description |
\(P(\theta)\) | Probability that an examinee with ability θ answers an item correctly |
\(\theta\) | Examinee’s ability, assumed to follow a normal distribution with mean 0 and standard deviation 1 |
\(a\) | Discrimination parameter |
\(b\) | Difficulty parameter = location parameter |
\(j\) | Item number |
est()
function in
the ltm
packageex1
ex1 <- est(resp = LSAT[, 1:5], # Argument specifying the test data
model = "2PL", # Assume the 2PL model
engine = "ltm") # Specify estimation using the ltm package
ex1
$est
[,1] [,2] [,3]
item1 0.8253717 -3.3597333 0
item2 0.7229498 -1.3696501 0
item3 0.8904752 -0.2798981 0
item4 0.6885500 -1.8659193 0
item5 0.6574511 -3.1235746 0
$se
[,1] [,2] [,3]
[1,] 0.2580641 0.86694584 0
[2,] 0.1867055 0.30733661 0
[3,] 0.2326171 0.09966721 0
[4,] 0.1851659 0.43412010 0
[5,] 0.2100050 0.86998187 0
$vcm
$vcm[[1]]
[,1] [,2]
[1,] 0.06659708 0.2202370
[2,] 0.22023698 0.7515951
$vcm[[2]]
[,1] [,2]
[1,] 0.03485894 0.05385658
[2,] 0.05385658 0.09445579
$vcm[[3]]
[,1] [,2]
[1,] 0.05411071 0.012637572
[2,] 0.01263757 0.009933553
$vcm[[4]]
[,1] [,2]
[1,] 0.03428641 0.07741096
[2,] 0.07741096 0.18846026
$vcm[[5]]
[,1] [,2]
[1,] 0.04410211 0.1799518
[2,] 0.17995180 0.7568684
ex1 <- est(resp = LSAT[, 1:5], # Argument specifying the test data
model = "2PL", # Assume the 2PL model
engine = "ltm") # Specify estimation using the ltm package
plot(x = P1, # Specify the result estimated by irf() as the argument x
co = NA, # Specify ICC colors / draw ICCs in different colors for each item
label = TRUE) # Attach item numbers to each ICC
abline(v = 0, lty = 2) # Draw a vertical dotted line at x = 0
Ability
)Probability of a correct response
What Can Be Understood from the Item
Characteristic Curve (ICC)
● The curve for item3 is located
around the center of the figure, with relatively high discrimination (a)
→ a strong item
● item1 and item5
have curves shifted to the left → items that are too easy
● The steeper the curve, the better it
distinguishes differences in ability (item3 is a typical
example)
Appropriate Range of Discrimination
a
0.1 〜 0.4
(Source:芝祐順編『項目反応理論』p.34)
# Required packages
library(irt)
library(ggplot2)
library(dplyr)
# Model estimation (repeated)
ex1 <- est(resp = LSAT[, 1:5],
model = "2PL",
engine = "ltm")
# Extract discrimination (1st column)
disc <- ex1$est[, 1]
# Convert to data frame and reorder (ascending order!)
disc_df <- data.frame(
Item = names(disc),
Discrimination = disc
) %>%
arrange(Discrimination) %>% # ★ Fixed here
mutate(Item = factor(Item, levels = Item)) # Apply ordering
# Plot
ggplot(disc_df, aes(x = Item, y = Discrimination)) +
geom_bar(stat = "identity", fill = "darkgreen") +
geom_text(aes(label = round(Discrimination, 2)), vjust = -0.5, size = 4) +
labs(title = "Items Ordered by Increasing Discrimination",
x = "Item", y = "Discrimination") +
theme_minimal() +
theme_bw(base_family = "HiraKakuProN-W3") # Prevent garbled text
→ Range of discrimination a ≈ 0.66 – 0.89 → No issues
Appropriate Range of Discrimination
a
-3 〜 3
(Source:芝祐順編『項目反応理論』p.34)
# Required packages
library(irt)
library(ggplot2)
library(dplyr)
# Model estimation (repeated)
ex1 <- est(resp = LSAT[, 1:5],
model = "2PL",
engine = "ltm")
# Extract difficulty (2nd column)
difficulty <- ex1$est[, 2]
# Get item names and convert to data frame
diff_df <- data.frame(
Item = rownames(ex1$est),
Difficulty = difficulty
) %>%
arrange(Difficulty) %>% # Sort in ascending order
mutate(Item = factor(Item, levels = Item)) # Fix the order
# Plot (bar chart)
ggplot(diff_df, aes(x = Item, y = Difficulty)) +
geom_bar(stat = "identity", fill = "magenta") +
geom_text(aes(label = round(Difficulty, 2)), vjust = -0.5, size = 4) +
labs(title = "Items Ordered by Increasing Difficulty",
x = "Item", y = "Difficulty") +
theme_minimal() +
theme_bw(base_family = "HiraKakuProN-W3") # Prevent garbled text
-
item1
and item5
are too easy (difficulty b ≤
−3)
- item4
, item2
, and item3
have
moderate difficulty
→ Range of difficulty b ≈ −3.36
to −0.28
→ Too easy
Appropriate Range of Standard error
(Discrimination a)
0.1 〜 0.4
(Source:芝祐順編『項目反応理論』p.34)
# Required packages
library(irt)
library(ggplot2)
library(dplyr)
# Model estimation
ex1 <- est(resp = LSAT[, 1:5],
model = "2PL",
engine = "ltm")
# Extract standard errors (SE) of discrimination
disc_se <- ex1$se[, 1] # First column = SE of discrimination
# Get item names
item_names <- rownames(ex1$est)
# Convert to data frame and reorder in ascending order
disc_se_df <- data.frame(
Item = item_names,
Discrimination_SE = disc_se
) %>%
arrange(Discrimination_SE) %>% # Sort ascending
mutate(Item = factor(Item, levels = Item)) # Reflect order in the plot
# Plot (bar chart: left = small → right = large)
ggplot(disc_se_df, aes(x = Item, y = Discrimination_SE)) +
geom_bar(stat = "identity", fill = "purple") +
geom_text(aes(label = round(Discrimination_SE, 3)),
vjust = -0.5, size = 4) +
labs(title = "Items Ordered by Increasing Standard Error (SE) of Discrimination",
x = "Item",
y = "Standard Error of Discrimination") +
theme_minimal() +
theme_bw(base_family = "HiraKakuProN-W3") # Prevent garbled text
Range of standard errors for discrimination a: 0.185 –
0.258
→ All within the recommended range
→ No issues
Visualizing Standard error of Discrimination \(a\)
Appropriate Range of Standard error
(Difficulty \(b\))
0.2 〜 0.5
(Source:芝祐順編『項目反応理論』p.34)
# Required packages
library(irt)
library(ggplot2)
library(dplyr)
# Model estimation
ex1 <- est(resp = LSAT[, 1:5],
model = "2PL",
engine = "ltm")
# Extract standard errors (SE) of difficulty (2nd column)
diff_se <- ex1$se[, 2]
# Get item names
item_names <- rownames(ex1$est)
# Convert to data frame and reorder in ascending order
diff_se_df <- data.frame(
Item = item_names,
Difficulty_SE = diff_se
) %>%
arrange(Difficulty_SE) %>% # Sort ascending
mutate(Item = factor(Item, levels = Item)) # Fix the order
# Plot
ggplot(diff_se_df, aes(x = Item, y = Difficulty_SE)) +
geom_bar(stat = "identity", fill = "red") +
geom_text(aes(label = round(Difficulty_SE, 3)),
vjust = -0.5, size = 4) +
labs(title = "Items Ordered by Increasing Standard Error (SE) of Difficulty",
x = "Item",
y = "Standard Error of Difficulty") +
theme_minimal() +
theme_bw(base_family = "HiraKakuProN-W3") # Prevent garbled text
item1
and item5
is unstable
(both se = 0.87)
・Overall, the test is biased toward easy items.
・item3
has moderate difficulty and high discrimination,
making it an excellent item from an IRT perspective.
・item1
and item5 are too easy and have relatively low
discrimination.
→ Depending on the purpose of the test, they should be reconsidered or
excluded.
・From the perspective of difficulty, item1
and
item5
(difficulty ≤ −3
) should be candidates
for deletion.
・Adding more difficult items would improve balance.
・Including items with high discrimination (a > 1.2
)
would enhance the precision of ability differentiation.
・Adding items with difficulty around b ≈ 0 to +2
would
strengthen discrimination among high-ability examinees.
ltm
package.# IRT analysis using the 2PL model (limited to 5 items)
mod <- ltm(LSAT[, 1:5] ~ z1, IRT.param = TRUE)
# Specify the range of ability values θ
theta_vals <- seq(-3, 3, by = 0.1)
# Extract item parameters (discrimination a, difficulty b) from the model
coefs <- coef(mod) # Returned values: column 1 = Dffclt (b), column 2 = Discr (a)
# Create a data frame to store the probabilities of correct responses
icc_df <- data.frame(theta = theta_vals)
# Calculate the probability of a correct response for each item (2PL model formula)
for (i in 1:nrow(coefs)) {
b <- coefs[i, 1] # Difficulty b
a <- coefs[i, 2] # Discrimination a
P_theta <- 1 / (1 + exp(-a * (theta_vals - b))) # 2PL ICC formula
icc_df[[paste0("Item", i)]] <- round(P_theta, 4)
}
item1
through
item5
plot(mod, type = "ICC", items = 1:5)
# Add a vertical dotted line (θ = −3)
abline(v = -3, col = "red", lty = 2, lwd = 1)
item1
) indicates Probability =
0.57.item3
is 0.0815item3
) indicates Probability = 0.0815.
item1 | item2 | item3 | item4 | item5 |
Correct | Correct | Correct | Correct | Incorrect |
θ (ability) | item1 | item2 | item3 | item4 i | tem5 |
−3 | 0.5737 | 0.2353 | 0.0815 | 0.3141 | 0.5203 |
\[prob.of.item1.correct \\× \\prob.of.item2.correct \\×\\ prob.of.item3.correct \\×\\ prob.of.item4.correct \\×\\ prob.of.item5.incorrect \\×\\ (1 − prob.of.item5 correct)\] \[=0.5737×0.2353×0.0815×\\0.3141× (1-0.5203) \\=0.001658 (=0.17\%)\]
・The probability that the lowest-ability student (θ = −3) answers 4 out of 5 items correctly and misses only 1 is 0.17%
item5
correctly and item1
through
item4
incorrectly.theta (θ) | item1 | item2 | item3 | item4 | item5 |
0 | 0.9412 | 0.9412 | 0.562 | 0.7833 | 0.8863 |
item1
through item4 correctly and item5
incorrectly.\[prob.of.item1.correct \\× \\prob.of.item2.correct \\×\\ prob.of.item3.correct \\×\\ prob.of.item4.correct \\×\\ prob.of.item5.incorrect \\×\\ (1 − prob.of.item5 correct)\] \[=0.9412×0.9412×0.562×\\0.7833× (1-0.8863) \\= 0.034343 (=3.4\%)\]
・The probability that an average-ability
student (θ = 0) answers 4 out of 5 items correctly and misses only 1 is
3.4%
- Probability of responses for the highest-ability student (θ = 3)
theta (θ)item1 | item2 | item3 | item4 | item5 | |
3 | 0.9948 | 0.9593 | 0.9489 | 0.9661 | 0.9825 |
\[prob.of.item1.correct \\× \\prob.of.item2.correct \\×\\ prob.of.item3.correct \\×\\ prob.of.item4.correct \\×\\ prob.of.item5.incorrect \\×\\ (1 − prob.of.item5 correct)\] \[=0.9948×0.9593×0.9489×\\0.9489× (1-0.9825) \\= 0.015338 (=1.5\%)\]
・The probability that the highest-ability
student (θ = 3) answers 4 out of 5 items correctly and misses only 1 is
1.5%
- Based on the “probabilities of correct responses for item1 through
item5” shown above,
- For each ability value θ (from −3 to 3), we can calculate the
probability of this student’s response pattern (Correct, Correct,
Correct, Correct, Incorrect).
ltm()
function, the likelihood estimated for
each ability θ represents the probability of observing this response
pattern.# IRT analysis using the 2PL model (limited to 5 items)
mod <- ltm(LSAT[, 1:5] ~ z1, IRT.param = TRUE)
# Specify the range of ability values θ
theta_vals <- seq(-3, 3, by = 0.1)
# Extract item parameters (a, b) from the model
coefs <- coef(mod) # col1 = b (difficulty), col2 = a (discrimination)
# Response pattern (1 = correct, 0 = incorrect)
response_pattern <- c(1, 1, 1, 1, 0)
# Initialize: list to store results
result_list <- list()
# Calculate for each θ
for (j in seq_along(theta_vals)) {
theta <- theta_vals[j]
item_probs <- numeric(length(response_pattern))
for (i in 1:length(response_pattern)) {
b <- coefs[i, 1]
a <- coefs[i, 2]
P <- 1 / (1 + exp(-a * (theta - b)))
# Save P if correct, 1 - P if incorrect
item_probs[i] <- ifelse(response_pattern[i] == 1, P, 1 - P)
}
# Likelihood = product of the probabilities across the 5 items
likelihood <- prod(item_probs)
# Record results in data frame format
result_list[[j]] <- data.frame(
theta = theta,
Item1 = round(item_probs[1], 4),
Item2 = round(item_probs[2], 4),
Item3 = round(item_probs[3], 4),
Item4 = round(item_probs[4], 4),
Item5 = round(item_probs[5], 4),
likelihood = round(likelihood, 6)
)
}
# Combine results for all θ values and display
result_df <- do.call(rbind, result_list)
theta | item1 | item2 | item3 | item4 | item5 | likelihood |
0.5 | 0.9603 | 0.7944 | 0.667 | 0.836 | 0.0845 | 0.035957 |
Test Information Function ・For the two-parameter logistic (2PL) model, the test information is expressed by the following formula:
\[I(\theta) = 1.7^2\sum_{j=1}^na_j^2P_j(\theta)Q_j(\theta)\]
Variable | Description |
\(I(\theta)\) | Test information |
Constant | \(a_j\) Discrimination of item \(j\) |
\(P_j(\theta)\) | Probability of a correct response to item \(j\) at ability level θ |
\(Q_j(\theta)\) | Probability of an incorrect response to item \(j\) at ability level θ |
tif()
function.I <- irtoys::tif(ip = ex1$est) # Assume the 2PL model for the data
# x: argument specifying the results estimated by tif()
plot(x = I) # ip: argument specifying the item parameters of the test
・Horizontal axis … Latent trait value \(θ\) (ability) Vertical axis … Test
information (measurement precision
= inverse of the standard error)
- Solid line … Test Information Curve
→ Obtained by connecting the test information values at each latent
trait level
What the Test Information Curve (TIC) Shows
1. Which ability levels are measured
accurately? Where information is high → the test is more precise
at that ability level θ
→ Information is highest around \(\theta =
-2\)
→ The test is most accurate at the level of \(\theta = -2\)
・Where information is low → the test is less precise at that ability
level θ
→ Information is lowest at \(\theta =
4\)
→ The test is least accurate at the level of \(\theta = 4\)
・Example: If the TIC peaks around θ = 0
→ It can be said to be “an optimal test for measuring average
examinees.”
2. Reveals the design intent of the
test
・By observing where the TIC peaks, we can see what type of examinees
the test is designed for. If the TIC peaks around \(\theta = -2\) → The test is aimed at
relatively low-ability examinees.
・In general:
TIC Peak Location | Meaning | Target |
---|---|---|
Around θ = 0 | For average examinees | General ability tests |
θ > 0 (right side) | For high-ability examinees | Advanced/professional exams |
θ < 0 (left side) | For beginners/low-ability examinees | Basic skills assessments |
・In this case, the TIC peak is to the left
→ It indicates that the test is designed for below-average ability
examinees.
3. Indicates reliability (precision) as
well
・Higher information = smaller standard error (SE) in that range
・Relationship with standard error:
\[SE(\theta) =
\frac{1}{\sqrt{{I(\theta)}}}\]
・In other words, when the amount of information is large, the
estimation of \(\theta\) is less prone
to fluctuation (= more reliable).
4. Setting a criterion for test
information: 0.5 or higher
・For example, if we set the criterion for test information at “0.5 or
higher,”
→ we can identify the range of ability levels θ that meet the standard
for measurement precision.
irf()
&
cor()
In other words, the assumption is that “whether an examinee
answers correctly is determined solely by their ability.”
Examination of local independence is often conducted using the \(Q_3\) statistic.
The \(Q_3\) statistic is
obtained by subtracting the expected value from each observed item
response,
→ then calculating the correlations between the resulting residual
scores.
Specifically, the \(Q_3\)
statistic is defined as the correlation between residuals obtained by
subtracting the expected probability of a correct response (calculated
from the item response model) from the observed response.
The closer the absolute value is to 0, the more reasonable it is
to assume local independence between item responses.
• For example, in this case, the residual score \(d_1\) for item1 can be expressed as
follows:
\[d_1 = u_1 - \hat{P_1(\theta)}\]
𝑄3 Statistic (Yen, 1984)
・The 𝑄3 statistic (Yen, 1984) makes use of what corresponds to a
partial correlation coefficient in regression analysis.
・A partial correlation coefficient is “the correlation between \(x\) and \(y\) after removing the effect of variable
\(z\).”
→ It is often used to examine the influence of spurious
correlations.
・Applied to the context of IRT:
→ “The correlation between \(y_{pi}\)
and \(y_{pj}\) after removing the
effect of \(\theta_p\).” → If this
value is 0, local independence holds.
Source: https://journals.sagepub.com/doi/10.1177/014662168400800201
• Estimate latent trait values with the mlebme()
function
⇒ The mlebme()
function is included in the
irtoys
package
head(mlebme(resp = LSAT[, 1:5], # Specify the test data
ip = ex1$est, # Assume the 2PL model for the data
method = "BM")) # Specify estimation of latent trait values by Maximum Likelihood (ML)
est sem n
[1,] -1.895392 0.7954829 5
[2,] -1.895392 0.7954829 5
[3,] -1.895392 0.7954829 5
[4,] -1.479314 0.7960948 5
[5,] -1.479314 0.7960948 5
[6,] -1.479314 0.7960948 5
ip
: Argument specifying the item parameters of each
item in the testmethod
: Specifies which estimation method to use for
estimating latent trait values→ It is possible to obtain estimates even for examinees with all-correct or all-incorrect responses (Bayesian estimation)
irf()
function• The irf()
function assumes the 2PLM
• Specify ex1$est
→ Assign the estimated item parameters under the 2PLM as the item
parameters for each item
• Specify theta.est[, 1]
→ Assign the estimated latent trait values under the 2PLM as the latent
trait estimates
⇒ Save the result as P
P <- irtoys::irf(ip = ex1$est, # Specify item parameters
x = theta.est[, 1]) # Specify latent trait values for each examinee
Variable | Description |
\(x\) | Latent trait value \(\theta\) (ability) of each examinee |
\(f\) | Estimated probability of a correct response |
Rows | Examinees (1,000 individuals) |
Columns | Items (item1–item5) |
item1 item2 item3 item4 item5
1 -0.7700558 -0.4061064 -0.1917689 -0.4949268 -0.6915701
2 -0.7700558 -0.4061064 -0.1917689 -0.4949268 -0.6915701
3 -0.7700558 -0.4061064 -0.1917689 -0.4949268 -0.6915701
4 -0.8252089 -0.4801900 -0.2557741 -0.5661590 0.2533129
5 -0.8252089 -0.4801900 -0.2557741 -0.5661590 0.2533129
6 -0.8252089 -0.4801900 -0.2557741 -0.5661590 0.2533129
Check the residual score \(d_{11}\) for examinee 1, item1
・For example, the residual score \(d_1\) for examinee 1, item1 is
−0.7700558
・Confirm the response \(u_{ij} =
u_{11}\) of examinee 1 to item1
[1] 0
[1] -0.7700558
item1 item2 item3 item4 item5
item1 1.00000000 -0.04142824 -0.04101429 -0.064167975 -0.062538809
item2 -0.04142824 1.00000000 -0.11322248 -0.097060194 -0.029585197
item3 -0.04101429 -0.11322248 1.00000000 -0.092262203 -0.104216701
item4 -0.06416797 -0.09706019 -0.09226220 1.000000000 -0.003656669
item5 -0.06253881 -0.02958520 -0.10421670 -0.003656669 1.000000000
# Take the absolute values of Q3
Q3_abs <- abs(Q3)
# Replace diagonal elements (self-correlation = 1.0) with NA to exclude them
diag(Q3_abs) <- NA
# Count the number of absolute values ≥ 0.2
count_0.2_or_more <- sum(Q3_abs >= 0.2, na.rm = TRUE)
# Total number of elements (after excluding diagonals)
total_elements <- sum(!is.na(Q3_abs))
# Calculate the proportion
proportion_0.2_or_more <- count_0.2_or_more / total_elements
# Display the results
cat("Number of absolute values ≥ 0.2:", count_0.2_or_more, "\n")
Number of absolute values ≥ 0.2: 0
Proportion: 0
library(ggplot2)
library(reshape2)
# Convert Q3 matrix to absolute values
Q3_abs <- abs(Q3)
diag(Q3_abs) <- NA # Exclude diagonal elements
# Melt into long format
Q3_long <- melt(Q3_abs, varnames = c("Item1", "Item2"), value.name = "Q3_value")
# Add a flag for values greater than 0.2
Q3_long$Violation <- Q3_long$Q3_value > 0.2
# Create heatmap
ggplot(Q3_long, aes(x = Item1, y = Item2, fill = Violation)) +
geom_tile(color = "white") +
scale_fill_manual(values = c("white", "deeppink")) +
labs(
title = "Heatmap of Local Independence Violations (Pink = Violation)",
x = "Item",
y = "Item"
) +
theme_bw(base_family = "HiraKakuProN-W3") + # White background + Japanese font
theme(
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 1), # Rotate labels
axis.text.y = element_text(hjust = 1)
)
itf()
Item Fit evaluates “whether each item properly follows the
theoretical model (e.g., 2PL model, 3PL model).”
In IRT, it is also important to examine the degree of fit to the
item response model.
Here, we will use the itf()
function to examine the
item fit of item1.
resp: Argument specifying the test data
irtoys::itf(resp = LSAT[, 1:5], # Response data [examinees, items]
item = 1, # Specify examination of item fit for the 1st item
ip = ex1$est, # Item parameters estimated under the 2PL model
theta = theta.est[, 1]) # Estimated ability values (θ) for each examinee
Statistic DF P-value
10.0741811 6.0000000 0.1215627
What Item Fit Reveals
Whether the item “fits” the IRT
model or not
・Compare the “actual response patterns from the data” for each item
with the “theoretically predicted response patterns” from the IRT
model
→ If there is a discrepancy, it means the model assumptions do not fit
that item.
・Using items that do not fit the model may lead to inaccurate
estimation of the latent trait value \(θ\) (ability).
・Serves as a basis for checking item quality and deciding whether to
revise or remove inappropriate items.
・Provides clues for detecting bias (DIF: Differential Item
Functioning).
Phenomenon | Possible Explanation |
---|---|
Actual proportion correct is lower than the model prediction | Item wording is unclear / answer choices are confusing |
Abnormal behavior only for specific ability groups | Item is biased or misleading |
Proportion correct is close to random | Guessing strongly influences responses (c-parameter insufficient) |
Cognitively too complex | Cannot be explained by a single latent trait θ |
Indicator for Evaluating Fit: S–X²
Statistic (Orlando & Thissen’s Item Fit Index)
・A more precise goodness-of-fit test (especially used for 2PL and 3PL
models)
・Ability is divided into groups (typically deciles)
→ Differences between the model’s expected proportion correct and the
actual proportion correct are calculated for each group
→ Fit is then evaluated as a chi-square type statistic
= Although it follows a chi-square distribution, this is a method unique
to S–X²
→ It is distinct from the conventional chi-square goodness-of-fit
test
Interpretation of Results
・If the p-value is greater than the significance level (0.05):
・p-value
= 0.1215627
→ Fail to reject the null hypothesis
→ The fitted item response model is judged to adequately represent the
data.
・In the figure output by the itf function:
・The greater the discrepancy between the solid line and the circles,
the worse the model fit to the data.
trf()
& plot()
functions. • For each latent trait value, calculate and plot the
expected raw score.E <- trf(ip = ex1$est) # Assume the 2PL model for the data
# ip: argument specifying the item parameters included in the test
plot(x = E) # Expected raw scores at various latent trait values
What the Test Characteristic Curve (TCC)
Shows 1. Relationship between
examinees’ ability θ and expected scores
・The solid line slopes upward
→ As examinees’ ability increases, their scores also tend to
increase
2. Test Difficulty and Distribution
Characteristics
・If examinees’ ability θ shows a sudden increase in scores around
0
→ The test is designed for average-ability examinees.
・If examinees’ ability θ shows a sudden increase in scores around
2–4
→ The test is designed for above-average examinees.
・If examinees’ ability θ shows a sudden increase in scores around −4 to
−2
→ The test is designed for below-average examinees.
3. Skewness and Limitations of the Score
Distribution
・Where the slope of the TCC is shallow
→ Score changes are less responsive in that ability range (= difficult
to distinguish examinees).
・If the expected total score flattens out near the upper or lower
ends
→ It becomes difficult to distinguish between high scorers and low
scorers.
head(mlebme(resp = LSAT[, 1:5], # Specify the test data
ip = ex1$est, # Assume the 2PL model for the data
method = "BM")) # Specify estimation of latent trait values by Maximum Likelihood (ML)
est sem n
[1,] -1.895392 0.7954829 5
[2,] -1.895392 0.7954829 5
[3,] -1.895392 0.7954829 5
[4,] -1.479314 0.7960948 5
[5,] -1.479314 0.7960948 5
[6,] -1.479314 0.7960948 5
resp
: Argument specifying the test dataip
: Argument specifying the item parameters of each
item in the testmethod
: Specifies which estimation method to use for
estimating latent trait valuesItem | Meaning |
est | Estimated ability value (θ) |
sem | Standard Error of Measurement |
n | Number of items used for estimation (here, all examinees = 5 items) |