library(DT)
library(glpkAPI)
library(irt)
library(irtoys)
library(ltm)
library(plink)
library(plyr)
library(psych)
library(reactable)
library(tidyverse)

1. Overview of Item Response Theory (IRT)

The Aim of Item Response Theory

・By evaluating the quality of test items,
→ To better measure the latent ability of respondents

1.1 What Can Be Understood Through Item Response Theory

A Measurement Theory for Quantifying Academic Ability

  • Item Response Theory (IRT) is one branch of test theory.
  • It represents the global standard for the development and administration of large-scale ability tests.
  • Test theory has been systematically developed within the field of psychometrics.
  • IRT is a statistical (probabilistic) modeling approach that expresses responses to test items.
  • The fundamental framework was established by Lord (1952) and has since been advanced.
  • IRT mathematically models the probability that a given examinee correctly answers a given item.
  • It enables fair evaluation that is not influenced by the difficulty level of specific test items or by differences in ability distribution across groups of examinees.
  • Beyond testing, IRT is widely applied in psychology research for developing and improving psychological scales, with the aim of evaluating and enhancing measurement performance and efficiency.
  • For each test-taker, a unidimensional latent scale is assumed, and an individual’s latent trait value on this scale is estimated.
  • The latent trait value represents the examinee’s ability, θ.
  • At the same time, several types of item parameters are estimated for each item.
Item Parameter Symbol Description
Discrimination a The sensitivity of the response to the latent trait θ (slope of the ICC)
Difficulty b The ability level required to answer the item correctly (center of the ICC)

Item Characteristic Curve (ICC)

  • The ICC illustrates how the probability of a correct response, \(P(θ)\), changes with the examinee’s ability level \(θ\).
  • Horizontal axis: Latent trait value \(θ\) (the assumed “ability” of the examinee)
  • Vertical axis: Probability of a correct response
    = From the response data of examinees, an item characteristic curve is estimated for each test item.
    → This allows estimation of an examinee’s ability independent of the difficulty of specific items.

○ If the difficulty level of the item is the same・・・

  • The lower the examinee’s ability, the lower the probability of a correct response.
  • The higher the examinee’s ability, the higher the probability of a correct response.
    ・This relationship is estimated using logistic regression analysis.

  • By using the ICC, the following two indicators become clear:
Difficulty Parameter

→ The difficulty of an item can be determined by identifying the point on the horizontal axis where the ICC reaches a 50% probability of a correct response.

Discrimination Parameter

→ The steeper the slope of the curve, the better the item distinguishes between examinees with low ability and those with high ability.

  • Let us now consider the item characteristic curves (ICCs) for five items (test questions) shown below.

  • Item 1 (black): A relatively easy question, since even examinees with low ability have a high probability of answering it correctly.
  • Item 3 (green): A relatively difficult question, since examinees need to possess a fairly high level of ability to answer it correctly.
  • When the items are ordered from most difficult to easiest, the sequence is: 3 → 2 → 4 → 5 → 1.
  • The item with the steepest slope is Item 3.
    → Item 3 best discriminates between examinees with low ability and those with high ability.

Features of Item Response Theory
・Test results composed of different items can be compared with each other.
・Test results obtained from different groups can be compared with each other.

Advantages of Item Response Theory

1. Individualized Precision Assessment

  • Traditional precision evaluation: “This test has an error margin of ±X points” (i.e., an overall evaluation).
  • IRT-based precision evaluation: “For the high-ability group, the error margin is ±Y points; for the low-ability group, the error margin is ±Z points” (i.e., individualized precision evaluation).

2. Ability-Specific Evaluation of Test Measurement Precision

  • Enables rational selection of truly useful items.
  • Allows creation of shortened test versions while maintaining item precision.
  • Makes it possible to control test difficulty and score distributions in advance.
  • Ensures reliability of scores even when examinees solve different sets of items.
  • Enables item selection tailored to each examinee’s ability and response pattern.
    → This makes possible the development of Computer-Adaptive Testing (CAT).

1.2 Specific Analytical Methods

◎ Estimation of Difficulty and Discrimination

  • Construct a response matrix in which correct answers are coded as 1 and incorrect answers as 0.
  • For each item, fit a logistic curve that best approximates the observed response pattern.
  • In doing so, use the criterion that is most probable statistically (i.e., likelihood).
    → Likelihood will be explained later.
  • Both the examinee’s ability and the item’s difficulty can be estimated.
    → This makes it possible to distinguish between good and poor items.

◎ “Equating”

  • The process of standardizing scores or scales across different tests (i.e., putting them on the same “yardstick”).
  • Based on item difficulty and discrimination parameters, estimate the examinee’s latent trait value (ability θ).
    → The origin and unit of ability θ can be set arbitrarily.
    → A reference scale is defined, and other scales are then placed on it (i.e., equating).
    → This enables calculation of scores with a common meaning across different tests.
    → Even when test content differs, abilities can be compared on the same scale.

Example of Equating:

  • Ensuring that scores from the 2024 version of a test and the 2025 version can be compared on the same scale.
    → To achieve this, it is necessary to use anchor items or similar procedures to align the scales.

◎ Building an Item Bank

  • Record item parameters estimated by IRT along with the content of each item.
    → Later, the same items can be reused, allowing the selection of item sets tailored to the examinee’s ability level.
  • This collection of item data is referred to as an item bank.
  • From the item parameters, the test information function can be derived.
    → This shows which ability levels a test is most appropriate for.
    → Enables item-level evaluation using indices that are independent of the examinee population.

◎ Using Adaptive IRT Tests

→ The set of items presented changes depending on the examinee’s responses.
→ Ensures that the most suitable items for estimating an examinee’s ability are administered.
→ Allows estimation of ability with fewer items and in less time.

◎ Statistically Evaluating the Appropriateness of Test Items

・Main methods:
1. Correlation analysis with test scores
2. Calculation of a discrimination index
3. Application of Item Response Theory

  • Divide examinees into upper and lower groups based on performance.
  • Calculate the selection rate for each response option.
  • Check whether each option appropriately distinguishes between the upper and lower groups.
    → Revise options if necessary.
  • Evaluate internal consistency of the test using correlation coefficients and factor analysis.
  • Use IRT to identify items with low discrimination.

2. Why Use Item Response Theory?

  • Because classical test theory has two major limitations.
Two Problems Faced by Classical Test Theory
Issue of sample dependence
Issue of item dependence

2.1 Issue of Sample Dependence

  • Student X at School A has a standardized score of 50.
  • Student Y at School B has a standardized score of 60.
  • Can we really conclude that Student Y, with the higher standardized score, has greater ability?
  • Not necessarily so.

The reason:

  • Because the ability levels of the examinee groups differ
  • If the levels of School A and School B are the same
    → Student Y with a standardized score of 60 can be said to have higher ability.
  • However, if School A has a higher overall level than School B
    → It is possible that Student X with a score of 50 actually has higher ability than Student Y with a score of 60.
    → The evaluation of test difficulty depends on the level of the sample (group) that took the test.
  • When the group’s ability level is high → individual test scores are high → the overall group average is high.
  • When the group’s ability level is low → individual test scores are low → the overall group average is low.
  • This phenomenon is called sample dependence of test scores.
    → Care is needed when comparing across different groups.

Solution
・To compare School A and School B, estimate the item parameters of each using an IRT model.
→ Use common items (or common examinees) to equate the scales.
→ This allows examinee ability (θ) to be compared on the same scale.

However, in practice・・・
  • Since the tests at Schools A and B are different, there are usually no common items or common examinees.
    → Equating is theoretically impossible.
Lesson:
  • If future comparisons are anticipated, anchor items should be embedded from the beginning across tests.

2.2 The Problem of Item Dependence

  • The average score on the test administered at JHS A last year was 50 points.
  • The average score on the test administered at JHS A this year was 70 points.
  • In this case, can we say that the academic ability of the students at JHS A has improved?
  • Not necessarily so

The reason:

・Because the difficulty of the test items differs

  • If the test items were identical last year and this year
    → then students’ ability improved by 20 points compared to last year.
  • However, if last year’s test was more difficult than this year’s,
    → the lower average score on last year’s test compared to this year’s may simply reflect higher item difficulty,
    → meaning the students from last year could actually have had higher ability.
  • This issue is called item dependence in test scores.
    → It implies that scores from different tests cannot be directly compared.
  • Test scores of examinees are also influenced by the difficulty of the individual items (questions) included in the test.

Solution

Equating

・Separately estimate the difficulty of the test items and the ability of the examinees, and evaluate them on a common scale.

Step 1: Fit an IRT model to each year’s test and estimate the item parameters.

・Example: In a 2PL model, estimate the discrimination parameter (a) and the difficulty parameter (b) for each item.

Step 2: Use common items (anchor items) to “equate” the scales.

・Include at least a few identical items across last year’s and this year’s tests.
・This makes it possible to estimate transformation coefficients (A and B) to align the scales.
・(Main methods: Stocking–Lord method, Haebara method, etc.)

Step 3: Compare the means of ability (θ)

・On the equated scale, compare the ability distributions (θ) of students from last year and this year.
・If there is a significant difference in the means or distributions of θ, then it can be concluded that “academic ability has improved.”

Summary
・Even if raw test scores are converted into standardized scores (deviation values), examinees’ abilities cannot be accurately compared in the following cases:
(1) The ability levels of examinee groups differ across schools
(2) Because the tests were administered at different times, the difficulty of the items differs

2.3 The Problem of Not Being Able to Distinguish Between Difficulty and Examinee Ability

  • Under classical test theory, when the average score on a test administered at School A is 70 points…
    → It is unclear why the average was 70 points.
  • Was it because the students’ academic ability was high?
  • Or was it because the test items were too easy?
  • However, by using Item Response Theory (IRT), we can gain clues as to why the average score was 70 points.
  • Average student ability = 2.2
  • Average item difficulty = 0.7

Score Representation and Interpretation under Different Theories

Classical Test Theory Item Response Theory
Raw score Latent trait value \(θ\) (theta) = examinee’s ability
Standardized score (deviation value) Item characteristics (item difficulty and discrimination)
When item characteristics and examinee ability are confounded
Classical Test Theory Item Response Theory
Cannot distinguish between item difficulty and examinee ability Can represent item characteristics and examinee ability separately
  • Latent trait value \(θ\) is a score independent of the ability distribution of the examinee group.
  • Item characteristics refer to an item’s difficulty and discrimination.
  • Based on these item characteristics (difficulty and discrimination), we estimate the latent trait value \(θ\) (i.e., the examinee’s ability).

3. Assumptions Required for Item Response Theory (IRT)

  • To conduct analyses based on IRT, three assumptions must be satisfied.
Assumption Description
1. Local Independence The assumption that there are no unnecessary dependencies among items
2. Monotonicity of the Probability of a Correct Response The assumption that item difficulty is appropriate (the probability of a correct response increases with ability)
3. Goodness of Fit The assumption that the data fit the item response model

3.1 Local Independence

  • In IRT, local independence means the following assumption:

“Given that an examinee’s ability value θ is fixed, responses to different items are statistically independent.”

  • For example, suppose two examinees with the same ability θ respond to items Q1 and Q2 on a test.
    When local independence holds:
  • Whether or not a student answers Q1 correctly does not affect whether they answer Q2 correctly.
  • In other words, as long as correct/incorrect responses can be explained by ability θ, there are no extra dependencies among items.
    When local independence is violated:
  • A student who answers Q1 correctly is more likely to also answer Q2 correctly.
  • This indicates the presence of common factors beyond θ (e.g., similar wording of the questions or requiring knowledge of the same unit).

Difference Between Probability and Likelihood

Probability The magnitude of the likelihood of what will happen in the future
Likelihood The magnitude of the likelihood behind what has already happened

Probability

  • Suppose there is a bag containing 3 red balls and 2 white balls, for a total of 5.
  • One ball is drawn from the bag, its color is observed, and then it is returned to the bag.
  • This process is repeated three times.

Result:
  • 1st draw … Red
  • 2nd draw … White
  • 3rd draw … Red
Question:
  • What is the probability that the result will be “Red, White, Red”?
  • It can be calculated using the multiplication rule:

\[\frac{3}{5} × \frac{2}{5} × \frac{3}{5} = \frac{18}{125}\]

  • However, to apply the multiplication rule, we must assume that

the assumption of local independence, meaning that each draw is independent and does not influence the others.

Key Point
・The assumption of local independence is a fundamental prerequisite for using IRT.
・If this assumption is not satisfied, calculating scores using IRT becomes meaningless.

Examples Where the Assumption of Local Independence Is Not Met:

Example 1
  • Choose the correct form of the verb:
    She ___ to the store every morning.
    A. go
    B. goes
    C. going
    D. gone
Example 2
  • Fill in the blank with the correct form:
    He ___ his homework before dinner every day
    A. do
    B. does
    C. doing
    D. done

  • Both Question 1 and Question 2 ask about the third-person singular present tense verb form (subject–verb agreement), and thus depend on the same grammatical knowledge.
    → Examinees who answer Question 1 correctly are also highly likely to answer Question 2 correctly.
    → This is not explained solely by ability θ, but rather by a substantive relationship between the two items (dependence on a common skill).
    → In this case, conditional dependence arises between the items, and the assumption of local independence is violated.

Likelihood

  • Suppose there is a bag containing a total of 5 balls, red and white combined, though the exact numbers of each are unknown.
  • One ball is drawn from the bag, its color observed, and then returned to the bag.
  • This process is repeated three times.
Result:
  • 1st draw: Red
  • 2nd draw: White
  • 3rd draw: Red
Question:
  • Given this result, what is the most likely original number of red and white balls in the bag?

  • Unlike probability, which predicts what will happen in the future, here the balls have already been drawn.
  • From the observed outcome (Red, White, Red), we want to estimate how many red and white balls were originally in the unseen bag.
  • Estimating the magnitude of the underlying possibility behind what has already happened is called likelihood.
  • So, what numbers of red and white balls are most plausible?
  • Let us calculate the likelihood for each possible composition of balls.
Number of Balls Likelihood
1 Red, 4 White \(\frac{1}{5}×\frac{4}{5}×\frac{1}{5}= \frac{4}{125}\)
2 Red, 4 white \(\frac{2}{5}×\frac{3}{5}×\frac{2}{5}= \frac{12}{125}\)
3 Red, 2 White \(\frac{3}{5}×\frac{2}{5}×\frac{3}{5}= \frac{18}{125}\)
4 Red, 1 White \(\frac{4}{5}×\frac{1}{5}×\frac{4}{5}= \frac{16}{125}\)
  • Based on the observed result “Red, White, Red,”
    → the most plausible case is when the bag contains 3 red balls and 2 white balls.

Key Point
・In IRT analysis, the estimation method used is Maximum Likelihood Estimation (MLE).
・MLE = Maximum Likelihood Estimation.
・For example, suppose a student’s responses are: “correct on Item 1, incorrect on Item 2, correct on Item 3.”
→ In the bag analogy, this corresponds to “1st draw: red, 2nd draw: white, 3rd draw: red.”
・The test results are observed (just as the colors of the drawn balls are known).
・At this point, we estimate the student’s ability θ (analogous to estimating the unseen mix of red and white balls in the bag).
・The contents of the bag are not directly observable.
- Likewise, the student’s ability θ is unobservable.
・In IRT, the student’s ability θ is estimated in this way.  

MLE in IRT: Illustrated with the Bag Example
・Observed data: “Red, White, Red”
・Hypothesized parameters: possible combinations of red and white balls in the bag (e.g., 1 Red & 4 White, 2 Red & 3 White, 3 Red & 2 White, 4 Red & 1 White)
・ For each hypothesis (parameter set), calculate the probability (likelihood) of obtaining the observed outcome.
👉 Adopt the parameter (ball combination) that yields the highest likelihood.
・In other words:
→ From among the candidate parameter sets, choose the one that best explains the observed data.

3.2 Monotonicity of the Probability of a Correct Response

  • Monotonicity means that
    → As an examinee’s ability (θ) increases, the probability of answering an item correctly also increases.

  • Why monotonicity is important:
  • If the basic assumption of the IRT model (the higher the θ, the higher the probability of a correct response) is violated, the accuracy of ability estimation decreases.
  • The reliability and validity of the test may be undermined, making fair evaluation impossible.
  • If such an item is mistakenly regarded as a “good item” and continues to be used, there is a risk of unfairly underestimating high-ability examinees.

3.3 Model Fit

  • In IRT, model fit refers to an important index for determining how well the estimated model explains the observed data.

Main purposes of checking model fit:

  1. Selection of the model (1PL, 2PL, 3PL, etc.)
  2. Detection of misfitting items
  3. Verification of the model’s reliability and predictive accuracy

Levels of model fit in IRT:

Level of Fit Overview Main Methods
Overall Model Whether the entire model fits the data AIC, BIC, M2, RMSEA
Item Level Whether each item is consistent with the model Residual analysis, S-X² test, infit/outfit
Person Level Whether each examinee’s responses follow the model Person-fit statistics (e.g., l_z)

4. Quantifying Academic Ability and Its Issues

4.1 Is “academic ability” = “test score”?

\[Test Score  ≠ Academic Ability\]

  • The belief that “a test score is academic ability, and that it is absolute” is mistaken. The human mind is a black box.
    → Academic ability cannot be measured directly.
    It can only be measured indirectly.
  • Test scores always contain a degree of uncertainty.
    → For example, even if a student scores 90 points on a test,
    → that is merely an “estimate” of ability.
    → Proper “estimation” is required
    → Maximum Likelihood Estimation (the estimation method used in IRT).

  • Weight and height can be measured directly
    → The same result will be obtained regardless of the measuring instrument used.

4.2 Difference Between “Scoring Criteria” and “Scoring”

  • Scoring Criteria: The standards set before grading to classify responses.
  • Scoring: The act of comparing an examinee’s responses with the scoring criteria and classifying them as correct, partially correct, or incorrect.
Scoring Criteria May reflect the test administrator’s judgment
Scoring Must not involve subjectivity
  • Scoring criteria may include the test administrator’s decisions.
    → For example: which answers are considered “correct,” which are awarded partial credit, and which are classified as “incorrect.”
  • Scoring itself must not include the test administrator’s subjectivity.
    → It must be conducted strictly according to the predetermined criteria.
    → Regardless of who scores, when, or where, the scoring results must be the same.
    → What is obtained through scoring is the “score”.

4.3 Problems with Conventional Test Scores

How conventional test scores are calculated

1. Number-correct score The number of items answered correctly
2. Weighted number-correct score The total score based on assigned weights for each item
3. Standardization (Z-scores) / Deviation scores Numerical representation of one’s relative standing within the same group
1. Problems with the number-correct score
  • If all items have the same level of difficulty and importance, then using the number-correct score poses no issue.
  • However, when items of differing difficulty are included:
    → “Answering a difficult item correctly”
    = “Answering an easy item correctly”
    → treating them as equal is problematic
2. Problems with the weighted number-correct score
  • How to assign weights (i.e., point values) to individual items is critically important.
  • Depending on how weights are assigned, an examinee’s total score may vary.
  • It is difficult to assign weights “based on rational and objective grounds”.
  • Ranking items strictly by difficulty (or ease) is not straightforward.
  • There is no rational method for determining weights.
  • Nor is it possible to sufficiently verify whether the weights assigned are valid.
  • For example, suppose a student scores 90 out of 100 points…

  • This is the total of the assigned points for the items answered correctly.
  • Even if the scoring is done according to the criteria and the total is calculated accurately,
    → the weights assigned to individual items are not determined “based on rational and objective grounds”.
    → The point value for a given item inevitably contains an element of ambiguity, such as “perhaps 90 points would be about right.”
  • If the degree of this “ambiguity” could be determined objectively,
    → then test scores could be interpreted accordingly.
3. Problems with Standardization (Z-scores) and Deviation Scores
  • A deviation score represents how far above or below the mean a score is, using the standard deviation as the unit.
  • If two groups have the same mean score → then deviation scores can be used to compare students’ performance across those groups.
  • However, if the groups are different, their means will also differ.
    → Deviation scores from different groups cannot be compared.
  • For details on standardization (Z-scores) and deviation scores, see the followings:
    2.3 How to Read the Standard Normal Distribution Table
    2.4 偏差値 (Hensachi: deviation value)

Problems with Conventional Test Scores


1. When tests contain different items, number-correct scores and weighted number-correct scores cannot be compared.
2. In number-correct scores, the difficulty and importance of items are not reflected.
3. In weighted number-correct scores, there is no clear, rational, and objective way to determine weights.
4. Dviation scores from different groups cannot be compared.

Contributions of IRT

Regarding the problems of conventional test scores
・When tests contain different items, number-correct scores and weighted number-correct scores cannot be compared.
→ With IRT, test scores can be compared even when the tests contain different items
(However, the assumptions of “local independence” and “unidimensionality” must be satisfied.)
・In number-correct scores, the difficulty and importance of items are not reflected.
→ IRT allows item difficulty to be analyzed and quantified in a rational way
・(You can determine the “difficulty” of items.)

→ IRT shows how much correct/incorrect responses differ depending on examinees’ ability
(You can determine the “discrimination” or “importance” of items.)

・In weighted number-correct scores, there is no clear, rational, or objective way to assign weights.
→ In IRT analysis, it is not necessary to decide weights either before or after the test
→ Instead, IRT uses parameters in place of weights Deviation scores from different groups cannot be compared.
→ With IRT, results from different groups can be compared

5. Method of Estimating Academic Ability Using IRT I

Features of Item Response Theory
・Test results composed of different items can be compared with each other.
・Test results obtained from different groups can be compared with each other.
  • In this section, we explain why IRT makes these two types of comparisons possible.

5.1 The “Measuring Scale” Prepared in IRT

  • Unlike traditional methods of measuring academic ability, IRT provides a “scale with units” as a tool for measurement.

The Scale for Measuring Academic Ability … Ability θ

  • In IRT, the ability being measured is called “ability θ” = “latent trait value θ” (θ is pronounced theta).
  1. Test developers do not need to assign weights (point values) to items.
  2. There is no need to calculate the total number of correct answers.
  3. The formula for deviation scores, \(\frac{\text{Score} - \text{Mean}}{\text{Standard Deviation}} \times (10+50)\), is not required.
  • How ability θ is estimated:
Estimate the most likely “ability θ” from the examinee’s response pattern of correct and incorrect answers.

Data Structure Used in IRT

Meaning of \(u_{ij}\)

・Indicates whether examinee \(i\) answered item \(j\) correctly or incorrectly.
・The starting point of test theory is “response data” (1 = correct, 0 = incorrect).
・Suppose there are 3 examinees: A = 1st, B = 2nd, C = 3rd → the examinee index is represented by \(i\).
・Suppose there are 3 items: Q1 = 1st, Q2 = 2nd, Q3 = 3rd → the item index is represented by \(j\).

  • \(u_{ij}\) represents the response outcome (correct/incorrect) of the i-th examinee on the j-th item.

  • For example, \(u_{12}\) represents the response outcome 1 (correct) of “the 1st examinee” on “the 2nd item”
    → In other words, the response data \(u_{ij}\) has an outcome of 1.

Guttman Scale

  • The estimation of “ability θ” in IRT is similar to a vision test.
  • The figure on the left below shows the Landolt C chart used in vision testing.
  • Participants are presented with “C”-shaped stimuli (Landolt rings) of various sizes.
    → They are asked to indicate the direction of the gap in the “C” (up, down, left, right).
    → What matters is not “how many were answered correctly,” but “how small the C can be while still being answered correctly.”
    → No matter how many large “C”s a participant answers correctly, that does not mean they have good eyesight.
    → The smaller the “C” that can be correctly identified, the better the eyesight is judged (i.e., estimated).
The same principle applies in IRT

→ No matter how many easy items are answered correctly, the value of ability θ cannot be considered high
→ The more difficult the items that are answered correctly, the higher the ability θ is judged (i.e., estimated)

  • The table summarizing this test data is shown in the figure on the left above.
  • This is called the Guttman scale.
  • The vision test was taken by 12 participants, labeled A, B, C, …, L.
  • The “visibility” of the “C” stimuli ranged from largest/easiest to see, 0.1 (highest), to smallest/hardest to see, 1.5 (lowest).
    → This “visibility” serves as the stimulus.
  • Participant A answered all items correctly (= 1) ⇒ Good eyesight.
  • Participant B answered all items correctly (= 1) except for “visibility = 1.5.”
  • Participant L answered all items incorrectly (= 0) ⇒ Poor eyesight.
  • The Guttman scale constructs a measurement scale that converts a physical quantity (stimulus size) into a psychological quantity (eyesight).
  • It describes the relationship between the correct response rate for each stimulus and the participant’s “true eyesight.”
    → This provides an index of how well the stimulus can discriminate levels of eyesight (= discrimination power).
  • In IRT, the probability of a correct response is modeled as a function of the latent trait value \(θ\) (here, eyesight).
  • Item response data are treated as binary (0 = incorrect, 1 = correct), and factor analysis is applied to extract a single factor.
    → For each examinee, the latent trait value \(θ\) (here, “eyesight”) is estimated. The latent trait value has indeterminacy in origin and scale.
    → By setting mean = 0 and standard deviation = 1, it can be treated like a standardized score (z-score).
    → This makes interpretation of results easier.

Similarities Between Vision Tests and IRT

  • In academic tests, item difficulty is determined by the correct response rate.
  • However, if the correct response rate is used directly, the same item will appear to differ in difficulty depending on the ability level of the group:
  • In a high-ability group → correct response rate increases (the item seems easier).
  • In a low-ability group → correct response rate decreases (the item seems harder).
  • In contrast, vision tests use the Guttman scale, where “ring size” is predetermined to correspond to eyesight levels.

Therefore, testing can be conducted consistently for both high-ability and low-ability groups.

Method of Analysis Indicator for Measuring Item Difficulty
Vision Test Ring size (Guttman scale)
IRT Item characteristics

5.2 Item Characteristics

  • A method of determining item difficulty regardless of the level of the test-taking group … Item Characteristics

What are Item Characteristics?


The probability of correctly answering an item according to differences in ability θ

Item Characteristic Curve (ICC)

  • For individuals at which ability level is this item difficult? How sharply does the probability of a correct answer change? (Discrimination)
    ・For a given item (question), the curve represents the probability of a correct response according to the examinee’s ability θ.
    ・The probability of a correct answer is calculated for each value of the latent trait θ and plotted
    ・Horizontal axis … Ability θ (latent trait value)
    ・Vertical axis … Probability of a correct answer (0 to 1)

  • A visualization of how the probability of correctly answering an item changes depending on the examinee’s ability θ
    ・As examinees move from low to high ability θ → the probability of a correct response gradually increases

Focusing on the slope of the curve:

・Examinees with low ability θ
→ the probability of a correct response rises gradually
・Examinees with medium ability θ
→ the probability of a correct response rises sharply
→ The slope is steepest
→ the probability of a correct response (i.e., discrimination) increases the most when ability θ rises by one unit
・Examinees with high ability θ
→ the probability of a correct response rises gradually

5.3 IRT Model

5.3.1 What Are Item Characteristics?


Item characteristics are obtained by conducting IRT analysis on test results

  • The item characteristics derived from the analysis are unique to each individual item → The shape of the Item Characteristic Curve differs from item to item

  • In IRT analysis, various models are used, but here we introduce the most commonly applied two-parameter logistic model.

  • The shape of the Item Characteristic Curve (ICC) in the two-parameter logistic model is determined by the item’s difficulty and discrimination.

  • Because the shape of the ICC is determined by these two parameters, the model is called the two-parameter logistic model.

  • Difficulty and Discrimination = Item parameters

Notation Details
\(P(\theta)\) The probability that an examinee with ability θ correctly answers a given item
\(\theta\) Examinee’s ability, assumed to follow a normal distribution with mean 0 and standard deviation 1
\(a\) Discrimination parameter
\(b\) Difficulty parameter = location parameter
\(j\) Item number
Two-Parameter Logistic Model (2PLM)
\[P_j(\theta) = \frac{1}{1 + exp[-1.7a_j(\theta-b_j)]}\]

Visualizing the latent trait value \(θ\) (e.g., eyesight) using the Item Characteristic Curve (ICC)

  • Horizontal axis … Examinee’s ability \(θ\)
  • Vertical axis … Probability of a correct response (0–1)
  • Discrimination parameter is the same for the three items = 1.2
  • Difficulty parameters differ across the three items: –0.5 to 1.0
  • The larger the difficulty parameter, the further the ICC shifts to the right
  • An ICC shifted to the right = a higher ability \(θ\) is required to answer correctly
    → A more difficult item

5.3.2 Changes in the ICC Due to Differences in Difficulty

  • Fixing the Probability of a Correct Response at 50%
  • Keeping the discrimination parameter fixed (\(a = 1.2\)), examine how the ICC changes as the difficulty parameter varies (–0.5 to 1.0)
  • Draw a dashed line at the point where the probability of a correct response equals 50%
  • To achieve a 50% probability of a correct response for a difficult item (\(b = 1.0\)), a high ability level (\(θ = 1.0\)) is required
  • To achieve a 50% probability of a correct response for an easy item (\(b = -1.0\)), a lower ability level (\(θ = -1.0\)) is sufficient

  • The difficulty parameter is the ability level \(θ\) at which the probability of a correct response is 50%.
  • For a given item, the ability level \(θ\) at which the proportion of examinees who answer correctly and incorrectly is evenly split represents the difficulty of that item.

5.3.3 Appropriate Magnitude of the Difficulty Parameter \(b\)

Recommended range of difficulty \(b\) −3 to 3
Range where estimation is most stable −2 to 2

Appropriate Magnitude of the Difficulty Parameter \(b\)

−3 〜 3

(Source:芝祐順編『項目反応理論』p.34)

5.3.4 Changes in the Item Characteristic Curve Due to Differences in Discrimination

  • Fix the difficulty parameter (\(b = 0\)) and examine how the ICC changes as the discrimination parameter varies (0.5–1.5).
  • The three items have the same difficulty = the ability level \(θ\) at which the probability of a correct response is 50% is the same.
    → They all result in a correct response at the same ability level \(θ\).
    → At \(θ = 0\), the probability of a correct response = 0.5 (50%), so the curves intersect at a single point.
  • Looking closely at the vicinity of this intersection, the steepness (slope) of the ICC differs.
  • A steeper slope
    → the influence of ability \(θ\) on the probability of a correct response is larger (i.e., the difference in probability of a correct response is greater).

  • Probability of a correct response on Item 1 when ability θ = −4・・・0.416
  • Probability of a correct response on Item 1 when ability θ = 4 ・・・0.584
    Difference in probabilities of correct response: 0.584 − 0.416 = 0.168

  • Probability of a correct response for Item 3 when ability θ = −4・・・ 0.265
  • Probability of a correct response for Item 3 when ability θ = 4 ・・・0.735
    Difference in probability of a correct response: 0.735 − 0.265 = 0.47

  • When ability θ is around the difficulty level of 0, the same difference in θ results in a larger difference in the probability of a correct response for items with higher discrimination (i.e., Item 3).
    → Items with higher discrimination (i.e., Item 3) can more clearly distinguish between those who are likely to answer incorrectly and those who are likely to answer correctly.

Key Point about Discrimination ・Discrimination serves as a criterion for judging how sensitively an item can distinguish between correct and incorrect responders based on differences in ability θ.
・However, this applies only when ability θ is around the difficulty level of 0.

5.3.5 What Constitutes an Appropriate Level of Discrimination

The figure below shows Item Characteristic Curves for three cases where discrimination is 0, −0.5, and 10.

When discrimination = 0
  • Item 2 (discrimination = 0)
  • The Item Characteristic Curve is flat.
    → Regardless of whether ability is high or low, the probability of a correct response is the same
    → The test has no meaning.
    → Such an item should not be included in the test.
When discrimination is negative
  • Item 1 (discrimination = −0.5)
  • The Item Characteristic Curve slopes downward to the right.
    → The probability of a correct response is higher for those with low ability, and lower for those with high ability.
    → The test has no meaning.
    → Such an item should not be included in the test.
When discrimination is excessively large
  • Item 3 (discrimination = 10)
  • The slope of the Item Characteristic Curve around 0 increases explosively.
    → It cannot be categorically labeled as “inappropriate.”
    → However, it is certainly “atypical” compared with other items.
    → It should be carefully examined within the overall context of the test to determine whether it is appropriate.

Appropriate Magnitude of Discrimination (a)

0.3 〜 2.0

(Source:芝祐順編『項目反応理論』p.34)

5.3.6 The “Ruler” Used in IRT

  • Measuring height or weight is different from estimating academic ability.
  • Height … because we have a “ruler” to measure height, there is no need to estimate it.
  • With a height scale that has both an [origin (= 0)] and [units (cm)], we can measure it directly.

  • Academic ability cannot be measured directly.
  • Both the “ability” to be measured and the “ruler” itself are hypothetical constructs.
    → The content of ability is left unspecified (or cannot be clearly defined).
    → We use a hypothetical ruler to estimate the hypothetical ability θ.
    → For this, a “ruler (= scale)” is required.
    → A “ruler” needs both an origin and units.

Method for Estimating Ability Using IRT ・Assume that ability is distributed with a mean of 0 and a standard deviation of 1

・Measure that ability with a “ruler” that has an origin of 0 and tick marks from −3 to 3

6. Method II: Estimating Ability with IRT

  • Estimation of ability θ by IRT … θ is estimated as the most likely ability level based on each examinee’s response pattern of correct and incorrect answers.
  • In this section, we use the two-parameter logistic model for the analysis.

Data Preparation

  • We use data included in the ltm package.
  • The data are based on the scored responses to the Law School Admission Test.
  • Number of examinees: 1000
  • Number of items: 5 items from Section IV
  • A correct response is coded as 1, an incorrect response as 0
Variable Name Details
ID Examinee ID
Item1 Response to Item 1 (0 or 1)
Item2 Response to Item 2 (0 or 1)
Item3 Response to Item 3 (0 or 1)
Item4 Response to Item 4 (0 or 1)
Item5 Response to Item 5 (0 or 1)
SS Total score from Items 1–5
class Classification of examinees based on raw score
  • class: Ranges from 1 to 5, with higher values indicating higher raw score groups.

  • This dataset is analyzed using IRT to evaluate the characteristics of both items and the test as a whole.

  • Load the necessary packages for the analysis.

library(ltm)
  • Load the data
data(LSAT)
  • Check the variable names contained in the data frame LSAT
names(LSAT)
[1] "Item 1" "Item 2" "Item 3" "Item 4" "Item 5"
  • Rename Item 1 to item1 (remove the half-width space from the variable name)
LSAT <- LSAT |> 
  rename("item1" = "Item 1",
    "item2" = "Item 2",
    "item3" = "Item 3",
    "item4" = "Item 4",
    "item5" = "Item 5")
  • Create the variable total, which represents the sum of item1 through item5
LSAT <- LSAT |> 
  dplyr::mutate(total = rowSums(dplyr::across(item1:item5), 
    na.rm = TRUE)) # Specify to ignore missing values (NA) when calculating the sum
DT::datatable(LSAT)

6.0 Choosing Analyses According to Purpose

Purpose Appropriate Analysis
Examine item difficulty and discrimination ltm() and ICC plots
Evaluate the effectiveness of the test across the θ-space ICC or Test Information Function (TIF) analysis
Assign ability scores to individuals mlebme() or factor.scores()
  • In this section, we proceed with the analysis following these steps.
  • Primarily, we will use the irtoys package.
  • Step 3 uses the psych package.
  • Step 11 uses the mlebme package.
Analysis Method Package::Function Used Content of Analysis
1. Calculation of correct answer rates colMeans() Understand item difficulty by average correct answer rate
2. Calculation of item–total (I–T) correlation cor() Check whether each item aligns with the total score (an indicator of discrimination)
3. Examination of unidimensionality psych::fa.parallel() Check whether all items are measured by a single latent trait
4. Estimation of item parameters irtoys::est() Numerically estimate each item’s discrimination a and difficulty b
5. Item Characteristic Curve (ICC) irtoys::irf() Visualize how the probability of a correct response changes with ability θ
6. Estimation of latent trait values (item side) ltm::ltm() “Item-side” estimation of θ (model fit, estimation of a and b)
7. Test Information Curve (TIC) irtoys::tif() Determine at which ability levels this test measures most precisely
8. Examination of local independence irtoys::irf(), base::cor() Check whether there are extra dependencies among items (violation of model assumptions)
9. Examination of item fit irtoys::itf() Assess how well each item fits the IRT model
10. Test Characteristic Curve (TCC) irtoys::trf() How many points can a person with ability θ expect to score?
11. Estimation of latent trait values (examinee side) irtoys::mlebme() “Examinee-side” estimation of θ (how capable each person is)
  • Note 1: The “I–T correlation discrimination” is only a simple indicator of discrimination within classical test theory and is not the same as the IRT discrimination (\(a\) parameter) in the strict sense.

  • Note 2: When estimating latent trait values, keep in mind that there are θ values from both the “item side” and the “examinee side.”

6.1 Calculation of Correct Answer Rates: colMeans()

Understanding item difficulty through average correct answer rates

  • Calculation of correct answer rates
  • Use the colMeans() function to calculate the correct response rate (Correct Response Rate: crr).
crr <- base::colMeans(x = LSAT[, 1:5],
  na.rm = TRUE)
crr
item1 item2 item3 item4 item5 
0.924 0.709 0.553 0.763 0.870 
  • The result crr obtained here is a named numeric vector.
    → Since it is not very convenient to use, convert this vector into a data frame.
df_crr <- data.frame(      # Specify the name of the data frame (here, df_crr)  
  item = names(crr),       # Specify the variable name (here, item)  
  seikai = as.numeric(crr) # Specify the variable name (here, seikai)  
)
  • Check the data frame
df_crr
   item seikai
1 item1  0.924
2 item2  0.709
3 item3  0.553
4 item4  0.763
5 item5  0.870
  • Display items sorted by lower correct response rates
  • Specify the factor order of item based on descending values of seikai
df_crr$item <- factor(df_crr$item, 
  levels = df_crr$item[order(df_crr$seikai)])
ggplot(df_crr, aes(x = seikai, y = item)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_text(aes(label = round(seikai, 2)),  # Round to 2 decimal places
            hjust = 1.2, size = 6) +        # Display inside the bars
  labs(
    title = "Correct Answer Rate for Each Item",
    x = "Item",
    y = "Correct Answer Rate"
  ) +
  theme_minimal() +
  theme_bw(base_family = "HiraKakuProN-W3") # Prevent garbled characters 

  • It is clear that item1 has the highest correct answer rate (92%), while item3 has the lowest (55%)

Key Points for Calculating Correct Answer Rates ・Check whether there are items with extremely high or low correct answer rates
・If there are items with extremely high/low rates → Problematic
・If there are no such items
→ No problem
→ In this case, there are no items with extreme rates
→ No problem
→ Proceed to the next analysis

6.2 Calculation of I–T Correlations: cor()

Checking whether each item aligns with the total score (an indicator of discrimination)

  • Use the cor() function to calculate the correlation between raw scores (item1item5) and the total score total.
it <- cor(x = LSAT[, 1:5],
  y = LSAT[, 6],
  use = "pairwise.complete.obs")
it
           [,1]
item1 0.3620104
item2 0.5667721
item3 0.6184398
item4 0.5344183
item5 0.4353664
  • The result it obtained here is a “single-column matrix with row names.”
    → Since this is inconvenient to use, convert the matrix into a data frame and add the row names as a column for item names.
# 行列をデータフレームに変換
df_it <- as.data.frame(it)

# 行名を項目名として列に追加
df_it$item <- rownames(df_it)

# 列名をわかりやすく変更(オプション)
colnames(df_it) <- c("correlation", "item")
DT::datatable(df_it)
  • Display the items sorted by lower correlation coefficients
  • Specify the factor order based on the values of correlation
df_it$item <- factor(df_it$item, 
  levels = df_it$item[order(df_it$correlation)])
ggplot(df_it, aes(x = item, y = correlation)) +
  geom_bar(stat = "identity", fill = "orange") +
  geom_text(aes(label = round(correlation, 3)), 
    vjust = -0.5, size = 4) +
  ylim(0, 0.7) +
  labs(
    title = "Item–Total Correlation",
x = "Item",
y = "Correlation Coefficient"
) +
theme_minimal() +
theme_bw(base_family = "HiraKakuProN-W3") # Prevent garbled characters

  • The correlations between the raw scores (item1item5) and the total score (total) range from 0.36 to 0.61.

Key Points in Calculating I–T Correlations
・Check whether correlations are observed between responses to each item (item1–item5) and the total score (ss)

I–T Correlation Value Evaluation Treatment of Item
~ 0.2 Extremely low (caution) Consider excluding
0.2–0.3 Somewhat low Re-examine depending on content
0.3–0.4 Reasonable level Pending / context-dependent
0.4 and above Good (desirable) Acceptable to retain

→ Here, all items show correlations above 0.2
→ No problem
→ Proceed to the next analysis without excluding items

6.3 Examination of Unidimensionality: sych::fa.parallel()

Checking whether all items are measured by a single latent trait

  • This is often examined using methods from factor analysis.
  • Here, we draw a parallel analysis scree plot using psych::fa.parallel() (Horn, 1965).
  • If the first eigenvalue is large and the subsequent ones drop sharply
    → strong unidimensionality
library(psych)

data <- read.csv("data/IRT_LSAT.csv")
item_data <- data[, -1] # Exclude the first column from the analysis  
# Examination of unidimensionality (principal components + parallel analysis)
fa.parallel(item_data, 
  fa = "pc", 
  n.iter = 100) # Specify 100 simulations

Parallel analysis suggests that the number of factors =  NA  and the number of components =  1 

Meaning of the X- and Y-axes

Axis What it Represents Interpretation Point
X-axis Factor number (principal component number) Factors 1–5
Y-axis Eigenvalue (amount of variance explained) Greater than 1 → Potentially meaningful factor

→ Eigenvalue
=> In principal component analysis (PCA) or factor analysis, this is the value that indicates “the amount of total variance in the data explained by that factor.”
→ It represents how much variance each factor explains.
→ The larger the eigenvalue, the more information (i.e., variation, structure) the factor captures.
→ A factor with a large eigenvalue can summarize and explain the scores of many items.

Key Point・・・focus on the blue line → eigenvalues from the actual data

  • How many factors have the blue line above the red line?
  • Only Factor 1’s blue line (actual data) lies far above the red line (random data).
  • For Factors 2–5, the blue line falls below the red line.

Meaning of the line colors

Line Color/Type What It Represents Interpretation
Blue line Eigenvalues from actual data Strength of factors derived from real item correlations
Red dashed line (thin) Simulated Average eigenvalues from completely random data
Red dashed line (thicker) Resampled Average eigenvalues from resampled versions of the original data
Black line Eigenvalue = 1 If greater than 1 → potentially meaningful factor
Analysis Results・Only Factor 1 from the actual data (blue line) is larger than the random data (red line).
・Such a factor is only Factor 1.
→ Unidimensionality is established.
→ It is appropriate to assume a one-factor IRT model such as 2PL or Rasch. → Safe to proceed with IRT model analysis (e.g., ltm() or mirt()).
  • The three data checks (correct answer rate, I–T correlation, unidimensionality) have been passed
    → Proceed to estimation of item parameters (discrimination a and difficulty b) and latent trait values (ability θ).

6.4 Estimation of Item Parameters

Estimating the values of “Discrimination (a)” and “Difficulty (b)” for each item

・Here, the two-parameter logistic model (2PL: generalized logistic model) is used for analysis.

Purpose of the Two-Parameter Logistic Model To estimate the ability level θ at which the response pattern of a test-taker is most likely to occur, based on the item parameters of the test questions—-

• It is generally assumed that the latent trait values follow a normal distribution with a mean of 0 and variance of 1 (standard normal distribution) for estimation.

Methods of Estimating Item Parameters

  • The ability θ is estimated solely from the response pattern of each test-taker.
  • The item parameters are estimated from the response patterns of all test-takers.
Target of Estimation Information Required for Estimation
Ability The response pattern of the examinee whose ability is to be estimated
Item Parameters Response patterns of all test-takers

・Item parameters are derived through IRT analysis based on the correct/incorrect response data of examinees who took the test containing the item.

  • Item parameters (discrimination and difficulty) and examinees’ ability levels θ are estimated simultaneously.
    → Using the joint maximum likelihood estimation (JMLE) method

joint maximum likelihood estimation (JMLE)

What can and cannot be done regarding the estimation of item parameters

What we cannot do:

・Create a problem in advance with predetermined values such as “discrimination = ○○, difficulty = △△”
← Because item parameters cannot be obtained without conducting IRT analysis using actual test results  

What we can do:

・Develop a large number of items, administer them to examinees, and then conduct IRT analysis
→ Accumulate verification and evaluation of each individual item

Differences Between the 1PL Model and the 2PL Model

Feature 1PL Model (Rasch Model) 2PL Model (Generalized Logistic Model)
Model Formula \(P(\text{Correct}) = \frac{1}{1+e^{-(\theta-b)}}\) \(P(\text{Correct}) = \frac{1}{1+e^{-a(\theta-b)}}\)
Discrimination a Same for all items (fixed) Estimated for each item
Difficulty b Estimated for each item Estimated for each item
Number of Parameters Number of items (only b) + 1 (fixed a) Number of items × 2 (both a and b estimated)
Focus of Analysis Relationship between ability and item difficulty b Relationship among ability, difficulty b, and discrimination a (a more flexible model)
  • The 1PL model (Rasch model) assumes that “all items have equal discrimination.”
  • The 2PL model relaxes this assumption and reflects actual differences in item discrimination.
    → In the 1PL model, only difficulty (b parameter) is estimated.
    → In the 2PL model, both difficulty (b parameter) and discrimination (a parameter) are estimated.
    → The 2PL model thus reveals how easily each item differentiates among levels of ability.

Two-Parameter Logistic Model (2PL)

  • The 2PL model is more flexible than the 1PL Rasch model.
    → However, because the number of parameters increases, the stability of estimation decreases somewhat.
  • Formula of the 2PL model:
\[P_j(\theta) = \frac{1}{1 + exp[-1.7a_j(\theta-b_j)]}\]
Notation Description
\(P(\theta)\) Probability that an examinee with ability θ answers an item correctly
\(\theta\) Examinee’s ability, assumed to follow a normal distribution with mean 0 and standard deviation 1
\(a\) Discrimination parameter
\(b\) Difficulty parameter = location parameter
\(j\) Item number
  • Estimate item parameters using the est() function in the ltm package
    → Save the result as ex1
ex1 <- est(resp = LSAT[, 1:5], # Argument specifying the test data
  model = "2PL",               # Assume the 2PL model
  engine = "ltm")              # Specify estimation using the ltm package

ex1
$est
           [,1]       [,2] [,3]
item1 0.8253717 -3.3597333    0
item2 0.7229498 -1.3696501    0
item3 0.8904752 -0.2798981    0
item4 0.6885500 -1.8659193    0
item5 0.6574511 -3.1235746    0

$se
          [,1]       [,2] [,3]
[1,] 0.2580641 0.86694584    0
[2,] 0.1867055 0.30733661    0
[3,] 0.2326171 0.09966721    0
[4,] 0.1851659 0.43412010    0
[5,] 0.2100050 0.86998187    0

$vcm
$vcm[[1]]
           [,1]      [,2]
[1,] 0.06659708 0.2202370
[2,] 0.22023698 0.7515951

$vcm[[2]]
           [,1]       [,2]
[1,] 0.03485894 0.05385658
[2,] 0.05385658 0.09445579

$vcm[[3]]
           [,1]        [,2]
[1,] 0.05411071 0.012637572
[2,] 0.01263757 0.009933553

$vcm[[4]]
           [,1]       [,2]
[1,] 0.03428641 0.07741096
[2,] 0.07741096 0.18846026

$vcm[[5]]
           [,1]      [,2]
[1,] 0.04410211 0.1799518
[2,] 0.17995180 0.7568684

6.5 Item Characteristic Curve (ICC)

Visualizing how the probability of a correct response changes with ability θ

ex1 <- est(resp = LSAT[, 1:5], # Argument specifying the test data
  model = "2PL",               # Assume the 2PL model
  engine = "ltm")              # Specify estimation using the ltm package
  • Here, the discrimination (a) and difficulty (b) parameters for 5 items are estimated.
  • Let’s plot the characteristic curves for these 5 items.
P1 <- irf(ip = ex1$est) # irf() function to calculate the probability of correct response  
plot(x = P1,        # Specify the result estimated by irf() as the argument x
  co = NA,          # Specify ICC colors / draw ICCs in different colors for each item
  label = TRUE)     # Attach item numbers to each ICC

abline(v = 0, lty = 2) # Draw a vertical dotted line at x = 0

  • Horizontal axis・・・Latent trait value \(θ\) (Ability)
  • Vertical axis・・・Probability of a correct response
  • The numbers displayed here represent both the item names (Q1, …, Q10) and the column numbers (1, …, 10).
    → They can be used as they are, without modification.

What Can Be Understood from the Item Characteristic Curve (ICC)
The curve for item3 is located around the center of the figure, with relatively high discrimination (a) → a strong item
● item1 and item5 have curves shifted to the left → items that are too easy
The steeper the curve, the better it distinguishes differences in ability (item3 is a typical example)

  • For item3, the curve rises sharply around θ ≈ −1
    → As ability increases, the probability of a correct response increases steeply
    → High discrimination
    → Such items are effective in accurately differentiating examinees with average ability

Interpretation of Results

Interpretation of Discrimination a

Appropriate Range of Discrimination a

0.1 〜 0.4

(Source:芝祐順編『項目反応理論』p.34)

  • Visualizing Discrimination
# Required packages
library(irt)
library(ggplot2)
library(dplyr)

# Model estimation (repeated)
ex1 <- est(resp = LSAT[, 1:5], 
           model = "2PL",        
           engine = "ltm")

# Extract discrimination (1st column)
disc <- ex1$est[, 1]

# Convert to data frame and reorder (ascending order!)
disc_df <- data.frame(
  Item = names(disc),
  Discrimination = disc
) %>%
  arrange(Discrimination) %>%   # ★ Fixed here
  mutate(Item = factor(Item, levels = Item))  # Apply ordering

# Plot
ggplot(disc_df, aes(x = Item, y = Discrimination)) +
  geom_bar(stat = "identity", fill = "darkgreen") +
  geom_text(aes(label = round(Discrimination, 2)), vjust = -0.5, size = 4) +
  labs(title = "Items Ordered by Increasing Discrimination", 
       x = "Item", y = "Discrimination") +
  theme_minimal() +
  theme_bw(base_family = "HiraKakuProN-W3") # Prevent garbled text

→ Range of discrimination a ≈ 0.66 – 0.89 → No issues

Interpretation of Dificulty b

Appropriate Range of Discrimination a

-3 〜 3

(Source:芝祐順編『項目反応理論』p.34)

  • Visualizing Difficulty
# Required packages
library(irt)
library(ggplot2)
library(dplyr)

# Model estimation (repeated)
ex1 <- est(resp = LSAT[, 1:5], 
           model = "2PL",       
           engine = "ltm")

# Extract difficulty (2nd column)
difficulty <- ex1$est[, 2]

# Get item names and convert to data frame
diff_df <- data.frame(
  Item = rownames(ex1$est),
  Difficulty = difficulty
) %>%
  arrange(Difficulty) %>%   # Sort in ascending order
  mutate(Item = factor(Item, levels = Item))  # Fix the order

# Plot (bar chart)
ggplot(diff_df, aes(x = Item, y = Difficulty)) +
  geom_bar(stat = "identity", fill = "magenta") +
  geom_text(aes(label = round(Difficulty, 2)), vjust = -0.5, size = 4) +
  labs(title = "Items Ordered by Increasing Difficulty", 
       x = "Item", y = "Difficulty") +
  theme_minimal() +
  theme_bw(base_family = "HiraKakuProN-W3") # Prevent garbled text

- item1 and item5 are too easy (difficulty b ≤ −3)
- item4, item2, and item3 have moderate difficulty
→ Range of difficulty b ≈ −3.36 to −0.28
→ Too easy

Interpretation of Standard error

Appropriate Range of Standard error (Discrimination a)

0.1 〜 0.4

(Source:芝祐順編『項目反応理論』p.34)

  • Visualizing Standard error of Discrimination \(a\)
# Required packages
library(irt)
library(ggplot2)
library(dplyr)

# Model estimation
ex1 <- est(resp = LSAT[, 1:5], 
           model = "2PL",       
           engine = "ltm")

# Extract standard errors (SE) of discrimination
disc_se <- ex1$se[, 1]  # First column = SE of discrimination

# Get item names
item_names <- rownames(ex1$est)

# Convert to data frame and reorder in ascending order
disc_se_df <- data.frame(
  Item = item_names,
  Discrimination_SE = disc_se
) %>%
  arrange(Discrimination_SE) %>%   # Sort ascending
  mutate(Item = factor(Item, levels = Item))  # Reflect order in the plot

# Plot (bar chart: left = small → right = large)
ggplot(disc_se_df, aes(x = Item, y = Discrimination_SE)) +
  geom_bar(stat = "identity", fill = "purple") +
  geom_text(aes(label = round(Discrimination_SE, 3)), 
            vjust = -0.5, size = 4) +
  labs(title = "Items Ordered by Increasing Standard Error (SE) of Discrimination", 
       x = "Item", 
       y = "Standard Error of Discrimination") +
  theme_minimal() +
  theme_bw(base_family = "HiraKakuProN-W3") # Prevent garbled text

  • Range of standard errors for discrimination a: 0.185 – 0.258
    → All within the recommended range
    → No issues

  • Visualizing Standard error of Discrimination \(a\)

Appropriate Range of Standard error (Difficulty \(b\))

0.2 〜 0.5

(Source:芝祐順編『項目反応理論』p.34)

# Required packages
library(irt)
library(ggplot2)
library(dplyr)

# Model estimation
ex1 <- est(resp = LSAT[, 1:5], 
           model = "2PL",       
           engine = "ltm")

# Extract standard errors (SE) of difficulty (2nd column)
diff_se <- ex1$se[, 2]

# Get item names
item_names <- rownames(ex1$est)

# Convert to data frame and reorder in ascending order
diff_se_df <- data.frame(
  Item = item_names,
  Difficulty_SE = diff_se
) %>%
  arrange(Difficulty_SE) %>%   # Sort ascending
  mutate(Item = factor(Item, levels = Item))  # Fix the order

# Plot
ggplot(diff_se_df, aes(x = Item, y = Difficulty_SE)) +
  geom_bar(stat = "identity", fill = "red") +
  geom_text(aes(label = round(Difficulty_SE, 3)), 
            vjust = -0.5, size = 4) +
  labs(title = "Items Ordered by Increasing Standard Error (SE) of Difficulty", 
       x = "Item", 
       y = "Standard Error of Difficulty") +
  theme_minimal() +
  theme_bw(base_family = "HiraKakuProN-W3") # Prevent garbled text

  • Estimation for item1 and item5 is unstable (both se = 0.87)
  • The others have moderate standard errors (0.1 – 0.43)

Summary of Analysis Results and Suggestions for Improvement  

Summary of Results

・Overall, the test is biased toward easy items.
item3 has moderate difficulty and high discrimination, making it an excellent item from an IRT perspective.
item1 and item5 are too easy and have relatively low discrimination.
→ Depending on the purpose of the test, they should be reconsidered or excluded.

Suggestions for Improvement

・From the perspective of difficulty, item1 and item5 (difficulty ≤ −3) should be candidates for deletion.
・Adding more difficult items would improve balance.
・Including items with high discrimination (a > 1.2) would enhance the precision of ability differentiation.
・Adding items with difficulty around b ≈ 0 to +2 would strengthen discrimination among high-ability examinees.

6.6 Estimation of Latent Trait Values (Item Side)

Estimating θ from the “item side” (model fit, estimation of a and b)

  • In an IRT model, the observed response data (correct/incorrect) are used
    → to estimate examinees’ ability θ (latent trait values).
  • One of the most common methods is Maximum Likelihood Estimation (MLE).
Basic Idea of Maximum Likelihood Estimation
・Examine candidate values of ability θ from −3 to +3 in increments of 0.1
・Identify the θ at which the “likelihood” of the observed response pattern occurring is maximized
・That “most likely θ” becomes the estimated ability (latent trait value)

Probability Calculation for Estimating Ability θ

  • Probability calculations are performed to estimate ability θ.
  • Use the data included in the ltm package.
    • Scoring results of responses to the Law School
  • Admission Test (LSAT).
  • Number of examinees: 1000
  • Number of items: 5 items from Section IV
  • Scoring: 1 = correct, 0 = incorrect

Observed Response Data (Correct/Incorrect)

DT::datatable(LSAT)
  • Calculate the item-by-item probability of a correct response for θ (ability level) values ranging from −3 to 3 in increments of 0.1.
library(ltm)
# IRT analysis using the 2PL model (limited to 5 items)
mod <- ltm(LSAT[, 1:5] ~ z1, IRT.param = TRUE)

# Specify the range of ability values θ
theta_vals <- seq(-3, 3, by = 0.1)

# Extract item parameters (discrimination a, difficulty b) from the model
coefs <- coef(mod)  # Returned values: column 1 = Dffclt (b), column 2 = Discr (a)

# Create a data frame to store the probabilities of correct responses
icc_df <- data.frame(theta = theta_vals)

# Calculate the probability of a correct response for each item (2PL model formula)
for (i in 1:nrow(coefs)) {
  b <- coefs[i, 1]  # Difficulty b
  a <- coefs[i, 2]  # Discrimination a
  P_theta <- 1 / (1 + exp(-a * (theta_vals - b)))  # 2PL ICC formula
  icc_df[[paste0("Item", i)]] <- round(P_theta, 4)
}
  • The probabilities of correct responses for each item have been calculated for θ (ability) values ranging from −3 to 3 in increments of 0.1.

Probabilities of Correct Responses for item1 through item5

DT::datatable(icc_df)
  • The probabilities of correct responses for each item can be checked for every 0.1 increment of θ (ability).
  • These results can then be visualized as item characteristic curves (ICCs).
plot(mod, type = "ICC", items = 1:5)

# Add a vertical dotted line (θ = −3)
abline(v = -3, col = "red", lty = 2, lwd = 1)

  • When θ = −3, the value for item1 is 0.5737
    → The black line (item1) indicates Probability = 0.57.
  • When θ = −3, the value for item3 is 0.0815
    → The green line (item3) indicates Probability = 0.0815.  

For example, calculate the probability of this student’s response pattern (Correct, Correct, Correct, Correct, Incorrect) for each ability level θ.

item1 item2 item3 item4 item5
Correct Correct Correct Correct Incorrect

Probability of responses for the lowest-ability student (θ = −3)

  • Calculate the probability that this student answers item5 correctly and item1 through item4 incorrectly.
  • The probabilities of this student answering each of the five items correctly are as follows:
θ (ability) item1 item2 item3 item4 i tem5
−3 0.5737 0.2353 0.0815 0.3141 0.5203
  • Assumption: Whether the student answers each item correctly is independent of the others (= local independence).
  • Now, let’s calculate the probability that this student answers item1 through item4 correctly and item5 incorrectly.

\[prob.of.item1.correct \\× \\prob.of.item2.correct \\×\\ prob.of.item3.correct \\×\\ prob.of.item4.correct \\×\\ prob.of.item5.incorrect \\×\\ (1 − prob.of.item5 correct)\] \[=0.5737×0.2353×0.0815×\\0.3141× (1-0.5203) \\=0.001658 (=0.17\%)\]

・The probability that the lowest-ability student (θ = −3) answers 4 out of 5 items correctly and misses only 1 is 0.17%

Probability of responses for an average-ability student (θ = 0)

  • Calculate the probability that this student answers item5 correctly and item1 through item4 incorrectly.
  • The probabilities of this student answering each of the five items correctly are as follows:
theta (θ) item1 item2 item3 item4 item5
0 0.9412 0.9412 0.562 0.7833 0.8863
  • Assumption: Whether the student answers each item correctly is independent of the others (= local independence).
  • Let’s calculate the probability that this student answers item1 through item4 correctly and item5 incorrectly.

\[prob.of.item1.correct \\× \\prob.of.item2.correct \\×\\ prob.of.item3.correct \\×\\ prob.of.item4.correct \\×\\ prob.of.item5.incorrect \\×\\ (1 − prob.of.item5 correct)\] \[=0.9412×0.9412×0.562×\\0.7833× (1-0.8863) \\= 0.034343 (=3.4\%)\]

・The probability that an average-ability student (θ = 0) answers 4 out of 5 items correctly and misses only 1 is 3.4%
- Probability of responses for the highest-ability student (θ = 3)

Calculate the probability that this student answers item5 correctly and item1 through item4 incorrectly.

  • The probabilities of this student answering each of the five items correctly are as follows:
theta (θ)item1 item2 item3 item4 item5
3 0.9948 0.9593 0.9489 0.9661 0.9825
  • Assumption: Whether the student answers each item correctly is independent of the others (= local independence).
  • Let’s calculate the probability that this student answers item1 through item4 correctly and item5 incorrectly.

\[prob.of.item1.correct \\× \\prob.of.item2.correct \\×\\ prob.of.item3.correct \\×\\ prob.of.item4.correct \\×\\ prob.of.item5.incorrect \\×\\ (1 − prob.of.item5 correct)\] \[=0.9948×0.9593×0.9489×\\0.9489× (1-0.9825) \\= 0.015338 (=1.5\%)\]

・The probability that the highest-ability student (θ = 3) answers 4 out of 5 items correctly and misses only 1 is 1.5%
- Based on the “probabilities of correct responses for item1 through item5” shown above,
- For each ability value θ (from −3 to 3), we can calculate the probability of this student’s response pattern (Correct, Correct, Correct, Correct, Incorrect).

Using the ltm() function, the likelihood estimated for each ability θ represents the probability of observing this response pattern.

# IRT analysis using the 2PL model (limited to 5 items)
mod <- ltm(LSAT[, 1:5] ~ z1, IRT.param = TRUE)

# Specify the range of ability values θ
theta_vals <- seq(-3, 3, by = 0.1)

# Extract item parameters (a, b) from the model
coefs <- coef(mod)  # col1 = b (difficulty), col2 = a (discrimination)

# Response pattern (1 = correct, 0 = incorrect)
response_pattern <- c(1, 1, 1, 1, 0)

# Initialize: list to store results
result_list <- list()

# Calculate for each θ
for (j in seq_along(theta_vals)) {
  theta <- theta_vals[j]
  item_probs <- numeric(length(response_pattern))
  
  for (i in 1:length(response_pattern)) {
    b <- coefs[i, 1]
    a <- coefs[i, 2]
    P <- 1 / (1 + exp(-a * (theta - b)))
    
    # Save P if correct, 1 - P if incorrect
    item_probs[i] <- ifelse(response_pattern[i] == 1, P, 1 - P)
  }
  
  # Likelihood = product of the probabilities across the 5 items
  likelihood <- prod(item_probs)
  
  # Record results in data frame format
  result_list[[j]] <- data.frame(
    theta = theta,
    Item1 = round(item_probs[1], 4),
    Item2 = round(item_probs[2], 4),
    Item3 = round(item_probs[3], 4),
    Item4 = round(item_probs[4], 4),
    Item5 = round(item_probs[5], 4),
    likelihood = round(likelihood, 6)
  )
}

# Combine results for all θ values and display
result_df <- do.call(rbind, result_list)
DT::datatable(result_df)
  • When ability θ = 0.5, the likelihood is the highest (0.035957).
theta item1 item2 item3 item4 item5 likelihood
0.5 0.9603 0.7944 0.667 0.836 0.0845 0.035957
  • This student’s response pattern (correct on item1, correct on item2, correct on item3, correct on item4, incorrect on item5)
    → is most likely to occur when the student’s ability θ = 0.5 (= 0.035957)
    → Therefore, the student’s ability is estimated as 0.5.
Basic Idea of Maximum Likelihood Estimation ・Examine candidate values of ability θ from −3 to 3 in increments of 0.1
・Identify the θ at which the likelihood of that response pattern occurring is the highest
・That “most likely θ” becomes the estimated ability (latent trait value)

Cases of “All Correct” or “All Incorrect”

  • No matter how large the ability θ is, the probability of a correct response never reaches 1
    → It only approaches 1 asymptotically.
  • If a student answers all items correctly, maximum likelihood estimation cannot estimate ability θ.
  • If a student answers all items incorrectly, maximum likelihood estimation likewise cannot estimate ability θ.
  • In practice, “all correct” or “all incorrect” cases can occur.
    → Therefore, if using maximum likelihood estimation to estimate ability θ,
    → it is necessary to predefine the following based on the ability distribution of the examinee group:
    ・The value of θ for the all-correct case
    ・The value of θ for the all-incorrect case

6.7 Test Information Curve (TIC)

6.7.1 What the Test Information Curve Shows

  • At which ability levels does this test measure most accurately?
  • In IRT, the estimation precision of a test can be calculated.
    = It shows the degree of “error” included at a specified ability level θ.
  • The measurement precision of a test is also referred to as “test information.”
  • Test information can be calculated for each ability level θ.

Test Information Function ・For the two-parameter logistic (2PL) model, the test information is expressed by the following formula:

\[I(\theta) = 1.7^2\sum_{j=1}^na_j^2P_j(\theta)Q_j(\theta)\]

Variable Description
\(I(\theta)\) Test information
Constant \(a_j\) Discrimination of item \(j\)
\(P_j(\theta)\) Probability of a correct response to item \(j\) at ability level θ
\(Q_j(\theta)\) Probability of an incorrect response to item \(j\) at ability level θ
  • In IRT, measurement precision can be calculated for each ability level θ.
  • The test information calculated for each ability θ
    → can be visualized with the Test Information Curve (TIC).
  • This allows us to visualize “how accurately the test measures at different ability levels (θ).”
    • Compute the test information for each latent trait value and plot it.
    = Use the tif function to calculate test information
    → Use the plot function to create the TIC.
  • Test information at various latent trait values θ (abilities) can be calculated with the irtoys package tif() function.
I <- irtoys::tif(ip = ex1$est) # Assume the 2PL model for the data
                               # x: argument specifying the results estimated by tif()
plot(x = I)                    # ip: argument specifying the item parameters of the test

・Horizontal axis … Latent trait value \(θ\) (ability) Vertical axis … Test information (measurement precision
= inverse of the standard error)
- Solid line … Test Information Curve
→ Obtained by connecting the test information values at each latent trait level

What the Test Information Curve (TIC) Shows
1. Which ability levels are measured accurately? Where information is high → the test is more precise at that ability level θ
→ Information is highest around \(\theta = -2\)
→ The test is most accurate at the level of \(\theta = -2\)
・Where information is low → the test is less precise at that ability level θ
→ Information is lowest at \(\theta = 4\)
→ The test is least accurate at the level of \(\theta = 4\)
・Example: If the TIC peaks around θ = 0
→ It can be said to be “an optimal test for measuring average examinees.”

2. Reveals the design intent of the test
・By observing where the TIC peaks, we can see what type of examinees the test is designed for. If the TIC peaks around \(\theta = -2\) → The test is aimed at relatively low-ability examinees.
・In general:

TIC Peak Location Meaning Target
Around θ = 0 For average examinees General ability tests
θ > 0 (right side) For high-ability examinees Advanced/professional exams
θ < 0 (left side) For beginners/low-ability examinees Basic skills assessments

・In this case, the TIC peak is to the left
→ It indicates that the test is designed for below-average ability examinees.

3. Indicates reliability (precision) as well
・Higher information = smaller standard error (SE) in that range
・Relationship with standard error:

\[SE(\theta) = \frac{1}{\sqrt{{I(\theta)}}}\]
・In other words, when the amount of information is large, the estimation of \(\theta\) is less prone to fluctuation (= more reliable).

4. Setting a criterion for test information: 0.5 or higher
・For example, if we set the criterion for test information at “0.5 or higher,”
→ we can identify the range of ability levels θ that meet the standard for measurement precision.

Key Point of Item Fit Results
At which latent trait value θ does the test information reach its maximum?
・The test information is maximized around a latent trait value of −2.
⇒ The estimation precision is highest for examinees with low ability levels (around θ = −2).

6.7.2 Test Purpose and Test Information

Left Figure:

  • Test information when “difficulty” is the same but “discrimination” differs
  • A steep, high peak around ability θ = 0
    → Suitable for measuring a narrow range of ability around θ = 0 with high precision (e.g., for selection tests)

Right Figure:

  • Test information when “discrimination” is the same but “difficulty” differs
  • A tall, steep peak around ability θ = 0. → Suitable for measuring a wide range of ability around θ = 0 with consistent precision (e.g., academic tests that assess basic skills)

6.8 Examination of Local Independence: irf() & cor()

Checking whether there are extra dependencies between items (violations of model assumptions)

6.8.1 What is Local Independence?

  • When analyzing test data or questionnaire responses based on IRT,
    ⇒ the assumption of local independence is made
    • Examination of local independence in IRT
  • When the value of the latent trait θ is fixed, responses to different items are assumed to occur independently.
  • The “assumption of local independence” is essentially the same as the “assumption of unidimensionality.”
Assumption of Unidimensionality Variation in whether each item is answered correctly depends only on the magnitude of the latent trait value θ*
  • In other words, the assumption is that “whether an examinee answers correctly is determined solely by their ability.”

  • Examination of local independence is often conducted using the \(Q_3\) statistic.

  • The \(Q_3\) statistic is obtained by subtracting the expected value from each observed item response,
    → then calculating the correlations between the resulting residual scores.

  • Specifically, the \(Q_3\) statistic is defined as the correlation between residuals obtained by subtracting the expected probability of a correct response (calculated from the item response model) from the observed response.

  • The closer the absolute value is to 0, the more reasonable it is to assume local independence between item responses.
    • For example, in this case, the residual score \(d_1\) for item1 can be expressed as follows:

\[d_1 = u_1 - \hat{P_1(\theta)}\]

  • \(u_1\): Response to item1 (1 if correct, 0 if incorrect)
  • \(\hat{P_1(\theta)}\): Probability of a correct response calculated from the estimated item parameters and latent trait value

𝑄3 Statistic (Yen, 1984)

・The 𝑄3 statistic (Yen, 1984) makes use of what corresponds to a partial correlation coefficient in regression analysis.
・A partial correlation coefficient is “the correlation between \(x\) and \(y\) after removing the effect of variable \(z\).”
→ It is often used to examine the influence of spurious correlations.
・Applied to the context of IRT:
→ “The correlation between \(y_{pi}\) and \(y_{pj}\) after removing the effect of \(\theta_p\).” → If this value is 0, local independence holds.

Source: https://journals.sagepub.com/doi/10.1177/014662168400800201

6.8.2 How to Calculate the \(Q_3\) Statistic in R

Step 1: Estimate latent trait values using the mlebme function

• Estimate latent trait values with the mlebme() function
⇒ The mlebme() function is included in the irtoys package

head(mlebme(resp = LSAT[, 1:5],   # Specify the test data
  ip = ex1$est,                   # Assume the 2PL model for the data
  method = "BM"))                 # Specify estimation of latent trait values by Maximum Likelihood (ML)
           est       sem n
[1,] -1.895392 0.7954829 5
[2,] -1.895392 0.7954829 5
[3,] -1.895392 0.7954829 5
[4,] -1.479314 0.7960948 5
[5,] -1.479314 0.7960948 5
[6,] -1.479314 0.7960948 5
  • resp: Argument specifying the test data
  • ip: Argument specifying the item parameters of each item in the test
  • Specify ex1$est
    → The estimated item parameters under the 2PL model are assigned as the item parameters of each item.
  • After estimating the item parameters, the latent trait values are then estimated
  • method: Specifies which estimation method to use for estimating latent trait values
  • Specify ML
    → Latent trait values are estimated using the Maximum Likelihood method
    ⇒ When there are examinees who answered all items correctly or all items incorrectly, their estimates cannot be obtained
  • Specify BM
    → Incorporates prior information by assuming that latent trait values follow a standard normal distribution, allowing estimation even in such cases.

→ It is possible to obtain estimates even for examinees with all-correct or all-incorrect responses (Bayesian estimation)

theta.est <- mlebme(resp = LSAT[,1:5],
  ip = ex1$est,
  method="BM")
DT::datatable(theta.est)
  • Column 1: Estimated value (est)
  • Column 2: Standard error (sem)
  • Column 3: Number of items answered (n)

Step 2. Estimate the probability of correct responses using the irf() function

• The irf() function assumes the 2PLM
• Specify ex1$est
→ Assign the estimated item parameters under the 2PLM as the item parameters for each item
• Specify theta.est[, 1]
→ Assign the estimated latent trait values under the 2PLM as the latent trait estimates
⇒ Save the result as P

P <- irtoys::irf(ip = ex1$est,  # Specify item parameters
  x = theta.est[, 1])           # Specify latent trait values for each examinee
Variable Description
\(x\) Latent trait value \(\theta\) (ability) of each examinee
\(f\) Estimated probability of a correct response
Rows Examinees (1,000 individuals)
Columns Items (item1–item5)

Step 3. Calculate residual scores \(d\) from the estimated probabilities of correct responses and the test data

  • Compute \(d_j = u_j - \hat{P_j(\theta)}\) and save as \(d\) (1 ≦ j ≦ 5)
  • By specifying P$f, the estimated probabilities of correct responses can be extracted
    ⇒ Subtract the estimated probabilities from the test data (LSAT[, 1:5]) and save the residual scores as \(d\)
d <- LSAT[, 1:5] - P$f # Specify P$f to obtain the estimated probabilities of correct responses
  • Display the results for the first 6 of the 1,000 examinees
head(d)
       item1      item2      item3      item4      item5
1 -0.7700558 -0.4061064 -0.1917689 -0.4949268 -0.6915701
2 -0.7700558 -0.4061064 -0.1917689 -0.4949268 -0.6915701
3 -0.7700558 -0.4061064 -0.1917689 -0.4949268 -0.6915701
4 -0.8252089 -0.4801900 -0.2557741 -0.5661590  0.2533129
5 -0.8252089 -0.4801900 -0.2557741 -0.5661590  0.2533129
6 -0.8252089 -0.4801900 -0.2557741 -0.5661590  0.2533129

Check the residual score \(d_{11}\) for examinee 1, item1
・For example, the residual score \(d_1\) for examinee 1, item1 is −0.7700558
・Confirm the response \(u_{ij} = u_{11}\) of examinee 1 to item1

LSAT[1,1]
[1] 0
  • Examinee 1’s response to item1 is 0 (incorrect) The estimated probability of a correct response for Examinee 1 on item1, \(\hat{P_{11}(\theta)}\), is 0.7700558 (see figure above)
    → Therefore, Examinee 1’s residual score \(d_{11}\) is
LSAT[1,1] - 0.7700558
[1] -0.7700558

Step 4. Calculate the value of the \(Q_3\) statistic using the cor function

Q3 <- cor(x = d, 
  y = d, 
  use = "pairwise.complete.obs")
Q3
            item1       item2       item3        item4        item5
item1  1.00000000 -0.04142824 -0.04101429 -0.064167975 -0.062538809
item2 -0.04142824  1.00000000 -0.11322248 -0.097060194 -0.029585197
item3 -0.04101429 -0.11322248  1.00000000 -0.092262203 -0.104216701
item4 -0.06416797 -0.09706019 -0.09226220  1.000000000 -0.003656669
item5 -0.06253881 -0.02958520 -0.10421670 -0.003656669  1.000000000

Step 5. \(Q_3\) Create a heatmap of the statistic values

  • Take the absolute values of Q3
  • Replace the diagonal elements (self-correlation = 1.0) with NA to exclude them
  • Count the number of absolute values ≥ 0.2 and calculate their proportion of the total
# Take the absolute values of Q3
Q3_abs <- abs(Q3)

# Replace diagonal elements (self-correlation = 1.0) with NA to exclude them
diag(Q3_abs) <- NA

# Count the number of absolute values ≥ 0.2
count_0.2_or_more <- sum(Q3_abs >= 0.2, na.rm = TRUE)

# Total number of elements (after excluding diagonals)
total_elements <- sum(!is.na(Q3_abs))

# Calculate the proportion
proportion_0.2_or_more <- count_0.2_or_more / total_elements

# Display the results
cat("Number of absolute values ≥ 0.2:", count_0.2_or_more, "\n")
Number of absolute values ≥ 0.2: 0 
cat("Proportion:", proportion_0.2_or_more, "\n")
Proportion: 0 
  • The number of cases where the absolute value of Q3 is ≥ 0.2 is 0
  • Let’s visualize the results in a heatmap
library(ggplot2)
library(reshape2)

# Convert Q3 matrix to absolute values
Q3_abs <- abs(Q3)
diag(Q3_abs) <- NA  # Exclude diagonal elements

# Melt into long format
Q3_long <- melt(Q3_abs, varnames = c("Item1", "Item2"), value.name = "Q3_value")

# Add a flag for values greater than 0.2
Q3_long$Violation <- Q3_long$Q3_value > 0.2

# Create heatmap
ggplot(Q3_long, aes(x = Item1, y = Item2, fill = Violation)) +
  geom_tile(color = "white") +
  scale_fill_manual(values = c("white", "deeppink")) +
  labs(
    title = "Heatmap of Local Independence Violations (Pink = Violation)",
    x = "Item",
    y = "Item"
  ) +
  theme_bw(base_family = "HiraKakuProN-W3") +   # White background + Japanese font
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 1),  # Rotate labels
    axis.text.y = element_text(hjust = 1)
  )

Key Points in Examining Local Independence

  • Whether the variation in item responses depends only on the magnitude of the latent trait value \(\theta\)
  • The closer the absolute value of \(Q_3\) is to 0, the more justifiable it is to assume local independence among item responses.
  • If the absolute value of \(Q_3\) ≥ 0.2 → Potential problem
    → Suggests possible local dependence (Chen & Thissen, 1997)
  • If the absolute value of \(Q_3\) ≤ 0.2 → No problem When we calculated the absolute values of \(Q_3\):
    → Number of cases with absolute \(Q_3\) ≥ 0.2 = 0 → No problem
    → Suggests that local independence holds
    → Proceed to the next analysis
  • The Q3 matrix is a residual correlation matrix created from the discrepancy (= residuals) between “probabilities of correct responses estimated by the IRT model” and the “actual response data.”
  • Therefore, \(Q_3\) ≥ 0.2 indicates that the assumption of local independence is violated, suggesting one of the following:
    🔵 Strong dependencies remain that cannot be explained by ability θ
    🔵 Items are connected for some other reason
  • Local dependence: A state in which local independence does not hold
    ⇒ Note that \(Q_3\) is only one criterion and should be interpreted with caution
    → In this case, all absolute values of \(Q_3\) are ≤ 0.2. → No problem
    → Suggests that local independence holds
    → Proceed to the next analysis
  • Local dependence: A state in which local independence does not hold
    ⇒ Remember that \(Q_3\) is only one reference measure, so caution is necessary

6.9 Examination of Item Fit: itf()

Checking how well each item fits the IRT model

  • Item Fit evaluates “whether each item properly follows the theoretical model (e.g., 2PL model, 3PL model).”

  • In IRT, it is also important to examine the degree of fit to the item response model.

  • Here, we will use the itf() function to examine the item fit of item1.

  • resp: Argument specifying the test data

irtoys::itf(resp = LSAT[, 1:5], # Response data [examinees, items]
  item = 1,                     # Specify examination of item fit for the 1st item
  ip = ex1$est,                 # Item parameters estimated under the 2PL model
  theta = theta.est[, 1])       # Estimated ability values (θ) for each examinee

 Statistic         DF    P-value 
10.0741811  6.0000000  0.1215627 
  • The figure shows the degree of discrepancy between the fitted item response model and the data.
  • Horizontal axis … Latent trait value (Ability)
  • Vertical axis … Probability of a correct response (Proportion right)
  • Solid line・・・Predicted probability of a correct response based on the fitted item response model
  • Circles … Actual proportion of correct responses for each group when examinees are grouped by their estimated latent trait values
  • Latent trait value \(θ\) (ability) is assumed to follow a normal distribution with mean 0 and variance 1

What Item Fit Reveals
Whether the item “fits” the IRT model or not
・Compare the “actual response patterns from the data” for each item with the “theoretically predicted response patterns” from the IRT model
→ If there is a discrepancy, it means the model assumptions do not fit that item.

Why is this important?

・Using items that do not fit the model may lead to inaccurate estimation of the latent trait value \(θ\) (ability).
・Serves as a basis for checking item quality and deciding whether to revise or remove inappropriate items.
・Provides clues for detecting bias (DIF: Differential Item Functioning).

Possible reasons when item fit is poor

Phenomenon Possible Explanation
Actual proportion correct is lower than the model prediction Item wording is unclear / answer choices are confusing
Abnormal behavior only for specific ability groups Item is biased or misleading
Proportion correct is close to random Guessing strongly influences responses (c-parameter insufficient)
Cognitively too complex Cannot be explained by a single latent trait θ

Indicator for Evaluating Fit: S–X² Statistic (Orlando & Thissen’s Item Fit Index)
・A more precise goodness-of-fit test (especially used for 2PL and 3PL models)
・Ability is divided into groups (typically deciles)
→ Differences between the model’s expected proportion correct and the actual proportion correct are calculated for each group
→ Fit is then evaluated as a chi-square type statistic
= Although it follows a chi-square distribution, this is a method unique to S–X²
→ It is distinct from the conventional chi-square goodness-of-fit test

Interpretation of Results

Null Hypothesis:“The fitted model adequately represents the data.

Obtained Results:

・If the p-value is greater than the significance level (0.05):
p-value = 0.1215627
→ Fail to reject the null hypothesis
→ The fitted item response model is judged to adequately represent the data.
・In the figure output by the itf function:
・The greater the discrepancy between the solid line and the circles, the worse the model fit to the data.

6.10 Creating the Test Characteristic Curve (TCC):

How many points can a person with ability \(\theta\) score?

  • Analysis using the trf() & plot() functions. • For each latent trait value, calculate and plot the expected raw score.
  • Use the trf function to calculate the test information
    → Use the plot function to create the TCC
E <- trf(ip = ex1$est)   # Assume the 2PL model for the data
                         # ip: argument specifying the item parameters included in the test
plot(x = E)              # Expected raw scores at various latent trait values

  • Horizontal axis・・・Latent trait value (Ability)
  • Vertical axis・・・Expected total score (Expected Score)

What the Test Characteristic Curve (TCC) Shows 1. Relationship between examinees’ ability θ and expected scores
・The solid line slopes upward
→ As examinees’ ability increases, their scores also tend to increase

2. Test Difficulty and Distribution Characteristics
・If examinees’ ability θ shows a sudden increase in scores around 0
→ The test is designed for average-ability examinees.
・If examinees’ ability θ shows a sudden increase in scores around 2–4
→ The test is designed for above-average examinees.
・If examinees’ ability θ shows a sudden increase in scores around −4 to −2
→ The test is designed for below-average examinees.

3. Skewness and Limitations of the Score Distribution
・Where the slope of the TCC is shallow
→ Score changes are less responsive in that ability range (= difficult to distinguish examinees).
・If the expected total score flattens out near the upper or lower ends
→ It becomes difficult to distinguish between high scorers and low scorers.

6.11 Estimation of Latent Trait Values (Examinee Side)

Estimating θ from the “examinee side” (how capable the person is)

  • Estimate latent trait values in R
    • Use the mlebme function to estimate latent trait values
    ⇒ The mlebme function is included in the irtoys package
head(mlebme(resp = LSAT[, 1:5],   # Specify the test data
  ip = ex1$est,                   # Assume the 2PL model for the data
  method = "BM"))                 # Specify estimation of latent trait values by Maximum Likelihood (ML)
           est       sem n
[1,] -1.895392 0.7954829 5
[2,] -1.895392 0.7954829 5
[3,] -1.895392 0.7954829 5
[4,] -1.479314 0.7960948 5
[5,] -1.479314 0.7960948 5
[6,] -1.479314 0.7960948 5
  • resp: Argument specifying the test data
  • ip: Argument specifying the item parameters of each item in the test
  • Specify ex1$est
    → The item parameters estimated under the 2PL model are assigned as the parameters for each item
  • After estimating the item parameters, the latent trait values are then estimated
  • method: Specifies which estimation method to use for estimating latent trait values
  • Specify ML
    → Latent trait values are estimated using the Maximum Likelihood method
    ⇒ When there are examinees who answered all items correctly or all items incorrectly, their estimates cannot be obtained
  • Specify BM
    → Incorporates prior information by assuming that latent trait values follow a standard normal distribution. → Makes it possible to obtain estimates even for examinees with all-correct or all-incorrect responses (Bayesian estimation)
theta.est <- mlebme(resp = LSAT[,1:5],
  ip = ex1$est,
  method="BM")
DT::datatable(theta.est)
What Can Be Learned from These Analysis Results.
  • This table shows the following information for each examinee:
Item Meaning
est Estimated ability value (θ)
sem Standard Error of Measurement
n Number of items used for estimation (here, all examinees = 5 items)

Precision of θ estimates is uniformly low

  • The standard error (sem) is almost constant, around 0.80. → Estimation precision is not high
  • This is due to the small number of items (n = 5)

Significance of Estimating θ from the Examinee Side

  • Quantifies the “latent ability” of examinees
    → Measures ability not just by correct response rate, but also by which level of difficulty items they were able to solve