1、Exploring Person-Item Interaction with Multidimensional Multilevel Rasch Models,Wen-Chung Wang Department of Psychology National Chung Cheng University, Taiwan,2,3,4,5,Outline,Person-Item Interaction Situations where person-item interaction may occur Testlet-based items Rater effects Rating scales M
2、issing data Parameter estimation Concluding remark,6,Person-Item Interaction,Standard Rasch models (and other IRT models) assume that item parameters remain constant over persons (local item independence), meaning there is no person-item interaction. If item responses fit the models expectation, goo
3、d measurement quality is achieved, e.g., test-independent ability estimation and sample-independent item calibration. If person do interact with items, e.g., an item is easy for one person but is difficult for another, then those models ignore this interaction will no longer be appropriate in that t
4、he parameter estimates are biased. Fit statistics and DIF analysis may be able to identify this kind of misfit.,7,Person-Item Interaction,If person-item interaction is detected, one may simply remove such items or persons, or take the interaction into account directly into the model and obtain more
5、information for further test revision. The first approach may be sometime inappropriate because too many items or persons are removed. The second approach needs to extend standard IRT models. We focus on this approach here.,8,Situations where person-item interaction may occur,Testlet-based items (it
6、ems that share a common stimulus) Items within a testlet may have chaining effects, which may vary over persons Rater effects Raters may have different severities in judging different persons responses Rating scales Persons may have different subjective judgments in selecting categories of rating sc
7、ale items Missing data Persons may not miss items at random,9,1. Testlet-based items,Testlet design is advocated partly because it corresponds to real life situations in which problems are always interrelated. Two approaches to modeling the person-item interaction There fixed-effect approach: Assump
8、tion: The interaction remains constant over persons Analysis unit: Response pattern of a testlet Parameterization: Adding a set of fixed-effect parameters into the model Examples: Hoskens W.-C. Wang & Wilson, 2005,10,1. The Rasch testlet model,The Rasch model isWhen differential person-item interact
9、ion is suspected, we can add a random variable into the model:to form:,11,1. The Rasch testlet model,For polytomous items, the partial credit model (PCM; Masters, 1982) and the rating scale model (RSM; Andrich, 1978) areFor testlet d (e.g., a scenario followed by several Likert-type items), the diff
10、erential person-item interaction can be modeled as,12,1. An example,We analyzed the 2001 English test of the Basic Competence Tests for Journal High School Students. The test contained 44 multiple-choice items, including 17 independent items and 27 items in 11 testlets. A total of 5000 examinees wer
11、e randomly selected from a population with more than 300,000 examinees. The Rasch testlet model and the standard Rasch model were fitted to the data.,13,1. An example,14,Variance estimates of the theta and testlet effects under the testlet model and the standard model,15,1. An example,The IRT reliab
12、ility was .93 under the standard model and was .92 under the testlet model. The IRT reliability was overestimated when the person-item interaction within testlets was ignored under the standard model. According to the Spearman-Brown Prophecy formula, the test would have to be increased approximately
13、 17.4% in length in order to achieve a reliability of .93 from .92, when the person-item interaction was taken into account appropriately under the testlet model.,16,2. Rater effects,A typical Rasch model for a three-facet situation:where qn is the latent trait of person n; Bi is the difficulty of i
14、tem i; Cj is the difficulty of category k relative to category k 1 (B and C belong to the same item facet); Dk is the severity of rater k; and pnijk and pnj(j-1)k are the probabilities of scoring j and j 1, respectively. In this model, rater k is assumed to hold a constant severity when judging diff
15、erent persons responses.,17,2. Rater effects,When rater k is suspected to show differential severities when judging different persons, we may add a random variable to model this rater-person interaction:and,18,2. Rater effects,If , then rater k shows differential severities when giving ratings to pe
16、rsons. The larger the variance is, the greater the variation in severity. Therefore, it represents the intra-rater inconsistency. The inter-rater inconsistency can be depicted by the dispersion of Dk. The more diverse is Dk between raters, the more variation in severity between raters.,19,2. An exam
17、ple,The data set contained 1797 Year 6 students responses to a single writing task. Each of the 1797 students writing scripts was graded by 2 of the 8 raters. When assessing the scripts, each rater was required to provide two ratings, one was about overall performance and the other was about textual
18、 features. These two criteria were judged on a six-point scale. The data set was analyzed using the random-effects three-facet model and the standard three-facet model, respectively. According to the likelihood ratio test, adding 8 random variables in the random-effect model significantly increased
19、model-data fit.,20,2. An example,The IRT reliability was .87 under the standard model and was .79 under the random-effect model. The reliability was overestimated under the standard model. According to the Spearman-Brown Prophecy formula, the test would have to be increased approximately 77.9% in le
20、ngth in order to achieve a reliability of .87 from .79 for the random-effects model. The overestimation in reliability was serious once the rater-person interaction was ignored.,21,2. An example,The severity estimates for the 8 raters were between -1.93 and 1.23, indicating that these raters did not
21、 show agreement in severity so that the inter-rater consistency was not very high. The variance estimates of severity for the eight raters were between 0.48 and 7.10, with a mean of 2.53. Compared to the variance estimate for the person distribution of 8.12, some raters showed substantial intra-rate
22、r variation (low consistency) in severity while some others showed small intra-rater variation.,22,3. Rating scales,In the RSM, each item is assumed to have a difficulty parameter di, and all the items share the same set of threshold parameters tj. Likert-type or rating-scale items usually require p
23、ersons subjective judgments, which are very likely to vary over persons. For example, person A may consider the distance between “Strongly disagree” to “Disagree” a huge gap, whereas person B may consider it a minor gap. The RSM (or even the PCM) may be too stringent to fit real data from Likert-typ
24、e or rating-scale items, because the subjective judgments are not properly taken into account.,23,3. Rating scales,Acknowledging the subjective nature of these kinds of items, test analysts usually adopt more liberal criteria in assessing model-data fit when applying the RSM to Likert-type items of
25、self-report inventories than when applying the Rasch model to objective items of ability tests. Another way of acknowledging the subjective nature is to directly take it into account in the model.,24,3. Rating scales,Instead of treating the threshold parameter tj as a fixed-effect, one may treat it
26、as a random-effect over persons by imposing a distribution onto it and thus form the random-effect rating scale model (RE-RSM): It is also possible to form the random-effect partial credit model (RE-PCM):In general, the RE-PCM is not identifiable because there are too many random-effects. Therefore,
27、 we may constrain:which is called the constrained random-effect partial credit model (CRE-PCM).,25,26,27,3. An example,The data consisted of the responses of 476 college students who rated how they would perform on 30 teaching-related jobs according to four-point Likert-type scales (very poor, poor,
28、 good, and very good), assuming they were beginning teachers. The CRE-PCM, RE-RSM, PCM and RSM were fitted to the data. According to the likelihood ratio tests, the CRE-PCM fitted the data best, however, the RE-RSM also fitted well. Both the random-effect models fit better than the RSM and the PCM.,
29、28,29,3. An example,The estimated IRT reliabilities under the RSM (and PCM) and RE-RSM (and CRE-PCM) were .91 and .48, respectively. Under the RE-RSM, the variance estimates for the three thresholds were 1.28, 0.99, and 3.08, which were very large, compared to the variance estimate for the latent tr
30、ait 0.85. A great proportion in test score was attributable to the random effects in the thresholds over persons, which led to a much smaller test reliability of .48.,30,4. Missing data,Item non-response frequently occurs in educational and psychological testing as well as surveys in other social sc
31、iences. Despite careful test design, any item may conceivably not be answered by some persons at some time. Three kinds of missing data : Missing completely at random (MCAR): if the missing data are missing at random and the observed data are observed at random. Mmissing at random (MAR): if the miss
32、ingness is related to the values of the observed data in the data set, but not to the value of the item itself. That is, conditional on the observed data the missingness is random. Not missing at random (NMAR): if the missingness is related to the value of the item itself, and maybe other variables
33、in the data or even variables that are not measured, so that the missing data are not ignorable.,31,4. Missing data,IRT models are especially suited to handle MAR data because the estimation of a persons latent trait is based on the persons observed item responses. For some commonly used testing des
34、igns, such as alternating test forms, targeted testing, and adaptive testing, the missing data mechanism is ignorable for likelihood inferences about latent trait parameters. When data are NMAR, due to systematic differences between respondents and nonrespondents, standard IRT models are no longer a
35、ppropriate.,32,4. Missing data,We need an IRT model that can take the non-random missingness into account. A feasible way of handling NMAR data is to assume a latent trait (e.g., response propensity or temperament) that underlies the binary responses of missingness, which may be correlated with the
36、target latent trait q that the test intends to measure. To model non-ignorable missing data along the same IRT logic, one can treat the binary responses of missingness as a function of item and person following a dichotomous IRT model (e.g., the Rasch model).,33,4. Missing data,Every item has its ow
37、n set of parameters to describe the “difficulty” of being missed by persons (analogous to the standard difficulty parameters) and every person has a parameter to describe his or her tendency of not responding:where and are the probabilities of no-responding (scoring 1) and responding (scoring 0) ite
38、m i for person n, respectively. The larger this item difficulty parameter is, the less likely the item will be missed. The larger the person parameter gn is, the more likely the person will not respond to items.,34,4. Missing data,If there are multiple missing categories (e.g., “No Opinion”, “Dont K
39、now”, or “Refuse to Answer”) and each category can be treated as driven by a distinct latent trait, then the above equation can be generalized to multiple equations, one for each kind of missingness. The equations for the missingness and the standard equation for the IRT model (e.g., the Rasch, PCM
40、or RSM) are simultaneously used to fit the data, so that the combined model is multidimensional. You may view the original observed responses as Test A, and the nonresponded data (recoded as 1 and 0 for missing and 0 otherwise) as Test B, and then calibrate both data sets jointly with a multidimensi
41、onal model.,35,4. Missing data,If g is uncorrelated with q, meaning that the missingness is driven by a latent trait that is independent of q, then knowing g will not help predict q. Therefore, the data are in fact MAR. On the other hand, when g is correlated with q, the missingness contains some co
42、llateral information about q. As a result, the data are NMAR and thus using the standard unidimensional approach is no longer appropriate. It can be inferred that the higher is the correlation between g and q, the more efficient is the use of the multidimensional approach and the less appropriate is
43、 the use of the unidimensional one.,36,4. An example,670 adults in Taiwan responded to six 4-point Likert-type items (strongly disagree, disagree, agree, and strongly agree) about government administration (i.e., policy-making, saving money, free from corruption, long-term goal setting, public secur
44、ity, and economic growth) through face to face interviews. When a respondent chose not to select any category from the four point scale, the respondents attitude was classified by the interviewer according to the three kinds of missingness: No Opinion, Dont Know, and Refuse to Answer.,37,Category pe
45、rcentages of the six items,38,4. An example,The data were analyzed two ways: The RSM model was fitted to the observed data. A four-dimensional model in which the RSM was fitted to the observed data and three Rasch models were fitted to the non-responded data, one for each kind of missingness. The mu
46、ltidimensional approach yielded a statistically better fit than the unidimensional approach.,39,Estimated variances, covariances, correlations, and test reliabilities for the four latent traits using the multidimensional approach,Note. gN , gD, and gR denote the three latent traits for the “No Opini
47、on”, “Dont Know”, and “Refuse to Answer” missingness, respectively; The values in the diagonal are the variances, those in the upper triangle are the covariances, and those in the lower triangle are the correlations.,40,4. An example,gN, gD and gR were only slightly correlated with q, indicating the
48、 unidimensional approach (assuming data are MAR) was appropriate. gN, gD and gR were almost uncorrelated, which means that combining these three kinds of missingness into a single category was inappropriate. The multidimensional and the unidimensional approaches yielded IRT reliabilities of .70 and
49、.71, respectively. In brief, for this particular data set the multidimensional and the unidimensional approaches made little difference.,41,Parameter Estimation,It can be shown that all the models in the four conditions are special cases of the multidimensional random coefficients multinomial logit
50、model (MRCMLM; Adams, Wilson, & Wang, 1997). The parameters in these models can be estimated using the computer program ConQuest (Wu, Adams, & Wilson, 1998).,42,Parameter Estimation,The MRCMLM, being a member of the exponential family of distributions, can be viewed as a generalized linear mixed model. In addition to ConQuest, the SAS NLMIXED procedure and the Stata GLLAMM procedure are alternatives for fitting many common nonlinear and generalized linear mixed models, including the MRCMLM.,