首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒

Health outcomes researchers are increasingly applying Item Response Theory (IRT) methods to questionnaire development, evaluation, and refinement efforts.


To provide a brief overview of IRT, to review some of the critical issues associated with IRT applications, and to demonstrate the basic features of IRT with an example.


Example data come from 6,504 adolescent respondents in the National Longitudinal Study of Adolescent Health public use data set who completed to the 19-item Feelings Scale for depression. The sample was split into a development and validation sample. Scale items were calibrated in the development sample with the Graded Response Model and the results were used to construct a 10-item short form. The short form was evaluated in the validation sample by examining the correspondence between IRT scores from the short form and the original, and by comparing the proportion of respondents identified as depressed according to the original and short form observed cut scores.


The 19 items varied in their discrimination (slope parameter range: .86–2.66), and item location parameters reflected a considerable range of depression (−.72–3.39). However, the item set is most discriminating at higher levels of depression. In the validation sample IRT scores generated from the short and long forms were correlated at .96 and the average difference in these scores was −.01. In addition, nearly 90% of the sample was classified identically as at risk or not at risk for depression using observed score cut points from the short and long forms.


When used appropriately, IRT can be a powerful tool for questionnaire development, evaluation, and refinement, resulting in precise, valid, and relatively brief instruments that minimize response burden.


Quality of Life Research - The Multiple Sclerosis Walking Scale (MSWS-12) is the predominant patient-reported measure of multiple sclerosis (MS) -elated walking ability, yet it had not been...  相似文献   

Evaluation of patient-reported outcomes (PRO) is increasingly performed in health sciences. PRO differs from other measurements because such patient characteristics cannot be directly observed. Item response theory (IRT) is an attractive way for PRO analysis. However, in the framework of IRT, sample size justification is rarely provided or ignores the fact that PRO measures are latent variables with the use of formulas developed for observed variables. It might therefore be inappropriate and might provide inadequately sized studies. The objective was to develop valid sample size methodology for the comparison of PRO in two groups of patients using IRT. The proposed approach takes into account questionnaire's items parameters, the difference of the latent variables means, and its variance whose derivation is approximated using Cramer-Rao bound (CRB). We also computed the associated power. We realized a simulation study taking into account sample size, number of items, and value of the group effect. We compared power obtained from CRB with the one obtained from simulations (SIM) and with the power based on observed variables (OBS). For a given sample size, powers using CRB and SIM were similar and always lower than OBS. We observed a strong impact of the number of items for CRB and SIM, the power increasing with the questionnaire's length but not for OBS. In the context of latent variables, it seems important to use an adapted sample size formula because the formula developed for observed variables seems to be inadequate and leads to an underestimated study size.  相似文献   

Health status assessment is frequently used to evaluate the combined impact of human immunodeficiency virus (HIV) disease and its treatment on functioning and well-being from the patient's perspective. No single health status measure can efficiently cover the range of problems in functioning and well-being experienced across HIV disease stages. Item response theory (IRT), item banking and computer adaptive testing (CAT) provide a solution to measuring health-related quality of life (HRQoL) across different stages of HIV disease. IRT allows us to examine the response characteristics of individual items and the relationship between responses to individual items and the responses to each other item in a domain. With information on the response characteristics of a large number of items covering a HRQoL domain (e.g. physical function, and psychological well-being), and information on the interrelationships between all pairs of these items and the total scale, we can construct more efficient scales. Item banks consist of large sets of questions representing various levels of a HRQoL domain that can be used to develop brief, efficient scales for measuring the domain. CAT is the application of IRT and item banks to the tailored assessment of HRQoL domains specific to individual patients. Given the results of IRT analyses and computer-assisted test administration, more efficient and brief scales can be used to measure multiple domains of HRQoL for clinical trials and longitudinal observational studies.  相似文献   



The present study investigates the properties of the French version of the OUT-PATSAT35 questionnaire, which evaluates the outpatients’ satisfaction with care in oncology using classical analysis (CTT) and item response theory (IRT).


This cross-sectional multicenter study includes 692 patients who completed the questionnaire at the end of their ambulatory treatment. CTT analyses tested the main psychometric properties (convergent and divergent validity, and internal consistency). IRT analyses were conducted separately for each OUT-PATSAT35 domain (the doctors, the nurses or the radiation therapists and the services/organization) by models from the Rasch family. We examined the fit of the data to the model expectations and tested whether the model assumptions of unidimensionality, monotonicity and local independence were respected.


A total of 605 (87.4 %) respondents were analyzed with a mean age of 64 years (range 29–88). Internal consistency for all scales separately and for the three main domains was good (Cronbach’s α 0.74–0.98). IRT analyses were performed with the partial credit model. No disordered thresholds of polytomous items were found. Each domain showed high reliability but fitted poorly to the Rasch models. Three items in particular, the item about “promptness” in the doctors’ domain and the items about “accessibility” and “environment” in the services/organization domain, presented the highest default of fit. A correct fit of the Rasch model can be obtained by dropping these items. Most of the local dependence concerned items about “information provided” in each domain. A major deviation of unidimensionality was found in the nurses’ domain.


CTT showed good psychometric properties of the OUT-PATSAT35. However, the Rasch analysis revealed some misfitting and redundant items. Taking the above problems into consideration, it could be interesting to refine the questionnaire in a future study.  相似文献   

Context A test score is a number which purportedly reflects a candidate’s proficiency in some clearly defined knowledge or skill domain. A test theory model is necessary to help us better understand the relationship that exists between the observed (or actual) score on an examination and the underlying proficiency in the domain, which is generally unobserved. Common test theory models include classical test theory (CTT) and item response theory (IRT). The widespread use of IRT models over the past several decades attests to their importance in the development and analysis of assessments in medical education. Item response theory models are used for a host of purposes, including item analysis, test form assembly and equating. Although helpful in many circumstances, IRT models make fairly strong assumptions and are mathematically much more complex than CTT models. Consequently, there are instances in which it might be more appropriate to use CTT, especially when common assumptions of IRT cannot be readily met, or in more local settings, such as those that may characterise many medical school examinations. Objectives The objective of this paper is to provide an overview of both CTT and IRT to the practitioner involved in the development and scoring of medical education assessments. Methods The tenets of CCT and IRT are initially described. Then, main uses of both models in test development and psychometric activities are illustrated via several practical examples. Finally, general recommendations pertaining to the use of each model in practice are outlined. Discussion Classical test theory and IRT are widely used to address measurement‐related issues that arise from commonly used assessments in medical education, including multiple‐choice examinations, objective structured clinical examinations, ward ratings and workplace evaluations. The present paper provides an introduction to these models and how they can be applied to answer common assessment questions. Medical Education 2010: 44 : 109–117  相似文献   



We review the papers presented at the NCI/DIA conference, to identify areas of controversy and uncertainty, and to highlight those aspects of item response theory (IRT) and computer adaptive testing (CAT) that require theoretical or empirical research in order to justify their application to patient reported outcomes (PROs).


IRT and CAT offer exciting potential for the development of a new generation of PRO instruments. However, most of the research into these techniques has been in non-healthcare settings, notably in education. Educational tests are very different from PRO instruments, and consequently problematic issues arise when adapting IRT and CAT to healthcare research.


Clinical scales differ appreciably from educational tests, and symptoms have characteristics distinctly different from examination questions. This affects the transferring of IRT technology. Particular areas of concern when applying IRT to PROs include inadequate software, difficulties in selecting models and communicating results, insufficient testing of local independence and other assumptions, and a need of guidelines for estimating sample size requirements. Similar concerns apply to differential item functioning (DIF), which is an important application of IRT. Multidimensional IRT is likely to be advantageous only for closely related PRO dimensions.


Although IRT and CAT provide appreciable potential benefits, there is a need for circumspection. Not all PRO scales are necessarily appropriate targets for this methodology. Traditional psychometric methods, and especially qualitative methods, continue to have an important role alongside IRT. Research should be funded to address the specific concerns that have been identified.



We review the NCI/DIA conference, “Improving health outcomes assessment based on modern measurement theory and computerized adaptive testing,” and suggest next steps in use of item response theory (IRT) to assess health outcomes.


In recent years the level of interest and use of IRT methods has increased dramatically among health outcomes researchers. The NCI/DIA conference on June 24–25, 2004, was one of the first systematic opportunities to examine many challenging issues in applying IRT to the health outcomes field.


Based on the conference presentations, we identified five issues important to future applications of IRT to health outcomes.


The five key issues are as follows: (1) collaboration between academia, government and industry; (2) common versus unique item banks; (3) educating and establishing standards for use and reporting of IRT; (4) demonstrating the value of IRT; and (5) continuing efforts to improve the user friendliness of IRT software.


Moving forward will require a collaborative effort between academia, government agencies, and industry to design and conduct IRT research. A common item bank developed with collaboration from investigators from multiple institutions could be very valuable to the field. The establishment of consensus standards for use and reporting of IRT results would help users and consumers of the methodology. Clear documentation of how IRT can lead to better patient-reported outcome measures and more accurate understanding of substantive issues is essential. Academia, government and industry should continue current work to enhance the user-friendliness of the IRT software.



The International Classification of Functioning, Disability and Health (ICF) proposes three main health outcomes, Impairment (I), Activity Limitation (A) and Participation Restriction (P), but good measures of these constructs are needed The aim of this study was to use both Classical Test Theory (CTT) and Item Response Theory (IRT) methods to carry out an item analysis to improve measurement of these three components in patients having joint replacement surgery mainly for osteoarthritis (OA).  相似文献   

Noerholm  V.  Groenvold  M.  Watt  T.  Bjorner  J.B.  Rasmussen  N.-A.  Bech  P. 《Quality of life research》2004,13(2):531-540
BACKGROUND: The main objective of this study was to investigate the construct validity of the WHOQOL-BREF by use of Rasch and Item Response Theory models and to examine the stability of the model across high/low scoring individuals, gender, education, and depressive illness. Furthermore, the objective of the study was to estimate the reference data for the quality of life questionnaire WHOQOL-BREF in the general Danish population and in subgroups defined by age, gender, and education. METHODS: Mail-out-mail-back questionnaires were sent to a randomly selected sample of the Danish general population. The response rate was 68.5%, and the sample reported here contained 1101 respondents: 578 women and 519 men (four respondents did not indicate their genders). RESULTS: Each of the four domains of the WHOQOL-BREF scale fitted a two-parameter IRT model, but did not fit the Rasch model. Due to multidimensionality, the total score of 26 items fitted neither model. Regression analysis was carried out, showing a level of explained variance of between 10 and 14%. The mean scores of the WHOQOL-BREF are reported as normative data for the general Danish population. CONCLUSION: The profile of the four WHOQOL-BREF domains is a more adequate expression of quality of life than the total score of all 26 items. Although none of the subscales are statistically sufficient measures of their domains, the profile scores seem to be adequate approximations to the optimal score.  相似文献   



To use the item response theory (IRT) methods to examine the degree to which the four selected tools reflect sarcopenia and to arrange them according to their ability to estimate sarcopenia severity.


A cross-sectional study aimed at verifying the possibilities of using diagnostic tools for sarcopenia.

Setting and Participants

The study included residents living in an assisted living unit at the Senior Centre in Blansko (South Moravia, Czech Republic) (n=77). Sarcopenia was estimated according to the proposals of the European Working Group on Sarcopenia in Older People (EWGSOP) using calf circumference, the EWGSOP algorithm, hand grip strength, and the Short Physical Performance Battery (SPPB).


The results from the IRT model showed that these four methods indicate strong unidimensionality so that they measure the same latent variable. The methods ranked according to the discrimination level ranging from high to low discrimination where the calf circumference was the most discriminatory (Hi = 0.86) and the SPPB together with hand grip strength were the least discriminatory (both Hi = 0.44).


We are recommending to identify mild sarcopenia by SPPB or hand grip strength, moderate sarcopenia by the EWGSOP algorithm and severe sarcopenia by the calf circumference.

The current study examined the psychometric characteristics of the College-Oriented Eating Disorders Screen (COEDS), a college-student-focused screening measure to assess and identify individuals at-risk for the development of eating disordered pathology. By screening a large pool of pilot questions and using methods based in item response theory (IRT), seven items were identified with well-targeted contents that discriminated well across the continuum of eating disorder severity. The resulting measure evidenced a unidimensional factor structure and correlated highly with the original COEDS, standard measures of eating disorders pathology, and a measure of associated symptomatology (e.g., depressive symptoms). Based on these results, we discuss the utility of the COEDS as a prognostic indicator for risk of eating disordered pathology among college students.  相似文献   

The Severity of Disabilities Scale (SDS) of the ICIDH reflects the degree to which an individual's ability to perform a certain activity is restricted. This paper describes the application of two models from item response theory (IRT), the graded response model and the partial credit model, in order to derive a tentative proposal for a revised SDS. The key ingredient of the approach is to scale existing disability items obtained in different studies on a common scale by exploiting the overlap. Both IRT models are fitted to a linked data set containing items for measuring walking disability. Based on these solutions, a tentative SDS is constructed. The paper concludes with a discussion of the implications, limitations and advantages of the approach.  相似文献   

ObjectivesDetermining the minimal clinically important difference (MCID) of questionnaires on an interval scale, the trait level (TL) scale, using item response theory (IRT) models could overcome its association with baseline severity. The aim of this study was to compare the sensitivity (Se), specificity (Sp), and predictive values (PVs) of the MCID determined on the score scale (MCID-Sc) or the TL scale (MCID-TL).Study Design and SettingThe MCID-Sc and MCID-TL of the MOS-SF36 general health subscale were determined for deterioration and improvement on a cohort of 1,170 patients using an anchor-based method and a partial credit model. The Se, Sp, and PV were calculated using the global rating of change (the anchor) as the gold standard test.ResultsThe MCID-Sc magnitude was smaller for improvement (1.58 points) than for deterioration (−7.91 points). The Se, Sp, and PV were similar for MCID-Sc and MCID-TL in both cases. However, if the MCID was defined on the score scale as a function of a range of baseline scores, its Se, Sp, and PV were consistently higher.ConclusionThis study reinforces the recommendations concerning the use of an MCID-Sc defined as a function of a range of baseline scores.  相似文献   

BACKGROUND AND OBJECTIVE: The objective of the study was to enhance the clinical interpretation and practicality of the widely used comprehensive Sickness Impact Profile. METHOD: Item Response Theory (extension of the Rasch model) was used to calibrate the severity of the SIP items, to assess item bias and to construct equally severe short forms of the SIP that can be used interchangeably. The scores of 1507 subjects were analyzed. RESULTS: Of the 127 SIP items, 82 items fitted the extended Rasch model, i.e., the observed proportions of sickness level groups endorsing the items corresponded to the proportions expected by the model. The item severity hierarchy allowed a more straightforward interpretation of the calibrated SIP-82 scores. Some items showed bias in age, gender, or diagnosis groups. The equivalent short forms agreed sufficiently well with the calibrated SIP-82 item pool to be used interchangeably. We observed a moderate correlation between the original SIP item severity weights and the Rasch item severity calibrations (r=0.53). CONCLUSION: The interpretability and practicality of the SIP was enhanced by the IRT calibration. Using the item calibrations, short forms can be assembled that can be used interchangeably.  相似文献   

Background: Measurement of headache impact is important in clinical trials, case detection, and the clinical monitoring of patients. Computerized adaptive testing (CAT) of headache impact has potential advantages over traditional fixed-length tests in terms of precision, relevance, real-time quality control and flexibility. Objective: To develop an item pool that can be used for a computerized adaptive test of headache impact. Methods: We analyzed responses to four well-known tests of headache impact from a population-based sample of recent headache sufferers (n = 1016). We used confirmatory factor analysis for categorical data and analyses based on item response theory (IRT). Results: In factor analyses, we found very high correlations between the factors hypothesized by the original test constructers, both within and between the original questionnaires. These results suggest that a single score of headache impact is sufficient. We established a pool of 47 items which fitted the generalized partial credit IRT model. By simulating a computerized adaptive health test we showed that an adaptive test of only five items had a very high concordance with the score based on all items and that different worst-case item selection scenarios did not lead to bias. Conclusion: We have established a headache impact item pool that can be used in CAT of headache impact.  相似文献   



Cancer survivors frequently experience worry about a variety of topics, including fear of recurrence. However, general measures of worry still require examination of reliability for this vulnerable population. This study utilized modern psychometric methods to examine the reliability of a worry measure in women with breast or gynecologic cancer.


Women with cancer (n = 332) completed the 16-item Penn State Worry Questionnaire (PSWQ), which has an abbreviated 8-item version (PSWQ-A). Categorical confirmatory factor analysis (CCFA) was used to determine the factor structure and item response theory (IRT) was used to examine score reliability.


CCFA supported a two-factor structure with 11 positively worded items and the 5 negatively worded items loading on different factors. IRT analysis of the 11 positively worded items showed that each was contributing meaningful information to the overall scores. The 11 positively worded items and the PSWQ-A produced the most reliable scores for levels of worry ranging from one θ below to two θ above the mean.


The 11 positively worded items of the PSWQ and the 8-item PSWQ-A were suitable for use in cancer patients while the full PSWQ was unsuitable due to inclusion of the negatively worded items. Future research should consider measuring worry when examining distress in cancer survivors.  相似文献   

目的 利用项目反应理论(item response theory,IRT)对《中国版职业紧张核心量表》质量进行分析与评价,为后期量表使用和修订提供参考依据。方法 采用方便抽样方法,抽取湖北省两家三甲医院和多家一、二级医院共1261名医务人员作为研究对象,应用《中国版职业紧张核心量表》调查其职业紧张情况。采用主成分分析验证量表4个维度的单维性。采用IRT中的Same Jima等级反应模型计算每个条目的区分度、难度系数和信息量,从微观角度评价量表的测量特性。结果 量表4个维度均满足单维性假设。IRT结果显示所有条目的区分度较好,取值范围在0.67~3.10;17个条目中有13个条目的难度系数在-2.78~2.30之间,且不存在难度逆反现象,条目9和11难度过高且难度逆反,条目15和16难度过低过高并存且有难度逆反现象,提示待改进;除了条目9、11和15提供的信息量中等,条目16和17提供的信息量较差以外,其余条目的信息量均较好。结论 《中国版职业紧张核心量表》所有条目的区分度较好。从难度系数和信息量两个角度,条目9、11、15、16、17的测验质量均是有待改进的,其余条目性能良好,建议针对上述分析结果结合专家意见对问题条目进行修订。  相似文献   

目的 应用 CTT 与 IRT 两种分析理论对宫颈癌患者生命质量量表(QLICP-CE V2.0)的条目进行分析与评价。 方法 通过应用 QLICP-CE(V2.0)对 186 例宫颈癌病人进行测评,采用经典测量理论 CTT 中的四种统计方法(变异度法、相关系数法、因子分析法、克朗巴赫系数法)来评价条目质量的好坏。同时采用项目反应理论IRT中的 Samejima 等级反应模型计算每个条目的难度、区分度系数和信息量。 结果 CTT 分析结果提示 QLICP-CE(V2.0)共性模块中有 9 个条目与其所在领域的相关性比较低,而特异模块中有3个。IRT结果显示所有条目的区分度较好,取值范围均在0.64~1.33;44个条目中有35个条目的难度系数取值范围在-3.49~3.76,且随着难度等级(B1→B4)的增加呈现出单调递增的趋势;除 3 个条目外所有条目的平均信息量均较好。 结论 QLICP-CE(V2.0)量表所有条目区分度比较好,大部分条目的性能良好,但仍然有少部分条目有待进一步修订并验证效果。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号