首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
CONTEXT: Item response theory (IRT) measurement models are discussed in the context of their potential usefulness in various medical education settings such as assessment of achievement and evaluation of clinical performance. PURPOSE: The purpose of this article is to compare and contrast IRT measurement with the more familiar classical measurement theory (CMT) and to explore the benefits of IRT applications in typical medical education settings. SUMMARY: CMT, the more common measurement model used in medical education, is straightforward and intuitive. Its limitation is that it is sample-dependent, in that all statistics are confounded with the particular sample of examinees who completed the assessment. Examinee scores from IRT are independent of the particular sample of test questions or assessment stimuli. Also, item characteristics, such as item difficulty, are independent of the particular sample of examinees. The IRT characteristic of invariance permits easy equating of examination scores, which places scores on a constant measurement scale and permits the legitimate comparison of student ability change over time. Three common IRT models and their statistical assumptions are discussed. IRT applications in computer-adaptive testing and as a method useful for adjusting rater error in clinical performance assessments are overviewed. CONCLUSIONS: IRT measurement is a powerful tool used to solve a major problem of CMT, that is, the confounding of examinee ability with item characteristics. IRT measurement addresses important issues in medical education, such as eliminating rater error from performance assessments.  相似文献   

2.
张昊  尚磊 《实用预防医学》2019,26(3):381-385
量表广泛运用在心理学和教育学中,而在医学研究中,许多疾病状态和不健康行为等,只能通过量表测量和间接地测评。但量表编制在医学领域的运用中还显得有些滞后和不足。本文将已在心理学和教育学量表经过验证的量表编制中经常使用的测量学理论,即经典测量理论(classical test theory, CTT)、概化理论(generalizability theory,GT)和项目反应理论(item response theory,IRT)三大类,加以介绍。CTT理论广泛运用在医学量表编制中,具有简便易行、计算准确等优点,但其只适用于简单线性模型,限制了其在医学中的应用,而在教育学和心理学试卷及量表编制中,已开发引入了更为前沿的测量理论。GT理论作为CTT理论的补充,在同时考虑多个误差来源的基础上进行信度估计,信度估计比CTT更为细致和准确。IRT理论引入非线性模型及相关参数,可对量表条目进行更为细致和深入的研究分析,提升量表条目质量。但因以上两套理论实施复杂和学科之间壁垒存在等原因,一直未出现在医学量表编制中,如果使其在将来的医学量表的编制中得到充分广泛的运用和验证,将有重要的理论和现实意义。  相似文献   

3.

Purpose

The present study investigates the properties of the French version of the OUT-PATSAT35 questionnaire, which evaluates the outpatients’ satisfaction with care in oncology using classical analysis (CTT) and item response theory (IRT).

Methods

This cross-sectional multicenter study includes 692 patients who completed the questionnaire at the end of their ambulatory treatment. CTT analyses tested the main psychometric properties (convergent and divergent validity, and internal consistency). IRT analyses were conducted separately for each OUT-PATSAT35 domain (the doctors, the nurses or the radiation therapists and the services/organization) by models from the Rasch family. We examined the fit of the data to the model expectations and tested whether the model assumptions of unidimensionality, monotonicity and local independence were respected.

Results

A total of 605 (87.4 %) respondents were analyzed with a mean age of 64 years (range 29–88). Internal consistency for all scales separately and for the three main domains was good (Cronbach’s α 0.74–0.98). IRT analyses were performed with the partial credit model. No disordered thresholds of polytomous items were found. Each domain showed high reliability but fitted poorly to the Rasch models. Three items in particular, the item about “promptness” in the doctors’ domain and the items about “accessibility” and “environment” in the services/organization domain, presented the highest default of fit. A correct fit of the Rasch model can be obtained by dropping these items. Most of the local dependence concerned items about “information provided” in each domain. A major deviation of unidimensionality was found in the nurses’ domain.

Conclusions

CTT showed good psychometric properties of the OUT-PATSAT35. However, the Rasch analysis revealed some misfitting and redundant items. Taking the above problems into consideration, it could be interesting to refine the questionnaire in a future study.  相似文献   

4.
5.
目的 应用经典测量理论与项目反应理论对慢性胃炎患者生命质量量表QLICD-CG(V2.0)的条目进行分析。方法 采用QLICD-CG(V2.0)量表,对163名慢性胃炎患者进行生命质量评估。利用Multilog 7.03软件进行项目反应理论分析得出每个条目的难度、区分度系数和信息量,同时结合经典测量理论分析的4种统计方法来评价条目质量的优劣。结果 CTT结果显示,除了3个条目(GPH3、GPS3、CG11)外,剩余条目都符合4种统计学方法至少满足3种的标准;IRT结果显示,所有条目的难度系数都在-6.42~4.36,而且随着难度等级(B1→B4)增加呈现出单调递增的趋势,所有条目的区分度都在1.37~1.69,所有条目的平均信息量都在0.356~0.780。39个条目中,37个条目的性能良好,2个条目(GPH3、GPS3)需要优化。结论 QLICD-CG(V2.0)量表的大部分条目的性能较好,但少数条目仍需进一步改进。  相似文献   

6.
目的 应用 CTT 与 IRT 两种分析理论对宫颈癌患者生命质量量表(QLICP-CE V2.0)的条目进行分析与评价。 方法 通过应用 QLICP-CE(V2.0)对 186 例宫颈癌病人进行测评,采用经典测量理论 CTT 中的四种统计方法(变异度法、相关系数法、因子分析法、克朗巴赫系数法)来评价条目质量的好坏。同时采用项目反应理论IRT中的 Samejima 等级反应模型计算每个条目的难度、区分度系数和信息量。 结果 CTT 分析结果提示 QLICP-CE(V2.0)共性模块中有 9 个条目与其所在领域的相关性比较低,而特异模块中有3个。IRT结果显示所有条目的区分度较好,取值范围均在0.64~1.33;44个条目中有35个条目的难度系数取值范围在-3.49~3.76,且随着难度等级(B1→B4)的增加呈现出单调递增的趋势;除 3 个条目外所有条目的平均信息量均较好。 结论 QLICP-CE(V2.0)量表所有条目区分度比较好,大部分条目的性能良好,但仍然有少部分条目有待进一步修订并验证效果。  相似文献   

7.
Construct-irrelevant variance (CIV) - the erroneous inflation or deflation of test scores due to certain types of uncontrolled or systematic measurement error - and construct underrepresentation (CUR) - the under-sampling of the achievement domain - are discussed as threats to the meaningful interpretation of scores from objective tests developed for local medical education use. Several sources of CIV and CUR are discussed and remedies are suggested. Test score inflation or deflation, due to the systematic measurement error introduced by CIV, may result from poorly crafted test questions, insecure test questions and other types of test irregularities, testwiseness, guessing, and test item bias. Using indefensible passing standards can interact with test scores to produce CIV. Sources of content under representation are associated with tests that are too short to support legitimate inferences to the domain and which are composed of trivial questions written at low-levels of the cognitive domain. ``Teaching to the test' is another frequent contributor to CUR in examinations used in medical education. Most sources of CIV and CUR can be controlled or eliminated from the tests used at all levels of medical education, given proper training and support of the faculty who create these important examinations. This revised version was published online in June 2006 with corrections to the Cover Date.  相似文献   

8.
Item banks and Computerized Adaptive Testing (CAT) have the potential to greatly improve the assessment of health outcomes. This review describes the unique features of item banks and CAT and discusses how to develop item banks. In CAT, a computer selects the items from an item bank that are most relevant for and informative about the particular respondent; thus optimizing test relevance and precision. Item response theory (IRT) provides the foundation for selecting the items that are most informative for the particular respondent and for scoring responses on a common metric. The development of an item bank is a multi-stage process that requires a clear definition of the construct to be measured, good items, a careful psychometric analysis of the items, and a clear specification of the final CAT. The psychometric analysis needs to evaluate the assumptions of the IRT model such as unidimensionality and local independence; that the items function the same way in different subgroups of the population; and that there is an adequate fit between the data and the chosen item response models. Also, interpretation guidelines need to be established to help the clinical application of the assessment. Although medical research can draw upon expertise from educational testing in the development of item banks and CAT, the medical field also encounters unique opportunities and challenges.  相似文献   

9.
Background: Item response theory (IRT) is a powerful framework for analyzing multiitem scales and is central to the implementation of computerized adaptive testing. Objectives: To explain the use of IRT to examine measurement properties and to apply IRT to a questionnaire for measuring migraine impact – the Migraine Specific Questionnaire (MSQ). Methods: Data from three clinical studies that employed the MSQ-version 1 were analyzed by confirmatory factor analysis for categorical data and by IRT modeling. Results: Confirmatory factor analyses showed very high correlations between the factors hypothesized by the original test constructions. Further, high item loadings on one common factor suggest that migraine impact may be adequately assessed by only one score. IRT analyses of the MSQ were feasible and provided several suggestions as to how to improve the items and in particular the response choices. Out of 15 items, 13 showed adequate fit to the IRT model. In general, IRT scores were strongly associated with the scores proposed by the original test developers and with the total item sum score. Analysis of response consistency showed that more than 90% of the patients answered consistently according to a unidimensional IRT model. For the remaining patients, scores on the dimension of emotional function were less strongly related to the overall IRT scores that mainly reflected role limitations. Such response patterns can be detected easily using response consistency indices. Analysis of test precision across score levels revealed that the MSQ was most precise at one standard deviation worse than the mean impact level for migraine patients that are not in treatment. Thus, gains in test precision can be achieved by developing items aimed at less severe levels of migraine impact. Conclusions: IRT proved useful for analyzing the MSQ. The approach warrants further testing in a more comprehensive item pool for headache impact that would enable computerized adaptive testing.  相似文献   

10.
Differential item functioning (DIF) in tests and multi-item surveys occurs when a lack of conditional independence exists between the response to one or more items and membership to a particular group, given equal levels of proficiency. We develop an approach to detecting DIF in the context of item response theory (IRT) models based on computing a diagnostic which is the posterior mean of a p-value. IRT models are fit in a Bayesian framework, and simulated proficiency parameters from the posterior distribution are retained. Monte Carlo estimates of the p-value diagnostic are then computed by comparing the fit of nonparametric regressions of item responses on simulated proficiency parameters and group membership. Some properties of our approach are examined through a simulation experiment. We apply our method to the analysis of responses from two separate studies to the BASIS-24, a widely used self-report mental health assessment instrument, to examine DIF between the English and Spanish-translated version of the survey.  相似文献   

11.
This article provides an overview of item response theory (IRT) models and how they can be appropriately applied to patient-reported outcomes (PROs) measurement. Specifically, the following topics are discussed: (a) basics of IRT, (b) types of IRT models, (c) how IRT models have been applied to date, and (d) new directions in applying IRT to PRO measurements.  相似文献   

12.
Patient relevant outcomes, such as cognitive functioning and functional status, measured using questionnaires, have become important endpoints in medical studies. Traditionally, responses to individual items are simply summed to obtain a score for each patient. Recently, there has been interest in another paradigm, item response theory (IRT), proposed as an alternative to summed scores. The benefits of the use of IRT are greatest, when it is used in conjunction with a calibrated item bank. This is a collection of items, which have been presented to large groups of patients, whose responses are used to estimate the measurement properties of the individual items. This article examines the methodology surrounding the use of IRT to construct and calibrate an item bank and uses the AMC Linear Disability Score project, which aims to develop an item bank to measure functional status as expressed by the ability to perform activities of daily life, as an illustration.  相似文献   

13.

Objectives

We review the papers presented at the NCI/DIA conference, to identify areas of controversy and uncertainty, and to highlight those aspects of item response theory (IRT) and computer adaptive testing (CAT) that require theoretical or empirical research in order to justify their application to patient reported outcomes (PROs).

Background

IRT and CAT offer exciting potential for the development of a new generation of PRO instruments. However, most of the research into these techniques has been in non-healthcare settings, notably in education. Educational tests are very different from PRO instruments, and consequently problematic issues arise when adapting IRT and CAT to healthcare research.

Results

Clinical scales differ appreciably from educational tests, and symptoms have characteristics distinctly different from examination questions. This affects the transferring of IRT technology. Particular areas of concern when applying IRT to PROs include inadequate software, difficulties in selecting models and communicating results, insufficient testing of local independence and other assumptions, and a need of guidelines for estimating sample size requirements. Similar concerns apply to differential item functioning (DIF), which is an important application of IRT. Multidimensional IRT is likely to be advantageous only for closely related PRO dimensions.

Conclusions

Although IRT and CAT provide appreciable potential benefits, there is a need for circumspection. Not all PRO scales are necessarily appropriate targets for this methodology. Traditional psychometric methods, and especially qualitative methods, continue to have an important role alongside IRT. Research should be funded to address the specific concerns that have been identified.
  相似文献   

14.

Background  

The International Classification of Functioning, Disability and Health (ICF) proposes three main health outcomes, Impairment (I), Activity Limitation (A) and Participation Restriction (P), but good measures of these constructs are needed The aim of this study was to use both Classical Test Theory (CTT) and Item Response Theory (IRT) methods to carry out an item analysis to improve measurement of these three components in patients having joint replacement surgery mainly for osteoarthritis (OA).  相似文献   

15.
Health status assessment is frequently used to evaluate the combined impact of human immunodeficiency virus (HIV) disease and its treatment on functioning and well-being from the patient's perspective. No single health status measure can efficiently cover the range of problems in functioning and well-being experienced across HIV disease stages. Item response theory (IRT), item banking and computer adaptive testing (CAT) provide a solution to measuring health-related quality of life (HRQoL) across different stages of HIV disease. IRT allows us to examine the response characteristics of individual items and the relationship between responses to individual items and the responses to each other item in a domain. With information on the response characteristics of a large number of items covering a HRQoL domain (e.g. physical function, and psychological well-being), and information on the interrelationships between all pairs of these items and the total scale, we can construct more efficient scales. Item banks consist of large sets of questions representing various levels of a HRQoL domain that can be used to develop brief, efficient scales for measuring the domain. CAT is the application of IRT and item banks to the tailored assessment of HRQoL domains specific to individual patients. Given the results of IRT analyses and computer-assisted test administration, more efficient and brief scales can be used to measure multiple domains of HRQoL for clinical trials and longitudinal observational studies.  相似文献   

16.
ObjectiveThe objective of this study was to develop a questionnaire that could integrate patient and provider items on mobility and self-care into unidimensional scales. The instrument should be suitable for various measurement models (patient and provider data [PAT–PRO], only patient data [PAT], only provider data [PRO]).Study Design and SettingThe existing instruments, MOSES-Patient and MOSES-Provider, were integrated into the MOSES-Combi and completed by a total of 1,019 neurology, cardiac, or musculoskeletal patients and/or their physicians (MOSES = acronym for “mobilty and self-care”).ResultsAfter selection of 18 items, all 12 scales of the MOSES-Combi (87 items) were largely unidimensional, met the standards for a 1-parameter item-response theory (IRT) model, were sufficiently reliable, and showed no differential item functioning (DIF) for age or gender. The person parameters set in the PAT–PRO measurement model show at least moderate, but usually substantial, agreement with those set in the PRO and PAT measurement models.ConclusionThe advantages of the MOSES-Combi are that it can be used for various measurement models and is suitable for studying agreement between patient and provider assessments because of its psychometric properties (same scaling for patient and provider items). Integration of various data sources in an IRT scale can be extended to other assessments.  相似文献   

17.
Background: As part of a larger study whose objective is to develop an abbreviated version of the EORTC QLQ-C30 suitable for research in palliative care, analyses were conducted to determine the feasibility of generating a shorter version of the 4-item emotional functioning (EF) scale that could be scored in the original metric. Methods: We used data from 24 European cancer studies conducted in 10 different languages (n=8242). Item selection was based on analyses by item response theory (IRT). Based on the IRT results, a simple scoring algorithm was developed to predict the original 4-item EF sum scale score from a reduced number of items. Results: Both a 3-item and a 2-item version (item 21 ‘Did you feel tense?’ and item 24 ‘Did you feel depressed?’) predicted the total score with excellent agreement and very little bias. In group comparisons, the 2-item scale led to the same conclusions as those based on the original 4-item scale with little or no loss of measurement efficiency. Conclusion: Although these results are promising, confirmatory studies are needed based on independent samples. If such additional studies yield comparable results, incorporation of the 2-item EF scale in an abbreviated version of the QLQ-C30 for use in palliative care research settings would be justified. The analyses reported here demonstrate the usefulness of the IRT-based methodology for shortening questionnaire scales.  相似文献   

18.
OBJECTIVE: To demonstrate the value of item response theory (IRT) and differential item functioning (DIF) methods in examining a health-related quality-of-life measure in children and adolescents. STUDY DESIGN AND SETTING: This illustration uses data from 5,429 children using the four subscales of the PedsQL 4.0 Generic Core Scales. The IRT model-based likelihood ratio test was used to detect and evaluate DIF between healthy children and children with a chronic condition. RESULTS: DIF was detected for a majority of items but canceled out at the total test score level due to opposing directions of DIF. Post hoc analysis indicated that this pattern of results may be due to multidimensionality. We discuss issues in detecting and handling DIF. CONCLUSION: This article describes how to perform DIF analyses in validating a questionnaire to ensure that scores have equivalent meaning across subgroups. It offers insight into ways information gained through the analysis can be used to evaluate an existing scale.  相似文献   

19.
Cognitive screening tests and items have been found to perform differently across groups that differ in terms of education, ethnicity and race. Despite the profound implications that such bias holds for studies in the epidemiology of dementia, little research has been conducted in this area. Using the methods of modern psychometric theory (in addition to those of classical test theory), we examined the performance of the Attention subscale of the Mattis Dementia Rating Scale. Several item response theory models, including the two- and three-parameter dichotomous response logistic model, as well as a polytomous response model were compared. (Log-likelihood ratio tests showed that the three-parameter model was not an improvement over the two-parameter model.) Data were collected as part of the ten-study National Institute on Aging Collaborative investigation of special dementia care in institutional settings. The subscale KR-20 estimate for this sample was 0.92. IRT model-based reliability estimates, provided at several points along the latent attribute, ranged from 0.65 to 0.97; the measure was least precise at the less disabled tail of the distribution. Most items performed in similar fashion across education groups; the item characteristic curves were almost identical, indicating little or no differential item functioning (DIF). However, four items were problematic. One item (digit span backwards) demonstrated a large error term in the confirmatory factor analysis; item-fit chi-square statistics developed using BIMAIN confirm this result for the IRT models. Further, the discrimination parameter for that item was low for all education subgroups. Generally, persons with the highest education had a greater probability of passing the item for most levels of theta. Model-based tests of DIF using MULTILOG identified three other items with significant, albeit small, DIF. One item, for example, showed non-uniform DIF in that at the impaired tail of the latent distribution, persons with higher education had a higher probability of correctly responding to the item than did lower education groups, but at less impaired levels, they had a lower probability of a correct response than did lower education groups. Another method of detection identified this item as having DIF (unsigned area statistic=3.05, p<0.01, and 2.96, p<0.01). On average, across the entire score range, the lower education group's probability of answering the item correctly was 0.11 higher than the higher education group's probability. A cross-validation with larger subgroups confirmed the overall result of little DIF for this measure. The methods used for detecting differential item functioning (which may, in turn, be indicative of bias) were applied to a neuropsychological subtest. These methods have been used previously to examine bias in screening measures across education and ethnic and racial subgroups. In addition to the important epidemiological applications of ensuring that screening measures and neuropsychological tests used in diagnoses are free of bias so that more culture-fair classifications will result, these methods are also useful for the examination of site differences in large multi-site clinical trials. It is recommended that these methods receive wider attention in the medical statistical literature.  相似文献   

20.
Objective  We tested the item response theory (IRT) model assumptions of the original item bank, and evaluated the practical and psychometric adequacy, of a computerized adaptive test (CAT) for patients with foot or ankle impairments seeking rehabilitation in outpatient therapy clinics. Methods  Data from 10,287 patients with foot or ankle impairments receiving outpatient physical therapy were analyzed. We first examined the unidimensionality, fit, and invariance IRT assumptions of the CAT item bank. Then we evaluated the efficiency of the CAT administration and construct validity and sensitivity of change of the foot/ankle CAT measure of lower-extremity functional status (FS). Results  Results supported unidimensionality, model fit, and invariance of item parameters and patient ability estimates. On average, the CAT used seven items to produce precise estimates of FS that adequately covered the content range with negligible floor and ceiling effects. Patients who were older, had more chronic symptoms, had more surgeries, had more comorbidities, and did not exercise prior to receiving rehabilitation reported worse discharge FS. Seventy-one percent of patients obtained statistically significant change at follow-up. Change of 8 FS units (scale 0–100) represented minimal clinically important improvement. Conclusions  We concluded that the foot/ankle item bank met IRT assumptions and that the CAT FS measure was precise, valid, and responsive, supporting its use in routine clinical application.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号