首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Patient relevant outcomes, such as cognitive functioning and functional status, measured using questionnaires, have become important endpoints in medical studies. Traditionally, responses to individual items are simply summed to obtain a score for each patient. Recently, there has been interest in another paradigm, item response theory (IRT), proposed as an alternative to summed scores. The benefits of the use of IRT are greatest, when it is used in conjunction with a calibrated item bank. This is a collection of items, which have been presented to large groups of patients, whose responses are used to estimate the measurement properties of the individual items. This article examines the methodology surrounding the use of IRT to construct and calibrate an item bank and uses the AMC Linear Disability Score project, which aims to develop an item bank to measure functional status as expressed by the ability to perform activities of daily life, as an illustration.  相似文献   

2.

Purpose

Most multidimensional patient-reported outcomes (PRO) measures are lengthy to complete. Computerized adaptive testing (CAT) that selects the most informative items can potentially reduce respondent burden without sacrificing measurement accuracy. The commonly used maximum Fisher information item selection method has been reported to lead to highly unbalanced item bank usage and potentially imprecise trait estimation. This study employs the content-balancing strategy in a bifactor-modeled CAT item selection and examines its impact on measurement accuracy and item bank usage.

Methods

Item responses from a population-based SF-36 survey were first calibrated using the bifactor graded response model. Four post hoc CATs using items and responses from the SF-36 data set were then created. The content-balancing strategy was adopted in the item selection procedure of the bifactor-modeled CAT. The measurement accuracy and usage of items of the CAT were compared between the tests with and without the content-balancing strategy.

Results

The results indicate that the CAT implemented with the content-balancing strategy offers a better overall measurement accuracy of both the general health status and the two health domains (physical and mental) of the SF-36.

Conclusions

The content-balancing strategy helps the CAT–PRO to balance the selection of items and achieve improved measurement accuracy. Its implementation in real-time CAT administration to measure multidimensional PRO traits merits further studies.  相似文献   

3.
BACKGROUND AND OBJECTIVE: Measuring physical functioning (PF) within and across postacute settings is critical for monitoring outcomes of rehabilitation; however, most current instruments lack sufficient breadth and feasibility for widespread use. Computer adaptive testing (CAT), in which item selection is tailored to the individual patient, holds promise for reducing response burden, yet maintaining measurement precision. We calibrated a PF item bank via item response theory (IRT), administered items with a post hoc CAT design, and determined whether CAT would improve accuracy and precision of score estimates over random item selection. METHODS: 1,041 adults were interviewed during postacute care rehabilitation episodes in either hospital or community settings. Responses for 124 PF items were calibrated using IRT methods to create a PF item bank. We examined the accuracy and precision of CAT-based scores compared to a random selection of items. RESULTS: CAT-based scores had higher correlations with the IRT-criterion scores, especially with short tests, and resulted in narrower confidence intervals than scores based on a random selection of items; gains, as expected, were especially large for low and high performing adults. CONCLUSION: The CAT design may have important precision and efficiency advantages for point-of-care functional assessment in rehabilitation practice settings.  相似文献   

4.
ObjectivesDevelopment of an item pool to construct a future computerized adaptive test (CAT) for fatigue in rheumatoid arthritis (RA). The item pool was based on the patients' perspective and examined for face and content validity previously. This study assessed the fit of the items with seven predefined dimensions and examined the item pool's dimensionality structure in statistical terms.Study Design and SettingA total of 551 patients with RA participated in this study. Several steps were conducted to come from an explorative item pool to a psychometrically sound item bank. The item response theory (IRT) analysis using the generalized partial credit model was conducted for each of the seven predefined dimensions. Poorly fitting items were removed. Finally, the best possible multidimensional IRT (MIRT) model for the data was identified.ResultsIn IRT analysis, 49 items showed insufficient item characteristics. Items with a discriminative ability below 0.60 and/or model misfit effect sizes greater than 0.10 were removed. Factor analysis on the 196 remaining items revealed three dimensions, namely severity, impact, and variability of fatigue. The dimensions were further confirmed in MIRT model analysis.ConclusionThis study provided an initially calibrated item bank and showed which dimensions and items can be used for the development of a multidimensional CAT for fatigue in RA.  相似文献   

5.
To make meaningful cross-cultural comparisons of health-related quality of life (HRQOL) or to pool international research data, it is essential to create culturally unbiased measures that detect clinically important differences between patients. We evaluated the measurement properties of the Functional Assessment of Cancer Therapy-Breast (FACT-B) in 111 Austrian and 144 U.S. patients with breast cancer using item response theory (IRT) methods. A small number of items were identified as displaying statistically significant differential item functioning (DIF), suggesting possible measurement bias. The majority of the items functioned similarly between the two cultural groups. U.S. patients reported lower (worse) physical function and well-being compared with Austrian patients, higher (better) social/family well-being and similar emotional well-being, before and after adjustment for DIF. IRT and related measurement models provide useful methods for assessing cross-cultural equivalence and determining which items can be pooled across languages before analyzing HRQOL data. Determination of clinically significant cross-cultural differences will require additional investigation.  相似文献   

6.
BACKGROUND AND OBJECTIVES: Most health-related quality-of-life questionnaires include multi-item scales. Scale scores are usually estimated as simple sums of the item scores. However, scoring procedures utilizing more information from the items might improve measurement abilities, and thereby reduce the needed sample sizes. We investigated whether item response theory (IRT)-based scoring improved the measurement abilities of the EORTC QLQ-C30 physical functioning, emotional functioning, and fatigue scales. METHODS: Using a database of 13,010 subjects we estimated the relative validities of IRT scoring compared to sum scoring of the scales. RESULTS: The mean relative validities were 1.04 (physical), 1.03 (emotional), and 0.97 (fatigue). None of these were significantly larger than 1. Thus, no gain in measurement abilities using IRT scoring was found for these scales. Possible explanations include that the items in the scales are not constructed for IRT scoring and that the scales are relatively short. CONCLUSION: IRT scoring of the three longest EORTC QLQ-C30 scales did not improve measurement abilities compared to the traditional sum scoring of the scales.  相似文献   

7.
Mungas D  Reed BR 《Statistics in medicine》2000,19(11-12):1631-1644
An ideal measure of global functioning for patients with dementia would discriminate at very high and very low levels of functioning and would have linear measurement properties such that a given change in score corresponds to the same amount of change in underlying ability at any part of the ability continuum. Using item response theory methods, linearity of test measurement can be directly assessed and items can be selected to construct a test with desired measurement characteristics. The purpose of this study was to apply item response theory methods to evaluating and developing global functioning scales. Subjects were 1207 patients who had received comprehensive dementia evaluations. Items were selected from two measures of cognitive functioning (Mini Mental State Examination, MMS; Blessed Information Memory Concentration Test, BIMCT) and one measure of independent functioning (Blessed-Roth Dementia Rating Scale, BRDRS). The MMS and BIMCT showed significant non-linearity of measurement, especially at low and high ability levels. A brief composite measure was created by selecting from the three instruments 25 items that fit a uniform distribution of item difficulty across the entire range of ability measured by the three instruments. This composite measure and the BRDRS showed better linearity of measurement than the other two instruments. Results have implications for development of a psychometrically sophisticated, brief measure of global functioning for clinical and research use in dementia.  相似文献   

8.
《Value in health》2022,25(9):1566-1574
ObjectivesIn economic evaluations, quality of life is measured using patient-reported outcome measures (PROMs), such as the EQ-5D-5L. A key assumption for the validity of PROMs data is measurement invariance, which requires that PROM items and response options are interpreted the same across respondents. If measurement invariance is violated, PROMs exhibit differential item functioning (DIF), whereby individuals from different groups with the same underlying health respond differently, potentially biasing scores. One important group of healthcare consumers who have been shown to have different views or priorities over health is older adults. This study investigates age-related DIF in the EQ-5D-5L using item response theory (IRT) and ordinal logistic regression approaches.MethodsMultiple-group IRT models were used to investigate DIF, by assessing whether older adults aged 65+ years and younger adults aged 18 to 64 years with the same underlying health had different IRT parameter estimates and expected item and EQ-5D-5L level sum scores. Ordinal logistic regression was also used to examine whether DIF resulted in meaningful differences in expected EQ level sum scores. Effect sizes examined whether DIF indicated meaningful score differences.ResultsThe anxiety/depression item exhibited meaningful DIF in both approaches, with older adults less likely to report problems. Pain/discomfort and mobility exhibited DIF to a lesser extent.ConclusionsWhen using the EQ-5D-5L to evaluate interventions and make resource allocation decisions, scoring bias due to DIF should be controlled for to prevent inefficient service provision, where the most cost-effective services are not provided, which could be detrimental to patients and the efficiency of health budgets.  相似文献   

9.
ObjectivesPatient-reported outcomes (PROs) are essential when evaluating many new treatments in health care; yet, current measures have been limited by a lack of precision, standardization, and comparability of scores across studies and diseases. The Patient-Reported Outcomes Measurement Information System (PROMIS) provides item banks that offer the potential for efficient (minimizes item number without compromising reliability), flexible (enables optional use of interchangeable items), and precise (has minimal error in estimate) measurement of commonly studied PROs. We report results from the first large-scale testing of PROMIS items.Study Design and SettingFourteen item pools were tested in the U.S. general population and clinical groups using an online panel and clinic recruitment. A scale-setting subsample was created reflecting demographics proportional to the 2000 U.S. census.ResultsUsing item-response theory (graded response model), 11 item banks were calibrated on a sample of 21,133, measuring components of self-reported physical, mental, and social health, along with a 10-item Global Health Scale. Short forms from each bank were developed and compared with the overall bank and with other well-validated and widely accepted (“legacy”) measures. All item banks demonstrated good reliability across most of the score distributions. Construct validity was supported by moderate to strong correlations with legacy measures.ConclusionPROMIS item banks and their short forms provide evidence that they are reliable and precise measures of generic symptoms and functional reports comparable to legacy instruments. Further testing will continue to validate and test PROMIS items and banks in diverse clinical populations.  相似文献   

10.
BACKGROUND AND OBJECTIVE: The objective of the study was to enhance the clinical interpretation and practicality of the widely used comprehensive Sickness Impact Profile. METHOD: Item Response Theory (extension of the Rasch model) was used to calibrate the severity of the SIP items, to assess item bias and to construct equally severe short forms of the SIP that can be used interchangeably. The scores of 1507 subjects were analyzed. RESULTS: Of the 127 SIP items, 82 items fitted the extended Rasch model, i.e., the observed proportions of sickness level groups endorsing the items corresponded to the proportions expected by the model. The item severity hierarchy allowed a more straightforward interpretation of the calibrated SIP-82 scores. Some items showed bias in age, gender, or diagnosis groups. The equivalent short forms agreed sufficiently well with the calibrated SIP-82 item pool to be used interchangeably. We observed a moderate correlation between the original SIP item severity weights and the Rasch item severity calibrations (r=0.53). CONCLUSION: The interpretability and practicality of the SIP was enhanced by the IRT calibration. Using the item calibrations, short forms can be assembled that can be used interchangeably.  相似文献   

11.
Liu X  Jin Z 《Statistics in medicine》2007,26(23):4311-4327
This paper presents a non-parametric approach for the selection of items in a scale for screening, with the score defined as the sum of item response indicators. Without specifying parametric models for binary classification probabilities, the proposed item selection method evaluates the change in classification accuracy due to adding or deleting one item for a scale with k items. It first removes least useful items from the scale and then uses a forward stepwise selection procedure to the remaining items to identify a subset of items for a reduced scale. The reduced scale usually retains or improves classification accuracy compared to the full scale. The variation in items selected can be assessed with bootstrap samples. In a simulation study, the proposed procedure shows a fairly good finite sample performance. The method is illustrated with a data set on patients with and without high risk of developing Alzheimer's disease who were administered a 40-item test of olfactory function.  相似文献   

12.
ObjectiveComputer adaptive tests (CATs) offer a flexible, test fair, and economic opportunity for accurate measurement of anxiety in patients with cardiovascular diseases (CVDs). The objective of this study was to develop and calibrate an item bank [anxiety item bank for cardiovascular patients (AIB-cardio)] as a prerequisite for an anxiety-CAT in CVD patients.Study Design and SettingAfter pretesting for relevance and comprehension, a pool of 155 anxiety items was answered on a five-point Likert scale. Sample consisted of 715 CVD patients, who were recruited in 14 German cardiac rehabilitation centers. A confirmatory factor analysis (CFA), Mokken analysis, and Rasch analysis were conducted.ResultsThe results of CFA and Mokken analysis confirmed one factor structure and double monotonicity. In Rasch analysis, merging response categories and removing items with misfit, differential item functioning or local response dependency reduced the AIB-cardio to 37 items. The AIB-cardio fitted to the Rasch model with a nonsignificant item–trait interaction (chi-square, 133.89; degrees of freedom, 111; P = 0.07). Person separation reliability was 0.85, and unidimensionality could be verified.ConclusionThe calibrated, unidimensional AIB-cardio provides the basis for a CAT to assess anxiety in rehabilitation patients with CVD with good psychometric properties. Further testing in other cardiovascular patients is needed to increase generalizability.  相似文献   

13.
Although psycholosocial aspects of skin diseases are well known, disease-specific questionnaires validated for use in clinical trials are not available to assess the impact of facial acne on health-related quality of life or to evaluate therapeutic change. Development of such an instrument was undertaken and included item generation, reduction and pilottesting phases. By interviewing acne subjects and dermatologists and literature review, 168 possible items were identified. Next, 165 acne subjects identified which items affected them and rated importance on a 5-point scale. Reduction to a brief questionnaire was performed by evaluating patient-perceived importance and factor analysis; four domains were identified (self-perception, roleemotional, role-social, acne symptoms). After pilot-testing for comprehension in acne subjects, further revisions were made to improve clarity and applicability. The resulting instrument takes 10 minutes to complete, and consists of 24 questions assessing how acne affected certain aspects of patients' lives during the past week on a 7-point scale. Thus, an instrument with excellent content validity was developed to assess health-related quality of life in patients with facial acne, and is comprised of statistically meaningful items of importance to patients. Other measurement characteristics are being assessed in a recently initiated study to evaluate test-retest reliability and responsiveness to therapy.  相似文献   

14.
BackgroundMany clinical scales contain items that are scored separately prior to being compiled into a single score. However, if the items have different degrees of importance, they should be weighted differently before being compiled. The principal aims of this study were to show how the “analytic hierarchy process” (AHP), which has never been used for this purpose, can be applied to weighting the six items of the “London handicap scale”, and to compare the AHP to the “conjoint analysis” (CA), which was previously implemented by Harwood et al. (1994) [1].DesignIn order to assess the relative importance of the six items, we submitted AHP and CA to a group of 10 physiatrists. We compared the methods in terms of item ranking according to importance, assessment of fictitious patients based on weights determined by each method, and perceived difficulty by the physiatrist.ResultsFor both techniques, “Physical independence” (PHY) was the best-weighted item, but other ranks varied depending on the technique. AHP was better than CA in terms of accuracy (global assessment of the clinical status) and perceived difficulty.ConclusionAHP may be used to reveal the importance that experts assign to the items of a multidimensional scale, and to calculate the appropriate weights for specific items. For this purpose, AHP seems to be more accurate than CA.  相似文献   

15.
Longitudinal surveys measuring physical or mental health status are a common method to evaluate treatments. Multiple items are administered repeatedly to assess changes in the underlying health status of the patient. Traditional models to analyze the resulting data assume that the characteristics of at least some items are identical over measurement occasions. When this assumption is not met, this can result in ambiguous latent health status estimates. Changes in item characteristics over occasions are allowed in the proposed measurement model, which includes truncated and correlated random effects and a growth model for item parameters. In a joint estimation procedure adopting MCMC methods, both item and latent health status parameters are modeled as longitudinal random effects. Simulation study results show accurate parameter recovery. Data from a randomized clinical trial concerning the treatment of depression by increasing psychological acceptance showed significant item parameter shifts. For some items, the probability of responding in the middle category versus the highest or lowest category increased significantly over time. The resulting latent depression scores decreased more over time for the experimental group than for the control group and the amount of decrease was related to the increase in acceptance level. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

16.
BACKGROUND: Many measurement instruments, particularly measures of hand functional ability, frequently comprise a large number of items. Reduced versions of these instruments can facilitate their use. This work proposes a new method for shortening an instrument. METHODS: The method proposed was based on a scale of item difficulty calculated using the Rasch model. It was applied on a hand functional measure comprising 67 tests. The sample included 194 patients with hand lesions. The shortened instrument obtained was compared with those provided by classic methods used in the literature, with item random choice, and with shortened versions proposed by four independent experts, two rehabilitation physicians and two occupational therapists, who are clinicians familiar with the tool. All the statistical analyses were carried out on a random sub-group of two-thirds of the sample. A cross validation was then carried out on the remaining third. RESULTS: The reduction obtained had score non significantly different from that of the original instrument. In addition, the intra-class correlation coefficient and the Cronbach alpha coefficient were high. Among the different degrees of reduction investigated, the 12-item version seemed to be appropriate. Our method appeared to provide better results in terms of discriminant validity and internal validity than the choices of the four experts. The reductions produced were also better than those obtained by classic methods based on principal component analysis and multiple linear regression, as well as those obtained by random choices of items. CONCLUSION: The method presented is pertinent and useful. The reduction obtained appeared to be better than the choices of experts and the reductions provided by classic methods. The method could be used in other fields.  相似文献   

17.

Purpose

In multiple sclerosis (MS), the use of preference-based measures is limited to generic measures such as Health Utilities Index Mark 2 and 3, the EQ-5D and the SF-6D. However, the challenge of using such generic preference-based measures in people with MS is that they may not capture all domains of health relevant to the disease. Therefore, the main aim of this paper is to describe the development of a health state classification system for MS patients. The specific objectives are: (1) to identify items best reflecting the domains of quality of life important to people with MS and (2) to provide evidence for the discriminative capacity of the response options by cross-walking onto a visual analog scale of health rating.

Methods

The data come from an epidemiologically sampled population of people with MS diagnosed post-1994. The dataset consisted of 206 items relating to impairments, activity limitations, participation restrictions, health perception and quality of life. Important domains were identified from the responses to the Patient Generated Index, an individualized measure of quality of life. The extent to which the items formed a uni-dimensional, linear construct was estimated using Rasch analysis, and the best item was selected using the threshold map.

Results

The sample was young (mean age 43) and predominantly female (n = 140/189; 74 %). The P-PBMSI classification system consisted of five items, with three response levels per item, producing a total of 243 possible health states. Regression coefficient values consistently decreased between response levels and the linear test for trend were statistically significant for all items. The linear test for trend indicated that for each item the response options provided the same discriminative ability within the magnitude of their capacity. A scoring algorithm was estimated using a simple additive formula. The classification system demonstrated convergent validity against other measures of similar constructs and known-groups validity between different clinical subgroups.

Conclusion

This study produced a health state classifier system based on items impacted upon by MS, and demonstrated the potential to discriminate the health impact of the disease.  相似文献   

18.
ObjectivesTo investigate the validity of a common depression metric in independent samples.Study Design and SettingWe applied a common metrics approach based on item-response theory for measuring depression to four German-speaking samples that completed the Patient Health Questionnaire (PHQ-9). We compared the PHQ item parameters reported for this common metric to reestimated item parameters that derived from fitting a generalized partial credit model solely to the PHQ-9 items. We calibrated the new model on the same scale as the common metric using two approaches (estimation with shifted prior and Stocking–Lord linking). By fitting a mixed-effects model and using Bland–Altman plots, we investigated the agreement between latent depression scores resulting from the different estimation models.ResultsWe found different item parameters across samples and estimation methods. Although differences in latent depression scores between different estimation methods were statistically significant, these were clinically irrelevant.ConclusionOur findings provide evidence that it is possible to estimate latent depression scores by using the item parameters from a common metric instead of reestimating and linking a model. The use of common metric parameters is simple, for example, using a Web application (http://www.common-metrics.org) and offers a long-term perspective to improve the comparability of patient-reported outcome measures.  相似文献   

19.
Background: The Item Response Theory (IRT) is becoming increasingly popular for item analysis. Theoretical considerations and simulation studies suggest that parameter estimates will become precise only by utilizing many items in large samples. Method: A simulation study focusing on a single scale was performed on data with (a) n = 40, 60, 80, 120, 200, 300, 500, and 900 cases utilizing (b) 4, 8, 16, or 32 items. The items were (c) symmetrically distributed vs. skew (skewness 0, 1, and 2). Item loadings were (d) homogeneous vs. heterogeneous. Item loadings were (e) low vs. high. Half of the items had (f) a correlated error or not. The number of answering categories (g) was four vs. five. A total of 10% of each item had missing values. The ability-estimates from the IRT model and the simple sum score served as criteria for evaluating the results. Results: The ability-estimate from the IRT model outperformed the sum score when there were many items, skewed distributed items, and the item loadings were heterogeneous and high. The sum score outperformed the ability-estimate when there were few items, nonskewed items, and homogeneous and low item loadings. However, convergence rates were partly low in small samples. Correlated errors affected, both negatively, the ability-estimate and the sum score. Conclusion: With skew item distributions and heterogeneous item loadings, utilizing an IRT model is recommended. However, with few items, many cases are required, conversely, with few cases many items. With few items and few cases, the sum score performs better.  相似文献   

20.
Ordinal response data are commonly observed in health and medical investigations that include several items. The primary goal in the modelling of item response data is to find a unique measurement of the person's abilities and of the item difficulties that satisfies the properties of the fundamental measurement. One such analytic method in item response theory is the Rasch measurement, which is a way to convert ordinal observations into linear measures. Current estimation strategies assume the independence of the Rasch model parameters. In this paper, based on the conditional maximum likelihood, we implemented a simultaneous estimation method that can compare the Rasch parameters more efficiently. We also obtained the asymptotic properties of these estimators and developed the conditional likelihood ratio test for the goodness-of-fit of the model. Simulation studies were used to demonstrate the improved performance of our estimators as compared to that of currently used conditional method known as the CON procedure. We conclude that our estimation method outperforms CON in both model fit and the precision of the Rasch estimators.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号