首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 859 毫秒
1.
OBJECTIVE: In health research, ordinal scales are extensively used. Reproducibility of ratings using these scales is important to assess their quality. This study aimed to compare two methods analyzing reproducibility: weighted Kappa statistic and log-linear models. STUDY DESIGN AND SETTING: Contributions of each method to the reproducibility assessment of ratings using ordinal scales were compared using intra- and interobserver data chosen in three different fields: Crow's feet scale in dermatology, dysplasia scale in oncology, updated Sydney scale in gastroenterology. RESULTS: Both methods provided an agreement level. In addition, log-linear models allowed evaluation of the structure of agreement. For the Crow's feet scale, both methods gave equivalent high agreement levels. For the dysplasia scale, log-linear models highlighted scale defects and Kappa statistic showed a moderate agreement. For the updated Sydney scale, log-linear models underlined a null distinguishability between two adjacent categories, whereas Kappa statistic gave a high global agreement level. CONCLUSION: Methods that can investigate level and structure of agreement between ordinal ratings are valuable tools, since they may highlight heterogeneities within the scales structure and suggest modifications to improve their reproducibility.  相似文献   

2.
This article uses log-linear models to describe pairwise agreement among several raters who classify a sample on a subjective categorical scale. The models describe agreement structure simultaneously for second-order marginal tables of a multidimensional cross-classification of ratings. Practical difficulties arise in fitting the models, because models refer to pairwise marginal tables of a very large and sparse table. A standard analysis that treats the marginal tables as independent yields consistent estimates of model parameters, but not of the covariance matrix of the estimates. We estimate the covariance matrix using the jackknife. We apply the models to describe agreement between evaluations made by seven pathologists of carcinoma in situ of the uterine cervix, using a five-level ordinal scale. Previous analyses showed differences among the pathologists in their pairwise levels of agreement, but we observe near homogeneity in the dependence structure of their ratings.  相似文献   

3.
New models that are useful in the assessment of rater agreement, particularly when the rating scale is ordered or partially ordered, are presented. The models are parameterized to address two important aspects of rater agreement: (i) agreement in terms of the overall frequency in which raters assign categories; and (ii) the extent to which raters agree on the category assigned to individual subjects or items. We present methodology for the simultaneous modelling of univariate marginal responses and bivariate marginal associations in the K-way contingency table representing the joint distribution of K rater responses. The univariate marginal responses provide information for evaluating agreement in terms of the overall frequency of responses, and the bivariate marginal associations provide information on category-wise agreement among pairs of raters. In addition, estimated scores within a generalized log non-linear model for bivariate associations facilitate the assessment of category distinguishability.  相似文献   

4.
Widespread inconsistencies are commonly observed between physicians' ordinal classifications in screening tests results such as mammography. These discrepancies have motivated large‐scale agreement studies where many raters contribute ratings. The primary goal of these studies is to identify factors related to physicians and patients' test results, which may lead to stronger consistency between raters' classifications. While ordered categorical scales are frequently used to classify screening test results, very few statistical approaches exist to model agreement between multiple raters. Here we develop a flexible and comprehensive approach to assess the influence of rater and subject characteristics on agreement between multiple raters' ordinal classifications in large‐scale agreement studies. Our approach is based upon the class of generalized linear mixed models. Novel summary model‐based measures are proposed to assess agreement between all, or a subgroup of raters, such as experienced physicians. Hypothesis tests are described to formally identify factors such as physicians' level of experience that play an important role in improving consistency of ratings between raters. We demonstrate how unique characteristics of individual raters can be assessed via conditional modes generated during the modeling process. Simulation studies are presented to demonstrate the performance of the proposed methods and summary measure of agreement. The methods are applied to a large‐scale mammography agreement study to investigate the effects of rater and patient characteristics on the strength of agreement between radiologists. Copyright © 2017 John Wiley & Sons, Ltd.  相似文献   

5.
We present a model‐based approach to the analysis of agreement between different raters in a situation where all raters have supplied ordinal ratings of the same cases in a sample. It is assumed that no “gold standard” is available. The model is an ordinal regression model with random effects—a so‐called rating scale model. The model includes case‐specific parameters that allow each case his or hers own level (disease severity). It also allows raters to have different propensities to score a given set of individuals more or less positively—the rater level. Based on the model, we suggest quantifying the rater variation using the median odds ratio. This allows expressing the variation on the same scale as the observed ordinal data. An important example that will serve to motivate and illustrate the proposed model is the study of breast cancer diagnosis based on screening mammograms. The purpose of the assessment is to detect early breast cancer in order to obtain improved cancer survival. In the study, mammograms from 148 women were evaluated by 110 expert radiologists. The experts were asked to rate each mammogram on a 5‐point scale ranging from “normal” to “probably malignant.”  相似文献   

6.
Many large‐scale studies have recently been carried out to assess the reliability of diagnostic procedures, such as mammography for the detection of breast cancer. The large numbers of raters and subjects involved raise new challenges in how to measure agreement in these types of studies. An important motivator of these studies is the identification of factors that contribute to the often wide discrepancies observed between raters' classifications, such as a rater's experience, in order to improve the reliability of the diagnostic process of interest. Incorporating covariate information into the agreement model is a key component in addressing these questions. Few agreement models are currently available that jointly model larger numbers of raters and subjects and incorporate covariate information. In this paper, we extend a recently developed population‐based model and measure of agreement for binary ratings to incorporate covariate information using the class of generalized linear mixed models with a probit link function. Important information on factors related to the subjects and raters can be included as fixed and/or random effects in the model. We demonstrate how agreement can be assessed between subgroups of the raters and/or subjects, for example, comparing agreement between experienced and less experienced raters. Simulation studies are carried out to test the performance of the proposed models and measures of agreement. Application to a large‐scale breast cancer study is presented. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

7.
Two examples demonstrate how one can use association models to analyse agreement data. The first example concerns intra-rater variability in the classification of sputum cytology slides, and the second deals with variability associated with the reporting of passive smoking histories. The paper emphasizes models in which one estimates category scores from the data, that is models that are not in the log-linear family of models. Such models have use in assessment of category distinguishability and provide insights not easily obtained with log-linear models.  相似文献   

8.
《Annals of epidemiology》2017,27(10):677-685.e4
PurposeInterpretation of screening tests such as mammograms usually require a radiologist's subjective visual assessment of images, often resulting in substantial discrepancies between radiologists' classifications of subjects' test results. In clinical screening studies to assess the strength of agreement between experts, multiple raters are often recruited to assess subjects' test results using an ordinal classification scale. However, using traditional measures of agreement in some studies is challenging because of the presence of many raters, the use of an ordinal classification scale, and unbalanced data.MethodsWe assess and compare the performances of existing measures of agreement and association as well as a newly developed model-based measure of agreement to three large-scale clinical screening studies involving many raters' ordinal classifications. We also conduct a simulation study to demonstrate the key properties of the summary measures.ResultsThe assessment of agreement and association varied according to the choice of summary measure. Some measures were influenced by the underlying prevalence of disease and raters' marginal distributions and/or were limited in use to balanced data sets where every rater classifies every subject. Our simulation study indicated that popular measures of agreement and association are prone to underlying disease prevalence.ConclusionsModel-based measures provide a flexible approach for calculating agreement and association and are robust to missing and unbalanced data as well as the underlying disease prevalence.  相似文献   

9.
It is valuable in many studies to assess both intrarater and interrater agreement. Most measures of intrarater agreement do not adjust for unequal estimates of prevalence between the separate rating occasions for a given rater and measures of interrater agreement typically ignore data from the second set of assessments when raters make duplicate assessments. In the event when both measures are assessed there are instances where interrater agreement is larger than at least one of the corresponding intrarater agreements. This implies that a rater agrees less with him/herself and more with another rater. In the situation of multiple raters making duplicate assessments on all subjects, the authors propose properties for an agreement measure based on the odds ratio for a dichotomous trait: (i) estimate a single prevalence across two reading occasions for each rater; (ii) estimate pairwise interrater agreement from all available data; (iii) bound the pairwise interrater agreement above by the corresponding intrarater agreements. Estimation of odds ratios under these properties is done by maximizing the multinomial likelihood with constraints using generalized log-linear models in combination with a generalization of the Lemke-Dykstra iterative-incremental algorithm. An example from a mammography examination reliability study is used to demonstrate the new method.  相似文献   

10.
Screening and diagnostic procedures often require a physician's subjective interpretation of a patient's test result using an ordered categorical scale to define the patient's disease severity. Because of wide variability observed between physicians' ratings, many large‐scale studies have been conducted to quantify agreement between multiple experts' ordinal classifications in common diagnostic procedures such as mammography. However, very few statistical approaches are available to assess agreement in these large‐scale settings. Many existing summary measures of agreement rely on extensions of Cohen's kappa. These are prone to prevalence and marginal distribution issues, become increasingly complex for more than three experts, or are not easily implemented. Here we propose a model‐based approach to assess agreement in large‐scale studies based upon a framework of ordinal generalized linear mixed models. A summary measure of agreement is proposed for multiple experts assessing the same sample of patients' test results according to an ordered categorical scale. This measure avoids some of the key flaws associated with Cohen's kappa and its extensions. Simulation studies are conducted to demonstrate the validity of the approach with comparison with commonly used agreement measures. The proposed methods are easily implemented using the software package R and are applied to two large‐scale cancer agreement studies. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

11.
Scoring systems are used in nearly all fields of medicine for evaluation of the state of a disease. The prediction performance of scoring systems with respect to an ordinal outcome scale is investigated, based on grouped continuous logistic models as well as on an extension of the stereotype logistic regression model. The latter is a canonical approach, which allows assessment of properties of outcome categories such as partial and total ordering, distinguishability and allocatability. The approach is applied to a data set of patients with injuries of the head.  相似文献   

12.
The purpose of this study was to evaluate the inter-rater reliability of hand diagrams, which are commonly used in research case definitions of carpal tunnel syndrome (CTS). To evaluate the potential of non-random misclassification of cases, we also studied predictors of rater disagreement as a function of personal and work factors, and of hand symptoms not classic for CTS. Participants in a longitudinal study investigating the development of CTS completed repeated self-administered questionnaires. Three experienced clinicians, blind to subjects' work or personal history, independently rated all hand diagrams on an ordinal scale from 0 to 3. Disagreements between ratings were resolved by consensus. Reliability was measured by the weighted kappa statistic. Logistic regression models evaluated predictors of disagreement. Three hundred and thirty-three subjects completed 494 hand diagrams. Eighty-five percent were completed by self-administered questionnaire and 15% by telephone interview. Weighted kappa values representing agreement among the three raters, were 0.83 (95% CI: 0.78, 0.87) for right hand diagrams and 0.88 (95% CI: 0.83, 0.91) for left hand diagrams. Ratings from hand diagrams obtained by telephone interview produced better agreement. Agreement among raters was not affected by subjects' personal or work factors. Disagreement among raters was associated with the presence of hand/wrist symptoms other than classic CTS symptoms. Overall, high levels of agreement were attained by independent raters of hand diagrams. Personal factors did not affect agreement among raters, but presence of non-CTS symptoms seemed to affect results and should be considered in studies focused on diverse populations with heterogeneity of upper extremity symptoms.  相似文献   

13.
It is common practice to assess consistency of diagnostic ratings in terms of ‘agreement beyond chance’. To explore the interpretation of such a term we consider relevant statistical techniques such as Cohen's kappa and log-linear models for agreement on nominal ratings. We relate these approaches to a special latent class concept that decomposes observed ratings into a class of systematically consistent and a class of fortuitous ratings. This decomposition provides a common framework in which the specific premises of Cohen's kappa and of log-linear models can be identified and put into perspective. As a result it is shown that Cohen's kappa may be an inadequate and biased index of chance-corrected agreement in studies of intra-observer as well as inter-observer consistency. We suggest a more critical use and interpretation of measures gauging observer reliability by the amount of agreement beyond chance. © 1998 John Wiley & Sons, Ltd.  相似文献   

14.
OBJECTIVE: Although rapid epidemiologic investigations of toxic exposures require estimates of individual exposure levels, objective measures of exposure are often unavailable. We investigated whether self-reported exposure histories, when reviewed and classified by a panel of raters, provided a useful exposure metric. METHODS: A panel reviewed exposure histories as reported by people who experienced a chlorine release. The panelists received no information about health-care requirements or specific health effects. To each exposure case, each panelist assigned one of five possible exposure severity ratings. When assigned ratings were not in initial agreement, the panelists discussed the case and assigned a consensus rating. Percent agreement and kappa statistics assessed agreement among panelists, Kendall's W measured agreement among panelists in their overall ordering of the exposure histories, and Spearman's rho compared the resultant rankings with individual health outcome. RESULTS: In 48% of the cases, the panelists' initial ratings agreed completely. Overall, initial ratings for a given case matched the consensus rating 69% to 89% of the time. Pair-wise comparisons revealed 85% to 95% agreement among panelists, with weighted kappa statistics between 0.69 and 0.83. In their overall ranking of the exposure histories, the panelists reached significant agreement (W = 0.90, p < 0.0001). Disagreement arose most frequently regarding probable chlorine concentration and duration of exposure. This disagreement was most common when panelists differentiated between adjacent categories of intermediate exposure. Panel-assigned exposure ratings significantly correlated with health outcome (Spearman's rho = 0.56; p < 0.0001). CONCLUSION: Epidemiologists and public health practitioners can elicit and review self-reported exposure histories and assign exposure severity ratings that predict medical outcome. When objective markers of exposure are unavailable, panel-assigned exposure ratings may be useful for rapid epidemiologic investigations.  相似文献   

15.
Rating scales are common for self‐assessments of qualitative variables and also for expert‐rating of the severity of disability, outcomes, etc. Scale assessments and other ordered classifications generate ordinal data having rank‐invariant properties only. Hence, statistical methods are often based on ranks. The aim is to focus at the differences in ranking approaches between measures of association and of disagreement in paired ordinal data. The Spearman correlation coefficient is a measure of association between two variables, when each data set is transformed to ranks. The augmented ranking approach to evaluate disagreement takes account of the information given by the pairs of data, and provides identification and measures of systematic disagreement, when present, separately from measures of additional individual variability in assessments. The two approaches were applied to empirical data regarding relationship between perceived pain and physical health and reliability in pain assessments made by patients. The art of disagreement between the patients' perceived levels of outcome after treatment and the doctor's criterion‐based scoring was also evaluated. The comprehensive evaluation of observed disagreement in terms of systematic and individual disagreement provides valuable interpretable information of their sources. The presence of systematic disagreement can be adjusted for and/or understood. Large individual variability could be a sign of poor quality of a scale or heterogeneity among raters. It was also demonstrated that a measure of association must not be used as a measure of agreement, even though such misuse of correlation coefficients is common. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

16.
A method is proposed for classification to ordinal categories by applying the search partition analysis (SPAN) approach. It is suggested that SPAN be repeatedly applied to binary outcomes formed by collapsing adjacent categories of the ordinal scale. By a simple device, whereby successive binary partitions are constrained to be nested, a partition for classification to the ordinal states is obtained. The approach is applied to ordinal categories of glucose tolerance to discriminate between diabetes, impaired glucose tolerance and normal states. The results are compared with analysis by ordinal logistic regression and by classification trees.  相似文献   

17.
The advantage of modelling agreement on a categorical scale among observers rather than using summarizing indices is now well established. However, analysis of agreement among more than two observers is essentially based on pairs of observers. We present a global and partial agreement modelling approach derived from the quasi-independence and quasi-symmetry log-linear models. This approach addresses high order interactions in the contingency table rather than two-way interaction in the pairwise agreement approach. Pairwise, global and partial agreement models were applied to the detection by six pathologists of three lesions in biopsy specimens arising from patients suspected to be affected by Crohn's disease. The global and partial agreement approach surpasses the pairwise agreement approach, especially if there is heterogeneity among ratings. © 1998 John Wiley & Sons, Ltd.  相似文献   

18.
OBJECTIVE: To evaluate test-retest reliability of social network-related information of the" Pr -Sa de" study. METHODS: A test-retest reliability study was conducted using a multidimensional questionnaire applied to a cohort of university employees. The same questionnaire was filled out twice by 192 non-permanent employees with two weeks apart. Agreement was estimated using kappa statistics (categorical variables), weighted kappa statistics, log-linear models (ordinal variables), and intraclass correlation coefficient (discrete variables). RESULTS: Estimates of reliability were higher than 0.70 for most variables. Stratified analyses revealed no consistently varying patterns of reliability according to gender, age or schooling strata. Log-linear modelling showed that, for the study ordinal variables, the model of best fit was "diagonal agreement plus linear by linear association". CONCLUSIONS: The high level of reliability estimated in this study suggests that the process of measurement of social network-related aspects was adequate. Validation studies, which are currently being conducted, will complete the quality assessment of this information.  相似文献   

19.
This article describes the test-retest reliability of a scale comprising five dimensions of social support: material, emotional, informational, affective, and positive social interaction. In the study, a sample of 192 employees at a university in Rio de Janeiro filled out the same questionnaire on two occasions, 15 days apart. Measures of stability used were the intraclass correlation coefficient (ICC), weighted kappa statistic, and log-linear models. Internal consistency was evaluated using the Cronbach's alpha coefficient. Social support dimensions showed internal consistency ranging from 0.75 to 0.91 at test, and 0.86 to 0.93 at retest. The ICC ranged from 0.78 to 0.87 in the five dimensions of the scale, with no substantial differences by gender, age, or level of schooling. For most questions, the "agreement plus linear by linear association" and "quasi-association" log-linear models gave the best fit. According to these results, the reliability of the instrument was considered adequate, enabling it to be used in ongoing assessment of associations between social support and health-related outcomes in a cohort study (the Pro-Health Study) recently begun in Rio de Janeiro.  相似文献   

20.
The concordance correlation coefficient (CCC), a measure of concordance in ratings from multiple raters, was used to study inter-rater agreement in measurements of time to event, generally not observed with perfect consistency among raters. As a function of the first two moments of rating measures, the CCC can be estimated with data subject to censoring, using a likelihood-based estimation method applied under the assumptions of random censoring and parametric distribution models for the ratings of time to event. A simulation study was conducted for small sample performance under various censoring proportions. The use of the CCC with censored data is illustrated with an example taken from a data set containing data on time to an event with two raters per subject.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号