首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
CONTEXT: Factors that interfere with the ability to interpret assessment scores or ratings in the proposed manner threaten validity. To be interpreted in a meaningful manner, all assessments in medical education require sound, scientific evidence of validity. PURPOSE: The purpose of this essay is to discuss 2 major threats to validity: construct under-representation (CU) and construct-irrelevant variance (CIV). Examples of each type of threat for written, performance and clinical performance examinations are provided. DISCUSSION: The CU threat to validity refers to undersampling the content domain. Using too few items, cases or clinical performance observations to adequately generalise to the domain represents CU. Variables that systematically (rather than randomly) interfere with the ability to meaningfully interpret scores or ratings represent CIV. Issues such as flawed test items written at inappropriate reading levels or statistically biased questions represent CIV in written tests. For performance examinations, such as standardised patient examinations, flawed cases or cases that are too difficult for student ability contribute CIV to the assessment. For clinical performance data, systematic rater error, such as halo or central tendency error, represents CIV. The term face validity is rejected as representative of any type of legitimate validity evidence, although the fact that the appearance of the assessment may be an important characteristic other than validity is acknowledged. CONCLUSIONS: There are multiple threats to validity in all types of assessment in medical education. Methods to eliminate or control validity threats are suggested.  相似文献   

2.
AIM: Because it deals with qualitative information, portfolio assessment inevitably involves some degree of subjectivity. The use of stricter assessment criteria or more structured and prescribed content would improve interrater reliability, but would obliterate the essence of portfolio assessment in terms of flexibility, personal orientation and authenticity. We resolved this dilemma by using qualitative research criteria as opposed to reliability in the evaluation of portfolio assessment. METHODOLOGY/RESEARCH DESIGN: Five qualitative research strategies were used to achieve credibility and dependability of assessment: triangulation, prolonged engagement, member checking, audit trail and dependability audit. Mentors read portfolios at least twice during the year, providing feedback and guidance (prolonged engagement). Their recommendation for the end-of-year grade was discussed with the student (member checking) and submitted to a member of the portfolio committee. Information from different sources was combined (triangulation). Portfolios causing persistent disagreement were submitted to the full portfolio assessment committee. Quality assurance procedures with external auditors were used (dependability audit) and the assessment process was thoroughly documented (audit trail). RESULTS: A total of 233 portfolios were assessed. Students and mentors disagreed on 7 (3%) portfolios and 9 portfolios were submitted to the full committee. The final decision on 29 (12%) portfolios differed from the mentor's recommendation. CONCLUSION: We think we have devised an assessment procedure that safeguards the characteristics of portfolio assessment, with credibility and dependability of assessment built into the judgement procedure. Further support for credibility and dependability might be sought by means of a study involving different assessment committees.  相似文献   

3.
INTRODUCTION: Structured assessment, embedded in a training programme, with systematic observation, feedback and appropriate documentation may improve the reliability of clinical assessment. This type of assessment format is referred to as in-training assessment (ITA). The feasibility and reliability of an ITA programme in an internal medicine clerkship were evaluated. The programme comprised 4 ward-based test formats and 1 outpatient clinic-based test format. Of the 4 ward-based test formats, 3 were single-sample tests, consisting of 1 student-patient encounter, 1 critical appraisal session and 1 case presentation. The other ward-based test and the outpatient-based test were multiple sample tests, consisting of 12 ward-based case write-ups and 4 long cases in the outpatient clinic. In all the ITA programme consisted of 19 assessments. METHODS: During 41 months, data were collected from 119 clerks. Feasibility was defined as over two thirds of the students obtaining 19 assessments. Reliability was estimated by performing generalisability analyses with 19 assessments as items and 5 test formats as items. RESULTS: A total of 73 students (69%) completed 19 assessments. Reliability expressed by the generalisability coefficients was 0.81 for 19 assessments and 0.55 for 5 test formats. CONCLUSIONS: The ITA programme proved to be feasible. Feasibility may be improved by scheduling protected time for assessment for both students and staff. Reliability may be improved by more frequent use of some of the test formats.  相似文献   

4.
INTRODUCTION: This study describes the development of an instrument to measure the ability of medical students to reflect on their performance in medical practice. METHODS: A total of 195 Year 4 medical students attending a 9-hour clinical ethics course filled in a semi-structured questionnaire consisting of reflection-evoking case vignettes. Two independent raters scored their answers. Respondents were scored on a 10-point scale for overall reflection score and on a scale of 0-2 for the extent to which they mentioned a series of perspectives in their reflections. We analysed the distribution of scores, the internal validity and the effect of being pre-tested with an alternate form of the test on the scores. The relationships between overall reflection score and perspective score, and between overall reflection score and gender, career preference and work experience were also calculated. RESULTS: The interrater reliability was sufficient. The range of scores on overall reflection was large (1-10), with a mean reflection score of 4.5-4.7 for each case vignette. This means that only 1 or 2 perspectives were mentioned, and hardly any weighing of perspectives took place. The values over the 2 measurements were comparable and were strongly related. Women had slightly higher scores than men, as had students with work experience in health care, and students considering general practice as a career. CONCLUSIONS: Reflection in medical practice can be measured using this semistructured questionnaire built on case vignettes. The mean score allows for the measurement of improvement by future educational efforts. The wide range of individual differences allows for comparisons between groups. The differences found between groups of students were as expected and support the validity of the instrument.  相似文献   

5.
OBJECTIVES: To evaluate the development, validity and reliability of a multimodality objective structured clinical examination (OSCE) in undergraduate psychiatry, integrating interactive face-to-face and telephone history taking and communication skills stations, videotape mental state examinations and problem-oriented written stations. METHODS: The development of the OSCE on a restricted budget is described. This study evaluates the validity and reliability of 4 15-18-station OSCEs for 128 students over 1 year. Face and content validity were assessed by a panel of clinicians and from feedback from OSCE participants. Correlations with consultant clinical 'firm grades' were performed. Interrater reliability and internal consistency (interstation reliability) were assessed using generalisability theory. RESULTS: The OSCE was feasible to conduct and had a high level of high perceived face and content validity. Consultant firm grades correlated moderately with scores on interactive stations and poorly with written and video stations. Overall reliability was moderate to good, with G-coefficients in the range 0.55-0.68 for the 4 OSCEs. CONCLUSIONS: Integrating a range of modalities into an OSCE in psychiatry appears to represent a feasible, generally valid and reliable method of examination on a restricted budget. Different types of stations appear to have different advantages and disadvantages, supporting the integration of both interactive and written components into the OSCE format.  相似文献   

6.
Downing SM 《Medical education》2004,38(9):1006-1012
CONTEXT: All assessment data, like other scientific experimental data, must be reproducible in order to be meaningfully interpreted. PURPOSE: The purpose of this paper is to discuss applications of reliability to the most common assessment methods in medical education. Typical methods of estimating reliability are discussed intuitively and non-mathematically. SUMMARY: Reliability refers to the consistency of assessment outcomes. The exact type of consistency of greatest interest depends on the type of assessment, its purpose and the consequential use of the data. Written tests of cognitive achievement look to internal test consistency, using estimation methods derived from the test-retest design. Rater-based assessment data, such as ratings of clinical performance on the wards, require interrater consistency or agreement. Objective structured clinical examinations, simulated patient examinations and other performance-type assessments generally require generalisability theory analysis to account for various sources of measurement error in complex designs and to estimate the consistency of the generalisations to a universe or domain of skills. CONCLUSIONS: Reliability is a major source of validity evidence for assessments. Low reliability indicates that large variations in scores can be expected upon retesting. Inconsistent assessment scores are difficult or impossible to interpret meaningfully and thus reduce validity evidence. Reliability coefficients allow the quantification and estimation of the random errors of measurement in assessments, such that overall assessment can be improved.  相似文献   

7.
CONTEXT: Reliability is defined as the extent to which a result reflects all possible measurements of the same construct. It is an essential measurement characteristic. Unfortunately, there are few objective tests for the most important aspects of the professional role because they are complex and intangible. In addition, professional performance varies markedly from setting to setting and case to case. Both these factors threaten reliability. AIM: This paper describes the classical approach to evaluating reliability and points out the limitations of this approach. It goes on to describe how generalisability theory solves many of these limitations. CONDITIONS: A G-study uses variance component analysis to measure the contributions that all relevant factors make to the result (observer, situation, case, assessee and their interactions). This information can be combined to reflect the reliability of a single observation as a reflection of all possible measurements - a true reflection of reliability. It can also be used to estimate the reliability of a combined sample of several different observations, or to predict how many observations are required with different test formats to achieve a given level of reliability. Worked examples are used to illustrate the concepts.  相似文献   

8.
PURPOSE: To examine the validity of a written knowledge test of skills for performance on an OSCE in postgraduate training for general practice. METHODS: A randomly-selected sample of 47 trainees in general practice took a knowledge test of skills, a general knowledge test and an OSCE. The OSCE included technical stations and stations including complete patient encounters. Each station was checklist rated and global rated. RESULTS: The knowledge test of skills was better correlated to the OSCE than the general knowledge test. Technical stations were better correlated to the knowledge test of skills than stations including complete patient encounters. For the technical stations the rating system had no influence on the correlation. For the stations including complete patient encounters the checklist rating correlated better to the knowledge test of skills than the global rating. CONCLUSION: The results of this study support the predictive validity of the knowledge test of skills. In postgraduate training for general practice a written knowledge test of skills can be used as an instrument to estimate the level of clinical skills, especially for group evaluation, such as in studies examining the efficacy of a training programme or as a screening instrument for deciding about courses to be offered. This estimation is more accurate when the content of the test matches the skills under study. However, written testing of skills cannot replace direct observation of performance of skills.  相似文献   

9.
The long case     
BACKGROUND: The long case has been gradually replaced by the objective structured clinical examination (OSCE) as a summative assessment of clinical skills. Its demise occurred against a paucity of psychometric research. This article reviews the current status of the long case, appraising its strengths and weaknesses as an assessment tool. ISSUES: There is a conflict between validity and reliability. The long case assesses an integrated clinical reaction between doctor and real patients and has high face validity. Intercase reliability is the prime problem. As most examinations traditionally used a single case only, problems of content specificity and standardisation were not addressed. DISCUSSION: Recent research suggests that testing across more cases does improve reliability. Better structuring of tests and direct observation increases validity. Substituting standardised cases for real patients may be of little benefit compared to increasing the sample of cases. CONCLUSIONS: Observed long cases can be useful for assessment depending on the sample size of cases and examiners. More research is needed into the exact nature of intercase and interexaminer variance and consequential validity. Feasibility remains a key problem. More exploration of combined assessments using real patients with OSCEs is suggested.  相似文献   

10.
CONTEXT: The College of Medicine and Medical Sciences at the Arabian Gulf University, Bahrain, replaced the traditional long case/short case clinical examination on the final MD examination with a direct observation clinical encounter examination (DOCEE). Each student encountered four real patients. Two pairs of examiners from different disciplines observed the students taking history and conducting physical examinations and jointly assessed their clinical competence. OBJECTIVES: To determine the reliability and validity of the DOCEE by investigating whether examiners agree when scoring, ranking and classifying students; to determine the number of cases and examiners necessary to produce a reliable examination, and to establish whether the examination has content and concurrent validity. SUBJECTS: Fifty-six final year medical students and 22 examiners (in pairs) participated in the DOCEE in 2001. METHODS: Generalisability theory, intraclass correlation, Pearson correlation and kappa were used to study reliability and agreement between the examiners. Case content and Pearson correlation between DOCEE and other examination components were used to study validity. RESULTS: Cronbach's alpha for DOCEE was 0.85. The intraclass and Pearson correlation of scores given by specialists and non-specialists ranged from 0.82 to 0.93. Kappa scores ranged from 0.56 to 1.00. The overall intraclass correlation of students' scores was 0.86. The generalisability coefficient with four cases and two raters was 0.84. Decision studies showed that increasing the cases from one to four improved reliability to above 0.8. However, increasing the number of raters had little impact on reliability. The use of a pre-examination blueprint for selecting the cases improved the content validity. The disattenuated Pearson correlations between DOCEE and other performance measures as a measure of concurrent validity ranged from 0.67 to 0.79. CONCLUSIONS: The DOCEE was shown to have good reliability and interrater agreement between two independent specialist and non-specialist examiners on the scoring, ranking and pass/fail classification of student performance. It has adequate content and concurrent validity and provides unique information about students' clinical competence.  相似文献   

11.
PURPOSE: This investigation aimed to explore the measurement properties of scores from a patient simulator exercise. METHODS: Analytic and holistic scores were obtained for groups of medical students and residents. Item analysis techniques were used to explore the nature of specific examinee actions. Interrater reliability was calculated. Scores were contrasted for third year medical students, fourth year medical students and emergency department residents. RESULTS: Interrater reliabilities for analytic and holistic scores were 0.92 and 0.81, respectively. Based on item analysis, proper timing and sequencing of actions discriminated between low- and high-ability examinees. In general, examinees with more advanced training obtained higher scores on the simulation exercise. CONCLUSION: Reliable and valid measures of clinical performance can be obtained from a trauma simulation provided that care is taken in the development and scoring of the scenario.  相似文献   

12.
PURPOSE: Earlier studies of absolute standard setting procedures for objective structured clinical examinations (OSCEs) show inconsistent results. This study compared a rational and an empirical standard setting procedure. Reliability and credibility were examined first. The impact of a reality check was then established. METHODS: The OSCE included 16 stations and was taken by trainees in their final year of postgraduate training in general practice and experienced general practitioners. A modified Angoff (independent judgements, no group discussion) with and without a reality check was used as a rational procedure. A method related to the borderline group procedure, the borderline regression (BR) method, was used as an empirical procedure. Reliability was assessed using generalisability theory. Credibility was assessed by comparing pass rates and by relating the passing scores to test difficulty. RESULTS: The passing scores were 73.4% for the Angoff procedure without reality check (Angoff I), 66.0% for the Angoff procedure with reality check (Angoff II) and 57.6% for the BR method. The reliabilities (expressed as root mean square errors) were 2.1% for Angoffs I and II, and 0.6% for the BR method. The pass rates of the trainees and GPs were 19% and 9% for Angoff I, 66% and 46% for Angoff II, and 95% and 80% for the BR method, respectively. The correlation between test difficulty and passing score was 0.69 for Angoff I, 0.88 for Angoff II and 0.86 for the BR method. CONCLUSION: The BR method provides a more credible and reliable standard for an OSCE than a modified Angoff procedure. A reality check improves the credibility of the Angoff procedure but does not improve its reliability.  相似文献   

13.
PURPOSE: At the Faculty of Medicine at the Katholieke Universiteit Leuven, Belgium, we have developed a final examination that consists of extended matching multiple-choice questions. Extended matching questions (EMQs) originate from a case and have 1 correct answer within a list of at least 7 alternatives. If EMQs assess clinical reasoning, we can assume there will be a difference between the ways students and experienced doctors solve the problems within the questions. This study compared students' and residents' processes of solving EMQs. METHODS: Twenty final year students and 20 fourth or fifth year residents specialising in internal medicine solved 20 EMQs aloud. All questions concerned diagnosis or pathogenesis. Ten EMQs related to internal medicine and 10 questions to other medical disciplines. The session was audio-taped and transcribed. RESULTS: The residents correctly answered significantly more questions concerning internal medicine than did the students. Their reasoning was more "forward" and less "backward". No difference between residents and students was found for the other questions. The residents scored better on internal medicine than on the other questions. They used more backward and less forward reasoning when solving the other questions than they did with the internal medicine questions. The better half of the respondents used significantly more forward and less backward reasoning than did the poorer half. CONCLUSION: In accordance with the literature, medical expertise was characterised by forward reasoning, whereas outside their area of expertise, the subjects switched over to backward reasoning. It is possible to assess processes of clinical reasoning using EMQs.  相似文献   

14.
INTRODUCTION: Inventories to quantify approaches to studying try to determine how students approach academic tasks. Medical curricula usually aim to promote a deep approach to studying, which is associated with academic success and which may predict desirable traits postqualification. AIMS: This study aimed to validate a revised Approaches to Learning and Studying Inventory (ALSI) in medical students and to explore its relation to student characteristics and performance. METHODS: Confirmatory factor analysis was used to validate the reported constructs in a sample of 128 Year 1 medical students. Models were developed to investigate the effect of age, graduate status and gender, and the relationships between approaches to studying and assessment outcomes. RESULTS: The ALSI performed as anticipated in this population, thus validating its use in our sample, but a 4-factor solution had a better fit than the reported 5-factor one. Medical students scored highly on deep approach compared with other students in higher education. Graduate status and gender had significant effects on approach to studying and a deep approach was associated with higher academic scores. CONCLUSIONS: The ALSI is valid for use in medical students and can uncover interesting relationships between approaches to studying and student characteristics. In addition, the ALSI has potential as a tool to predict student success, both academically and beyond qualification.  相似文献   

15.
BACKGROUND: The reproducibility of authentic assessment methods has been investigated for objective structured clinical examinations (OSCEs) and video assessment in general practice, but not for assessment with incognito standardized patients. PURPOSE: To investigate the reproducibility of assessment with incognito standardized patients. METHODS: A total of 27 Dutch rheumatologists in 16 hospitals were each visited by 8 incognito standardized patients presenting with different rheumatological disorders. After each visit, the standardized patient completed a case-specific checklist containing items on medical history, physical examination and management. Over a 20-month period, 254 incognito visits took place, of which 201 were first visits. The standardized patient was detected by the rheumatologist in 2 cases only. These encounters were not included in the analysis. Generalizability theory was used to investigate the reproducibility of the assessment. RESULTS: One fifth of the variance can be attributed to variation between rheumatologists. The largest variance is due to the variation in difficulty among cases. A reproducible assessment requires 3 hours of testing time (6 cases) if it is obtained through a norm-referenced interpretation of scores and 7 hours of testing time (14 cases) if it is obtained through an absolute interpretation of scores. CONCLUSION: The reproducibility of performance assessment in clinical practice by incognito standardized patients is similar to that of other authentic measurements for the assessment of clinical competence and performance.  相似文献   

16.
BACKGROUND: While much is now known about how to assess the competence of medical practitioners in a controlled environment, less is known about how to measure the performance in practice of experienced doctors working in their own environments. The performance of doctors depends increasingly on how well they function in teams and how well the health care system around them functions. METHODS: This paper reflects the combined experiences of a group of experienced education researchers and the results of literature searches on performance assessment methods. CONCLUSION: Measurement of competence is different to measurement of performance. Components of performance could be re-conceptualised within a different domain structure. Assessment methods may be of a different utility to that in competence assessment and, indeed, of different utility according to the purpose of the assessment. An exploration of the utility of potential performance assessment methods suggests significant gaps that indicate priority areas for research and development.  相似文献   

17.
INTRODUCTION: An earlier study showed that an Angoff procedure with > or = 10 recently graduated students as judges can be used to estimate the passing score of a progress test. As the acceptability and feasibility of this approach are questionable, we conducted an Angoff procedure with test item writers as judges. This paper reports on the reliability and credibility of this procedure and compares the standards set by the two different panels. METHODS: Fourteen item writers judged 146 test items. Recently graduated students had assessed these items in a previous study. Generalizability was investigated as a function of the number of items and judges. Credibility was judged by comparing the pass/fail rates associated with the Angoff standard, a relative standard and a fixed standard. The Angoff standards obtained by item writers and graduates were compared. RESULTS: The variance associated with consistent variability of item writers across items was 1.5% and for graduate students it was 0.4%. An acceptable error score required 39 judges. Item-Angoff estimates of the two panels and item P-values correlated highly. Failure rates of 57%, 55% and 7% were associated with the item writers' standard, the fixed standard and the graduates' standard, respectively. CONCLUSION: The graduates' and the item writers' standards differed substantially, as did the associated failure rates. A panel of 39 item writers is not feasible. The item writers' passing score appears to be less credible. The credibility of the graduates' standard needs further evaluation. The acceptability and feasibility of a panel consisting of both students and item writers may be worth investigating.  相似文献   

18.
BACKGROUND: Assessment plays a key role in the learning process. The validity of any given assessment tool should ideally be established. If an assessment is to act as a guide to future teaching and learning then its predictive validity must be established. AIM: To assess the ability of an objective structured clinical examination (OSCE) taken at the end of the first clinical year of an undergraduate medical degree to predict later performance in clinical examinations. METHODS: Performance of two consecutive cohorts of year 3 medical undergraduates (n=138 and n=128) in a 23 station OSCE were compared with their performance in 5 subsequent clinical examinations in years 4 and 5 of the course. RESULTS: Poor performance in the OSCE was strongly associated with later poor performance in other clinical examinations. Students in the lowest three deciles of OSCE performance were 6 times more likely to fail another clinical examination. Receiver operating characteristic curves were constructed as a method to criterion reference the cut point for future examinations. CONCLUSION: Performance in an OSCE taken early in the clinical course strongly predicts later clinical performance. Assessing subsequent student performance is a powerful tool for assessing examination validity. The use of ROC curves represents a novel method for determining future criterion referenced examination cut points.  相似文献   

19.
Peer assessment of professional competence   总被引:3,自引:0,他引:3  
BACKGROUND: Current assessment formats for medical students reliably test core knowledge and basic skills. Methods for assessing other important domains of competence, such as interpersonal skills, humanism and teamwork skills, are less well developed. This study describes the development, implementation and results of peer assessment as a measure of professional competence of medical students to be used for formative purposes. METHODS: Year 2 medical students assessed the professional competence of their peers using an online assessment instrument. Fifteen randomly selected classmates were assigned to assess each student. The responses were analysed to determine the reliability and validity of the scores and to explore relationships between peer assessments and other assessment measures. RESULTS: Factor analyses suggest a 2-dimensional conceptualisation of professional competence: 1 factor represents Work Habits, such as preparedness and initiative, and the other factor represents Interpersonal Habits, including respect and trustworthiness. The Work Habits factor had moderate, yet statistically significant correlations ranging from 0.21 to 0.53 with all other performance measures that were part of a comprehensive assessment of professional competence. Approximately 6 peer raters were needed to achieve a generalisability coefficient of 0.70. CONCLUSIONS: Our findings suggest that it is possible to introduce peer assessment for formative purposes in an undergraduate medical school programme that provides multiple opportunities to interact with and observe peers.  相似文献   

20.
BACKGROUND: In medical education, assessment of medical competence and performance, important changes have taken place in the last 5 decades. These changes have affected the basic concepts in all 3 domains. DEVELOPMENTS IN EDUCATION AND ASSESSMENT: In education constructivism has provided a completely new view on how students learn best. In assessment the change from trait-orientated to competency- or role-orientated thinking has given rise to a whole range of new approaches. Certain methods of education, such as problem-based learning (PBL), and assessment, however, are often seen as almost synonymous with the underlying concepts, and one tends to forget that it is the concept that is important and that a particular method is but 1 way of using a concept. When doing this, one runs the risk of confusing means and ends, which may hamper or slow down new developments. LESSONS FOR RESEARCH: A similar problem seems to occur often in research of medical education. Here too, methods--or, rather, methodologies--are confused with research questions. This may lead to an overemphasis on research that fits well known methodologies (e.g. the randomised controlled trial) and neglect of what are sometimes even more important research questions because they do not fit well known methodologies. CONCLUSION: In this paper we advocate a return to the underlying concepts and a careful reflection of their use in various situations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号