首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
CONTEXT: Factors that interfere with the ability to interpret assessment scores or ratings in the proposed manner threaten validity. To be interpreted in a meaningful manner, all assessments in medical education require sound, scientific evidence of validity. PURPOSE: The purpose of this essay is to discuss 2 major threats to validity: construct under-representation (CU) and construct-irrelevant variance (CIV). Examples of each type of threat for written, performance and clinical performance examinations are provided. DISCUSSION: The CU threat to validity refers to undersampling the content domain. Using too few items, cases or clinical performance observations to adequately generalise to the domain represents CU. Variables that systematically (rather than randomly) interfere with the ability to meaningfully interpret scores or ratings represent CIV. Issues such as flawed test items written at inappropriate reading levels or statistically biased questions represent CIV in written tests. For performance examinations, such as standardised patient examinations, flawed cases or cases that are too difficult for student ability contribute CIV to the assessment. For clinical performance data, systematic rater error, such as halo or central tendency error, represents CIV. The term face validity is rejected as representative of any type of legitimate validity evidence, although the fact that the appearance of the assessment may be an important characteristic other than validity is acknowledged. CONCLUSIONS: There are multiple threats to validity in all types of assessment in medical education. Methods to eliminate or control validity threats are suggested.  相似文献   

2.
PURPOSE: This investigation aimed to explore the measurement properties of scores from a patient simulator exercise. METHODS: Analytic and holistic scores were obtained for groups of medical students and residents. Item analysis techniques were used to explore the nature of specific examinee actions. Interrater reliability was calculated. Scores were contrasted for third year medical students, fourth year medical students and emergency department residents. RESULTS: Interrater reliabilities for analytic and holistic scores were 0.92 and 0.81, respectively. Based on item analysis, proper timing and sequencing of actions discriminated between low- and high-ability examinees. In general, examinees with more advanced training obtained higher scores on the simulation exercise. CONCLUSION: Reliable and valid measures of clinical performance can be obtained from a trauma simulation provided that care is taken in the development and scoring of the scenario.  相似文献   

3.
Downing SM 《Medical education》2004,38(9):1006-1012
CONTEXT: All assessment data, like other scientific experimental data, must be reproducible in order to be meaningfully interpreted. PURPOSE: The purpose of this paper is to discuss applications of reliability to the most common assessment methods in medical education. Typical methods of estimating reliability are discussed intuitively and non-mathematically. SUMMARY: Reliability refers to the consistency of assessment outcomes. The exact type of consistency of greatest interest depends on the type of assessment, its purpose and the consequential use of the data. Written tests of cognitive achievement look to internal test consistency, using estimation methods derived from the test-retest design. Rater-based assessment data, such as ratings of clinical performance on the wards, require interrater consistency or agreement. Objective structured clinical examinations, simulated patient examinations and other performance-type assessments generally require generalisability theory analysis to account for various sources of measurement error in complex designs and to estimate the consistency of the generalisations to a universe or domain of skills. CONCLUSIONS: Reliability is a major source of validity evidence for assessments. Low reliability indicates that large variations in scores can be expected upon retesting. Inconsistent assessment scores are difficult or impossible to interpret meaningfully and thus reduce validity evidence. Reliability coefficients allow the quantification and estimation of the random errors of measurement in assessments, such that overall assessment can be improved.  相似文献   

4.
Context A test score is a number which purportedly reflects a candidate’s proficiency in some clearly defined knowledge or skill domain. A test theory model is necessary to help us better understand the relationship that exists between the observed (or actual) score on an examination and the underlying proficiency in the domain, which is generally unobserved. Common test theory models include classical test theory (CTT) and item response theory (IRT). The widespread use of IRT models over the past several decades attests to their importance in the development and analysis of assessments in medical education. Item response theory models are used for a host of purposes, including item analysis, test form assembly and equating. Although helpful in many circumstances, IRT models make fairly strong assumptions and are mathematically much more complex than CTT models. Consequently, there are instances in which it might be more appropriate to use CTT, especially when common assumptions of IRT cannot be readily met, or in more local settings, such as those that may characterise many medical school examinations. Objectives The objective of this paper is to provide an overview of both CTT and IRT to the practitioner involved in the development and scoring of medical education assessments. Methods The tenets of CCT and IRT are initially described. Then, main uses of both models in test development and psychometric activities are illustrated via several practical examples. Finally, general recommendations pertaining to the use of each model in practice are outlined. Discussion Classical test theory and IRT are widely used to address measurement‐related issues that arise from commonly used assessments in medical education, including multiple‐choice examinations, objective structured clinical examinations, ward ratings and workplace evaluations. The present paper provides an introduction to these models and how they can be applied to answer common assessment questions. Medical Education 2010: 44 : 109–117  相似文献   

5.
CONTEXT: The conceptualisation and measurement of competence in patient care are critical to the design of medical education programmes and outcome assessment. OBJECTIVE: We aimed to examine the major components and correlates of postgraduate competence in patient care. METHODS: A 24-item rating form with additional questions about resident doctors' performance and future residency offers was used. Study participants comprised 4560 subjects who graduated from Jefferson Medical College between 1975 and 2004. They pursued their graduate medical education in 508 hospitals. We used a longitudinal study design in which the rating form was completed by programme directors to evaluate residents at the end of the first postgraduate year. Factor analysis was used to identify the underlying components of postgraduate ratings. Multiple regression, t-test and correlational analyses were used to study the validity of the components that emerged. RESULTS: Two major components emerged, which we labelled 'Knowledge and Clinical Capabilities' and 'Professionalism', and which addressed the science and art of medicine, respectively. Performance measures during medical school, scores on medical licensing examinations, and global assessment of Medical Knowledge, Clinical Judgement and Data-gathering Skills showed higher correlations with scores on the Knowledge and Clinical Capabilities component. Global assessments of Professional Attitudes and ratings of Empathic Behaviour showed higher correlations with scores on the Professionalism component. Offers of continued residency and evaluations of desirable qualities were associated with both components. CONCLUSIONS: Psychometric support for measuring the components of Knowledge and Clinical Capabilities, and Professionalism provides an instrument to empirically evaluate educational outcomes to medical educators who are in search of such a tool.  相似文献   

6.
INTRODUCTION: Assessment of medical student clinical skills is best carried out using multiple assessment methods. A programme was developed to obtain parent evaluations of medical student paediatric interview skills for feedback and to identify students at risk of poor performance in summative assessments. METHOD: A total of 130 parent evaluations were obtained for 67 students (parent participation 72%, student participation 58%). Parents completed a 13-item questionnaire [Interpersonal Skills Rating Scale (IPS) maximum score 91, higher scores = higher student skill level]. Students received their individual parent scores and de-identified class mean scores as feedback, and participants were surveyed regarding the programme. Parent evaluation scores were compared with student performance in formative and summative faculty assessments of clinical interview skills. RESULTS: Parents supported the programme and participating students valued parent feedback. Students with a parent score that was less than 1 standard deviation (SD) below the class mean (low IPS score students) obtained lower faculty summative assessment scores than did other students (mean +/- SD, 59% +/- 5 versus 64% +/- 7; P < 0.05). Obtaining 1 low IPS score was associated with a subsequent faculty summative assessment score below the class mean (sensitivity 0.38, specificity 0.88). Parent evaluations combined with faculty formative assessments identified 50% of students who subsequently performed below the class mean in summative assessments. CONCLUSIONS: Parent evaluations provided useful feedback to students and identified 1 group of students at increased risk of weaker performance in summative assessments. They could be combined with other methods of formative assessment to enhance screening procedures for clinically weak students.  相似文献   

7.
Objectives  To evaluate the reliability and feasibility of assessing the performance of medical specialist registrars (SpRs) using three methods: the mini-clinical evaluation exercise (mini-CEX), directly observed procedural skills (DOPS) and multi-source feedback (MSF) to help inform annual decisions about the outcome of SpR training.
Methods  We conducted a feasibility study and generalisability analysis based on the application of these assessment methods and the resulting data. A total of 230 SpRs (from 17 specialties) in 58 UK hospitals took part from 2003 to 2004. Main outcome measures included: time taken for each assessment, and variance component analysis of mean scores and derivation of 95% confidence intervals for individual doctors' scores based on the standard error of measurement. Responses to direct questions on questionnaires were analysed, as were the themes emerging from open-comment responses.
Results  The methods can provide reliable scores with appropriate sampling. In our sample, all trainees who completed the number of assessments recommended by the Royal Colleges of Physicians had scores that were 95% certain to be better than unsatisfactory. The mean time taken to complete the mini-CEX (including feedback) was 25 minutes. The DOPS required the duration of the procedure being assessed plus an additional third of this time for feedback. The mean time required for each rater to complete his or her MSF form was 6 minutes.
Conclusions  This is the first attempt to evaluate the use of comprehensive workplace assessment across the medical specialties in the UK. The methods are feasible to conduct and can make reliable distinctions between doctors' performances. With adaptation, they may be appropriate for assessing the workplace performance of other grades and specialties of doctor. This may be helpful in informing foundation assessment.  相似文献   

8.
CONTEXT: We are unaware of studies examining the stability of teaching assessment scores across different medical specialties. A recent study showed that clinical teaching assessments of general internists reduced to interpersonal, clinical teaching and efficiency domains. We sought to determine the factor stability of this 3-dimensional model among cardiologists and to compare domain-specific scores between general internists and cardiologists. METHODS: A total of 2000 general internal medicine and cardiology hospital teaching assessments carried out from January 2000 to March 2004 were analysed using principal factor analysis. Internal consistency and inter-rater reliability were calculated. Mean item scores were compared between general internists and cardiologists. RESULTS: The interpersonal and clinical teaching domains previously demonstrated among general internists collapsed into 1 domain among cardiologists, whereas the efficiency domain remained stable. Internal consistency of domains (Cronbach's alpha range 0.89-0.93) and inter-rater reliability of items (range 0.65-0.87) were good to excellent for both specialties. General internists scored significantly higher (P<0.05) than cardiologists on most items except for 4 items that more accurately assessed the cardiology teaching environment. CONCLUSIONS: We observed factor instability of clinical teaching assessment scores from the same instrument administered to general internists and cardiologists. This finding was attributed to salient differences between these specialties' educational environments and highlights the importance of validating assessments for the specific contexts in which they are to be used. Future research should determine whether interpersonal domain scores identify superior teachers and study the reasons why interpersonal and clinical teaching domains are unstable across different educational settings.  相似文献   

9.
BACKGROUND: While much is now known about how to assess the competence of medical practitioners in a controlled environment, less is known about how to measure the performance in practice of experienced doctors working in their own environments. The performance of doctors depends increasingly on how well they function in teams and how well the health care system around them functions. METHODS: This paper reflects the combined experiences of a group of experienced education researchers and the results of literature searches on performance assessment methods. CONCLUSION: Measurement of competence is different to measurement of performance. Components of performance could be re-conceptualised within a different domain structure. Assessment methods may be of a different utility to that in competence assessment and, indeed, of different utility according to the purpose of the assessment. An exploration of the utility of potential performance assessment methods suggests significant gaps that indicate priority areas for research and development.  相似文献   

10.
CONTEXT: Standardised assessments of practising doctors are receiving growing support, but theoretical and logistical issues pose serious obstacles. OBJECTIVES: To obtain reference performance levels from experienced doctors on computer-based case simulation (CCS) and standardised patient-based (SP) methods, and to evaluate the utility of these methods in diagnostic assessment. SETTING AND PARTICIPANTS: The study was carried out at a military tertiary care facility and involved 54 residents and credentialed staff from the emergency medicine, general surgery and internal medicine departments. MAIN OUTCOME MEASURES: Doctors completed 8 CCS and 8 SP cases targeted at doctors entering the profession. Standardised patient performances were compared to archived Year 4 medical student data. RESULTS: While staff doctors and residents performed well on both CCS and SP cases, a wide range of scores was exhibited on all cases. There were no significant differences between the scores of participants from differing specialties or of varying experience. Among participants who completed both CCS and SP testing (n = 44), a moderate positive correlation between CCS and SP checklist scores was observed. There was a negative correlation between doctor experience and SP checklist scores. Whereas the time students spent with SPs varied little with clinical task, doctors appeared to spend more time on communication/counselling cases than on cases involving acute/chronic medical problems. CONCLUSION: Computer-based case simulations and standardised patient-based assessments may be useful as part of a multimodal programme to evaluate practising doctors. Additional study is needed on SP standard setting and scoring methods. Establishing empirical likelihoods for a range of performances on assessments of this character should receive priority.  相似文献   

11.
CONTEXT: Multiple-choice questions (MCQs) are frequently used to assess students in health science disciplines. However, few educators have formal instruction in writing MCQs and MCQ items often have item-writing flaws. The purpose of this study was to examine the impact of item-writing flaws on student achievement in high-stakes assessments in a nursing programme in an English-language university in Hong Kong. METHODS: From a larger sample, we selected 10 summative test papers that were administered to undergraduate nursing students in 1 nursing department. All test items were reviewed for item-writing flaws by a 4-person consensus panel. Items were classified as 'flawed' if they contained > or = 1 flaw. Items not containing item-writing violations were classified as 'standard'. For each paper, 2 separate scales were computed: a total scale which reflected the characteristics of the assessment as administered and a standard scale which reflected the characteristics of a hypothetical assessment including only unflawed items. RESULTS: The proportion of flawed items on the 10 test papers ranged from 28-75%; 47.3% of all items were flawed. Fewer examinees passed the standard scale than the total scale (748 [90.6%] versus 779 [94.3%]). Conversely, the proportion of examinees obtaining a score > or = 80% was higher on the standard scale than the total scale (173 [20.9%] versus 120 [14.5%]). CONCLUSIONS: Flawed MCQ items were common in high-stakes nursing assessments but did not disadvantage borderline students, as has been previously demonstrated. Conversely, high-achieving students were more likely than borderline students to be penalised by flawed items.  相似文献   

12.
INTRODUCTION: An essential element of practice performance assessment involves combining the results of various procedures in order to see the whole picture. This must be derived from both objective and subjective assessment, as well as a combination of quantitative and qualitative assessment procedures. Because of the severe consequences an assessment of practice performance may have, it is essential that the procedure is both defensible to the stakeholders and fair in that it distinguishes well between good performers and underperformers. LESSONS FROM COMPETENCE ASSESSMENT: Large samples of behaviour are always necessary because of the domain specificity of competence and performance. The test content is considerably more important in determining which competency is being measured than the test format, and it is important to recognise that the process of problem-solving process is more idiosyncratic than its outcome. It is advisable to add some structure to the assessment but to refrain from over-structuring, as this tends to trivialise the measurement. IMPLICATIONS FOR PRACTICE PERFORMANCE ASSESSMENT: A practice performance assessment should use multiple instruments. The reproducibility of subjective parts should not be increased by over-structuring, but by sampling through sources of bias. As many sources of bias may exist, sampling through all of them may not prove feasible. Therefore, a more project-orientated approach is suggested using a range of instruments. At various timepoints during any assessment with a particular instrument, questions should be raised as to whether the sampling is sufficient with respect to the quantity and quality of the observations, and whether the totality of assessments across instruments is sufficient to see 'the whole picture'. This policy is embedded within a larger organisational and health care context.  相似文献   

13.
BACKGROUND: Ward rounds are an essential responsibility for doctors in hospital settings. Tools for guiding and assessing trainees' performance of ward rounds are needed. A checklist was developed for that purpose for use with trainees in internal medicine. OBJECTIVE: To assess the content and construct validity of the task-specific checklist. METHODS: To determine content validity, a questionnaire was mailed to 295 internists. They were requested to give their opinion on the relevance of each item included on the checklist and to indicate the comprehensiveness of the checklist. To determine construct validity, an observer assessed 4 groups of doctors during performance of a complete ward round (n = 32). The nurse who accompanied the doctor on rounds made a global assessment of the performance. RESULTS: The response rate to the questionnaire was 80.7%. The respondents found that all 10 items on the checklist were relevant to ward round performance and that the item collection was comprehensive. Checklist mean-item scores differed between levels of expertise: junior house officers 1.4 (1.0-1.9); senior house officers 2.0 (1.5-2.9); specialist trainees 2.5 (1.8-2.8), and specialists 2.7 (2.3-3.5); median (range) (P < 0.001). A significant correlation was found between global observer scores and nurse scores (r = 0.56, P < 0.001). CONCLUSION: The checklist, developed for assessing trainees' performance of ward rounds in internal medicine, showed high content validity. Construct validity was supported by the higher scores of experienced doctors compared to those with less experience and the significant correlation between the observer's and nurses' global scores. The developed checklist should be valuable in guiding and assessing trainees on ward round performance.  相似文献   

14.
OBJECTIVES: This study aimed to compare an essay-style undergraduate medical assessment with modified essay, multiple-choice question (MCQ) and objective structured clinical examination (OSCE) undergraduate medical assessments in predicting students' clinical performance (predictive validity), and to determine the relative contributions of the written (modified essay and MCQ) assessment and OSCE to predictive validity. DESIGN: Before and after cohort study. SETTING: One medical school running a 6-year undergraduate course. PARTICIPANTS: Study participants included 137 Year 5 medical students followed into their trainee intern year. MAIN OUTCOME MEASURES: Aggregated global ratings by senior doctors, junior doctors and nurses as well as comprehensive structured assessments of performance in the trainee intern year. RESULTS: Students' scores in the new examinations predicted performance significantly better than scores in the old examinations, with correlation coefficients increasing from 0.05-0.44 to 0.41-0.81. The OSCE was a stronger predictor of subsequent performance than the written assessments but combining assessments had the strongest predictive validity. CONCLUSION: Using more comprehensive, more reliable and more authentic undergraduate assessment methods substantially increases predictive validity.  相似文献   

15.
BACKGROUND: In medical education, assessment of medical competence and performance, important changes have taken place in the last 5 decades. These changes have affected the basic concepts in all 3 domains. DEVELOPMENTS IN EDUCATION AND ASSESSMENT: In education constructivism has provided a completely new view on how students learn best. In assessment the change from trait-orientated to competency- or role-orientated thinking has given rise to a whole range of new approaches. Certain methods of education, such as problem-based learning (PBL), and assessment, however, are often seen as almost synonymous with the underlying concepts, and one tends to forget that it is the concept that is important and that a particular method is but 1 way of using a concept. When doing this, one runs the risk of confusing means and ends, which may hamper or slow down new developments. LESSONS FOR RESEARCH: A similar problem seems to occur often in research of medical education. Here too, methods--or, rather, methodologies--are confused with research questions. This may lead to an overemphasis on research that fits well known methodologies (e.g. the randomised controlled trial) and neglect of what are sometimes even more important research questions because they do not fit well known methodologies. CONCLUSION: In this paper we advocate a return to the underlying concepts and a careful reflection of their use in various situations.  相似文献   

16.
The assessment of the performance of doctors in practice is becoming more widely accepted. While there are many potential purposes for such assessments, sometimes the consequences of the assessments will be 'high stakes'. In these circumstances, any of the many elements of the assessment programme may potentially be challenged. These assessment programmes therefore need to be robust, fair and defensible, taken from the perspectives of consumer, assessee and assessor. In order to inform the design of defensible programmes for assessing practice performance, a group of education researchers at the 10th Cambridge Conference adopted a project management approach to designing practice performance assessment programmes. This paper describes issues to consider in the articulation of the purposes and outcomes of the assessment, planning the programme, the administrative processes involved, including communication and preparation of assessees. Examples of key questions to be answered are provided, but further work is needed to test validity.  相似文献   

17.
OBJECTIVE: Examinations based on using standardised patients (SPs) commonly use checklist recordings to evaluate students' clinical performance. This paper examines whether and to what extent item and rater characteristics affect the reliability of history checklist recording in an SP-based assessment. METHODS: Checklist items were reviewed for the presence or absence of 5 item characteristics and a 2-point versus 3-point scoring scale. Agreement between checklist recordings obtained from SPs and clinician-examiners (CEs) were compared by item characteristics, scoring scale and CEs' level of involvement in the assessment. RESULTS: Based on 3179 pairs of recordings, the overall percentage of agreement between SPs and CEs was 83% (kappa = 0.64). Agreement was significantly higher for items scored on a 2-point than on a 3-point scale, and when the CE was also the author and the trainer of the station. After controlling for other factors, item characteristics were only marginally associated with level of interrater agreement. CONCLUSIONS: This study suggests that attention should be paid to specific aspects of checklist development and checklist recording training when an SP or CE is used as recorder.  相似文献   

18.
Validity: on meaningful interpretation of assessment data   总被引:1,自引:0,他引:1  
CONTEXT: All assessments in medical education require evidence of validity to be interpreted meaningfully. In contemporary usage, all validity is construct validity, which requires multiple sources of evidence; construct validity is the whole of validity, but has multiple facets. Five sources--content, response process, internal structure, relationship to other variables and consequences--are noted by the Standards for Educational and Psychological Testing as fruitful areas to seek validity evidence. PURPOSE: The purpose of this article is to discuss construct validity in the context of medical education and to summarize, through example, some typical sources of validity evidence for a written and a performance examination. SUMMARY: Assessments are not valid or invalid; rather, the scores or outcomes of assessments have more or less evidence to support (or refute) a specific interpretation (such as passing or failing a course). Validity is approached as hypothesis and uses theory, logic and the scientific method to collect and assemble data to support or fail to support the proposed score interpretations, at a given point in time. Data and logic are assembled into arguments--pro and con--for some specific interpretation of assessment data. Examples of types of validity evidence, data and information from each source are discussed in the context of a high-stakes written and performance examination in medical education. CONCLUSION: All assessments require evidence of the reasonableness of the proposed interpretation, as test data in education have little or no intrinsic meaning. The constructs purported to be measured by our assessments are important to students, faculty, administrators, patients and society and require solid scientific evidence of their meaning.  相似文献   

19.
OBJECTIVES: To perform internal and external evaluations of all 5 medical schools in Bosnia and Herzegovina against international standards. METHODS: We carried out a 2-stage survey study using the same 5-point Likert scale for internal and external evaluations of 5 medical schools in Bosnia and Herzegovina (Banja Luka, Foca/East Sarajevo, Mostar, Sarajevo and Tuzla). Participants consisted of managerial staff, teaching staff and students of medical schools, and external expert assessors. Main outcome measures included scores on internal and external evaluation forms for 10 items concerning aspects of school curriculum and functioning: 'School mission and objectives'; 'Curriculum'; 'Management'; 'Staff'; 'Students'; 'Facilities and technology'; 'Financial issues'; 'International relationships'; 'Internal quality assurance', and 'Development plans'. RESULTS: During internal assessment, schools consistently either overrated their overall functioning (Foca/East Sarajevo, Mostar and Tuzla) or markedly overrated or underrated their performance on individual items on the survey (Banja Luka and Sarajevo). Scores for internal assessment differed from those for external assessment. These differences were not consistent, except for the sections 'School mission and objectives', 'Curriculum' and 'Development plans', which were consistently overrated in the internal assessments. External assessments was more positive than internal assessments on 'Students' and 'Facilities and technology' in 3 of 5 schools. CONCLUSIONS: This assessment exercise in 5 medical schools showed that constructive and structured evaluation of medical education is possible, even in complex and unfavourable conditions. Medical schools in Bosnia and Herzegovina have successfully formed a national consortium for formal collaboration in curriculum development and reform.  相似文献   

20.
Peer assessment of professional competence   总被引:3,自引:0,他引:3  
BACKGROUND: Current assessment formats for medical students reliably test core knowledge and basic skills. Methods for assessing other important domains of competence, such as interpersonal skills, humanism and teamwork skills, are less well developed. This study describes the development, implementation and results of peer assessment as a measure of professional competence of medical students to be used for formative purposes. METHODS: Year 2 medical students assessed the professional competence of their peers using an online assessment instrument. Fifteen randomly selected classmates were assigned to assess each student. The responses were analysed to determine the reliability and validity of the scores and to explore relationships between peer assessments and other assessment measures. RESULTS: Factor analyses suggest a 2-dimensional conceptualisation of professional competence: 1 factor represents Work Habits, such as preparedness and initiative, and the other factor represents Interpersonal Habits, including respect and trustworthiness. The Work Habits factor had moderate, yet statistically significant correlations ranging from 0.21 to 0.53 with all other performance measures that were part of a comprehensive assessment of professional competence. Approximately 6 peer raters were needed to achieve a generalisability coefficient of 0.70. CONCLUSIONS: Our findings suggest that it is possible to introduce peer assessment for formative purposes in an undergraduate medical school programme that provides multiple opportunities to interact with and observe peers.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号