Global Descriptive Evaluations Are More Responsive than Global Numeric Ratings in Detecting Studentsʼ Progress during the Inpatient Portion of an Internal Medicine Clerkship

Abstract

Grade inflation by clinical instructors is a constant concern for clerkship directors.1 Achieving evaluation of students' performances that is credible, reliable, valid, and provides meaningful feedback to students is a central goal for clinical clerkships.2 However, most efforts to address improving descriptive evaluations of medical students have focused on revisions to existing evaluation forms, despite the well-recognized limitations of commonly used numeric rating scales.3 Recently, the University of Utah School of Medicine implemented a revised curriculum, which included substantive changes in the evaluation of clinical performance in the medicine clerkship.4,5 The new approach uses regularly scheduled, formal evaluation and feedback sessions, coupled with a vocabulary of global terms describing progressive levels of students' performance from “reporter,” to “interpreter,” to “manager/educator” (R-I-M-E).6,7 This system has been used for over 15 years at the Uniformed Services University of Health Sciences (USUHS), and has been shown to be reliable8 and valid.9–11 We have demonstrated the feasibility and acceptability of implementing this system in our institution.5 Medical students enthusiastically support this method.5,12 Preliminary work has suggested that this evaluation system can detect students' progress during a clinical clerkship.13 To study this in our own clerkship, we hypothesized that the R-I-M-E vocabulary, used in the setting of formal evaluation and feedback sessions, would be more responsive to changes in student performance than would a method of global numeric ratings. Method The internal medicine core clerkship at the University of Utah, required of all students in the third year, is a 12-week course divided into two six-week periods; one is taken in each semester. Typically, 25 students (roughly one fourth of each third-year medical school class) are in the course at any time. The first nine weeks are spent in a traditional inpatient setting. Students are assigned to one of three hospitals: the University Hospital, a Veterans Affairs Health Center, or a community hospital. In general, a ward team at each site includes two third-year students, two interns (PGY1, rotating by the calendar month), one resident (PGY2 or PGY3, rotating by four-week intervals), and an attending physician (rotating by three-week intervals). In all, there are 13 teams, 11 of which are considered general medicine. Though some students also rotate on two subspecialty teams (cardiology and hematology—oncology), each student has at least six weeks of inpatient general medicine by the end of nine weeks. If his or her performance at this point is satisfactory (i.e., at least at the “reporter” level), the student advances to the three-week ambulatory portion of the clerkship, which concludes his or her clinical assignments for the clerkship. Formal evaluation sessions were scheduled at regular three-week intervals throughout the inpatient portion of the clerkship, coinciding with the conclusion of the attending physician's tour of duty. The clerkship director (or assistant) who had received training in running the evaluation sessions moderated these meetings. The ward residents (PGY2 and PGY3), attending physicians, chief medical residents, and the medicine chair (or designee) attended the evaluation session. Beginning with the residents, each evaluator was asked to describe his or her students' performances using the R-I-M-E vocabulary, and to provide specific observations to support the ratings. Briefly, a “reporter” is consistently good in interpersonal skills, and reliably obtains and communicates clinical findings. An “interpreter” is able to prioritize and analyze patients' problems. A “manager” consistently proposes reasonable options, which incorporate patients' preferences. An “educator” has a consistent level of knowledge of current medical evidence, and can critically apply knowledge to specific patients.6 We modified Pangaro's descriptors6 by adding the term “observer” to describe expectations of students upon entry into the clerkship. An “observer” demonstrates reliability in keeping appointments and other behavior appropriate for a clinical shadowing experience, but neither independently nor consistently meets the performance level of the reporter. In addition to the five terms, evaluators were permitted to use intermediary steps between descriptors (e.g., “reporter—interpreter”) to identify students who, at times, but not consistently, performed at a higher level. The inclusion of intermediate steps yielded a nine-point scale. Evaluators were directed to conclude their critiques by identifying a specific “next step” to which students' future efforts might be directed. Instructors also completed a standardized clerkship evaluation form. This form included 12 domains of students' performances (e.g., reliability and commitment, fund of knowledge, history-taking skills). Within each domain, instructors' used a five-point Likert scale, with specific behavioral descriptors anchoring each level of performance, to rate the students. Written comments and a final global numeric rating (0 = failing, 1 = marginal, 2 = competent, 3 = very good, 4 = excellent) were also required. Instructors were told that they could use intermediate ratings on the numeric scale, such as “1.5,” thus yielding the potential for a nine-point scale. The numeric ratings were neither verbally requested by the clerkship director nor discussed at the session. Evaluation forms were turned in at the end of each evaluation session. If instructors changed during the medical student's clerkship experience, they were unaware of prior instructors' verbal or written evaluations. For statistical analysis, the R-I-M-E descriptors were transformed to the following numeric scale: observer = 0, observer—reporter = 0.5, reporter = 1, reporter—interpreter = 1.5, interpreter = 2.0, interpreter—manager = 2.5, manager = 3, manager—educator = 3.5, and educator = 4. Mean ratings for both descriptive and numeric systems were calculated separately for residents and attendings for each of the three evaluation sessions. Since not all students rotated on a subspecialty service during the nine-week clerkship, we excluded evaluations from these services due to an insufficient number of evaluations. Differences in mean ratings across the sessions were compared by paired t-test (two-tailed). A statistical software package was used for the analysis. Results Ninety-seven students participated in the clerkship during the study period. At each of the three evaluation sessions, the evaluation completion rate by residents and attending physicians was >75% for each evaluation method. Table 1 shows the mean ratings for the R-I-M-E descriptors and the numeric values given by residents and attending physicians at each evaluation session. The R-I-M-E descriptors demonstrated greater changes in mean ratings over subsequent evaluation sessions than did the global numeric method. The changes in R-I-M-E ratings were statistically significant (p < .05 for both residents' and attending physicians' evaluations) across all three evaluation sessions, while numeric ratings did not consistently change until the third session.TABLE 1: Descriptive and Numeric Ratings at Three Evaluation Sessions of 97 Students in an Internal Medicine Clerkship, University of Utah School of Medicine, 1999–2000*The frequency distribution of descriptive and numeric ratings provided by residents and attending physicians at the second evaluation session is shown in Figure 1. In contrast to the numeric system, descriptive evaluations were distributed more normally and had greater range; this finding was observed at each evaluation session. The frequency analysis also demonstrated a rapid “ceiling effect” for the numeric ratings, beginning at the first evaluation session and persisting over the subsequent evaluations.Figure 1: Frequency distribution of ratings of students from the second evaluation session. Shown are the numbers of ratings from instructors (residents and attending physicians combined) for each category. Descriptive evaluations were more normally and more widely distributed, and did not demonstrate the ceiling effect seen with numeric evaluations. On the horizontal axis, O = observer, R = reporter, I = interpreter, M = manager, and E = educator; intermediate ratings (e.g., I/M = interpreter/manager) are also included.Discussion There is a need for clinical performance evaluation methods that are reliable, valid, based on educational objectives, and free from influences such as grade inflation. Our results show that using a method of descriptive assessment, instructors were able to detect and describe significant changes in students' performances over each successive three-week interval of observation, in contrast to evaluations using the global numeric system. Furthermore, the magnitudes of the changes observed using the descriptive method were consistently greater than those produced by the numeric ratings. Although the increase in numeric ratings was statistically significant by the final evaluation session, the mean ratings clustered around “3, very good.” In contrast, the descriptive ratings clearly showed growth from reporter, through interpreter and toward manager. This information is clearly more educationally meaningful to clerkship directors and students alike. There may be several reasons for these findings. First, the R-I-M-E descriptors are based on specific behaviors and performances and thus may be less prone to an individual instructor's interpretation. The global numeric descriptors are less specific and often provide little guidance as to their application in relation to the core objectives of the clerkship. Second, participation in the evaluation sessions also allows the clerkship director the opportunity to discuss an instructor's comments and consider them in the context of the clerkship goals. Comments that do or do not substantiate a given level of a student's performance can be quickly identified and feedback to the instructor can be given immediately. This “case-based” faculty development aspect of the evaluation sessions has been described in detail elsewhere.14 Another important finding of our study is that the frequency plots of the R-I-M-E descriptive ratings show a remarkably broad use of the rating range by evaluators. This was true at each evaluation session, but it is most striking for the data collected at the second session, in which all nine categories contained at least one evaluation. In contrast, the numeric ratings were skewed to the right (higher numeric ratings) from the time of the first evaluation session, with an impressive ceiling effect observed in frequency distributions at the third session. Considering our results in the context of theory offers additional insights. For over two decades, the transtheoretical model (TTM) has emerged as a meaningful and useful framework with which to understand and investigate the mechanisms of intentional behavioral changes.15 Though it has been studied and used primarily in the setting of chemical addiction, it provides a useful paradigm for understanding how and why the learners in our study changed. According to the TTM, the process of real, sustained change occurs over five sequential stages: precontemplation, contemplation, preparation, action, and maintenance. It is reasonable to believe that, in the setting of clinical training, medical students are at least in the contemplation stage of change (i.e., they are thinking seriously about improving their performances). The power of this form of descriptive evaluation is that it provides a rhetoric—a stable language—by which learners and teachers may consider current performance, (contemplation), discuss goals and expectations—the “next step”—(preparation), and share an environment where progress can be demonstrated (action and maintenance). We acknowledge several assumptions and limitations to our work. First, this was not a randomized, controlled study testing a preconceived hypothesis grounded in current theory. The significance of our findings is enhanced when the findings are considered within the context of existing stage theory (e.g., TTM), though this was postulated post hoc—after the data had been collected and analyses had been completed. Second, we assumed that students actually do progress over the course of the clerkship. This seems reasonable, though it is not proved by our work. Third, transforming the R-I-M-E descriptors to numbers for analysis is based on the assumption that the “distances” between successive descriptors (e.g., “reporter” → “interpreter” and “interpreter” → “manager”) are equal. While this may or may not be true, it does not affect our comparisons between global descriptors and global numeric ratings, because the assumption is applied equally to both systems. Fourth, our study is limited to one clerkship in a single institution. Nevertheless, preliminary work at another institution has also demonstrated that the R-I-M-E descriptors can detect student growth during an internal medicine clerkship.14 We do not know whether the findings are generalizable across disciplines. Fifth, if the numeric rating scale had been used as the evaluating template for verbal discussion, it is possible that the evaluators might have shown improvement in discrimination and greater change over time. Our study does not address the question of whether discrimination and progress are driven by the descriptors per se, the interactive evaluation and feedback sessions, or both. Finally, this one-year cohort study reports changes observed in mean ratings for a population of students, not for individual students. Work is under way to determine whether factors can be identified that predict individual student growth or academic “failure to thrive.” Despite these limitations, our findings are important for several reasons. To our knowledge, this is the first report of a “real-time” clinical assessment tool capable of producing ratings that span the range of the scale. In addition, we have shown that it is a method that can limit grade inflation and ensure that evaluations of students' performances are systematic and based on course expectations, rather than the nuances of individual raters. Perhaps most important, this improvement in evaluation is directly linked to feedback to students that is honest, direct, and timely, and can form the basis for helping each individual student achieve his or her “next step.”

References

Page 1

	Year	Citations

Page 1