Measuring Academic Production—Caveat Inventor

Abstract

In response to the impact of the rapidly changing economic environment of health care on the financing of academic medicine, medical school faculties have started to develop systems that measure their increasingly complex and diverse clinical, teaching, and scholarly activities. These efforts at individual institutions were given a boost in 1998 when the Association of American Medical Colleges (AAMC) formed the Mission-based Management (MBM) Program.1,2 The AAMC guidelines integrate and enhance features of previously published relative value systems developed at individual medical schools.3–7 Expert panels were established to develop a framework that schools could adapt to create their own measurement systems, with separate panels for medical education, research, and patient care. The panels envisioned that these systems would provide data useful for developing careers, rewarding faculty contributions, and aiding budgetary decision making.1,2 Central in the AAMC guidelines for developing such systems is the concept of judging the relative values of faculty activities. Based on our reading of the AAMC reports and earlier literature, and our experience in developing such a system for measuring the relative values of educational, research, clinical, and administrative activities in one department, we have some concerns that we wish to share with the readers of Academic Medicine. OUR EXPERIENCE In June 1998, the administration of the University of Oklahoma College of Medicine requested each department to develop a system for measuring the relative values of faculty members' contributions. The purposes of the exercise included demonstrating for the state legislature and potential private donors how much work the medical school faculty does, evaluating individual faculty members or programs, and comparing departments. During the process of developing the relative-value—based system for our department, we observed some features of previous models and our own that threaten their acceptability and validity. Indeed, many of these problems were anticipated by the AAMC panels.1,2 While some of these problems may be due to the particular features of these systems or conditions unique to the process of their development, others seem inherent in the very nature of the endeavor of measuring the relative values of academic activities. Measuring the Full Range of Relative Values Medical school faculty carry out a broad range of activities. A numerical measure used as an “activity weight”1 must be able to reflect both the smallest and the largest values of these activities. It has been suggested that in determining an “activity weight,” consideration be given to the time required to produce a “unit” of the educational activity, the time and effort involved in preparation, the skill required, and the priority of the activity with regard to the educational mission of the school. Though the AAMC panel recommendations do not outline a specific process for assigning activity weights, previous measurement systems have judged the value of the activity's product on a 1-to-10 scale, and then multiplied it by an estimate of the amount of time required.3–5 The range of the resulting products can be broad enough to cover the range of academic activities. However, there is a question whether such constrained “activity weight” scales have the qualities appropriate for multiplying. Although each number on such scales has more value than the number below it, an “8” is not necessarily twice as good as a “4,” and because the step from “1” to “2” is not necessarily the same as the step from “9” to “10,” the “relative value unit” scale can be no better than an ordered categorical scale. Thus there is reason for concern about the validity of sums and products of relative value units, such as an individual's or a department's annual production. Hoping to avoid this problem, our system used an open-ended ratio scale (any value greater than or equal to 0). We observed, however, that there seems to be a natural tendency to collapse the numerical range of judgments to a familiar range, such as 1 to 10. Despite the potentially broader range offered by our ratio scale, the relative values derived from our faculty members' judgments still seemed to underestimate the differences in value between diverse faculty activities. For example, a full-time book writer would need to solo-author 19 books a year to produce value equal to that created by a full-time clinician. Arbitrariness of Certain Elements of the Scaling Process The design of a relative-value scaling system may favor some faculty activities over others. We had the choice of whether to use means or medians of faculty relative value estimates to derive “activity weights” and found that the approach we chose could favor some activities over others. Many administrative activities were assigned higher relative values by the median-based transformation, while many scholarly activities were favored by the mean-based transformation. Though each of these resulting scales may look valid, subtle differences would make substantial differences in career income if translated into sustained salary changes.8,9 Differences in the weighting scheme, or indeed in the judgments of the relative values of particular activities, may have a larger impact when applied to groups of faculty who have very diverse portfolios of activities. Who Shall Judge the Relative Values of Faculty Activities? Should the judgments of relative value be made by department leaders or by the entire faculty? Our system involved the faculty in identifying the list of activities, rating all activities, and counting the activities they did personally in a year. The resulting relative-value scales and estimates of departmental production were based on the faculty members' judgments. There are several disadvantages of having all faculty members rate the activities; these disadvantages may offset the legitimacy provided by the wide participation. Though the faculty ratings may be very pertinent for the allocation of work or rewards within a department, for other activities, such as assessing adherence to the objectives of an institutional strategic plan, perhaps the central administration's assessment is more pertinent. In our experience, the faculty initially resented and resisted the rating task, partly because the mandate to develop the system was imposed from outside the department and the eventual use of the system was unclear. This may have affected the quality of the estimates, which points out the need for a careful approach to the introduction of any system recommended by an educational panel.1 Another problem was that not all individuals followed the instructions to use a ratio scale when they rated the activities. A few individuals, for example, gave exaggerated ratings for particular classes of activity. Consistent and appropriate use of a ratio scale for magnitude estimation may require more than written instructions. Perhaps faculty members could be given training or make judgments under the guidance of an interviewer trained to check for consistency. Such considerations support delegating the judgments to a highly motivated committee, despite the bias that some committee processes may produce. A second problem is the tendency to game the system. Faculty members may make ratings to their own benefit, to say “What I do is more valuable.” This tendency would surely be exacerbated if the answers had budgetary consequences. One might think that when all faculty members were surveyed, the exaggerations would cancel each other out. However, that would depend on the relative numbers of people engaged in the various activities. If the mean of all responses is used, then the larger the numbers one uses the more one can influence the total. If the median is used and all faculty members exaggerate to serve themselves, the majority will control the median. It is not clear that a technical solution can be found for this problem. We saw some evidence of gaming the system by our faculty. Overall, when both clinical and nonclinical faculty members' rated clinical and nonclinical activities, both groups of faculty rated the clinical activities about equally, but the nonclinical faculty rated nonclinical activities higher than the clinical faculty did. Potential Impact of Allocating Resources Based on the Relative Values of Faculty Activities If our system or any other relative value measure were to be taken as the basis for a system for allocating resources to individuals, such as determining salary changes, one must consider how individuals would allocate their efforts. Rewarded, measured activities might drive out less-rewarded, less-measured activities. Indeed this is explicitly recognized as a fundamental concern by the AAMC panel on the evaluation of research.2 It is not certain that individuals' best contributions to science, to teaching, or to service, given their unique backgrounds, would be elicited in the context of a system that rewarded only countable activities. Instead, individuals might choose those activities that are most easily counted or most certain to be completed. For example, the uncertainty of pursuing grants and publications might make that a poor bet compared with the certainty of signing up to teach or to see patients. As more faculty members choose those activities, would they also give them higher values during subsequent periodic reviews and further drive down the relative value of scholarly activities? The problems of gaming the system would also apply at the institutional level, where department chairs might exaggerate the value of their department's contributions. The motivation for this could be quite powerful. While the budget changes based on a relative-value measurement system have been as small as +2.5% or −3.1%,10 they have also been as high as +18% or −30%.7 CONCLUSIONS In conclusion, the results of our department's experience in developing a relative-value—based system measuring faculty contributions demonstrate the need for each organization to take care to ensure that relative values used to measure production reflect the judgments of their members in a valid and equitable manner. Because of the problems we have identified with previous systems and the fact that most of these problems are general to the attempt to measure the relative values of faculty activities, we think that the application of such measurements to budgetary decisions is premature.

References

Page 1

	Year	Citations

Page 1