A Generalizability Study of a Standardized Rating Form Used to Evaluate Instructional Quality in Clinical Ambulatory Sites

Abstract

Social, economic, and scientific developments have shifted much of clinical training from teaching hospitals to outpatient settings in diverse communities.1–3 These training sites labor under the same economic pressures as academic health centers. They must generate income and satisfy patients while also partnering in medical education, possibly as a secondary mission. Thus, concerns exist about the quality of instruction outside the academic health center.3 There is a need to measure instructional quality accurately and reliably in diverse clinical teaching sites.4 Traditionally, course directors have relied upon measures of student-rated teacher performance. As such, these instruments have been viewed as measures of teachers' effectiveness. However, these measures have neither been standardized across institutions nor extensively validated. In addition, they have produced measures with mixed research findings regarding reliability5,6 and may miss other important contributors to clinical instruction.7 A different assessment strategy uses measures of instructional processes in the outpatient setting and attempts to focus on student-centered learning.8 The MedEd IQ© is a student questionnaire derived from quality improvement theory and grounded in experiential learning theory.7 It measures four constructs that contribute to instruction in the office setting: precepting activities and three more novel constructs called learning opportunities, the learning environment, and learner involvement.9 The instrument measures these constructs as processes and is not intended to assess students' performances. Significant research has gone into establishing the content validity of this instrument.8 Questions remain, however, about whether the instrument can reliably measure the three process constructs that are separate from the traditional notions of teaching. Understanding and compensating for error is essential for designing a reliable and valid approach to the measurement of quality of clinical teaching sites. Using generalizability theory,10,12 the sources and relative magnitudes of errors in the rating process can be quantified. This study examined the generalizability of the MedEd IQ instrument ratings for multiple clinical teaching sites. It addressed three specific questions: To what degree is the MedEd IQ measuring “true” score differences in sites, and to what extent is the score based on identifiable error? What is the effect on measurement precision of varying the numbers of student raters and MedEd IQ items for each subscale? What is the minimum number of observations required to ensure a reasonably precise measurement of instructional quality within the three domains at the clinical site? Methods The MedEd IQ, a 33-item questionnaire, is completed by students at the conclusion of their clinical training experience. This research examined the measurement of three constructs of instructional quality within the ambulatory setting. These constructs included the learning environment (adequacy of the clinical site as a primary care classroom), the learning opportunities (availability of situations to experience clinical learning), and learner involvement (ability of the learner to participate in relevant clinical experiences).8,9 The role of the site as an effective learning environment was examined by asking the students about their experiences at the clinical site. Six items assessed this construct, for example: “The site was set up so that I could easily join in patient care” and “I felt like my time was ‘wasted’ due to the way things were run.” A five-point response scale assessed agreement from disagree strongly to agree strongly. The second construct measured the availability of learning opportunities using six items. Again, a five-point agreement scale was used to measure responses to items such as “I had the opportunity to increase my independence in providing patient care.” Conversely, another item assessed whether the available opportunities “were repetitive without offering new learning experiences.” The third scale measured learner involvement in clinically relevant experiences through seven items. Items studied included participation in taking patients' histories and doing physical examinations as well as level of decision making through using laboratory testing and radiology services, assessing pathophysiologic mechanisms, and arriving at diagnoses and treatment plans. The response scale for these items was a three-point scale: “no participation,” “minimal participation,” and “good deal of participation.” For ease of comparison with the two other scales, the involvement scale was converted to a five-point scale (one, three, and five). The MedEd IQ was administered in first- and third-year courses at two publicly funded medical schools in the northeast. For first-year Introduction to Clinical Medicine courses, students completed the instrument at the end of the academic year. For the third-year course, students completed the instrument at the end of the six-to-eight-week family medicine clerkship. In both courses, students learn in clinical settings where they are exposed to patients, their problems, and clinical teachers who guide instruction. Responses were confidential, and no feedback was reported to the preceptor or site until at least four surveys had been completed. A total of 1,872 students were enrolled in the two courses at the two medical schools. A response rate of 78.8% was achieved, with 1,475 questionnaires available for analysis assessing 249 clinical sites. An average of 5.7 ratings was completed for each site, (median = 3, range 1–35). As described earlier, each MedEd IQ questionnaire was composed of four sections. Data from the three site-specific sections represent the three subscales used in this study, and separate analyses were performed for each subscale. Missing values within a subscale resulted in removal of the subscale observation from the analysis. Sites that had been rated four or more times (n = 116) were eligible for inclusion in the generalizability study (G study). The final balanced data set employed in estimating the variance components within the G study contained ratings of 116 sites (464 observations) for the learning environment and learning opportunity subscales and 114 sites (456 observations) for the learner involvement subscale. Random sampling of observations from within sites was achieved (for sites with more than four observations) using a special code in the statistical software with a random-number generator. A balanced sample of four ratings per site was used in the analysis. Estimation of the magnitude of each variance component represents the outcome of the G study (see Table 1). In this case, estimates of the object of measurement (sites), items rater nested within site, site-by-item interaction, and the residual effects were obtained using a (r:s) × i design with a random-effects analysis of variance (ANOVA) procedure to partition variance into the various sums-of-squares appropriate to each effect. The decision study (D study) was used to estimate projected reliabilities under various conditions of measurement, in this case, varying the numbers of raters and items. The D study calculated generalizability coefficients (G coefficients), which are analogous to reliability coefficients, to understand the outcomes for various conditions of measurement.TABLE 1: Generalizability Study: Variance Component Estimates for (r:s) × i DesignTechnical aspects of analysis. The model characterizing the analysis follows. Variance attributable to site is represented by “s,” item is represented by “i,” and student rater by “r.” A tilde (∼) is used to represent an effect. With raters nested within sites, and both raters and sites crossed with items, the score X(sir) assigned to any site (s) by any rater (r) on any item (i) can be represented as: X(sir) = μ + μ(s) ∼ + μ(i) ∼ + μ(r:s) ∼ + μ(si) ∼ + μ(ri:s) ∼. The term μ is the overall mean across all sites, raters, and items. The remaining terms in the equation are score effects. For example, the site score effect μ(s) ∼ is: μ(s) ∼ = μ(s) - μ, where μ(s) is the mean for sites across all raters and items. The generalizability coefficients were calculated as follows: where the σ2() terms are the estimates of variance components, k is the number of raters for the decision study, and m is the number of items for the decision study. Results Generalizability study. In total, five sources of variance were evaluated. The variance components are displayed in Table 1 along with the standard error (SE) and the percentage of variance accounted for. The percentage of variance attributable to the true measurable differences between sites is shown on the “sites” line. This is variance not otherwise explained by the other sources of error and is sometimes referred to as “true” score variance. Thus, 17% of the variance in the involvement subscale scores was due to true differences between sites. The “item” variance component (representing the degree to which the questions themselves were the source of variance) represented a relatively small source of error (4-17%). Student raters nested within sites represented a large source of error variance (25–30%), along with the residual random effects (34–54%). Decision study. Projected G coefficients (ρ2) for each of the three scales are displayed graphically in Figure 1. Values of ρ2 are shown for from five to 20 items and from two to 18 ratings. A dramatic advantage in terms of ρ2 is observed with the use of additional raters and a small advantage is obtained with the use of more items. This trend is apparent with up to 14 raters, after which diminishing returns in reliability are obtained. Comparatively smaller increases in ρ2 are obtained by increases in the number of items. For example, for learner involvement, four to six raters marking five items would achieve a reliability of approximately .6.Figure 1: Reliabilities of scores for (A) learning environment, (B) learning opportunity, and (C) learner involvement as measures of instruction quality.Discussion The purpose of this study was to examine the generalizability of the MedEd IQ in diverse clinical training sites and assess its ability to reliably measure differences among those sites for three constructs important to instructional quality—learning environment, learning opportunities, and learner involvement. The decision study provides information about the reliability of the instrument under various conditions of measurement. Results from the generalizability study of the MedEd IQ questionnaire confirm that reliabilities from a single form are clearly too low for use in evaluating a site. Variation among raters with different preferences and motivations can overwhelm the true differences among the sites. Therefore, as demonstrated in the decision study analysis, increasing the number of raters would be an effective method for reducing this error. In practice, multiple students rotating through a given site should use this questionnaire over time. Averaging across multiple raters acts to reduce this. The decision study indicates that when a site is observed multiple times by multiple raters, an acceptably reliable measure of site quality can be obtained. For this instrument, the constructs of learning opportunities and learning environment would require ten to 14 ratings to achieve a reliability of .7. Interestingly, the learner involvement subscale achieves similar reliability with only six to eight raters. A follow-up study obtaining a second random sample shows this result to be stable. Further investigation is ongoing and focuses on the reliability differential and the influence of the response scale. For sites that have low volumes of learners, the lower reliability of the MedEd IQ would suggest interpretive caution. However, for sites with high volumes of learners, course directors may be reassured that an instrument is available to monitor these components of instructional quality. Limitations. In this study, we were unable to assess the instrument's properties with respect to differentiating outstanding preceptors. While we recognize the importance of this endeavor, it will require a different analytic approach and will provide the opportunity in a later study to specifically study how preceptor activities vary in relation to other site characteristics. Our sample subset represents 116 clinical teaching sites affiliated with two medical schools in the Northeast. It is possible that our sample somewhat limits the generalizability, but we know of no evidence that suggests that these clinical sites are different from those in other locales. In this study, we used those sites that were more frequently used as teaching sites, and this may also represent a bias. Conclusions We believe that this study provides strong evidence that the MedEd IQ, when used with a properly designed assessment strategy (i.e., multiple ratings of ten to 12 per site), can be a useful adjunct in evaluating sites. The variance associated with relevant aspects of the measurement process was quantified. Depending on the measurement-precision requirements, we identified methods to achieve acceptable reliability. For sites that train at least five students per year, reasonable assessments of those sites can be made within two years. Additionally, our study confirms the importance of site qualities that appear to differentiate clinical instructional sites separate from traditional notions of teaching effectiveness. Instruction in clinical teaching sites should be a focus of continued research and ongoing quality assessment strategies.

References

Page 1

	Year	Citations

Page 1