Measuring health-related quality of life in childhood cancer: Lessons from the workshop (discussion)

Abstract

In this article, I will focus on some of the themes that emerged from the reports presented at the International Workshop and the discussion that flowed out of them. I will focus on how investigators should be clear on their purposes and the implications of different goals, alternative strategies for instrument development and the uses of parent and child ratings of health-related quality of life (HRQL). One goal of HRQL measures is differentiating between people who have a better HRQL and those who have a worse HRQL, a discriminative instrument (Kirshner, and Guyatt, 1985). We may be interested in the range of problems experienced by those with childhood cancer and their parents, how the range of problems differs in children with different cancers, children with cancer vs. either normal children or children with other health problems or examining correlates of good and poor HRQL (such as family function or socio-economic status). There are 3 key measurement properties necessary for a well-functioning discriminative instrument (Guyatt et al., 1993). The first is a high ratio of signal to noise (Guyatt et al., 1992). For discriminative instruments, the way of quantitating the signal-to-noise ratio is called “reliability”. If the variability in scores between patients (the signal) is much greater than the variability within patients (the noise), an instrument will prove reliable. Reliable instruments will generally demonstrate that stable patients show more or less the same results on repeated administration. Investigators quantitate test–retest reliability using an intraclass correlation coefficient that calculates the ratio of the between-patient variability to the total explainable variability (the between-person variability and the within-person variability). Test–retest reliability should be distinguished from internal consistency, also sometimes called “reliability”. A questionnaire or a domain of a questionnaire is internally consistent to the extent that the items measure the same underlying construct. The usual measure of internal consistency, Cronbach's alpha, shows higher values as a domain or questionnaire increases the number of items and as the mean inter-item correlation increases. Thus, questionnaires with redundant items may achieve a higher Cronbach's alpha and very efficient questionnaires with few items may have relatively low values of Cronbach's alpha. Direct measurement of test–retest reliability is therefore preferable. Cronbach's alpha is particularly unhelpful for evaluative instruments, which I will discuss below. The second key property of discriminative instruments is validity. Validity has to do with whether the instrument measures what it is intended to measure. One can categorize validity as face validity, i.e., does the instrument appear to measure what it is intended to measure; content validity, i.e., is the domain of interest comprehensively sampled by the items, or questions, in the instrument; and construct validity. A construct is a theoretically derived notion of the domain(s) we wish to measure. An understanding of the construct will lead to expectations about how an instrument should behave if it is valid. Construct validity therefore involves comparisons between measures and examination of the logical relationships that should exist between a measure and characteristics of patients and patient groups. The first step in construct validation is to a establish a “model”, or theoretical framework, that represents an understanding of what investigators are trying to measure. That theoretical framework provides a basis for understanding how the system being studied behaves and allows hypotheses or predictions about how the instrument being tested should relate to other measures. For example, Eiser et al. (1999) have developed an instrument with a very strong theoretical basis built around measuring discrepancies between actual and ideal self concepts in relation to illness-related activities and goals. Once they have developed their theoretical model, investigators test the performance of the instrument. A relatively weak form of validity is to demonstrate that the instrument can distinguish between groups in a predictable manner. For instance, Seid et al. (1999) demonstrated that their instrument showed different scores in children on and off treatment. Another relatively weak approach is to examine correlations between measures but only in terms of the extent to which correlations are unlikely to be due to the play of chance (the p value). A stronger approach is to look at the magnitude of the correlation between measures. For example, Parsons et al. (1999) report a correlation of 0.49 between children's and physicians' ratings of physical function. The strongest validation comes when investigators make a priori predictions about the relationship they expect if instruments perform the way they should. For instance, Speechley et al. (1999) found the expected strong correlations of 0.60 between pain measured by the Child Health Questionnaire (CHQ) and the Health Utilities Index Mark 3 (HUI 3), while correlations between the 2 measures' emotional function ratings were also strong (0.54) rather than moderate, as predicted. In general, the greater the extent to which the hypotheses are confirmed, the stronger the evidence for validity. Two basic approaches characterize the measurement of HRQL: generic instruments (including single indicators, health profiles and utility measures) and specific instruments (Patrick and Deyo, 1989). Generic instruments attempt to measure all important aspects of HRQL. One sub-category of generic measures, health profiles, typically has a number of items in each of a number of domains. For instance, one version of the CHQ includes 6 items related to physical functioning, 2 items related to bodily pain, 6 items related to self-esteem and so on (Landgraf et al., 1998). Another type of generic instrument, the utility measure of HRQL, is derived from economic and decision theory and reflects relative preferences for treatment process and outcome states. The key elements of utility measures are that they incorporate preference measurements, and the preferences for health states are relative to the preference for being dead. This allows them to be used in cost-utility analyses, which combine duration and quality of life. In utility measures, HRQL is summarized as a single number along a continuum that usually extends from being dead (0.0) to full health (1.0). Multi-attribute utility measures typically include only 1 item per domain. Generic measures are well suited for discriminative purposes, in large part because they are applicable to different populations suffering from very different sorts of HRQL impairment. For widely used instruments, the range of scores that one would expect in a general population has been determined. Both advantages are relevant to the third key property of discriminative instruments, their interpretability. The interpretation of HRQL discriminative measures is enhanced if we know what children in the general population might score. In addition, we can conduct comparisons of, e.g., the HRQL of survivors of Hodgkin's disease and survivors of neonatal intensive care units (Feeny et al., 1999). For health profiles that have been used widely in adults, quite extensive comparisons of this sort are available. For instance, in its use as a discriminative instrument, we know how patients in various health states score on a generic health profile called the Sickness Impact Profile. Shortly after hip replacement, patients have scores of 30, which decrease to <5 after full convalescence (Bergner et al., 1981). Scores in patients with chronic airflow limitation, severe enough to require home oxygen, are approximately 24 (McSweeney et al., 1982). Scores in patients with chronic, stable angina are approximately 11.5 (Fletcher et al., 1988). Scores in those with arthritis vary from 8.2 in patients with American Rheumatism Association arthritis class I to 25.8 in class IV (Deyo et al., 1983). The availability of data that enhance the interpretability of HRQL measures is likely to increase with more extensive use of generic health profiles in children. Reliable instruments can be used not only for discrimination but also for prediction. HRQL measures may predict mortality or the likelihood of cancer relapse. In adults, perceived health, measured through self-ratings, has proved successful in predicting mortality (Mossey and Shapiro, 1982). HRQL may also indicate that intervention, particularly psycho-social intervention, would be of use. Poor ratings of emotional function or adjustment could prompt intervention. In addition, discrepancies between ratings, parent vs. child or care-giver vs. child, may be an index of dysfunctional relationships in need of improved communication. Evaluative instruments (those designed to measure changes within individuals over time), require the same key measurement properties as discriminative instruments: a high ratio of signal to noise, validity and interpretability. The way of determining the signal-to-noise ratio for evaluative instruments is called “responsiveness”. Responsiveness refers to an instrument's ability to detect change. If a treatment results in an important difference in HRQL, investigators wish to be confident that they will detect that difference even if it is small. Responsiveness will be related directly to the magnitude of the difference in scores in patients who have improved or deteriorated (the signal) and the extent to which patients who have not changed obtain more or less the same scores (the noise). For instance, Phipps et al. (1999) have demonstrated that their instruments, rating change in somatic distress, nausea/vomiting and mucositis in inpatients, are responsive to (at least) moderate changes in patient status. The principles of validation are identical for evaluative and discriminative instruments, but investigators demonstrate the validity of evaluative instruments by showing that changes in the instrument being investigated correlate with changes in other related measures in the theoretically derived predicted direction and magnitude. For instance, Phipps et al. (1999) demonstrated similar trajectories of change in important domains of HRQL when rated by patients, parents and nurses. These investigators also demonstrated strong cross-sectional correlations between some of the ratings (e.g., somatic distress) but not others (quality of interactions as rated by nurses vs. either parents or children). The validation of their instruments by Phipps et al. (1999) for measuring change would have been even stronger had they looked at correlations of change over time between the different raters. To address the final key measurement property of an evaluative instrument, its interpretability, we may ask whether a particular change in score represents a trivial, small but important, moderate or large improvement or deterioration. When discussing discriminative instruments, I described how their interpretability is enhanced when we know ranges of scores from a variety of patient groups in different health states. A valuable strategy for enhancing the interpretability of evaluative instruments involves comparing scores on a new measure to an independent standard that has 2 key properties: first, it bears a substantial correlation with the new measure and, second, the standard is itself interpretable. One such standard that we have used in our own work is the patient's global rating of change: the minimally important difference in questionnaire score is that change observed when patients make a global rating suggesting that they have improved or deteriorated by a small but important degree (Jaeschke et al., 1989; Juniper et al., 1994, 1996). Osoba (1999) has used a similar strategy in cancer patients. Other possible independent standards include elements of functional status familiar to patients and clinicians (such as changes in mobility) or marker states (such as heart failure or arthritis functional status ratings) that are familiar to clinicians (Guyatt et al., 1991). Generic instruments that are very useful for discriminating between persons and groups according to the degree of HRQL impairment may be of limited use in the clinical trial setting. The reason is that because of a limited exploration of specific areas that are problems for patients, generic instruments may lack responsiveness. As a result, investigators have developed an alternative approach in which they focus on aspects of health status that are specific to the area of primary interest of a particular patient group (such as children with a particular type of cancer). The rationale for this approach lies in the potential for increased responsiveness that may result from including only important aspects of HRQL that are relevant to the patients being studied. Indeed, head-to-head comparisons in the context of randomized trials have shown that specific measures are often more responsive than their generic counterparts (Tandon et al., 1989; Tugwell et al., 1990; Chang et al., 1991; Laupacis et al., 1991; Smith et al., 1993). In addition to the likelihood of improved responsiveness, specific measures have the advantage of relating closely to areas routinely explored by clinicians. Investigators may be interested in understanding the nature of children's experiences and the nature of the parent–child interaction for their own sakes. Pursuing this, they may delve into the determinants of how the child reacts to the challenges presented by having cancer and what determines how well the parents and the family unit adapt. While quantitative discriminative instruments may be of some use in enhancing understanding, qualitative methods are likely to provide the most important insights (Haase et al., 1999). Interventions to enhance both child and parental coping may be the ultimate result of such curiosity-based research. Confusion and difficulty in understanding can result from a profusion of instruments. If every descriptive study and every clinical trial uses a different instrument, it will make it much more difficult for the clinicians, who should be the consumers of this research, to develop a good grasp of the area. However, use of a limited number of instruments will facilitate user familiarity as well as instrument interpretability. This argues for a moratorium on new instrument development and a focus on studying the measurement properties of the available questionnaires. This phase is likely to culminate with finding that a relatively small number of instruments out-perform their competitors, informing clinicians and investigators of the tools that they should use. There may, however, still be instances in which new instrument development is warranted. These are likely to be situations in which a new specific instrument is needed to deal with a particular type of cancer or a particular treatment regimen. Previous experience provides insight into some of the dos and don'ts of new instrument development. Some organizations have been slow in adopting direct patient and care-giver input into the standard development of new HRQL measures (Watson et al., 1999). Nevertheless, the risk of missing important areas and including less important items at the expense of items that have greater impact on patients' lives dictates full patient participation, both in generating new items and in reducing the number of items to a manageable number. Investigators have found focus groups and qualitative and semi-structured one-to-one interviews with patients to be helpful in the initial steps of instrument development. The psychometric tradition suggests using factor analysis to guide the item-reduction process. Factor analysis uses the pattern of inter-item correlations to produce clusters of items that have substantial correlations with one another. Items within each cluster have low correlations with items in other clusters. If one uses factor analysis to reduce items, one will omit orphan items (those which do not correlate highly with any of the clusters or which correlate equally well with 2 or more clusters) from the final questionnaire. Using factor analysis to reduce items risks dropping items that are very important to patients (Juniper et al., 1997). Armstrong et al. (1999) show how reliance on factor analysis can result in significant problems, even in the context of very careful and rigorous instrument development. Typically, patients with chronic disease rate symptoms associated with the disease or with serious side effects of treatment as having a large impact on their HRQL. As Armstrong et al. (1999) put it in reference to their instrument, the Miami Pediatric Quality of Life Questionnaire: “One area that is usually included as a major component in the quality of life of a child is physical functioning and comfort, but this area is glaringly missing from this measure.” How did such an omission happen? There were 7 physical functioning items in the investigators' initial pool of 56 candidate items. However, these 7 did not cluster into an independent factor; thus, 5 of the 7 items were rejected for the final questionnaire. It is not surprising that the symptom or physical function items did not correlate highly with one another: different cancers and different chemotherapy regimens lead to different patterns of symptoms and different limitations in physical functioning. However, it does not follow that these problems are unimportant to patients and their parents. On the contrary, such items are usually rated as extremely important. Investigators will be wise if they choose items on the basis of their importance to patients and restrict the use of factor analysis to helping decide which items to place in each of their instrument's domains. Several reports in this workshop focused on comparisons of parent's and child's rating of the child's HRQL. While investigators found varying degrees of agreement (Sawyer et al., 1999; Levi and Drotar, 1999), 2 general findings emerged. Firstly, even when agreement was strong, it was far from perfect. Secondly, parents rate their child generally as having more problems and poorer HRQL than does the child. The discussion at the Workshop emphasized that, rather than being a problem, discrepant ratings represent an opportunity. Investigators can use parent and child ratings in a variety of ways. Firstly, results of parent and child ratings, if they show moderate correlations, support one another's validity. If the correlations are very weak, it raises questions about the validity of one measure or the other. Secondly, if the correlations are sufficiently high, the parental rating can be used as a substitute for the child's rating if the latter is unobtainable. Thirdly, ratings can provide insight into the information needed for clinical management. For example, we found that, in children with asthma up to age 11, parental ratings provided complementary information; after that age, parental ratings added little additional insight into the state of the child's asthma or its impact on the child's life (Guyatt et al., 1997). Fourthly, discrepancies can provide insight into the communication between the child and parent and suggest the need for psycho-social intervention (though views varied on whether very high or very low correlations were more suggestive of problems). If correlations between parent and child ratings are very low, how might one establish which rating (if either) is invalid? One approach is to administer a number of other measures simultaneously. For instance, one could elicit an independent rating from a health worker and administer to both parent and child one or more other questionnaires that tap into HRQL. If correlations with these other measures are moderate with, e.g., the child's rating and very low with the parent's, this would suggest problems with the validity of the parent's rating. Another approach would be to conduct qualitative interviews to determine the reasons for the discrepancy. Investigators working in the field of childhood cancer can take advantage of lessons learned from HRQL investigation in adults and in other children's diseases. The lessons include the following: determine whether you are more interested in measuring differences between patients at a point in time or longitudinal change within patients over time; do not expect a single instrument necessarily to achieve both goals; choose instruments appropriate to your goal and establish the methodological properties of the instruments you use in keeping with that goal; in developing new instruments, maintain the primacy of what patients say are the most important determinants of their HRQL; finally, focus on a small number of instruments and conduct head-to-head comparisons to determine their relative merits. By attending to these suggestions, investigators can move HRQL methodology in childhood cancer ahead at a rapid pace.

References

Page 1

	Year	Citations

Page 1