Reliability, repeatability and reproducibility: analysis of measurement errors in continuous variables

Abstract

Clinical practice involves measuring quantities for a variety of purposes, such as aiding diagnosis, predicting future patient outcomes, and serving as endpoints in studies or randomized trials. Measurements are almost always prone to various sorts of errors, which cause the measured value to differ from the true value; accordingly, studies investigating measurement error frequently appear in this and other journals. The importance of measurement error depends upon the context in which the measurements in question are to be used. For example, a certain degree of measurement error may be acceptable if measurements are to be used as an outcome in a comparative study such as a clinical trial, but the same measurement errors may be unacceptably large to make measurements usable in individual patient management, such as screening or risk prediction. In the past 20 years many papers have been published advocating how studies of measurement error should be analyzed, with a paper by Bland and Altman1 being one of the most cited and well known examples. There has been much controversy concerning the choice of parameter to be estimated and reported, and consequently confusion surrounding the meaning and interpretation of results from studies investigating measurement error. In this paper we first distinguish between the general concepts of agreement and reliability to aid researchers in considering which are relevant for their particular application. We then review the statistical methods that can be used to investigate and quantify agreement and reliability, dealing separately with the different types of measurement error study, while emphasizing the largely common techniques that should be used for data analysis. We reiterate that the judgment of whether agreement or reliability are acceptable must be related to the clinical application, and cannot be proven by a statistical test. We highlight the fact that reliability depends on the population in which measurements are made, and not just on the measurement errors of the measurement method. We discuss the advantages of method comparison studies making at least two measurements with each measurement method on each subject. A key advantage is that the cause of a correlation between paired differences and means in the so-called Bland–Altman plot can be determined, in contrast to when only a single measurement is made with each method. Throughout the paper, we try to emphasize that calculated values of agreement and reliability from measurement error studies are estimates of parameters, and as such we should report such estimates with CIs to indicate the uncertainty with which they have been estimated. We restrict our attention to measurements of a continuous quantity; alternative methods are required for categorical data2. One difficulty in the measurement error field is the number of different terms used to describe studies of measurement error. The terms ‘agreement’, ‘reliability’, ‘reproducibility’ and ‘repeatability’ are used with varying degrees of consistency in the medical literature. We first make clear the distinction between the statistical concepts of agreement and reliability3. Agreement quantifies how close two measurements made on the same subject are, and is measured on the same scale as the measurements themselves. Two measurements of the same subject may differ for a number of reasons, depending on the conditions under which the measurements were made. In a method comparison study there will be differences because of inherent variability in each of the measurement methods, as well as potentially a bias between the measurements from the methods. If the measurements are made by different observers or raters, differences may be due to biases between the observers. Agreement between measurements is a characteristic of the measurement method(s) involved, which does not depend on the population in which measurements are made, unless bias or measurement precision varies with the true value being measured. One popular way of quantifying agreement is to estimate the 95% limits of agreement, as proposed by Bland and Altman1. These limits are defined such that we expect that, in the long run, 95% of future differences between measurements made on the same subject will lie within the limits. If reliability is high, measurement errors are small in comparison to the true differences between subjects, so that subjects can be relatively well distinguished (in terms of the quantity being measured) on the basis of the error-prone measurements. Conversely, if measurement errors tend to be large compared with the true differences between subjects, reliability will be low because differences between measurements of two subjects could be due purely to error rather than to a genuine difference in their true values. The reliability parameter is also known as an intraclass correlation (ICC), as it equals the correlation between any two measurements made on the same subject. Reliability takes values between zero and one, with a value of one corresponding to zero measurement error and a value of zero meaning that all the variability in measurements is due to measurement error. As a dimensionless quantity, it is arguably quite difficult to interpret, and deciding what value constitutes sufficiently high reliability is often made in a subjective fashion. Repeatability of measurements refers to the variation in repeat measurements made on the same subject under identical conditions4. This means that measurements are made by the same instrument or method, the same observer (or rater) if human input is required, and that the measurements are made over a short period of time, over which the underlying value can be considered to be constant. Variability in measurements made on the same subject in a repeatability study can then be ascribed only to errors due to the measurement refers to the variation in measurements made on a subject under conditions4. The conditions may be due to different measurement methods or being measurements being made by different observers or raters, or measurements being made over a period of time, within which the of the could The first of study we is a repeatability study, in which we investigate and quantify the repeatability of measurements made by a single instrument or method, and in which the conditions of measurement constant. The of study we is a method comparison study, in which measurements are made two measurement methods on a of The of two different methods means that this is a study, but the method comparison is used as it the conditions under which measurements have been made. In contrast to a repeatability study, bias may between measurements made by two different methods, and their measurement errors may have different The of study we is one in which measurements are made by different observers or this is a of As with method comparison biases may between and their measurement may As we discuss Measurements with observers or if in quantifying the measurement error of the particular observers in study, the same methods should be used as for a method comparison In to investigate the repeatability of a repeatability study for an make at least two measurements subject under identical This means that the measurements must be made by the same measurement method, or the same observer or The is to then quantify the agreement and reliability of measurements made by that particular method or If the differences between two measurements made on a subject are in the long we expect the difference between two measurements on a subject to differ by than the repeatability on 95% of estimate the we can a of to the data the repeat measurements made on can be in all statistical variability in data that which can be ascribed to differences between and that to within For a repeatability study, the are defined by the subjects under and this must be in the statistical used. The estimates how much variation in measurements can be to differences in the or values of subjects, with the measurement error. the results in estimates of the and (or the corresponding which are the The estimate of the can be used in the to an estimate of the repeatability We with a study by the repeatability of measurements of made by one observer from The estimated repeatability meaning that the difference between any two future measurements made by that particular observer on a particular are estimated to be than on 95% of is to that the repeatability of observer may be because of differences in the and of observers. the repeatability calculated is an it is to a for it to indicate how it has been estimated. A may be by statistical but in the we review how a 95% for the can be a can be used to a for the repeatability by the limits by If the for the is by the limits must first be to a for the The that the measurement errors are of the true and that the of the errors is the of values. the of errors with the true value being measured. This should be by paired differences between measurements their the so Bland–Altman We this in the context of a method comparison study the and describe how such errors can often be with by making a variability in measurement of the repeatability on the differences between measurements being This can be by a or plot of the paired differences in measurements on each subject. made in the of the for the is that the measurement errors are If this is in the may be to The reliability of a measurement method is often of when measurements are to be used to between subjects or of For example, if we have a choice of two measurement methods that could be used to an outcome in a clinical or study, the method with reliability will statistical to differences between for a In this we describe how measurements from a single method can be used to estimate the reliability in a we discuss the of reliability to measurement methods Reliability in method comparison and different observers in The same as can be used to estimate statistical will the estimated which is the of we can the estimates of the and the Agreement and to estimate the For the data of an estimated of This means that of the variability in measurements of the estimated to be due to genuine differences in between with the being due to errors in the measurement and the observer the measurements were all made by one the reliability may be to as As with the repeatability the calculated is an and a 95% should be statistical a but in the we review the of a 95% for the For the data of the estimate of a 95% of to we the same to estimate reliability as we describe for agreement, the same of 95% CIs for the estimate on of the true values. The reliability of a measurement method depends upon the of the population in which the measurements are made. the of reliability Agreement and we that the of true values in the measured by the the value of agreement between repeat measurements is a characteristic of the method or instrument the of measurement errors is the of true reliability depends on the of measurement errors and the true in the population in which measurements are made. The is with a a method for measuring has a of which does not with the underlying value being measured. we a study to estimate the reliability, and we from a population in which the in true is 20 This a reliability or of we our subjects from a in which the of true values is the same as the variability of the error. In this the reliability is to If studies report only an estimate of the reliability (ICC), can only make of the estimate if the population in which the to such measurements has We that report estimates of and in to the In this can whether the measurement method will be sufficiently for their application, in which the between subjects may be we a measurement method or in clinical we must that the measurements it are to by the measurement method that measurements made with the method are the method. If the measurements from the two methods are sufficiently close and the of a patient on their basis be the the method could the method in clinical because the method is or to The as to what constitutes depends on how the measurements are to be used. If the two measurements are made on the same or we can quantify their investigate and quantify the agreement between measurements made by two methods, we must at a a of subjects the measurement method and the proposed method. The data from such a study of of measurements from each with the the measurements from the two measurement methods. we making two measurements on each subject each method because the one measurement method is the most common we by analysis. The first to such a is to plot the The plot is of measurements from the method from the method (or If measurements were from we expect the to lie on the of the of can be used to if there is if data are the than this that the method on the measurements on the of for this plot and report a statistical for whether the of the plot from the of As by Bland and in a paper in this we expect the to have a than one if the method on the any measurement and so a of the that the true is to one is not a it the same as a plot with the of of the between the measurements from two methods is often by the difference in a measurements from the two methods the of their as first by and this is to as the Bland–Altman and is frequently in of measurement error We the plot an from the literature. measurement in with and measured and also by We have their plot of differences means and it is as only data from the in measured by and by their from study by the and the each with 95% The plot the difference between calculated and the of two values. not their plot or of results to indicate which measurements were from is to be and always the same measurement from the The the of the paired from zero an estimate of the bias between the two methods. which measurements have been from which to which method, on measurements. The indicate the estimated limits of agreement their which we describe The plot can be used to the differences between measurements made by the two methods. The variability of the differences between the two methods how well the methods If the variability of the paired differences is the of and there is between the difference and we can quantify the agreement between the methods by the limits of the of the limits of agreement, we how the for limits of agreement may be and how they can be on we expect the difference in as measured by the two methods to lie between and for 95% of future measurements. is to that the limits of agreement calculated in this are just and as with any of estimate it is to quantify and report how the limits are estimated of 95% The estimated limits of agreement indicate how large the between the measurements will be on 95% of The whether the methods sufficiently well must then be made depending on the context in which the measurements will be used. In contrast to the repeatability which bias between the limits of agreement method this The of the paired differences whether on one method to or measurements to the measurements of the method, which we to as a bias between the methods. In the data in from there of a bias between the two methods, because the difference between the methods close to at We can a statistical to whether there is of bias by a of the differences paired of the measurements from each we the that the true of the differences is corresponding to bias between the methods. For the data in this that there of a bias between and made The limits of agreement method that the of the differences is the of measurements. this will not be the In frequently the of the differences will with the In this for small values the limits of agreement will be than for values the limits will be of the plot of differences from the study by that the variability of the differences may be when the value being measured is The of the paired differences the of the means it for the This can often be by of the measurements of the two If the difference in the of the measurements results in we can the limits of agreement and CIs for the limits in the the difference of two is to the of the we can all the estimates made on the scale by of the limits of agreement to the of one measurements to the For the data in the of the difference in measurements with The limits of agreement on the scale are to the of with 95% limits of agreement for the of to The 95% CIs can be calculated on the scale to the which measurements were made by which method, we cannot whether the is for measurements by to by or a plot of the of the measurements by the two methods to their a with the estimated 95% limits of agreement and their 95% This that the of a for the paired differences is of measured by and by of on data published by the and the each with 95% to can also be used to try to For example, estimate limits of agreement paired differences in measurements as a of the of the two measurements. The key is that a should be used which that are and have the of is for there to be an between the paired differences and of this is in a study by the agreement between as measured and the calculated in We have the data from their Bland–Altman plot of the difference between the and the and it in There to be a between the paired differences and differences between measured by and by from study by correlation the and the each with 95% We can a statistical to the for a whether the correlation between the paired differences and means from zero or by of the differences the The is known as in the context of the of measurements from two methods, which for the data in is for an between paired differences and how should we than for the difference as a of the we should be to the cause of the A between the paired differences and means when the of measurements from one method from the of measurements from the other method. There are at least two of an between paired differences and The first is that there is between the difference in measurements from the two methods and true value being the bias between methods over the of true values. The is that the of the two methods This will in the of bias if a method has or measurement errors than the method. This is and when a or measurement method is compared with a If the cause of the correlation is and we were to plot the paired differences the true value being there be The between the Bland–Altman plot and a plot true values because the paired means measurement errors, in contrast to the true values. as by and with only one measurement method we cannot which of two is the we must make an that the of the two methods are so that the correlation is to or that the two are and that there in between bias and true value we may have a of For the of we with the from by that the correlation is due to a difference in the of the two methods. A correlation between differences and means will when the method with measurement error is from measurements of the method with error. For the data in the correlation is the errors than the measured by which and to the measured by as the true that such measurements were to an that there is we can 95% limits of agreement, as by because the limits of agreement method does not that the two are In to from the data whether bias is we must make at least two measurements with each method on each subject. This study has often been but is we describe how such studies can be to whether a correlation between paired differences and means is due to bias between the methods with two measurements from each As reliability may be a parameter with which to two different measurement methods. estimate each reliability, we must make at least two measurements of each subject with each of the two methods. The repeat measurements from each method can then be as two repeatability studies estimates of each reliability, which can be advantage of reliability to measurement methods is that it can be used to methods when their measurements are on different or as the reliability is a dimensionless reliability depends on the of the true values in the population Reliability depends on population it is that reliability are compared only if they have been estimated from the same report a single reliability estimate from method comparison studies in which it that only a single measurement subject method is appear that the estimates are a as The that there are biases between measurements within subjects, and that the are for all measurements measurements are from two different methods, of may be bias may between the methods and their may If is the is and the estimates of within and between subject are estimates of different The reliability estimate then has a different interpretation from the We the estimate from the to data from two different measurement methods. As making two measurements with each method an of whether bias between the methods is or whether the measurement error the measurements from each method can be separately as two repeatability the methods estimates of the repeatability and reliability for each method. The Bland–Altman plot of paired differences means can be by the of a two measurements from each method in of the single As with the Bland–Altman plot on a single measurement from each method, an correlation between differences and means could be due to bias or a difference in measurement error between the methods. We describe a to whether any bias between measurements made by a method with the as measured by the method. We by the of the two measurements made the method and we by the measurement made on subject by the measurement method. A plot of can then be used to investigate whether the bias between the and methods with the as measured by the method. is that is rather than because the a to the common measurement error in and is for the same that paired differences between measurements from two methods the measurements from one method is in method comparison between values of and that the bias between the methods with the true value as measured by the method. A statistical of this can be by a for with as the The for the that the in this from zero the of the of If of bias with true value is we may be in the at which the bias between the methods We can estimate this by the estimate from the by our estimate of the This quantity is an estimate of the in bias method for a in the true value as measured by the method. The choice to is We could have and have a different and estimate of the at which the bias This choice is because we will two different and because it means the is The underlying is a of and as such can be for The is to and for measurements are made by observers or measurements made by two different observers are than are two measurements made by the same as with two methods, measurements from two observers may differ due to bias between the observers observer and their measurement errors may also have different For example, measurements from an observer can make measurements will have a than made by a If measurements in the future are to be made by different we to describe and quantify the differences between such measurements in to whether differences are genuine or may be due to measurement error. As with a method comparison study, the way to study this is for each observer to make at least two measurements of a of The of such a study and the of statistical that are should be by whether in a particular of or whether we are in a population of observers or If future measurements are to be made by a of the measurement error study should each of observers making measurements of a of subjects, with at least two measurements observer subject. We can then the same methods as for method different observers as different measurement methods. If each observer two or measurements on each we can whether there is bias or bias between observers. We can the bias of future measurements (in by measurements the corresponding estimated We can also estimate repeatability and reliability for each the may be of to which observers are and if differences in reliability can be related to observer such as of or If we are to that biases between observers are we can a so-called to such a for a subject and observer The estimates indicate the and for bias between observers. such can be to the measurement error to differ between observers. The observers in a measurement error study can often be considered as a of observers from a population of observers may be used in future studies or clinical In this we are not in the particular observers in the measurement error study, but only in the that they the population of observers. In this it is that a number of observers is used in the measurement error For example, if the measurement error study only two can be the population of because we have a of just the observers in the measurement error study are considered a from the population of we such a study a that the observer as a We a with subject and observer a estimates of the and the measurement error to be the same for different The variability in measurements due to or biases between the observers. such estimates one can the of the difference in two made by the same observer or by two different observers. the is the will be than the This is because biases between observers to make measurements from different observers it is difficult to distinguish between subjects on the basis of measurements made by two different observers than if the subjects been measured by the same In this paper we have distinguished between the concepts of agreement and parameter is to the as they describe different of the measurement The choice of what to report in a particular study should be by how measurements are to be used in the and also by the fact that may to a measurement method in a different We have the fact that the reliability of a method depends not only on the of the measurement errors but on the of true values in the population in which measurements are made. As measurement techniques potentially may be used in a variety of clinical and different it is to report estimates of and We have which methods we are for the of repeatability method comparison and studies with measurements made by different observers or In we not a single reliability should be used for method comparison If the reliability of two methods are to be each reliability should be estimated by making at least two measurements on each subject with each measurement method. We have how an between paired differences and means may not be by bias between two methods. an may also be by a difference in the measurement error but with only one measurement subject method it is not to which of is the comparison studies should make at least two measurements subject method. This an of the of any between paired differences and means for measurements made by the two methods, and also the repeatability and reliability of each method to be estimated. measurements an observer or measurement error studies must an number of observers if in making a population of observers. we results for the of CIs for two in the in which two measurements are from each

References

Page 1

	Year	Citations

Page 1