When and why listeners disagree in voice quality assessment tasks

TLDR

Modeling listener variability in voice quality assessment is essential for developing reliable protocols and understanding why listeners disagree. The study examined whether a model could explain interrater variability by quantifying the contributions of four factors: instability of internal standards, difficulty isolating attributes, scale resolution, and attribute magnitude. One hundred twenty listeners participated in six experiments that varied scale resolution, the presence of comparison stimuli, and the match between comparison and target voices. The four factors explained 84.2 % of the variance in exact agreement, with matched comparison stimuli and continuous scales each doubling agreement, showing that interrater variability stems from task design rather than listener unreliability.

Abstract

Modeling sources of listener variability in voice quality assessment is the first step in developing reliable, valid protocols for measuring quality, and provides insight into the reasons that listeners disagree in their quality assessments. This study examined the adequacy of one such model by quantifying the contributions of four factors to interrater variability: instability of listeners' internal standards for different qualities, difficulties isolating individual attributes in voice patterns, scale resolution, and the magnitude of the attribute being measured. One hundred twenty listeners in six experiments assessed vocal quality in tasks that differed in scale resolution, in the presence/absence of comparison stimuli, and in the extent to which the comparison stimuli (if present) matched the target voices. These factors accounted for 84.2% of the variance in the likelihood that listeners would agree exactly in their assessments. Providing listeners with comparison stimuli that matched the target voices doubled the likelihood that they would agree exactly. Listeners also agreed significantly better when assessing quality on continuous versus six-point scales. These results indicate that interrater variability is an issue of task design, not of listener unreliability.

References

Page 1

	Year	Citations

Page 1