Tradeoff between quality and quantity of emotional annotations to characterize expressive behaviors

Abstract

Emotional descriptors collected from perceptual evaluations are important in the study of emotions. Many studies on emotion recognition depend on these labels to train classifiers. The reliability of the emotion descriptors vary with the number and quality of the raters. Conducting perceptual evaluations used to be an expensive and time demanding task, resulting in emotional databases with poor labels annotated by few raters. Nowadays, crowdsourcing services have simplified the process, reducing the cost, facilitating more evaluations per stimuli. The key challenge in using crowdsourcing for perceptual evaluation is the quality which significantly varies across workers. Is it better to have multiple annotations with lower inter-evaluator agreement or to have few annotations with higher inter-evaluator agreement? This study explores this tradeoff between quality and quantity in emotional annotations to characterize expressive behaviors. The analysis relies on emotional labels from the MSP-IMPROV database, where each video was evaluated by over 20 workers. We discuss the theoretical concept of effective reliability to address this problem. We demonstrate that a reduced set of labels with higher inter-evaluator agreement can provide similar classification performance than unfiltered set of labels from multiple workers. We discuss best practices to collecting annotations for emotion recognition tasks using crowdsourcing.

References

Page 1

	Year	Citations

Page 1