Impact of data imbalance caused by inactive frames and difference in sound duration on sound event detection performance

Abstract

Sound event detection (SED) is a major topic in machine listening research. In many SED methods, a segmented time frame is considered as one data sample for model training. The duration of a sound event depends strongly on the event class, for example, the sound event “fan” is a long-lasting sound, whereas the sound events “mouse clicking” and “glass jingling” are instantaneous sounds. The difference in time duration between sound event classes makes a significant difference in the number of data samples between event classes; therefore, it causes a severe data imbalance problem in SED. Moreover, there are many more inactive time frames of sound events than active frames because most sound events are likely to occur occasionally. This also causes a serious data imbalance problem between active and inactive frames of sound events. In this paper, we study in detail the impact of the sound duration and inactive frames on the detection performance of sound events by introducing five loss functions: simple reweighting loss, inverse frequency loss, class-balanced loss, asymmetric focal loss, and focal batch Tversky loss. Evaluation experiments using the TUT Acoustic Scenes 2016/2017 and Sound Events 2016/2017 datasets show that in SED, inactive frames tend to overwhelm the model training, and the data imbalance problem between active and inactive frames is more severe than that between sound event classes. The evaluation experiments also show that the introduced loss functions can alleviate these data imbalance problems and improve the SED performance considerably.

References

Page 1

	Year	Citations

Page 1