Concepedia

Publication | Closed Access

The bag-of-frames approach to audio pattern recognition: A sufficient model for urban soundscapes but not for polyphonic music

227

Citations

23

References

2007

Year

TLDR

The bag‑of‑frames (BOF) approach models audio signals as long‑term statistical distributions of local spectral features and has proven nearly optimal for simulating auditory perception of natural and human environments, making it the predominant paradigm for extracting high‑level descriptions from music signals. The study aims to compare how urban soundscapes and polyphonic music are modeled by the BOF approach. The authors find that BOF models soundscapes with near‑perfect precision but performs poorly on polyphonic music, and that custom homogeneity transforms reveal temporal and statistical differences that likely explain this disparity and imply distinct cognitive processing.

Abstract

The "bag-of-frames" approach (BOF) to audio pattern recognition represents signals as the long-term statistical distribution of their local spectral features. This approach has proved nearly optimal for simulating the auditory perception of natural and human environments (or soundscapes), and is also the most predominent paradigm to extract high-level descriptions from music signals. However, recent studies show that, contrary to its application to soundscape signals, BOF only provides limited performance when applied to polyphonic music signals. This paper proposes to explicitly examine the difference between urban soundscapes and polyphonic music with respect to their modeling with the BOF approach. First, the application of the same measure of acoustic similarity on both soundscape and music data sets confirms that the BOF approach can model soundscapes to near-perfect precision, and exhibits none of the limitations observed in the music data set. Second, the modification of this measure by two custom homogeneity transforms reveals critical differences in the temporal and statistical structure of the typical frame distribution of each type of signal. Such differences may explain the uneven performance of BOF algorithms on soundscapes and music signals, and suggest that their human perception rely on cognitive processes of a different nature.

References

YearCitations

Page 1