HuBERT: Self-Supervised Speech Representation Learning by Masked\n Prediction of Hidden Units

Abstract

Self-supervised approaches for speech representation learning are challenged\nby three unique problems: (1) there are multiple sound units in each input\nutterance, (2) there is no lexicon of input sound units during the pre-training\nphase, and (3) sound units have variable lengths with no explicit segmentation.\nTo deal with these three problems, we propose the Hidden-Unit BERT (HuBERT)\napproach for self-supervised speech representation learning, which utilizes an\noffline clustering step to provide aligned target labels for a BERT-like\nprediction loss. A key ingredient of our approach is applying the prediction\nloss over the masked regions only, which forces the model to learn a combined\nacoustic and language model over the continuous inputs. HuBERT relies primarily\non the consistency of the unsupervised clustering step rather than the\nintrinsic quality of the assigned cluster labels. Starting with a simple\nk-means teacher of 100 clusters, and using two iterations of clustering, the\nHuBERT model either matches or improves upon the state-of-the-art wav2vec 2.0\nperformance on the Librispeech (960h) and Libri-light (60,000h) benchmarks with\n10min, 1h, 10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model,\nHuBERT shows up to 19% and 13% relative WER reduction on the more challenging\ndev-other and test-other evaluation subsets.\n