Lattice-Based Search for Spoken Utterance Retrieval

TLDR

Spoken document retrieval typically relies on the single‑best ASR output, which is adequate for long broadcast news stories with low word‑error rates but fails for short teleconference snippets where WER can reach 50%. This work introduces an indexing method that operates on speech lattices instead of single‑best transcripts. The method flexibly represents both word and phone lattices, enabling efficient search for phrases that include out‑of‑vocabulary words. Experiments show that the lattice‑based approach raises F‑scores by more than five points over single‑best retrieval on low‑redundancy, high‑WER tasks.

Abstract

Recent work on spoken document retrieval has suggested that it is adequate to take the singlebest output of ASR, and perform text retrieval on this output. This is reasonable enough for the task of retrieving broadcast news stories, where word error rates are relatively low, and the stories are long enough to contain much redundancy. But it is patently not reasonable if one’s task is to retrieve a short snippet of speech in a domain where WER’s can be as high as 50%; such would be the situation with teleconference speech, where one’s task is to find if and when a participant uttered a certain phrase. In this paper we propose an indexing procedure for spoken utterance retrieval that works on lattices rather than just single-best text. We demonstrate that this procedure can improve F scores by over five points compared to singlebest retrieval on tasks with poor WER and low redundancy. The representation is flexible so that we can represent both word lattices, as well as phone lattices, the latter being important for improving performance when searching for phrases containing OOV words.

References

Page 1

	Year	Citations

Page 1