Experiments in Spoken Document Retrieval at CMU

Abstract

We describe our submission to the TREC-7 Spoken Document Retrieval (SDR) track and the speech recognition and information retrieval engines. We present SDR evaluation results and a brief analysis. A few developments are also described in greater detail including: . A new, probabilistic retrieval engine based on language models. . A new, TFIDF-based weighting function that incorporates word error probability. . The use of a simple confidence estimate for word probability based on speech recognition lattices. Although improvements over a development test set were promising, the new techniques failed to yield significant gains in the evaluation test set. 1. The SDR Data and Task The entire set of speech data for the 1998 TREC-7 spoken document retrieval track consisted of 153 hours of broadcast news, approximately 80 for training and 73 for testing. The data had been segmented into stories and manually transcribed. In the test set, there were three &quot;versions&quot; of the data available: A manually generated transcript, speech recognition transcripts based on IBM and CMU recognizers, and the raw audio data, to be transcribed by our own recognizer. The entire training set was used to train acoustic models for the speech recognition system. The remainder was held out as unseen test data. There were about 3245 stories in the training data set and 2866 in the test set. To develop and debug the system, the TREC-6 evaluation set was used in a Known-Item Retrieval system -- where every query has only one document assigned as relevant. In our experiments on the evaluation test set, the average precision of the retrieval for each of the relevant documents was used to judge the quality of the retrieval. However, since relevance judgements were not available for the development test se...

References

Page 1

	Year	Citations

Page 1