Statistical models for topic identification using phoneme substrings

Abstract

Phoneme substrings that are recurrent within training data are detected and logged using dynamic programming procedures. The resulting keystrings (cluster centroids) are awarded a usefulness rating based on smoothed occurrence probabilities in wanted and unwanted data. The rankings of the keystrings by usefulness measured on training, development test and final test data for three language-pairs from the OGI multi-language corpus are highly consistent, showing that language-specific features are being found. Statistical measures of local association also suggest that keystring occurrences can be correlated in a manner similar to that of keywords for a particular topic. With improved recognition accuracy it should be possible to exploit this information in order to enhance performance in topic identification.

References

Page 1

	Year	Citations

Page 1