Word Segmentation in the Spoken Dutch Corpus

Abstract

This paper describes the aims of the word segmentation in the Spoken Dutch Corpus (Corpus Gesproken Nederlands, CGN), and the procedures to create it. For one million words, a manually verified segmentation will be created, whereas the remaining nine million words will only come with an automatically generated segmentation. Described are our efforts to create the best possible automatic word segmentation from an auditory verified phonetic transcription, and the development of a protocol for the manual verification of that automatic segmentation. The paper also mentions some figures concerning the manual verification of the first hundred thousand words.

References

Page 1

	Year	Citations

Page 1