Part of speech tagging and lemmatisation for the spoken Dutch corpus

Abstract

This paper describes the lemmatisation and tagging guidelines developed for the &quot;Spoken Dutch Corpus&quot;, and lays out the philosophy behind the high granularity tagset that was designed for the project. To bootstrap the annotation of large quantities of material (10 million words) with this new tagset we tested several existing taggers and tagger generators on initial samples of the corpus. The results show that the most effective method, when trained on the small samples, is a high quality implementation of a Hidden Markov Model tagger generator. 1. Introduction The Dutch-Flemish project &quot;Corpus Gesproken Nederlands &quot; (1998-2003) aims at the collection, transcription and annotation of ten million words of spoken Dutch (Oostdijk, 2000). The first layer of linguistic annotation concerns the assignment of base forms and morphosyntactic tags to each of those ten million words. The first part of this paper presents the lemmatisation guidelines and the tagset which have been devised for thi...

References

Page 1

	Year	Citations

Page 1