Publication | Closed Access
Turning Digitised Material into a Diachronic Corpus
17
Citations
5
References
2019
Year
Unknown Venue
Historical Corpus DataEngineeringCorpus LinguisticsText MiningNatural Language ProcessingCorrect MetadataLanguage DocumentationInformation RetrievalComputational LinguisticsDiachronic CorpusLanguage StudiesMachine TranslationNederlab CorpusTerminology ExtractionMeta DataLexical ResourceLanguage CorpusText ProcessingLinguisticsDocument Processing
In this paper, we argue that exploitation of historical corpus data requires text metadata which metadata accompanying digital objects from digital libraries, archives or other electronic text collections, do not provide. Most text collections describe in their metadata the object (book, newspaper) containing the text. To do research on the style of an author, or study the language of a certain time period, or a phenomenon through time, correct metadata is needed for each word in the text, which leads to a very intricate metadata scheme for some text collections. We focus on the Nederlab corpus. Nederlab is a research environment that gives access to a large diachronic corpus of Dutch texts from the 6th - 21st century, of more than 10 billion words. The corpus has been compiled using existing digitised text material from researchers, research organisations, archives and libraries. The aim of Nederlab is to provide tools and data to enable researchers to trace long-term changes in Dutch language, culture and society. This type of research sets high-level requirements on the metadata accompanying the texts. Since the Nederlab corpus consists of different collections, each with their own metadata, the task of adding the appropriate metadata was not straightforward, all the more so because of the difference in perspective content providers and corpus builders have. We will describe the desired metadata scheme and how we tried to realize this for a corpus of the size of Nederlab.
| Year | Citations | |
|---|---|---|
Page 1
Page 1