Publication | Closed Access
Corpus CesCa
12
Citations
0
References
2012
Year
Applied LinguisticsSemantic FieldsSyntaxLanguage DocumentationLexical ResourceCesca CorpusCorpus LinguisticsComputational LexicologyComputational LinguisticsLanguage CorpusLexiconGrammarTextual DataLanguage StudiesSpanishLinguistics
This paper outlines the compilation of a corpus of Catalan written production. The CesCa corpus presents a picture of the Catalan written language throughout compulsory schooling. It contains two kinds of data: Vocabularies of five semantic fields comprising 242,404 lexical forms and Textual data of four different discourse genres consisting of 207,028 tokens. Both vocabularies and the textual data have been morphologically analyzed and lemmatized. The corpus is freely available. This paper will outline the main features of the corpus and make some suggestions as to the uses to which the corpus can be put.