Concepedia

Abstract

Within a strictly corpus-driven paradigm, an in-depth profiling of many linguistic phenomena requires fast access to massive amounts of data derived from very large corpora. This poster presentation describes an empirical baseline framework – the CCDB – established for this purpose in 2001 at the Institute for the German Language (IDS) in Mannheim. We use the CCDB for the study, development, and evaluation of methods for the data-driven exploration and modelling of language use. The CCDB can be accessed through a public web interface at the URL http://corpora.ids-mannheim.de/ccdb/ . The paper is structured as follows: We first describe the kind of data that the framework provides. Then, we briefly discuss the notion of similarity of collocation profiles. Finally, we give examples of specific CCDB-based methods that we have been recently working on.