L'ATILF (Analyse et Traitement Informatique de la Langue Française) est un laboratoire public de recherche en sciences humaines et sociales, du CNRS et de l'Université de Lorraine spécialisé en sciences du langage.

Retrouvez sur cette chaîne des vidéos des séminaires de l'ATILF et d'autres événements organisés par le laboratoire.

[Séminaire ATILF] Marc Kupietz : Linguistic Corpora and Research Tools: News from IDS Mannheim

Sept. 29, 2023
Duration: 01:06:12
Marc Kupietz (Responsable du département de linguistique de corpus à l'IDS) : Linguistic Corpora and Research Tools: News from IDS Mannheim


Electronic corpora have been built and used at the Leibniz Institute for the German Language (IDS) in Mannheim since its foundation in 1964, when – with interesting parallels to today – the booming research in artificial intelligence and the striving for strict empiricism were already major sources of motivation and inspiration. Starting with the Mannheimer Korpus I in 1969, a series of corpora have been created, used and made available at the IDS. Since 2004, this collection has been called the (Mannheim) German Reference Corpus DeReKo (Kupietz et al. 2010) and has been continuously expanded to now more than 55 billion words, growing by 2 billion words per year. While in the 60s and 70s users had to punch their Fortran programs to analyze the corpora, beginning with REFER in 1983, followed by COSMAS I (1992), COSMAS II (2003) and KorAP (2016) (Bański et al. 2013), the IDS also provides specialized corpus linguistic software tools for DeReKo’s now more than 40,000 users.
In the first part of my talk, I will outline the foundations of the German Reference Corpus DeReKo, including its goals and design principles, its expansion strategy, and its strategies for dealing with legal challenges and making the corpus data as accessible as possible without infringing the interests of rights holders.
In the second part of my talk, I will present KorAP, the current corpus analysis platform developed at the IDS, covering its goals and design principles, its key features such as multiple query languages, user-definable virtual corpora, arbitrary annotation layers and some extended possibilities through the use of its client libraries for R and Python (Kupietz/Diewald/Margaretha 2020).
The final part of my talk will focus on contrastive linguistic research based on DeReKo and KorAP. I will introduce the complementary initiatives European Reference Corpus (EuReCo) (Kupietz et al. 2020) and International Comparable Corpus (ICC) (Čermáková et al. 2021), both of which aim to provide readily usable comparable corpora at manageable cost, and report on recent developments and current plans.


Bański, Piotr/Bingel, Joachim/Diewald, Nils/Frick, Elena/Hanl, Michael/Kupietz, Marc/Pęzik, Piotr/Schnober, Carsten/Witt, Andreas (2013): KorAP: the new corpus analysis platform at IDS Mannheim. In: Vetulani, Zygmunt/Uszkoreit, Hans (eds.): Human Language Technologies as a Challenge for Computer Science and Linguistics. Proceedings of the 6th Language and Technology Conference. Poznań: Fundacja UAM. 586-587.
Čermáková, Anna/Jantunen, Jarmo/Jauhiainen, Tommi/Kirk, John/Křen, Michal/Kupietz, Marc/Uí Dhonnchadha, Elaine (2021): The International Comparable Corpus: Challenges in building multilingual spoken and written comparable corpora. In: Research in Corpus Linguistics 9(1). Murcia: Spanish Association for Corpus Linguistics. 89-103. https://doi.org/10.32714/ricl.09.01.06
Kupietz, Marc/Belica, Cyril/Keibel, Holger/Witt, Andreas (2010): The German Reference Corpus DeReKo: A primordial sample for linguistic research. In: Calzolari, Nicoletta et al. (eds): Proceedings of the seventh conference on International Language Resources and Evaluation (LREC 2010). Paris/La Valetta: ELRA. 1848-1854.
Kupietz, Marc/Diewald, Nils/Margaretha, Eliza (2020): RKorAPClient: An R Package for Accessing the German Reference Corpus DeReKo via KorAP. In: Calzolari, Nicoletta et al. (eds.): Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC), Marseille/Paris: ELRA. 7016-7021.
Kupietz, Marc/Diewald, Nils/Trawiński, Beata/Cosma, Ruxandra/Cristea, Dan/Tufiş, Dan/Váradi, Tamás/Wöllstein, Angelika (2020): Recent developments in the European Reference Corpus EuReCo. In: Granger, Sylviane/Lefer, Marie-Aude (eds.): Translating and Comparing Languages: Corpus-based Insights. (= Corpora and Language in Use, Proceedings 6). Louvain-la-Neuve: Presses universitaires de Louvain. 257-273.


