COReX 2018 feature set and databases

The COW & DeReKo Extractor COReX is a tool to annotate corpus documents annotated with the COWTek18 toolchain with document-level features representing the distribution of a large number of lexico-grammatical features. It is distributed as part of COWTek18. We have created databases of those features for DECOW16B, RanDECOW, and a subset of DeReKo. You can download them freely here:

https://www.webcorpora.org/opendata/corex18/

The development of COReX and the databases provided here was funded by the DFG (SCHA1916/1-1). It is a joint effort by Roland Schäfer (FU Berlin/DFG-funded) and Felix Bildhauer (IDS Mannheim, project “Corpus Grammar”).

The following table provides a short overview of the COReX feature set in its 2018 iteration. For more, see Bildhauer & Schäfer (2019, in prep.) and Schäfer & Bildhauer (2019, in prep.).

FeatureExplanation
crx_adjnumber of adjectives per 1,000 words
crx_advnumber of adverbs per 1,000 words
crx_alltokcoverall token count
crx_answnumber of answering particles per 1,000 words
crx_cardnumber of cardinals per 1,000 words
crx_clausevf*numer of clausal Vf per 1,000 Vf
crx_clitindefnumber of clitic indefinite articles per 1,000 indef. articles
crx_cmpndnumber of compounds per 1,000 common nouns
crx_cnnumber of common nouns per 1,000 words
crx_cnloan*number of loan nouns with recognizable suffix ('-ik', '-um') per 1,000 nouns
crx_conjnumber of coordinating particles per 1,000 words
crx_defnumber of definite articles per 1,000 words
crx_demnumber of demonstratives per 1,000 words
crx_dq*number of double quotes per 1,000 words
crx_emonumber of emoticons per 1,000 words
crx_esvf*number of expletive 'es' in Vf per 1,000 Vf
crx_gennumber of genitives per 1,000 nouns
crx_impnumber of imperatives per 1,000 words
crx_indefnumber of indefinite articles per 1,000 words
crx_infnumber of infinitives per 1,000 words
crx_itjnumber of interjections per 1,000 words
crx_modnumber of modal verbs per 1,000 words
crx_negnumber of negative particles per 1,000 words
crx_nelocnumber of location names per 1,000 words
crx_neorgnumber of organization names per 1,000 words
crx_nepernumber of person names per 1,000 words
crx_nonwrdnumber of non-words per 1,000 words
crx_partanumber of comparison particles per 1,000 words
crx_pass*number of passive constructions per clause
crx_perfnumber of perfect constructions per clause
crx_plunumber of pluperfect constructions per clause
crx_possnumber of possessive pronouns per 1,000 words
crx_pper_1st*number of 1st person pronouns per 1,000 words
crx_pper_2nd*number of 2nd person pronouns per 1,000 words
crx_pper_3rdnumber of 3rd person pronouns per 1,000 words
crx_prepnumber of prepositions per 1,000 words
crx_psimpxnumber of <psimpx> constituents per sentence
crx_qsvocnumber of other short/contracted forms ('nich', 'schomma') per 1,000 words
crx_rsimpxnumber of <rsimpx> constituents per sentence
crx_saposnumber of apostrophized 's per 1,000 words
crx_sentcsentence count
crx_shortnumber of non-standard contracted verbs and prepositions ('gehts', 'aufm') per 1,000 words
crx_simpxnumber of <simpx> constituents per sentence
crx_slen*average sentence length in words
crx_subjinumber of infin.-embedding ptcls per 1,000 words
crx_subjsnumber of subjunctors per 1,000 words
crx_tokccount of tokens within sentences
crx_ttrattype-token ratio (using crx_tokc)
crx_unknnumber of unknown lemmas per 1,000 words
crx_v2number of verb-second sentences per sentence
crx_vauxnumber of auxiliaries per 1,000 words
crx_vfinnumber of finite verbs per 1,000 words
crx_vflenaverage length of the pre-field ('Vorfeld', Vf)
crx_vlastnumber of verb-last sentences per sentence
crx_vpastnumber of past verbs per 1,000 words
crx_vpresnumber of present verbs per 1,000 words
crx_vpressubjnumber of pres. subjunctive verbs per 1,000 words
crx_vvnumber of lexical verbs per 1,000 words
crx_vvieren*number of '-ieren' verbs per 1,000 verbs
crx_vvpastsubjnumber of past subjunctive per 1,000 words
crx_whnumber of w(h) pronouns per 1,000 words
crx_wlen*average word length in characters (using crx_tokc)
crx_wpastsubjnumber of 'werden' subjunctives per 1,000 words
crx_zuinfnumber of 'zu' infinitives per 1,000 words