DECOW12

A web-derived corpus crawled in the DE top-level domain in late 2011. Created with texrex-mrvain (deprecated) and  TreeTagger. The DECOW12Q is a subset that contains only documents written in a quasi-spontaneous register, selected based on the occurrences of cliticized variants of the indefinite article as described in: Schäfer and Sayatz (2014) Die Kurzformen des Indefinitartikels im Deutschen. Zeitschrift für Sprachwissenschaft 33(2).

  • status: legacy (use DECOW14 instead)
  • standard versions: DECOW12 (full documents), DECOW12X (sentence shuffle)
  • “quasi-spontaneous”: DECOW12Q (documents), DECOW12QX (sentence shuffle)
  • format: UTF-8 (converted from ISO-8859-1)
  • size (DECOW12A): 9,108,097,177 tokens, 552,259,011 sentences, 7,632,384 documents

What happened to the older corpora called “COW2012”?

This table translates old names (ISO-8859-1-Korpora) to the current names of the legacy corpora (converted to UTF8):

Old nameUTF8 versionType
DECOW2012-00 –
DECOW2012-00
DECOW12A01 –
DECOW12A08
full documents
DECOW2012X-00 –
DECOW2012X-07
DECOW12AX01 –
DECOW12AX08
sentence shuffle
DECOW2012QSDECOW12Qfull documents
DECOW2012QSXDECOW2012QXsentence shuffle
DECOW2012-C00X1MS
DECOW2012-C02X3MS
DECOW2012-C04X5MS
DECOW2012-C06X7MS
subsets of
DECOW12AX
sentence shuffle
ESCOW2012ESCOW12Afull documents
ESCOW2012XSESCOW12AXsentence shuffle
FRCOW2011FRCOW11Afull documents
FRCOW2011XSFRCOW11AXsentence shuffle
NLCOW2012-00 –
NLCOW2012-01
NLCOW12A01 –
NLCOW12A01
full documents
NLCOW2012-00X –
NLCOW2012-01X
NLCOW12AX01 –
NLCOW12AX02
sentence shuffle
SECOW2012-00 –
SECOW2012-01
SECOW12A01 –
SECOW12A02
full documents
SECOW2012-00X –
SECOW2012-01X
SECOW12AX01 –
SECOW12AX02
sentence shuffle