A web-derived corpus crawled in the DE top-level domain in late 2011. Created with texrex-mrvain (deprecated) and TreeTagger. The DECOW12Q is a subset that contains only documents written in a quasi-spontaneous register, selected based on the occurrences of cliticized variants of the indefinite article as described in: Schäfer and Sayatz (2014) Die Kurzformen des Indefinitartikels im Deutschen. Zeitschrift für Sprachwissenschaft 33(2).
- status: legacy (use DECOW14 instead)
- standard versions: DECOW12 (full documents), DECOW12X (sentence shuffle)
- “quasi-spontaneous”: DECOW12Q (documents), DECOW12QX (sentence shuffle)
- format: UTF-8 (converted from ISO-8859-1)
- size (DECOW12A): 9,108,097,177 tokens, 552,259,011 sentences, 7,632,384 documents
What happened to the older corpora called “COW2012”?
This table translates old names (ISO-8859-1-Korpora) to the current names of the legacy corpora (converted to UTF8):
Old name | UTF8 version | Type |
---|---|---|
DECOW2012-00 – DECOW2012-00 | DECOW12A01 – DECOW12A08 | full documents |
DECOW2012X-00 – DECOW2012X-07 | DECOW12AX01 – DECOW12AX08 | sentence shuffle |
DECOW2012QS | DECOW12Q | full documents |
DECOW2012QSX | DECOW2012QX | sentence shuffle |
DECOW2012-C00X1MS DECOW2012-C02X3MS DECOW2012-C04X5MS DECOW2012-C06X7MS | subsets of DECOW12AX | sentence shuffle |
ESCOW2012 | ESCOW12A | full documents |
ESCOW2012XS | ESCOW12AX | sentence shuffle |
FRCOW2011 | FRCOW11A | full documents |
FRCOW2011XS | FRCOW11AX | sentence shuffle |
NLCOW2012-00 – NLCOW2012-01 | NLCOW12A01 – NLCOW12A01 | full documents |
NLCOW2012-00X – NLCOW2012-01X | NLCOW12AX01 – NLCOW12AX02 | sentence shuffle |
SECOW2012-00 – SECOW2012-01 | SECOW12A01 – SECOW12A02 | full documents |
SECOW2012-00X – SECOW2012-01X | SECOW12AX01 – SECOW12AX02 | sentence shuffle |