A web-derived Spanish corpus crawled in the ES top-level domain in late 2011. Created with texrex-mrvain (deprecated) and TreeTagger.
- status: legacy (use ESCOW14 instead)
- versions: ESCOW12 (full documents), ESCOW12X (sentence shuffle)
- format: UTF-8 (converted from ISO-8859-1)
- size (ESCOW12A): 1,234,592,102 tokens, 1,006,506 documents
What happened to the older corpora called “COW2012”?
This table translates old names (ISO-8859-1-Korpora) to the current names of the legacy corpora (converted to UTF8):
Old name | UTF8 version | Type |
---|---|---|
DECOW2012-00 – DECOW2012-00 | DECOW12A01 – DECOW12A08 | full documents |
DECOW2012X-00 – DECOW2012X-07 | DECOW12AX01 – DECOW12AX08 | sentence shuffle |
DECOW2012QS | DECOW12Q | full documents |
DECOW2012QSX | DECOW2012QX | sentence shuffle |
DECOW2012-C00X1MS DECOW2012-C02X3MS DECOW2012-C04X5MS DECOW2012-C06X7MS | subsets of DECOW12AX | sentence shuffle |
ESCOW2012 | ESCOW12A | full documents |
ESCOW2012XS | ESCOW12AX | sentence shuffle |
FRCOW2011 | FRCOW11A | full documents |
FRCOW2011XS | FRCOW11AX | sentence shuffle |
NLCOW2012-00 – NLCOW2012-01 | NLCOW12A01 – NLCOW12A01 | full documents |
NLCOW2012-00X – NLCOW2012-01X | NLCOW12AX01 – NLCOW12AX02 | sentence shuffle |
SECOW2012-00 – SECOW2012-01 | SECOW12A01 – SECOW12A02 | full documents |
SECOW2012-00X – SECOW2012-01X | SECOW12AX01 – SECOW12AX02 | sentence shuffle |