NLCOW12

A web-derived Dutch corpus crawled in the NL top-level domain in 2012. Created with texrex-mrvain (deprecated) and  TreeTagger.

  • status: legacy (use NLCOW14 instead)
  • versions: NLCOW12 (full documents), NLCOW12X (sentence shuffle)
  • format: UTF-8 (converted from ISO-8859-1)
  • size (NLCOW12A): 2,366,453,439 tokens, 121,582,724 sentences, 1,594,241 documents

What happened to the older corpora called “COW2012”?

This table translates old names (ISO-8859-1-Korpora) to the current names of the legacy corpora (converted to UTF8):

Old nameUTF8 versionType
DECOW2012-00 –
DECOW2012-00
DECOW12A01 –
DECOW12A08
full documents
DECOW2012X-00 –
DECOW2012X-07
DECOW12AX01 –
DECOW12AX08
sentence shuffle
DECOW2012QSDECOW12Qfull documents
DECOW2012QSXDECOW2012QXsentence shuffle
DECOW2012-C00X1MS
DECOW2012-C02X3MS
DECOW2012-C04X5MS
DECOW2012-C06X7MS
subsets of
DECOW12AX
sentence shuffle
ESCOW2012ESCOW12Afull documents
ESCOW2012XSESCOW12AXsentence shuffle
FRCOW2011FRCOW11Afull documents
FRCOW2011XSFRCOW11AXsentence shuffle
NLCOW2012-00 –
NLCOW2012-01
NLCOW12A01 –
NLCOW12A01
full documents
NLCOW2012-00X –
NLCOW2012-01X
NLCOW12AX01 –
NLCOW12AX02
sentence shuffle
SECOW2012-00 –
SECOW2012-01
SECOW12A01 –
SECOW12A02
full documents
SECOW2012-00X –
SECOW2012-01X
SECOW12AX01 –
SECOW12AX02
sentence shuffle