DECOW16 (A and B)

DECOW16 is the German web corpus by COW created with the 2016 technology of the COW initiative. DECOW16A was released in 2017 including the COReX document feature annotation. DECOW16B was released in 2018. The B iteration contains minor fixes and significantly improved topological parses and COReX data. DECOW16 is available through NoSketchEngine at

The development of DECOW16A/B was funded by the DFG (SCHA1916/1-1).

Both iterations come in two versions: the full version “A” or “B” version (web access) and a sentence shuffle “AX” or “BX” (download), cf. comparison table below. texrex, rofl and HyDRA are part of texrex-behindthecow. Other software used in the COW16 toolchain:  Heritrix, Ucto, TreeTagger, Marmot, Stanford CRF-NERMateTools. This product includes GeoLite data created by MaxMind, available from

All access (download/NoSketchEngine) is routed through!

Additional documentation

Corpus property overview

* Feature not available in NoSketchEngine.

Typefull documentssentence shuffle
Tokens20,495,087,352ca. 11 bn
Sentences807,782,354ca. 600 mn.
FormatXML with inline VRT, UTF-8XML with inline VRT, UTF-8
LicenseCOW Terms of Use 2; NoSkE onlyCOW Terms of Use 2
TLDsat, ch, deat, ch, de
Crawled in2011,20142011,2014
CrawlingHeritrix 1.14 BFSHeritrix 1.14 BFS
Boilerplate removaltexrex MLP, non-destructivetexrex MLP, destructive
Near-duplicate removaltexrex, w-Shinglingtexrex, w-Shingling
In-document dedupingtexrex, paragraphstexrex, paragraphs
Hyphenation removalHyDRA, destructiveHyDRA, destructive
Run-together repairrofl, destructiverofl, destructive
URL meta
IP meta
Last-modified meta
crawldate meta
Document quality metatexrex Badnesstexrex Badness
HTML title meta
HMTL keywords meta
Boilerplate score meta
Country meta
Region geoloc meta
City geoloc meta
Forum detection
Word tokenizationUcto + customUcto + custom
Sentence tokenizationUcto + customUcto +custom
sentence-level language + + custom
POSTreeTagger (STTS)TreeTagger (STTS)
LemmaTreeTagger + custom + SMORTreeTagger + custom + SMOR
Full nominal compound analysisSMOR + customSMOR + custom
Base lemma for verbs and nounsSMOR + customSMOR + custom
Morphological featuresMarmot, standardized COW featuresMarmot, standardized COW features
Named EntityStanford/PadoStanford/Pado
Topological parsingCheung & Penn (Stanford, TÜBA/DZ)Cheung & Penn (Stanford, TÜBA/DZ)
Dependency HeadMate by IMS Stuttgart trained on Tiger dependenciesMate by IMS Stuttgart trained on Tiger dependencies
Dependency RelationMate by IMS Stuttgart trained on Tiger dependenciesMate by IMS Stuttgart trained on Tiger dependencies
Document-level lexico-grammatical feature annotationCOReX 1.0COReX 1.0 (via external database)