DECOW16

DECOW14 is the German web corpus by COW created with the 2016 technology of the COW initiative. Available through NoSketchEngine at webcorpora.org.

It comes in two versions: the full version “A” (web access) and a sentence shuffle “AX” (download), cf. comparison table below. texrex, rofl and HyDRA are part of texrex-behindthecow. Other software used in the COW16 toolchain:  Heritrix, Ucto, TreeTagger, Marmot, Stanford CRF-NERMateTools. This product includes GeoLite data created by MaxMind, available from http://www.maxmind.com.

All access (download/NoSketchEngine) is routed through webcorpora.org!

Corpus property overview

* Feature not available in NoSketchEngine.

 AAX
Typefull documentssentence shuffle
Tokens20,495,087,352ca. 11 bn
Sentences807,782,354ca. 600 mn.
Documents17,147,104
Slices2020
FormatXML with inline VRT, UTF-8XML with inline VRT, UTF-8
LicenseCOW Terms of Use 2; NoSkE onlyCOW Terms of Use 2
ReleaseQ1/2017Q1/2017
TLDsat, ch, deat, ch, de
Crawled in2011,20142011,2014
CrawlingHeritrix 1.14 BFSHeritrix 1.14 BFS
Boilerplate removaltexrex MLP, non-destructivetexrex MLP, destructive
Near-duplicate removaltexrex, w-Shinglingtexrex, w-Shingling
In-document dedupingtexrex, paragraphstexrex, paragraphs
Hyphenation removalHyDRA, destructiveHyDRA, destructive
Run-together repairrofl, destructiverofl, destructive
URL meta
IP meta
Last-modified meta
crawldate meta
Document quality metatexrex Badnesstexrex Badness
HTML title meta
HMTL keywords meta
Boilerplate score meta
Country meta
Region geoloc meta
City geoloc meta
Forum detection
Word tokenizationUcto + customUcto + custom
Sentence tokenizationUcto + customUcto +custom
sentence-level language filteringlangid.py + customlangid.py + custom
POSTreeTagger (STTS)TreeTagger (STTS)
LemmaTreeTagger + custom + SMORTreeTagger + custom + SMOR
Full nominal compound analysisSMOR + customSMOR + custom
Base lemma for verbs and nounsSMOR + customSMOR + custom
Morphological featuresMarmot, standardized COW featuresMarmot, standardized COW features
Named EntityStanford/PadoStanford/Pado
Topological parsingCheung & Penn (Stanford, TÜBA/DZ)Cheung & Penn (Stanford, TÜBA/DZ)
Dependency HeadMate by IMS Stuttgart trained on Tiger dependenciesMate by IMS Stuttgart trained on Tiger dependencies
Dependency RelationMate by IMS Stuttgart trained on Tiger dependenciesMate by IMS Stuttgart trained on Tiger dependencies