DECOW14

DECOW14 is the German web corpus by COW created with the 2014 technology of the COW initiative. Available through the portal for COW corpora at webcorpora.org.

DECOW14 comes in two versions: A and AX (cf. comparison table below).

The software tools texrex, rofl and HyDRA are part of texrex-neuedimensionen. Other software used:  Heritrix, Ucto, TreeTagger, mate-toolsMalt Parser. Parsed versions to be released later in collaboration with IMS Stuttgart. This product includes GeoLite data created by MaxMind, available from http://www.maxmind.com. We would like to thank the high performance computing (HPC) service of the ZEDAT data center at Freie Universität Berlin for their CPU time.

Data downloads and access

Tool downloads

DECOW14A(X) properties

 AAX
Typefull documentssentence shuffle
Tokens20,495,087,35211,660,894,000
Sentences807,782,354624,767,747
Documents17,147,104
Slices2121
FormatXML with inline VRT, UTF-8XML with inline VRT, UTF-8
Licensenot publicly availableCOW Terms of Use 2
ReleaseQ1/2015Q1/2015
TLDsat, ch, deat, ch, de
Crawled in2011,20142011,2014
CrawlingHeritrix 1.14 BFSHeritrix 1.14 BFS
Boilerplate removaltexrex MLP, non-destructivetexrex MLP, destructive
Near-duplicate removaltexrex, w-Shinglingtexrex, w-Shingling
In-document dedupingtexrex, paragraphstexrex, paragraphs
Hyphenation removalHyDRA, destructiveHyDRA, destructive
Run-together repairrofl, destructiverofl, destructive
URL meta
IP meta
Last-modified meta
crawldate meta
Document quality metatexrex Badnesstexrex Badness
HTML title meta
HMTL keywords meta
Boilerplate score meta
Country meta
Region geoloc meta
City geoloc meta
"Quasi-spontaneous" meta
Word tokenizationUcto + customUcto + custom
Sentence tokenizationUctoUcto
POSTreeTagger (STTS)TreeTagger (STTS)
LemmaTreeTagger + customTreeTagger + custom
ChunksTreeTaggerTreeTagger
MorphologyMate-ToolsMate-Tools
Named EntityStanford/PadoStanford/Pado
Dependency Head– (planned)– (planned)
Dependency Relation– (planned)– (planned)