DECOW14 is the German web corpus by COW created with the 2014 technology of the COW initiative. Available through the portal for COW corpora at webcorpora.org.
DECOW14 comes in two versions: A and AX (cf. comparison table below).
The software tools texrex, rofl and HyDRA are part of texrex-neuedimensionen. Other software used: Heritrix, Ucto, TreeTagger, mate-tools, Malt Parser. Parsed versions to be released later in collaboration with IMS Stuttgart. This product includes GeoLite data created by MaxMind, available from http://www.maxmind.com. We would like to thank the high performance computing (HPC) service of the ZEDAT data center at Freie Universität Berlin for their CPU time.
Data downloads and access
- main access (frontend and download) at webcorpora.org (requires registration)
- ngrams (raw and aggregated)
- word and lemma frequency lists
Tool downloads
- COW’s TreeTagger lexicon additions
- Malt Parser model tigercow-de – not optimized, do not use for production
- SVN dump of everything – undocumented
DECOW14A(X) properties
A | AX | |
---|---|---|
Type | full documents | sentence shuffle |
Tokens | 20,495,087,352 | 11,660,894,000 |
Sentences | 807,782,354 | 624,767,747 |
Documents | 17,147,104 | – |
Slices | 21 | 21 |
Format | XML with inline VRT, UTF-8 | XML with inline VRT, UTF-8 |
License | not publicly available | COW Terms of Use 2 |
Release | Q1/2015 | Q1/2015 |
TLDs | at, ch, de | at, ch, de |
Crawled in | 2011,2014 | 2011,2014 |
Crawling | Heritrix 1.14 BFS | Heritrix 1.14 BFS |
Boilerplate removal | texrex MLP, non-destructive | texrex MLP, destructive |
Near-duplicate removal | texrex, w-Shingling | texrex, w-Shingling |
In-document deduping | texrex, paragraphs | texrex, paragraphs |
Hyphenation removal | HyDRA, destructive | HyDRA, destructive |
Run-together repair | rofl, destructive | rofl, destructive |
URL meta | ✓ | ✓ |
IP meta | ✓ | |
Last-modified meta | ✓ | ✓ |
crawldate meta | ✓ | ✓ |
Document quality meta | texrex Badness | texrex Badness |
HTML title meta | ✓ | – |
HMTL keywords meta | ✓ | – |
Boilerplate score meta | ✓ | ✓ |
Country meta | ✓ | ✓ |
Region geoloc meta | ✓ | – |
City geoloc meta | ✓ | ✓ |
"Quasi-spontaneous" meta | ✓ | ✓ |
Word tokenization | Ucto + custom | Ucto + custom |
Sentence tokenization | Ucto | Ucto |
POS | TreeTagger (STTS) | TreeTagger (STTS) |
Lemma | TreeTagger + custom | TreeTagger + custom |
Chunks | TreeTagger | TreeTagger |
Morphology | Mate-Tools | Mate-Tools |
Named Entity | Stanford/Pado | Stanford/Pado |
Dependency Head | – (planned) | – (planned) |
Dependency Relation | – (planned) | – (planned) |