RanDECOW17 is a German web corpus by COW created with the 2016 technology of the COW initiative. It is not based on breadth-first crawls, but it was “crawled” using the ClaraX research crawler developed in Roland Schäfer’s third-funded project Linguistic Web Characterisation.
RanDECOW17 was released in 2019 including the COReX document feature annotation. A version of RanDECOW17 is available through NoSketchEngine at webcorpora.org. It is not useful for most normal corpus studies.
The development of RanDECOW17 was funded by the DFG (SCHA1916/1-1).
texrex (integrated with ClaraX), rofl and HyDRA are part of texrex-behindthecow. Other software used in the COW16 toolchain: Ucto, TreeTagger, Marmot, Stanford CRF-NER, MateTools. This product includes GeoLite data created by MaxMind, available from http://www.maxmind.com.
All access (download/NoSketchEngine) is routed through webcorpora.org!
- Overviews of the dependency annotation labels and the topological annotation labels can be found by clicking on the respective link.
- The COReX document-level feature set is described at https://corporafromtheweb.org/corex18/.
Corpus property overview
* Feature not available in NoSketchEngine.
|Format||XML with inline VRT, UTF-8|
|TLDs||at, ch, de|
|Boilerplate removal||texrex MLP, non-destructive|
|In-document deduping||texrex, paragraphs|
|Hyphenation removal||HyDRA, destructive|
|Run-together repair||rofl, destructive|
|Document quality meta||texrex Badness|
|HTML title meta||✓|
|HMTL keywords meta||✓|
|Boilerplate score meta||✓|
|Region geoloc meta||✓|
|City geoloc meta||✓|
|Word tokenization||Ucto + custom|
|Sentence tokenization||Ucto + custom|
|sentence-level language filtering||langid.py + custom|
|Lemma||TreeTagger + custom + SMOR|
|Full nominal compound analysis||SMOR + custom|
|Base lemma for verbs and nouns||SMOR + custom|
|Morphological features||Marmot, standardized COW features|
|Topological parsing||Cheung & Penn (Stanford, TÜBA/DZ)|
|Dependency Head||Mate by IMS Stuttgart trained on Tiger dependencies|
|Dependency Relation||Mate by IMS Stuttgart trained on Tiger dependencies|
|Document-level lexico-grammatical feature annotation||COReX 1.0|