RanDECOW17

RanDECOW17 is a German web corpus by COW created with the 2016 technology of the COW initiative. It is not based on breadth-first crawls, but it was “crawled” using the ClaraX research crawler developed in Roland Schäfer’s third-funded project Linguistic Web Characterisation.

RanDECOW17 was released in 2019 including the COReX document feature annotation. A version of RanDECOW17 is available through NoSketchEngine at webcorpora.org. It is not useful for most normal corpus studies.

The development of RanDECOW17 was funded by the DFG (SCHA1916/1-1).

texrex (integrated with ClaraX), rofl and HyDRA are part of texrex-behindthecow. Other software used in the COW16 toolchain: Ucto, TreeTagger, Marmot, Stanford CRF-NERMateTools. This product includes GeoLite data created by MaxMind, available from http://www.maxmind.com.

All access (download/NoSketchEngine) is routed through webcorpora.org!

Additional documentation

Corpus property overview

* Feature not available in NoSketchEngine.

FeatureValue (RanDECOW17)
Typefull documents
Tokens956,877,102
Sentences52,338,081
Documents1,031,171
Slices1
FormatXML with inline VRT, UTF-8
LicenseCOW Terms of Use 2; NoSkE only
ReleaseQ1/2019
TLDsat, ch, de
Crawled in2016,2017
CrawlingClaraX
Boilerplate removaltexrex MLP, non-destructive
Near-duplicate removal
In-document dedupingtexrex, paragraphs
Hyphenation removalHyDRA, destructive
Run-together repairrofl, destructive
URL meta
IP meta
Last-modified meta
crawldate meta
Document quality metatexrex Badness
HTML title meta
HMTL keywords meta
Boilerplate score meta
Country meta
Region geoloc meta
City geoloc meta
Forum detection
Word tokenizationUcto + custom
Sentence tokenizationUcto + custom
sentence-level language filteringlangid.py + custom
POSTreeTagger (STTS)
LemmaTreeTagger + custom + SMOR
Full nominal compound analysisSMOR + custom
Base lemma for verbs and nounsSMOR + custom
Morphological featuresMarmot, standardized COW features
Named EntityStanford/Pado
Topological parsingCheung & Penn (Stanford, TÜBA/DZ)
Dependency HeadMate by IMS Stuttgart trained on Tiger dependencies
Dependency RelationMate by IMS Stuttgart trained on Tiger dependencies
Document-level lexico-grammatical feature annotationCOReX 1.0