Category Archives: Web Characterization

RanDECOW17

RanDECOW17 is a German web corpus by COW created with the 2016 technology of the COW initiative. It is not based on breadth-first crawls, but it was “crawled” using the ClaraX research crawler developed in Roland Schäfer’s third-funded project Linguistic Web Characterisation.

RanDECOW17 was released in 2019 including the COReX document feature annotation. A version of RanDECOW17 is available through NoSketchEngine at webcorpora.org. It is not useful for most normal corpus studies.

The development of RanDECOW17 was funded by the DFG (SCHA1916/1-1). Continue reading

COReX 2018 feature set and databases

The COW & DeReKo Extractor COReX is a tool to annotate corpus documents annotated with the COWTek18 toolchain with document-level features representing the distribution of a large number of lexico-grammatical features. It is distributed as part of COWTek18. We have created databases of those features for DECOW16B, RanDECOW, and a subset of DeReKo. You can download them freely here:

https://www.webcorpora.org/opendata/corex18/

The development of COReX and the databases provided here was funded by the DFG (SCHA1916/1-1). It is a joint effort by Roland Schäfer (FU Berlin/DFG-funded) and Felix Bildhauer (IDS Mannheim, project “Corpus Grammar”).

The following table provides a short overview of the COReX feature set in its 2018 iteration. For more, see Bildhauer & Schäfer (2019, in prep.) and Schäfer & Bildhauer (2019, in prep.). Continue reading