Category Archives: Research

RanDECOW17

RanDECOW17 is a German web corpus by COW created with the 2016 technology of the COW initiative. It is not based on breadth-first crawls, but it was “crawled” using the ClaraX research crawler developed in Roland Schäfer’s third-funded project Linguistic Web Characterisation.

RanDECOW17 was released in 2019 including the COReX document feature annotation. A version of RanDECOW17 is available through NoSketchEngine at webcorpora.org. It is not useful for most normal corpus studies.

The development of RanDECOW17 was funded by the DFG (SCHA1916/1-1). Continue reading

COReX 2018 feature set and databases

The COW & DeReKo Extractor COReX is a tool to annotate corpus documents annotated with the COWTek18 toolchain with document-level features representing the distribution of a large number of lexico-grammatical features. It is distributed as part of COWTek18. We have created databases of those features for DECOW16B, RanDECOW, and a subset of DeReKo. You can download them freely here:

https://www.webcorpora.org/opendata/corex18/

The development of COReX and the databases provided here was funded by the DFG (SCHA1916/1-1). It is a joint effort by Roland Schäfer (FU Berlin/DFG-funded) and Felix Bildhauer (IDS Mannheim, project “Corpus Grammar”).

The following table provides a short overview of the COReX feature set in its 2018 iteration. For more, see Bildhauer & Schäfer (2019, in prep.) and Schäfer & Bildhauer (2019, in prep.). Continue reading

COReX lexico-grammatical feature extractor

COReCO is our lexico-grammatical feature extraction system, developed at FU Berlin and IDS Mannheim. Please go here for preliminary information. Data will be released in 2017.

Based in parts on the previous COWCat experiments.

COReCo content classification

COReCO is our document topic classification system, developed at FU Berlin and IDS Mannheim. Please go here for preliminary information. Data will be released in 2017.

Based in parts on the previous COWCat experiments.