RanDECOW17 is a German web corpus by COW created with the 2016 technology of the COW initiative. It is not based on breadth-first crawls, but it was “crawled” using the ClaraX research crawler developed in Roland Schäfer’s third-funded project Linguistic Web Characterisation.
RanDECOW17 was released in 2019 including the COReX document feature annotation. A version of RanDECOW17 is available through NoSketchEngine at webcorpora.org. It is not useful for most normal corpus studies.
The development of RanDECOW17 was funded by the DFG (SCHA1916/1-1). Continue reading
DECOW16 is the German web corpus by COW created with the 2016 technology of the COW initiative. DECOW16A was released in 2017 including the COReX document feature annotation. DECOW16B was released in 2018. The B iteration contains minor fixes and significantly improved topological parses and COReX data. DECOW16 is available through NoSketchEngine at webcorpora.org.
The development of DECOW16A/B was funded by the DFG (SCHA1916/1-1). Continue reading
As explained in Section 4.1 of Roland Schäfer (2015) Processing and querying large web corpora with the COW14 architecture, we have to take certain measures in order to stay within the bounds of German copyright laws. This means that we only release sentence shuffles, i.e., corpora which are just bags of sentences. In other words, there are no documents in released versions of COW corpora, just single sentences without contexts. The original URL plus some other meta data are recorded for each sentence, however.
DECOW14 is the German web corpus by COW created with the 2014 technology of the COW initiative. Available through the portal for COW corpora at webcorpora.org.
A web-derived corpus crawled in the DE top-level domain in late 2011. Created with texrex-mrvain (deprecated) and TreeTagger. The DECOW12Q is a subset that contains only documents written in a quasi-spontaneous register, selected based on the occurrences of cliticized variants of the indefinite article as described in: Schäfer and Sayatz (2014) Die Kurzformen des Indefinitartikels im Deutschen. Zeitschrift für Sprachwissenschaft 33(2). Continue reading