Author Archives: Roland Schäfer

What is COW?

The COW (COrpora from the Web) corpora are the result of an ongoing project which has the goal of determining the value of linguistic material collected from the World Wide Web for fundamental linguistic research. The data are made available to a limited audience of collaborators within the linguistic community. Work on COW is supported by the German Research Council (Deutsche Forschungsgemeinschaft, DFG) in the form of the project Linguistic web characterization and web corpus creation (SCHA1916/1-1). In essence, COW is a collection of linguistically processed gigatoken web corpora created by Felix Bildhauer and Roland Schäfer at Freie Universität Berlin.

We have corpora in Dutch, English, French, German, Spanish, Swedish. The fourth-generation COW16 corpora available for English, French, German, and Spanish add a lot of linguistic annotation and provide a much higher data quality, especially for German and French.

Access to COW is provided at


RanDECOW17 is a German web corpus by COW created with the 2016 technology of the COW initiative. It is not based on breadth-first crawls, but it was “crawled” using the ClaraX research crawler developed in Roland Schäfer’s third-funded project Linguistic Web Characterisation.

RanDECOW17 was released in 2019 including the COReX document feature annotation. A version of RanDECOW17 is available through NoSketchEngine at It is not useful for most normal corpus studies.

The development of RanDECOW17 was funded by the DFG (SCHA1916/1-1). Continue reading

COReX 2018 feature set and databases

The COW & DeReKo Extractor COReX is a tool to annotate corpus documents annotated with the COWTek18 toolchain with document-level features representing the distribution of a large number of lexico-grammatical features. It is distributed as part of COWTek18. We have created databases of those features for DECOW16B, RanDECOW, and a subset of DeReKo. You can download them freely here:

The development of COReX and the databases provided here was funded by the DFG (SCHA1916/1-1). It is a joint effort by Roland Schäfer (FU Berlin/DFG-funded) and Felix Bildhauer (IDS Mannheim, project “Corpus Grammar”).

The following table provides a short overview of the COReX feature set in its 2018 iteration. For more, see Bildhauer & Schäfer (2019, in prep.) and Schäfer & Bildhauer (2019, in prep.). Continue reading

DECOW16 (A and B)

DECOW16 is the German web corpus by COW created with the 2016 technology of the COW initiative. DECOW16A was released in 2017 including the COReX document feature annotation. DECOW16B was released in 2018. The B iteration contains minor fixes and significantly improved topological parses and COReX data. DECOW16 is available through NoSketchEngine at

The development of DECOW16A/B was funded by the DFG (SCHA1916/1-1). Continue reading

Read this: You can only query and download sentence shuffles!

As explained in Section 4.1 of Roland Schäfer (2015) Processing and querying large web corpora with the COW14 architecture, we have to take certain measures in order to stay within the bounds of German copyright laws. This means that we only release sentence shuffles, i.e., corpora which are just bags of sentences. In other words, there are no documents in released versions of COW corpora, just single sentences without contexts. The original URL plus some other meta data are recorded for each sentence, however.


FRCOW16 is the French web corpus by COW created with the 2014 technology of the COW initiative. Available through the portal for COW corpora at STATUS UPDATE: Planned release date is 30 June 2017 for NoSketchEngine and July 15 2017 for the shuffle XML version. Continue reading

RStudio Server and Python

We have an RStudio Server installation running on This allows users to:

  • use Python with the convenient ManaCOW wrappers (in development) to make scripted queries with Python (using RStudio as a minimal Python IDE), and
  • do statistical analyses of the results in RStudio directly without downloading them.

Users who have already registered  on can apply for an RStudio account by writing us an email.

Note: The open-source version of RStudio Server, which we use, cannot be integrated into a single sign-on environment. Therefore, your NoSkE/download account and your RStudio account are two separate things with two separate passwords. If you have lost your RStudio Server password, please write us an email to have it reset. Please use the email address that you used when you registered on If you use a different address, we cannot help you for security reasons.


ENCOW14 is the English web corpus by COW created with the 2016 technology of the COW initiative. Available through the portal for COW corpora at running our custom Colibri² corpus portal software. Continue reading


FRCOW14 was the planned French web corpus by COW created with the 2014 technology of the COW initiative. Its release was delayed several times due to high workload and necessary improvements of the quality of the annotation implemented by us. Since it is now made entirely with COW16 technology, it is now released as FRCOW16.