The COW (COrpora from the Web) corpora are the result of an ongoing project which has the goal of determining the value of linguistic material collected from the World Wide Web for fundamental linguistic research. The data are made available to a limited audience of collaborators within the linguistic community. Work on COW is supported by the German Research Council (Deutsche Forschungsgemeinschaft, DFG) in the form of the project Linguistic web characterization and web corpus creation (SCHA1916/1-1). In essence, COW is a collection of linguistically processed gigatoken web corpora created by Felix Bildhauer and Roland Schäfer at Freie Universität Berlin.
We have corpora in Dutch, English, French, German, Spanish, Swedish. The fourth-generation COW16 corpora available for English, French, German, and Spanish add a lot of linguistic annotation and provide a much higher data quality, especially for German and French.
Access to COW is provided at webcorpora.org.
RanDECOW17 is a German web corpus by COW created with the 2016 technology of the COW initiative. It is not based on breadth-first crawls, but it was “crawled” using the ClaraX research crawler developed in Roland Schäfer’s third-funded project Linguistic Web Characterisation.
RanDECOW17 was released in 2019 including the COReX document feature annotation. A version of RanDECOW17 is available through NoSketchEngine at webcorpora.org. It is not useful for most normal corpus studies.
The development of RanDECOW17 was funded by the DFG (SCHA1916/1-1). Continue reading
The COW & DeReKo Extractor COReX is a tool to annotate corpus documents annotated with the COWTek18 toolchain with document-level features representing the distribution of a large number of lexico-grammatical features. It is distributed as part of COWTek18. We have created databases of those features for DECOW16B, RanDECOW, and a subset of DeReKo. You can download them freely here:
The development of COReX and the databases provided here was funded by the DFG (SCHA1916/1-1). It is a joint effort by Roland Schäfer (FU Berlin/DFG-funded) and Felix Bildhauer (IDS Mannheim, project “Corpus Grammar”).
The following table provides a short overview of the COReX feature set in its 2018 iteration. For more, see Bildhauer & Schäfer (2019, in prep.) and Schäfer & Bildhauer (2019, in prep.). Continue reading
DECOW16 is the German web corpus by COW created with the 2016 technology of the COW initiative. DECOW16A was released in 2017 including the COReX document feature annotation. DECOW16B was released in 2018. The B iteration contains minor fixes and significantly improved topological parses and COReX data. DECOW16 is available through NoSketchEngine at webcorpora.org.
The development of DECOW16A/B was funded by the DFG (SCHA1916/1-1). Continue reading
As explained in Section 4.1 of Roland Schäfer (2015) Processing and querying large web corpora with the COW14 architecture, we have to take certain measures in order to stay within the bounds of German copyright laws. This means that we only release sentence shuffles, i.e., corpora which are just bags of sentences. In other words, there are no documents in released versions of COW corpora, just single sentences without contexts. The original URL plus some other meta data are recorded for each sentence, however.
FRCOW16 is the French web corpus by COW created with the 2014 technology of the COW initiative. Available through the portal for COW corpora at webcorpora.org. STATUS UPDATE: Planned release date is 30 June 2017 for NoSketchEngine and July 15 2017 for the shuffle XML version. Continue reading
We have an RStudio Server installation running on webcorpora.org. This allows users to:
- use Python with the convenient ManaCOW wrappers (in development) to make scripted queries with Python (using RStudio as a minimal Python IDE), and
- do statistical analyses of the results in RStudio directly without downloading them.
Users who have already registered on webcorpora.org can apply for an RStudio account by writing us an email.
Note: The open-source version of RStudio Server
, which we use, cannot be integrated into a single sign-on environment. Therefore, your NoSkE/download account and your RStudio account are two separate things with two separate passwords. If you have lost your RStudio Server password, please write us an email to have it reset.
Please use the email address that you used when you registered on webcorpora.org. If you use a different address, we cannot help you for security reasons.
ENCOW14 is the English web corpus by COW created with the 2016 technology of the COW initiative. Available through the portal for COW corpora at webcorpora.org running our custom Colibri² corpus portal software. Continue reading
FRCOW14 was the planned French web corpus by COW created with the 2014 technology of the COW initiative. Its release was delayed several times due to high workload and necessary improvements of the quality of the annotation implemented by us. Since it is now made entirely with COW16 technology, it is now released as FRCOW16.