COW (COrpora from the Web) is a collection of linguistically processed gigatoken web corpora created by Felix Bildhauer and Roland Schäfer at Freie Universität Berlin, German Grammar Group. Roland Schäfer’s work related to the COW corpora is currently supported by the German Research Council (Deutsche Forschungsgemeinschaft, DFG) in the form of the project Linguistic web characterization and web corpus creation (SCHA1916/1-1).
We have corpora in major European languages (Dutch, English, French, German, Spanish, Swedish). The third-generation corpora COW14 are all larger than their predecessors, some containing 10 billion tokens or even 20 billion (DECOW14A). We focus on corpus quality in all areas (data collection as well as post-processing and linguistic annotation), not just larger corpus sizes.
All access to the corpora is routed through webcorpora.org. To comply with the German laws on intellectual property, the published corpora are sentence shuffles, and we only allow people working in the academia to access the corpora.