COW (COrpora from the Web) is a collection of linguistically processed gigatoken web corpora created by Felix Bildhauer and Roland Schäfer at Freie Universität Berlin, German Grammar Group. We have corpora in major European languages (Dutch, English, French, German, Spanish, Swedish).
The third-generation corpora COW14 are all larger than their predecessors, some containing 10 billion tokens or even 20 billion (DECOW14A). We focus on corpus quality in all areas (data collection as well as post-processing and linguistic annotation), not just larger corpus sizes.
All access to the corpora is routed through webcorpora.org. To comply with the German laws on intellectual property, the published corpora are sentence shuffles, and we only allow people working in the academia to access the corpora.