COW (COrpora from the Web) is a collection of linguistically processed gigatoken web corpora created by Felix Bildhauer and Roland Schäfer at Freie Universität Berlin, German Grammar Group. Roland Schäfer’s work related to the COW corpora is currently supported by the German Research Council (Deutsche Forschungsgemeinschaft, DFG) in the form of the project Linguistic web characterization and web corpus creation (SCHA1916/1-1).
We have corpora in major European languages (Dutch, English, French, German, Spanish, Swedish). The fourth-generation corpora COW16 are maintenance releases, adding a lot of linguistic annotation and providing much higher data quality.
Access to the corpora is provided at webcorpora.org (download and NoSketchEngine). To comply with the German laws on intellectual property, the published corpora are sentence shuffles, and we only allow people working in the academia to access the corpora.