COW (COrpora from the Web) is a collection of linguistically processed gigatoken web corpora created by Felix Bildhauer and Roland Schäfer at Freie Universität Berlin. Roland Schäfer’s work on the COW corpora is currently supported by the German Research Council (Deutsche Forschungsgemeinschaft, DFG) in the form of the project Linguistic web characterization and web corpus creation (SCHA1916/1-1).
We have corpora in major “European” languages (Dutch, English, French, German, Spanish, Swedish). The fourth-generation corpora COW16 are maintenance releases, adding a lot of linguistic annotation and providing much higher data quality.
Access to the corpora is provided at webcorpora.org (download and NoSketchEngine). To comply with the German laws on intellectual property, the downloadable corpora are sentence shuffles, and we only allow people working in the academia to access the corpora.