What is COW?

The COW (COrpora from the Web) corpora are the result of an ongoing project which has the goal of determining the value of linguistic material collected from the World Wide Web for fundamental linguistic research. The data are made available to a limited audience of collaborators within the linguistic community. Work on COW is supported by the German Research Council (Deutsche Forschungsgemeinschaft, DFG) in the form of the project Linguistic web characterization and web corpus creation (SCHA1916/1-1). In essence, COW is a collection of linguistically processed gigatoken web corpora created by Felix Bildhauer and Roland Schäfer at Freie Universität Berlin.

We have corpora in Dutch, English, French, German, Spanish, Swedish. The fourth-generation COW16 corpora available for English, French, German, and Spanish add a lot of linguistic annotation and provide a much higher data quality, especially for German and French.

Access to COW is provided at webcorpora.org.