Moving corporafromtheweb.org to webcorpora.org!

Bookmark http://www.webcorpora.org now in order to stay in touch with the COW!

We are in the process of reorganising our infrastructure. Our corpus server webcorpora.org will move to a site at Humboldt-Universität zu Berlin in 2022. (Registered users will be informed in time, and there won’t be any significant downtime.) The information provided here on corporafromtheweb.org will also be moved to the new server in the form of concise technical descriptions of the corpora provided. Please update your bookmarks!

What is COW?

The COW (COrpora from the Web) corpora are the result of an ongoing project which has the goal of determining the value of linguistic material collected from the World Wide Web for fundamental linguistic research. The data are made available to a limited audience of collaborators within the linguistic community. Work on COW is supported by the German Research Council (Deutsche Forschungsgemeinschaft, DFG) in the form of the project Linguistic web characterization and web corpus creation (SCHA1916/1-1). In essence, COW is a collection of linguistically processed gigatoken web corpora created by Felix Bildhauer and Roland Schäfer at Freie Universität Berlin.

We have corpora in Dutch, English, French, German, Spanish, Swedish. The fourth-generation COW16 corpora available for English, French, German, and Spanish add a lot of linguistic annotation and provide a much higher data quality, especially for German and French.

Access to COW is provided at webcorpora.org.