ENCOW14 is the English web corpus by COW created with the 2014 technology of the COW initiative. Available through the portal for COW corpora at webcorpora.org running our custom Colibri² corpus portal software.

It comes in two versions: the full version “A” and a sentence shuffle “AX”, cf. comparison table below. The AX version has the advantage of including dependency parses. Only ENCOW14AX is available in the Colibri² web interface and for download. texrex, rofl and HyDRA are part of texrex-neuedimensionen. Other software used:  Heritrix, Ucto, TreeTagger, Malt Parser. This product includes GeoLite data created by MaxMind, available from http://www.maxmind.com.

Data downloads and access


Corpus property overview

Typefull documentssentence shuffle
FormatsXML with inline VRTXML with inline VRT
Licensenot publicly availableCOW Terms of Use 2
TLDsuk, ca, com, org, ...uk, ca, com, org, ...
Crawled in2012,20142012,2014
CrawlingHeritrix 1.14 BFSHeritrix 1.14 BFS
Boilerplate removaltexrex MLP, non-destructivetexrex MLP, destructive
Near-duplicate removaltexrex w-Shinglingtexrex w-Shingling
In-document dedupingtexrex, paragraphstexrex, paragraphs
Hyphenation removalHyDRA, destructiveHyDRA, destructive
Run-together repairrofl, destructiverofl, destructive
URL metaYesYes
IP metaYes
Last-modified metaYesYes
crawldate metaYesYes
Document quality metatexrex Badnesstexrex Badness
HTML title metaYes
HMTL keywords metaYes
Boilerplate score metaYesYes
Country metaGeoLiteGeoLite
Region metaGeoLite
City metaGeoLiteGeoLite
Register meta
Word tokenizationUcto + customUcto + custom
Sentence tokenizationUctoUcto
POSTreeTagger, PennTreeTagger, Penn
Named Entity
Dependency HeadMalt, standard
Dependency RelationMalt, standard