ENCOW14

ENCOW14 is the English web corpus by COW created with the 2014 technology of the COW initiative. Available through the portal for COW corpora at webcorpora.org running our custom Colibri² corpus portal software.

It comes in two versions: the full version “A” and a sentence shuffle “AX”, cf. comparison table below. The AX version has the advantage of including dependency parses. Only ENCOW14AX is available in the Colibri² web interface and for download. texrex, rofl and HyDRA are part of texrex-neuedimensionen. Other software used:  Heritrix, Ucto, TreeTagger, Malt Parser. This product includes GeoLite data created by MaxMind, available from http://www.maxmind.com.

Data downloads and access

 

Corpus property overview

 AAX
Typefull documentssentence shuffle
Tokens16,821,840,2929,578,828,861
Sentences608,385,401425,374,806
Documents9,216,176
Slices1616
FormatsXML with inline VRTXML with inline VRT
Licensenot publicly availableCOW Terms of Use 2
ReleaseQ1/2015Q1/2015
TLDsuk, ca, com, org, ...uk, ca, com, org, ...
Crawled in2012,20142012,2014
CrawlingHeritrix 1.14 BFSHeritrix 1.14 BFS
Boilerplate removaltexrex MLP, non-destructivetexrex MLP, destructive
Near-duplicate removaltexrex w-Shinglingtexrex w-Shingling
In-document dedupingtexrex, paragraphstexrex, paragraphs
Hyphenation removalHyDRA, destructiveHyDRA, destructive
Run-together repairrofl, destructiverofl, destructive
URL metaYesYes
IP metaYes
Last-modified metaYesYes
crawldate metaYesYes
Document quality metatexrex Badnesstexrex Badness
HTML title metaYes
HMTL keywords metaYes
Boilerplate score metaYesYes
Country metaGeoLiteGeoLite
Region metaGeoLite
City metaGeoLiteGeoLite
Register meta
Word tokenizationUcto + customUcto + custom
Sentence tokenizationUctoUcto
POSTreeTagger, PennTreeTagger, Penn
LemmaTreeTaggerTreeTagger
Named Entity
ChunksTreeTaggerTreeTagger
Dependency HeadMalt, standard
Dependency RelationMalt, standard