ENCOW14 is the English web corpus by COW created with the 2016 technology of the COW initiative. Available through the portal for COW corpora at webcorpora.org running our custom Colibri² corpus portal software.

It comes in two versions: the full version “A” (web access) and a sentence shuffle “AX” (download), cf. comparison table below. texrex, rofl and HyDRA are part of texrex-behindthecow. Other software used in the COW16 toolchain:  Heritrix, Ucto, TreeTagger, Marmot, Stanford CRF-NERMalt Parser. This product includes GeoLite data created by MaxMind, available from http://www.maxmind.com.

All access (download/NoSketchEngine) is routed through webcorpora.org!

Corpus property overview

* Feature not available in NoSketchEngine.

Typefull documentssentence shuffle
FormatsXML with inline VRTXML with inline VRT
LicenseCOW Terms of Use 3 only in NoSketchEngineCOW Terms of Use 3 (download)
TLDsuk, ca, com, org, ...uk, ca, com, org, ...
Crawled in2012,20142012,2014
CrawlingHeritrix 1.14 BFSHeritrix 1.14 BFS
Boilerplate removaltexrex MLP, non-destructivetexrex MLP, destructive
Near-duplicate removaltexrex w-Shinglingtexrex w-Shingling
In-document dedupingtexrex, paragraphstexrex, paragraphs
Hyphenation removalHyDRA, destructiveHyDRA, destructive
Run-together repairrofl, destructiverofl, destructive
URL metaYesYes
IP metaYes
Last-modified metaYesYes
crawldate metaYes*Yes
Document quality metatexrex Badnesstexrex Badness
Sentence-wise language detectionlangid.py plus heuristicslangid.py plus heuristics
HTML title metaYes*
HMTL keywords metaYes*
Boilerplate score metaYesYes
Country metaGeoLiteGeoLite
Region metaGeoLite*
City metaGeoLite*
Forum detectionYesYes
Word tokenizationUcto + customUcto + custom
Sentence tokenizationUcto + customUcto + custom
POSTreeTagger (Penn)TreeTagger (Penn)
Named EntitiesStanfordStanford
Dependency HeadMaltMalt
Dependency RelationMaltMalt