ENCOW14 is the English web corpus by COW created with the 2014 technology of the COW initiative. Available through the portal for COW corpora at webcorpora.org running our custom Colibri² corpus portal software.
It comes in two versions: the full version “A” and a sentence shuffle “AX”, cf. comparison table below. The AX version has the advantage of including dependency parses. Only ENCOW14AX is available in the Colibri² web interface and for download. texrex, rofl and HyDRA are part of texrex-neuedimensionen. Other software used: Heritrix, Ucto, TreeTagger, Malt Parser. This product includes GeoLite data created by MaxMind, available from http://www.maxmind.com.
Data downloads and access
- main access (frontend and download) at webcorpora.org (requires registration)
- ngrams (raw and aggregated)
- word and lemma frequency lists
Corpus property overview
A | AX | |
---|---|---|
Type | full documents | sentence shuffle |
Tokens | 16,821,840,292 | 9,578,828,861 |
Sentences | 608,385,401 | 425,374,806 |
Documents | 9,216,176 | – |
Slices | 16 | 16 |
Formats | XML with inline VRT | XML with inline VRT |
License | not publicly available | COW Terms of Use 2 |
Release | Q1/2015 | Q1/2015 |
TLDs | uk, ca, com, org, ... | uk, ca, com, org, ... |
Crawled in | 2012,2014 | 2012,2014 |
Crawling | Heritrix 1.14 BFS | Heritrix 1.14 BFS |
Boilerplate removal | texrex MLP, non-destructive | texrex MLP, destructive |
Near-duplicate removal | texrex w-Shingling | texrex w-Shingling |
In-document deduping | texrex, paragraphs | texrex, paragraphs |
Hyphenation removal | HyDRA, destructive | HyDRA, destructive |
Run-together repair | rofl, destructive | rofl, destructive |
URL meta | Yes | Yes |
IP meta | Yes | – |
Last-modified meta | Yes | Yes |
crawldate meta | Yes | Yes |
Document quality meta | texrex Badness | texrex Badness |
HTML title meta | Yes | – |
HMTL keywords meta | Yes | – |
Boilerplate score meta | Yes | Yes |
Country meta | GeoLite | GeoLite |
Region meta | GeoLite | – |
City meta | GeoLite | GeoLite |
Register meta | – | – |
Word tokenization | Ucto + custom | Ucto + custom |
Sentence tokenization | Ucto | Ucto |
POS | TreeTagger, Penn | TreeTagger, Penn |
Lemma | TreeTagger | TreeTagger |
Named Entity | – | – |
Chunks | TreeTagger | TreeTagger |
Dependency Head | – | Malt, standard |
Dependency Relation | – | Malt, standard |