ESCOW14

ESCOW14 is the Spanish web corpus by COW created with the 2014 technology of the COW initiative. Available through the portal for COW corpora at webcorpora.org.

It comes in two versions: A and AX (cf. comparison table below). texrex refers to our own texrex-neuedimensionen web corpus cleaning software. Other software used:  Heritrix, Ucto, FreeLing. This product includes GeoLite data created by MaxMind, available from http://www.maxmind.com.

Data downloads and access

ESCOW14A(X) properties

 AAX
Typefull documentssentence shuffle
Tokens7,309,251,4413,680,794,644
Sentences329,229,514149,891,577
Documents5,000,000
Slices77
FormatXML with inline VRT, UTF-8XML with inline VRT, UTF-8
Licensenot publicly availableCOW Terms of Use 2
ReleaseQ2/2015Q2/2015
TLDsmost Spanish-speaking countriesmost Spanish-speaking countries
Crawled in2012, 20142012, 2014
CrawlingHeritrix 1.14 BFSHeritrix 1.14 BFS
Boilerplate removaltexrex MLP, non-destructivetexrex MLP, destructive
Near-duplicate removaltexrex, w-Shinglingtexrex, w-Shingling
In-document dedupingtexrex, paragraphstexrex, paragraphs
Hyphenation removal
Run-together repairRTWords, destructiveRTWords, destructive
URL meta
IP meta
Last-modified meta
crawldate meta
Document quality metatexrex Badnesstexrex Badness
HTML title meta
HMTL keywords meta
Boilerplate score meta
Country meta
Region geoloc meta
City geoloc meta
"Quasi-spontaneous" meta
Word tokenizationUcto + customUcto + custom
Sentence tokenizationUctoUcto
Language identificationFreeLingFreeLing
POSFreeLing (EAGLES + custom)FreeLing (EAGLES + custom)
LemmaFreeLingFreeLing
Chunks
MorphologyFreeLingFreeLing
Named Entity– (planned)– (planned)
Dependency Head– (planned)– (planned)
Dependency Relation– (planned)– (planned)