NLCOW14

NLCOW14 is the Dutch web corpus by COW created with the 2014 technology of the COW initiative. Available through the portal for COW corpora at webcorpora.org.

It comes in two versions: A and AX (cf. comparison table below). texrex, rofl and HyDRA are part of texrex-neuedimensionen. Other software used:  Heritrix, Ucto, TreeTagger. This product includes GeoLite data created by MaxMind, available from http://www.maxmind.com.

 AAX
Typefull documentssentence shuffle
Tokens6,887,226,2904,732,581,841
Sentences311,755,017259,717,960
Documents5,468,755
Slices77
FormatXML with inline VRT, UTF-8XML with inline VRT, UTF-8
Licensenot publicly availableCOW Terms of Use 2
ReleaseQ3/2014Q3/2014
TLDsbe, nlbe, nl
Crawled in2011,20142011,2014
CrawlingHeritrix 1.14 BFSHeritrix 1.14 BFS
Boilerplate removaltexrex MLP, non-destructivetexrex MLP, destructive
Near-duplicate removaltexrex, w-Shinglingtexrex, w-Shingling
In-document dedupingtexrex, paragraphstexrex, paragraphs
Hyphenation removalHyDRA, destructiveHyDRA, destructive
Run-together repairrofl, destructiverofl, destructive
URL metaYesYes
IP metaYes
Last-modified metaYesYes
crawldate metaYesYes
Document quality metatexrex Badnesstexrex Badness
HTML title metaYes
HMTL keywords metaYes
Boilerplate score metaYesYes
Country metaGeoLiteGeoLite
Region metaGeoLite
City metaGeoLiteGeoLite
Register meta
Word tokenizationUcto + customUcto + custom
Sentence tokenizationUctoUcto
POSTreeTaggerTreeTagger
LemmaTreeTaggerTreeTagger
Named Entity
Chunks
Dependency Head
Dependency Relation