SVCOW14 is the Swedish web corpus by COW created with the 2014 technology of the COW initiative. Available through the portal for COW corpora at

It comes in two versions: A and AX (cf. comparison table below). texrex, rofl and HyDRA are part of texrex-neuedimensionen. Other software used:  Heritrix, Ucto, HunPos. This product includes GeoLite data created by MaxMind, available from

Data downloads and access

Typefull documentssentence shuffle
FormatXML with inline VRT, UTF-8XML with inline VRT, UTF-8
Licensenot publicly availableCOW Terms of Use 2
TLDsfi, sefi, se
Crawled in2011,20142011,2014
CrawlingHeritrix 1.14 BFSHeritrix 1.14 BFS
Boilerplate removaltexrex MLP, non-destructivetexrex MLP, destructive
Near-duplicate removaltexrex, w-Shinglingtexrex, w-Shingling
In-document dedupingtexrex, paragraphstexrex, paragraphs
Hyphenation removalHyDRA, destructiveHyDRA, destructive
Run-together repairrofl, destructiverofl, destructive
URL metaYesYes
IP metaYes
Last-modified metaYesYes
crawldate metaYesYes
Document quality metatexrex Badnesstexrex Badness
HTML title metaYes
HMTL keywords metaYes
Boilerplate score metaYesYes
Country metaGeoLiteGeoLite
Region metaGeoLite
City metaGeoLiteGeoLite
Register meta
Word tokenizationUcto + customUcto + custom
Sentence tokenizationUctoUcto
POSHunPos, ParoleHunPos, Parole
Named Entity
Dependency Head
Dependency Relation