SVCOW14

SVCOW14 is the Swedish web corpus by COW created with the 2014 technology of the COW initiative. Available through the portal for COW corpora at webcorpora.org.

It comes in two versions: A and AX (cf. comparison table below). texrex, rofl and HyDRA are part of texrex-neuedimensionen. Other software used:  Heritrix, Ucto, HunPos. This product includes GeoLite data created by MaxMind, available from http://www.maxmind.com.

Data downloads and access

 AAX
Typefull documentssentence shuffle
Tokens8,569,582,8684,842,753,707
Sentences406,369,940306,599,971
Documents6,357,446
Slices99
FormatXML with inline VRT, UTF-8XML with inline VRT, UTF-8
Licensenot publicly availableCOW Terms of Use 2
ReleaseQ3/2014Q3/2014
TLDsfi, sefi, se
Crawled in2011,20142011,2014
CrawlingHeritrix 1.14 BFSHeritrix 1.14 BFS
Boilerplate removaltexrex MLP, non-destructivetexrex MLP, destructive
Near-duplicate removaltexrex, w-Shinglingtexrex, w-Shingling
In-document dedupingtexrex, paragraphstexrex, paragraphs
Hyphenation removalHyDRA, destructiveHyDRA, destructive
Run-together repairrofl, destructiverofl, destructive
URL metaYesYes
IP metaYes
Last-modified metaYesYes
crawldate metaYesYes
Document quality metatexrex Badnesstexrex Badness
HTML title metaYes
HMTL keywords metaYes
Boilerplate score metaYesYes
Country metaGeoLiteGeoLite
Region metaGeoLite
City metaGeoLiteGeoLite
Register meta
Word tokenizationUcto + customUcto + custom
Sentence tokenizationUctoUcto
POSHunPos, ParoleHunPos, Parole
Lemmacustomcustom
Named Entity
Chunks
Dependency Head
Dependency Relation