SVCOW14 is the Swedish web corpus by COW created with the 2014 technology of the COW initiative. Available through the portal for COW corpora at webcorpora.org.
It comes in two versions: A and AX (cf. comparison table below). texrex, rofl and HyDRA are part of texrex-neuedimensionen. Other software used: Heritrix, Ucto, HunPos. This product includes GeoLite data created by MaxMind, available from http://www.maxmind.com.
Data downloads and access
- main access (frontend and download) at webcorpora.org (requires registration)
- ngrams (raw and aggregated)
- word and lemma frequency lists
A | AX | |
---|---|---|
Type | full documents | sentence shuffle |
Tokens | 8,569,582,868 | 4,842,753,707 |
Sentences | 406,369,940 | 306,599,971 |
Documents | 6,357,446 | – |
Slices | 9 | 9 |
Format | XML with inline VRT, UTF-8 | XML with inline VRT, UTF-8 |
License | not publicly available | COW Terms of Use 2 |
Release | Q3/2014 | Q3/2014 |
TLDs | fi, se | fi, se |
Crawled in | 2011,2014 | 2011,2014 |
Crawling | Heritrix 1.14 BFS | Heritrix 1.14 BFS |
Boilerplate removal | texrex MLP, non-destructive | texrex MLP, destructive |
Near-duplicate removal | texrex, w-Shingling | texrex, w-Shingling |
In-document deduping | texrex, paragraphs | texrex, paragraphs |
Hyphenation removal | HyDRA, destructive | HyDRA, destructive |
Run-together repair | rofl, destructive | rofl, destructive |
URL meta | Yes | Yes |
IP meta | Yes | – |
Last-modified meta | Yes | Yes |
crawldate meta | Yes | Yes |
Document quality meta | texrex Badness | texrex Badness |
HTML title meta | Yes | – |
HMTL keywords meta | Yes | – |
Boilerplate score meta | Yes | Yes |
Country meta | GeoLite | GeoLite |
Region meta | GeoLite | – |
City meta | GeoLite | GeoLite |
Register meta | – | – |
Word tokenization | Ucto + custom | Ucto + custom |
Sentence tokenization | Ucto | Ucto |
POS | HunPos, Parole | HunPos, Parole |
Lemma | custom | custom |
Named Entity | – | – |
Chunks | – | – |
Dependency Head | – | – |
Dependency Relation | – | – |