Author Archives: Roland Schäfer

RanDECOW17

RanDECOW17 is a German web corpus by COW created with the 2016 technology of the COW initiative. It is not based on breadth-first crawls, but it was “crawled” using the ClaraX research crawler developed in Roland Schäfer’s third-funded project Linguistic Web Characterisation.

RanDECOW17 was released in 2019 including the COReX document feature annotation. A version of RanDECOW17 is available through NoSketchEngine at webcorpora.org. It is not useful for most normal corpus studies.

The development of RanDECOW17 was funded by the DFG (SCHA1916/1-1). Continue reading

COReX 2018 feature set and databases

The COW & DeReKo Extractor COReX is a tool to annotate corpus documents annotated with the COWTek18 toolchain with document-level features representing the distribution of a large number of lexico-grammatical features. It is distributed as part of COWTek18. We have created databases of those features for DECOW16B, RanDECOW, and a subset of DeReKo. You can download them freely here:

https://www.webcorpora.org/opendata/corex18/

The development of COReX and the databases provided here was funded by the DFG (SCHA1916/1-1). It is a joint effort by Roland Schäfer (FU Berlin/DFG-funded) and Felix Bildhauer (IDS Mannheim, project “Corpus Grammar”).

The following table provides a short overview of the COReX feature set in its 2018 iteration. For more, see Bildhauer & Schäfer (2019, in prep.) and Schäfer & Bildhauer (2019, in prep.). Continue reading

DECOW16 (A and B)

DECOW16 is the German web corpus by COW created with the 2016 technology of the COW initiative. DECOW16A was released in 2017 including the COReX document feature annotation. DECOW16B was released in 2018. The B iteration contains minor fixes and significantly improved topological parses and COReX data. DECOW16 is available through NoSketchEngine at webcorpora.org.

The development of DECOW16A/B was funded by the DFG (SCHA1916/1-1). Continue reading

Read this: You can only query and download sentence shuffles!

As explained in Section 4.1 of Roland Schäfer (2015) Processing and querying large web corpora with the COW14 architecture, we have to take certain measures in order to stay within the bounds of German copyright laws. This means that we only release sentence shuffles, i.e., corpora which are just bags of sentences. In other words, there are no documents in released versions of COW corpora, just single sentences without contexts. The original URL plus some other meta data are recorded for each sentence, however.

FRCOW16

FRCOW16 is the French web corpus by COW created with the 2014 technology of the COW initiative. Available through the portal for COW corpora at webcorpora.org. STATUS UPDATE: Planned release date is 30 June 2017 for NoSketchEngine and July 15 2017 for the shuffle XML version. Continue reading

RStudio Server and Python

We have an RStudio Server installation running on webcorpora.org. This allows users to:

  • use Python with the convenient ManaCOW wrappers (in development) to make scripted queries with Python (using RStudio as a minimal Python IDE), and
  • do statistical analyses of the results in RStudio directly without downloading them.

Users who have already registered  on webcorpora.org can apply for an RStudio account by writing us an email.

Note: The open-source version of RStudio Server, which we use, cannot be integrated into a single sign-on environment. Therefore, your NoSkE/download account and your RStudio account are two separate things with two separate passwords. If you have lost your RStudio Server password, please write us an email to have it reset. Please use the email address that you used when you registered on webcorpora.org. If you use a different address, we cannot help you for security reasons.

ENCOW16

ENCOW14 is the English web corpus by COW created with the 2016 technology of the COW initiative. Available through the portal for COW corpora at webcorpora.org running our custom Colibri² corpus portal software. Continue reading

FRCOW14

FRCOW14 was the planned French web corpus by COW created with the 2014 technology of the COW initiative. Its release was delayed several times due to high workload and necessary improvements of the quality of the annotation implemented by us. Since it is now made entirely with COW16 technology, it is now released as FRCOW16.