SZTAKI HLT | Hungarian Webcorpus 2.0

Hungarian Webcorpus 2.0

2020 -

The new version of the Hungarian Webcorpus was built from Common Crawl and includes a little over 9 billion words. As opposed to the original Webcorpus, the document structure is left intact. This ensures that long-range dependencies inside documents are retained, and allows the training of models that can exploit them (e.g. BERT, Tranformer-XL, etc).

Usage

The corpus is disseminated as tsv files in the CoNLL-U format. The annotations were produced by emtsv. The corpus can be downloaded in two versions, which differ in the number of fields in the tsv files:

  • The clean corpus contains the surface form, the whitespaces after each token, and the lemma and the POS tag assigned by emtsv.
  • The analyzed corpus includes the ana field in addition to the others, which lists all possible morphological analyses of the surface form.

For most use cases, the clean version should be sufficient. It is also only 83GB, which makes it easier to handle and faster to process than the 511GB analyzed version. In any case, please note that the corpus is very large, so make sure you have the appropriate infrastructure to process it. Also, please refrain from streaming the data from our repository. The best way to start is to download a few files, familiarize yourself with it and iron out any bugs in your code before running it on the whole corpus.

Both folders include a sha256sums file that contains the sha256 checksums for all tsv files. The integrity of the corpus then can be checked by running sha256sum -c sha256sums.

Code for processing the corpus can be downloaded from the cc_corpus GitHub repository. Feature requests should also be reported there.

If you use the corpus in your research, please cite the corresponding publication:

Nemeskey, Dávid Márk (2020). “Natural Language Processing methods for Language Modeling”. PhD thesis. Eötvös Loránd University.

License

The two subcorpora of Webcorpus 2.0 are available under the following licenses:

  1. The Common Crawl subcorpus is available under the same terms of use as Common Crawl itself is; see here for details.
  2. The Wikipedia subcorpus, as well as any columns in the CC subcorpus after the first are licensed under the Creative Commons Attribution-ShareAlike 4.0 Internation (CC BY-SA 4.0) license.

By downloading the corpus you agree to use it according to the terms and licenses listed above. In particular:

  • We didn’t produce the crawled content, we just found it on the web. So we are not vouching for the content or liable if there is something wrong with it.
  • We will take appropriate action if you let us know about a copyright infringement or other legal concern (inappropriate content, etc).
  • When you tell us about a copyright infringement, you have to: notify us in writing, sign the notification, describe the copyrighted work being infringed, and give us your contact information.
  • Please contact the resource owner (below) for the postal address and for an initial evaluation of the claim.
Resource owner