SZTAKI HLT | Hungarian Webcorpus 2.0

Hungarian Webcorpus 2.0

2020 -

The new version of the Hungarian Webcorpus was built from Common Crawl and includes a little over 9 billion words. As opposed to the original Webcorpus, the document structure is left intact. This ensures that long-range dependencies inside documents are retained, and allows the training of models that can exploit them (e.g. BERT, Tranformer-XL, etc).

Usage

The corpus is disseminated either as text or tsv files in the CoNLL-U format. The annotations were produced by emtsv. The corpus can be downloaded in three versions; the analysed versions differ in the number of fields in the tsv files:

  • The text corpus is comprised of text files in the BERT training format: each sentence is on a separate line, and documents are separated by an empty line. (25GB)
  • The clean corpus contains the surface form, the whitespaces after each token, and the lemma and the POS tag assigned by emtsv. (83GB)
  • The analyzed corpus includes the ana field in addition to the others, which lists all possible morphological analyses of the surface form. (511GB)

Which version to choose? If morphological information is not required (e.g. for training word embeddings), the text version (or its subset) is the best choice, as it takes up the least space. Otherwise, for most use cases, the clean version should be sufficient -- it is much easier to handle and faster to process than the 511GB analyzed version. In any case, please note that the corpus is very large, so make sure you have the appropriate infrastructure to process it. Also, please refrain from streaming the data from our repository. The best way to start is to download a few files, familiarize yourself with it and iron out any bugs in your code before running it on the whole corpus.

All folders include a sha256sums file that contains the sha256 checksums for all tsv files. The integrity of the corpus then can be checked by running sha256sum -c sha256sums.

Code for processing the corpus can be downloaded from the cc_corpus GitHub repository. Feature requests should also be reported there.

If you use the corpus in your research, please cite the corresponding publication:

Nemeskey, Dávid Márk (2020). “Natural Language Processing methods for Language Modeling”. PhD thesis. Eötvös Loránd University.

License

The two subcorpora of Webcorpus 2.0 are available under the following licenses:

  1. The Common Crawl subcorpus is available under the same terms of use as Common Crawl itself is; see here for details.
  2. The Wikipedia subcorpus, as well as any columns in the CC subcorpus after the first are licensed under the Creative Commons Attribution-ShareAlike 4.0 Internation (CC BY-SA 4.0) license.

By downloading the corpus you agree to use it according to the terms and licenses listed above. In particular:

  • We didn’t produce the crawled content, we just found it on the web. So we are not vouching for the content or liable if there is something wrong with it.
  • We will take appropriate action if you let us know about a copyright infringement or other legal concern (inappropriate content, etc).
  • When you tell us about a copyright infringement, you have to: notify us in writing, sign the notification, describe the copyrighted work being infringed, and give us your contact information.
  • Please contact the resource owner (below) for the postal address and for an initial evaluation of the claim.
Resource owner