SZTAKI HLT | Hungarian Webcorpus 2.0

Hungarian Webcorpus 2.0

2020 -

The new version of the Hungarian Webcorpus was built from Common Crawl and includes a little over 9 billion words. As opposed to the original Webcorpus, the document structure is left intact. This ensures that long-range dependencies inside documents are retained, and allows the training of models that can exploit them (e.g. BERT, Tranformer-XL, etc).

Usage

The corpus can be downloaded from this repository.

Please note that it is about 500 GB in size, so make sure you have the appropriate infrastructure to process it. Also, please refrain from streaming the data from our repository. The best way to start is to download a few files, familiarize yourself with it and iron out any bugs in your code before running it on the whole corpus.

Code for processing the corpus can be downloaded from the cc_corpus GitHub repository. Feature requests should also be reported here.

If you use the corpus in your research, please cite the corresponding publication:

Nemeskey, Dávid Márk (2020). “Natural Language Processing methods for Language Modeling”. PhD thesis. Eötvös Loránd University.

License

The two subcorpora of Webcorpus 2.0 are available under the following licenses:

  1. The Common Crawl subcorpus is available under the same terms of use as Common Crawl itself is; see here for details.
  2. The Wikipedia subcorpus, as well as any columns in the CC subcorpus after the first are licensed under the Creative Commons Attribution-ShareAlike 4.0 Internation (CC BY-SA 4.0) license.

By downloading the corpus you agree to use it according to the terms and licenses listed above. In particular:

  • We didn’t produce the crawled content, we just found it on the web. So we are not vouching for the content or liable if there is something wrong with it.
  • We will take appropriate action if you let us know about a copyright infringement or other legal concern (inappropriate content, etc).
  • When you tell us about a copyright infringement, you have to: notify us in writing, sign the notification, describe the copyrighted work being infringed, and give us your contact information.
  • Please contact the resource owner (below) for the postal address and for an initial evaluation of the claim.
Resource owner