emLam
A Language Modeling Benchmark Corpus for Hungarian, similar to the One Billion Word corpus (Chelba, 2014) for English.
The emLam corpus is a filtered and preprocessed version of the Hungarian Webcorpus. It is available under the Creative Commons Share-alike (CC-BY SA 4.0) license.
The two versions of the corpus below have been preprocessed with the e-magyar
pipeline, implemented as a GATE plugin. Due to errors in the pipeline, the contents of one file (web2-4p-2-05.gz) are missing from the corpus. Please note that the results in the MSZNY paper were based on a different version of the corpus, which was parsed with the deprecated hun* tools.
The corpus is available in a 90%-5%-5% train-valid-test split, in two formats:
A raw word-level model; approx. 480M tokens:
A "gluten-free" (GLF) model, in which words have been split into lemma and inflections tokens; approx. 660M tokens:
The preprocessing scripts used to generate the corpus are available in the emLam repository.
If you use the corpus or the repository in your project, please cite the following paper (see link for the bib):
Dávid Márk Nemeskey 2017. emLam – a Hungarian Language Modeling baseline. In Proceedings of the 13th Conference on Hungarian Computational Linguistics (MSZNY 2017).