SZTAKI HLT | huBERT: Hungarian BERT models

huBERT: Hungarian BERT models

2020 -

This page lists deep neural language models (BERT, Electra, etc.) trained on the Hungarian Webcorpus 2.0.

Models

`huBERT`

A cased model trained on Webcorpus 2.0 and a snapshot of the Hungarian Wikipedia. It can be downloaded in two formats:

as a raw Tensorflow checkpoint output by the official BERT training code.
as a named Hugging Face model, which can be downloaded, fine-tuned and used as any other model in the Hugging Face Model Hub.

Wikipedia

Models trained on the Hungarian Wikipedia. The details of the training can be found in the paper mentioned at the bottom of the page.

huBERT cased: the usual cased model
huBERT lowercased: a lowercased (not to be confused with the usual uncased) model. As opposed to English, diacritics in Hungarian are distinctive, so we kept them intact.

Performance

All models outperform the multilingual BERT model for masked LM, NER and NP chunking and the full huBERT outperforms the Wikipedia models by 0.5% on both extrinsic tasks. In particular, it is the current state of the art in NP chunking (table from the emBERT repository):

Task	Training corpus	multi-BERT F1	huBERT wiki F1	huBERT F1
named entity recognition	Szeged NER corpus	97.08%	97.03%	97.62%
base NP chunking	Szeged TreeBank 2.0	95.58%	96.64%	97.14%
maximal NP chunking	Szeged TreeBank 2.0	95.05%	96,41%	96,97%

Usage

The BERT checkpoints can be loaded into TensorFlow as-is. The recommended way to use the models, however, is to convert them to Pytorch and load (and fine-tune) them with the transformers library.

huBERT has also been uploaded to the Hugging Face Model Hub. It can be used under both Pytorch and TF 2.0 as any other official transformers model:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("SZTAKI-HLT/hubert-base-cc")
model = AutoModel.from_pretrained("SZTAKI-HLT/hubert-base-cc")

Lowercased models

A note on lowercased models: unfortunately transformers does not supports lowercased models out of the box. Make sure to pass do_lower_case=False to BertTokenizer when loading either the cased or lowercased models, and lower case the input text manually for the latter. E.g.:

tokenizer = BertTokenizer.from_pretrained('path/to/hubert_wiki_lower', do_lower_case=False) tokens = tokenizer.tokenize('My Hungarian text'.lower())

Citation

If you use the models on these page, please cite (chapter 4/5 of) the following publication:

Nemeskey, Dávid Márk (2020). “Natural Language Processing methods for Language Modeling”. PhD thesis. Eötvös Loránd University.