SZTAKI HLT | huBERT: Hungarian BERT models

huBERT: Hungarian BERT models

2020 -

This page lists deep neural language models (BERT, Electra, etc.) trained on the Hungarian Webcorpus 2.0.

Models

huBERT

A cased model trained on Webcorpus 2.0 and a snapshot of the Hungarian Wikipedia. It can be downloaded in two formats:

Wikipedia

Models trained on the Hungarian Wikipedia. The details of the training can be found in the paper mentioned at the bottom of the page.

  • huBERT cased: the usual cased model
  • huBERT lowercased: a lowercased (not to be confused with the usual uncased) model. As opposed to English, diacritics in Hungarian are distinctive, so we kept them intact.

Performance

All models outperform the multilingual BERT model for masked LM, NER and NP chunking and the full huBERT outperforms the Wikipedia models by 0.5% on both extrinsic tasks. In particular, it is the current state of the art in NP chunking (table from the emBERT repository):

Task Training corpus multi-BERT F1 huBERT wiki F1 huBERT F1
named entity recognition Szeged NER corpus 97.08% 97.03% 97.62%
base NP chunking Szeged TreeBank 2.0 95.58% 96.64% 97.14%
maximal NP chunking Szeged TreeBank 2.0 95.05% 96,41% 96,97%

Usage

The BERT checkpoints can be loaded into TensorFlow as-is. The recommended way to use the models, however, is to convert them to Pytorch and load (and fine-tune) them with the transformers library.

huBERT has also been uploaded to the Hugging Face Model Hub. It can be used under both Pytorch and TF 2.0 as any other official transformers model:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("SZTAKI-HLT/hubert-base-cc")
model = AutoModel.from_pretrained("SZTAKI-HLT/hubert-base-cc")

Lowercased models

A note on lowercased models: unfortunately transformers does not supports lowercased models out of the box. Make sure to pass do_lower_case=False to BertTokenizer when loading either the cased or lowercased models, and lower case the input text manually for the latter. E.g.:

tokenizer = BertTokenizer.from_pretrained('path/to/hubert_wiki_lower', do_lower_case=False) tokens = tokenizer.tokenize('My Hungarian text'.lower())

Citation

If you use the models on these page, please cite (chapter 4/5 of) the following publication:

Nemeskey, Dávid Márk (2020). “Natural Language Processing methods for Language Modeling”. PhD thesis. Eötvös Loránd University.

Resource owner