huBERT: Hungarian BERT models
This page lists deep neural language models (BERT, Electra, etc.) trained on the Hungarian Webcorpus 2.0.
Models
huBERT
A cased model trained on Webcorpus 2.0 and a snapshot of the Hungarian Wikipedia. It can be downloaded in two formats:
- as a raw Tensorflow checkpoint output by the official BERT training code.
- as a named Hugging Face model, which can be downloaded, fine-tuned and used as any other model in the Hugging Face Model Hub.
Wikipedia
Models trained on the Hungarian Wikipedia. The details of the training can be found in the paper mentioned at the bottom of the page.
- huBERT cased: the usual cased model
- huBERT lowercased: a lowercased (not to be confused with the usual uncased) model. As opposed to English, diacritics in Hungarian are distinctive, so we kept them intact.
Performance
All models outperform the multilingual BERT model for masked LM, NER and NP chunking and the full huBERT
outperforms the Wikipedia models by 0.5% on both extrinsic tasks. In particular, it is the current state of the art in NP chunking (table from the emBERT repository):
Task | Training corpus | multi-BERT F1 | huBERT wiki F1 | huBERT F1 |
---|---|---|---|---|
named entity recognition | Szeged NER corpus | 97.08% | 97.03% | 97.62% |
base NP chunking | Szeged TreeBank 2.0 | 95.58% | 96.64% | 97.14% |
maximal NP chunking | Szeged TreeBank 2.0 | 95.05% | 96,41% | 96,97% |
Usage
The BERT checkpoints can be loaded into TensorFlow as-is. The recommended way to use the models, however, is to convert them to Pytorch and load (and fine-tune) them with the transformers library.
huBERT
has also been uploaded to the Hugging Face Model Hub. It can be used under both Pytorch and TF 2.0 as any other official transformers
model:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("SZTAKI-HLT/hubert-base-cc")
model = AutoModel.from_pretrained("SZTAKI-HLT/hubert-base-cc")
Lowercased models
A note on lowercased models: unfortunately transformers
does not supports lowercased models out of the box. Make sure to pass do_lower_case=False
to BertTokenizer
when loading either the cased or lowercased models, and lower case the input text manually for the latter. E.g.:
tokenizer = BertTokenizer.from_pretrained('path/to/hubert_wiki_lower', do_lower_case=False)
tokens = tokenizer.tokenize('My Hungarian text'.lower())
Citation
If you use the models on these page, please cite (chapter 4/5 of) the following publication:
Nemeskey, Dávid Márk (2020). “Natural Language Processing methods for Language Modeling”. PhD thesis. Eötvös Loránd University.