huBERT: Hungarian BERT models
This page lists deep neural language models (BERT, Electra, etc.) trained on the Hungarian Webcorpus 2.0.
A cased model trained on Webcorpus 2.0 and a snapshot of the Hungarian Wikipedia. It can be downloaded in two formats:
- as a raw Tensorflow checkpoint output by the official BERT training code.
- as a named Hugging Face model, which can be downloaded, fine-tuned and used as any other model in the Hugging Face Model Hub.
Models trained on the Hungarian Wikipedia. The details of the training can be found in the paper mentioned at the bottom of the page.
- huBERT cased: the usual cased model
- huBERT lowercased: a lowercased (not to be confused with the usual uncased) model. As opposed to English, diacritics in Hungarian are distinctive, so we kept them intact.
All models outperform the multilingual BERT model for masked LM, NER and NP chunking and the full
huBERT outperforms the Wikipedia models by 0.5% on both extrinsic tasks. In particular, it is the current state of the art in NP chunking (table from the emBERT repository):
|Task||Training corpus||multi-BERT F1||huBERT wiki F1||huBERT F1|
|named entity recognition||Szeged NER corpus||97.08%||97.03%||97.62%|
|base NP chunking||Szeged TreeBank 2.0||95.58%||96.64%||97.14%|
|maximal NP chunking||Szeged TreeBank 2.0||95.05%||96,41%||96,97%|
huBERT has also been uploaded to the Hugging Face Model Hub. It can be used under both Pytorch and TF 2.0 as any other official
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("SZTAKI-HLT/hubert-base-cc") model = AutoModel.from_pretrained("SZTAKI-HLT/hubert-base-cc")
A note on lowercased models: unfortunately
transformers does not supports lowercased models out of the box. Make sure to pass
BertTokenizer when loading either the cased or lowercased models, and lower case the input text manually for the latter. E.g.:
tokenizer = BertTokenizer.from_pretrained('path/to/hubert_wiki_lower', do_lower_case=False)
tokens = tokenizer.tokenize('My Hungarian text'.lower())
If you use the models on these page, please cite (chapter 4/5 of) the following publication:
Nemeskey, Dávid Márk (2020). “Natural Language Processing methods for Language Modeling”. PhD thesis. Eötvös Loránd University.