SZTAKI HLT | Introducing huBERT

Introducing huBERT

Dávid Márk Nemeskey
In XVII. Magyar Számítógépes Nyelvészeti Konferencia, 2021


This paper introduces the huBERT family of models. The flag- ship is the eponymous BERT Base model trained on the new Hungarian Webcorpus 2.0, a 9-billion-token corpus of Web text collected from the Common Crawl. This model outperforms the multilingual BERT in masked language modeling by a huge margin, and achieves state-of-the-art performance in named entity recognition and NP chunking. The models are freely downloadable.

@InProceedings{ Nemeskey:2021a,
  author = {Nemeskey, Dávid Márk},
  title = {Introducing \texttt{huBERT}},
  booktitle = {{XVII}.\ Magyar Sz{\'a}m{\'i}t{\'o}g{\'e}pes Nyelv{\'e}szeti Konferencia ({MSZNY}2021)},
  year = 2021,
  pages = {TBA},
  address = {Szeged},