SZTAKI HLT | Autoencoder experiments on Hungarian words

Autoencoder experiments on Hungarian words

Judit Ács

May 11, 2017, 8:15
MTA SZTAKI (Lágymányosi u. 11, Budapest) Room 306 or 506

Judit Ács will present her experiments on Hungarian words using autoencoders.

Autoencoders are widely used for dimension reduction and compression. However, in NLP they are mostly applied at word-level features and character-level features are rarely exploited. We present a series of autoencoder and variational autoencoder experiments on Hungarian words using character unigrams. We extract character unigram features and add Gaussian noise in the case of variational autoencoders. We also add ‘realistic’ noise by randomly editing words up to one edit distance, in hope that the autoencoder will learn to perform similarly to a spell checker. Our manual error analysis gives insight into common Hungarian morphological phenomena which could be exploited for text compression. Our results suggest that Hungarian words can be dramatically compressed with little loss in accuracy. Our methods can be applied to other languages with relatively small alphabets.