SZTAKI HLT | Filtering Wiktionary triangles by linear mapping between distributed word models

Filtering Wiktionary triangles by linear mapping between distributed word models

Márton Makrai

In Proceedings of 10th Edition of the Language Resources and Evaluation Conference, 2016

Link
PDF
Slides (PDF)

Triangulation infers word translations in a pair of languages based on translations to other, typically better resourced ones called pivots. This method may introduce noise if words in the pivot are polysemous. The reliability of each triangulated translation is basically estimated by the number of pivot languages (Tanaka and Umemura, 1994).

Mikolov et al. (2013b) introduce a method for scoring word translations. Translation is formalized as a linear mapping between distributed vector space models (VSM) of the two languages. VSMs are trained on monolingual data, while the mapping is learned in supervised fashion, using a seed dictionary of some thousand word pairs.

We apply linear mapping to filter triangulated translations, and show that scores by the mapping are smoother measure of merit than the number of pivots. The methods we use are language-independent, and the training data is easy to obtain for many languages. We chose the German-Hungarian pair for evaluation, in which the filtered triangles resulting from our experiments are the greatest freely available list of word translations we are aware of.

Filtering Wiktionary triangles by linear mapping between distributed word models

Márton Makrai

MTA NYI