Hungarian word embeddings
At the first and second exhibition of the Artificial Intelligence Coalition, human language understanding was represented by researchers from the MTA Research Institute for Linguistics. Visitors could learn about two technologies: the e-magyar Hungarian text analysis chain and Hungarian word embeddings that represent words in systems based on machine learning (i.e. artificial intelligence). For the latter, there were three demos:
- search for similar words, answer analogical questions, and select the odd-man-out with a Hungarian word embedding
- the neighborhood of words, i.e. similar words in the syntactic-semantic space, visualized by Dániel Varga (MTA Rényi)
- a galactic journey in English by Andrei Kashcha.
Natural language processing covers tasks like machine translation, another task is the mining of structured information from texts where it is needed in large quantities, eg. measuring consumer/voter satisfaction/opinion. If we want to learn machine learning on texts, then we need features first: vectors provide these. Most of the methods of language technology work for the hundred or more languages with enough text, but there are differences, for example, in Hungarian, it is worth using a morphological analyzer (stemmer) based on linguistic knowledge because of the many word forms (inflectional and derivational). The most important module of e-magyar is a morphological analyzer. Vectors model the frequency (probability, "naturalness") of the words in each narrow context, and the same vectors (and more recently the early layers of deep nets) can be used for quite different tasks. The word artificial intelligence has become fashionable around 2012 because the models of connectionism (based on the activations of nodes thought to be neurons and the associations between them, researched since 1974) met the solid methodology of machine learning devloped in the 1990s.