SZTAKI HLT | Topic discovery in the diaries of Antarctica winteroverers with multilingual deep sentence encoders

Topic discovery in the diaries of Antarctica winteroverers with multilingual deep sentence encoders

Márton Makrai, Bea Ehmann, László Balázs
In 7th International Conference on Research, Technology and Education of Space (H-SPACE 2022) “New trends in the space sector”, 2022

Slides (PDF)

Overwintering in Antarctic outposts is one of the terrestrial models of Isolated, Confined and Extreme (ICE) environments pertinent to long duration spaceflight. In the CoALa project, overwintering crews of Antarctic research stations recorded weekly video diaries. The transcripts of these French, Italian and and English diaries have been analyzed with language processing tools with the long-term goal of automatically monitoring psychological processes and dynamics in isolated groups working in extreme circumstances.

We applied a fully data-driven method for discovering topics in the multilingual collection of diaries. In the past few years, deep language models revolutionized all aspects of language processing/understanding, including the task of sentence embedding, that of representing sentences in a linear vector space of a few hundred dimensions. Multilingual models map the sentences of 50--200 languages to a common latent space. To avoid the bias of research preconceptions, we apply the unsupervised method of clustering to form groups of similar sentences in the semantic space. We opt for the hierarchical density-based clustering method HDBScan. In order for the density of the cloud of sentence encoding vectors to make sense, it is advantageous to map the vectors in a lower-dimensional space, say 32. Non-linear dimension reduction methods like t-SNE, UMAP and DensMAP aim to optimize the trade-off between preserving global and local properties of the sentence cloud. DensMAP improves its forerunners by better preserving density as well, which is especially important for HDBScan. We interpretat the clusters approximately by assigning them key-words.

While dimension reduction is indeterministic and our method has many hyper-parameters which may influence the results, some large clusters of sentences seem to be stable, and these correspond to interesting topics. Examples below show one of the key-words (French, English or Italian) and a short post-hoc description.

  • sommeil: how the speaker has slept
  • groupe/comunque: the speaker's emotions about people; roles and conflicts in the group
  • winter: snow and winter
  • lune: sky phenomena (sun, moon, horizon, Aurora, etc.)
  • cuisinier: cooking and eating
  • settembre: expedition time on the large scale, plans (e.g. It's been six months, six months since I left home, since I left Lyon.)
  • koala: meta-sentences of diary writing (Hallo, CoALa. This week's been an OK week. ... I still have five minutes.)
  • vent: weather (wind, temperature, etc.)
  • engine: equipment (engine, tank, cleaning, computer, pump,: etc.)