SZTAKI HLT | Emergency Vocabulary

Emergency Vocabulary

2017 -

Manually created and automatically derived emergency vocabulary lists.

Manual lists

The lists in this section were created manually. based on the 4lang concept dictionary. For information on how to the lists were assembled, refer to the journal paper.

Word list Download link
BEV0 ∩ BV bev_0_bv
BEV bev

Small seed lists

Two small seed lists, which were used to bootstrap automatic vocabulary extraction, are presented below. The minimum seed list was itself created automatically from the New Reuters corpus from the seed emergency urgent. See a short explanation of the method in the next section.

The Wikipedia list was created by extracting from the section titles of the Wikipedia page on natural disasters. This initial list was expanded with a few terms that refer to human-induced emergency situations, such as terrorism and massacre.

Word list Download link
Minimum seed minimum
Wikipedia wikipedia

Automatically generated from Common Crawl

The word lists below were automatically generated from the Common Crawl News Dataset. The method is described in the SMERP paper. In a nutshell, the corpus was loaded into a search engine, a seed word list was used as a query, and the most representative terms from the resulting subcorpus were (after some filtering) designated as the emergency vocabulary.

The lists were bootstrapped from manually created lexica, including the ones introduced above and CrisisLex, an emergency lexicon distilled from Twitter messages as a crowdsourcing effort.

Vocabulary types

There are two vocabulary types, which differ in the kinds of bigrams they include. In post, stopword filtering was performed on the corpus after bigram creation, so it contains regular bigrams. In pre, stopword filtering happened first, and bigrams were created second. As a result, the bigrams in this dataset can contain words originally separated by stopwords; i.e. skip-grams.

Word lists
Base word list Pre Post
Basic Emergency Vocabulary (BEV) bev_pre bev_post
Minimum Seed minimum_pre minimum_post
Wikipedia Seed wiki_pre wiki_post
CrisisLex crisislex_pre crisislex_post
Word common in all the above common_pre common_post
(Skip-)bigrams common in all the above bigrams_pre

Citation policy

If you use any of the lists above (with the exception of CrisisLex, of course), please cite the papers below:

Dávid Márk Nemeskey, András Kornai (2018). Emergency Vocabulary. In: Information Systems Frontiers (TBP). (bib here)

Judit Ács, Dávid Márk Nemeskey, András Kornai (2017). Identification of disaster-implicated named entities. In: Proceedings of the First International Workshop on Exploitation of Social Media for Emergency Relief and Preparedness. (bib here)

Resource owner