Emergency Vocabulary
Manually created and automatically derived emergency vocabulary lists.
Manual lists
The lists in this section were created manually. based on the 4lang concept dictionary. For information on how to the lists were assembled, refer to the journal paper.
Word list | Download link |
---|---|
BEV0 ∩ BV | bev_0_bv |
BEV | bev |
Small seed lists
Two small seed lists, which were used to bootstrap automatic vocabulary extraction, are presented below. The minimum seed list was itself created automatically from the New Reuters corpus from the seed emergency urgent. See a short explanation of the method in the next section.
The Wikipedia list was created by extracting from the section titles of the Wikipedia page on natural disasters. This initial list was expanded with a few terms that refer to human-induced emergency situations, such as terrorism and massacre.
Word list | Download link |
---|---|
Minimum seed | minimum |
Wikipedia | wikipedia |
Automatically generated from Common Crawl
The word lists below were automatically generated from the Common Crawl News Dataset. The method is described in the SMERP paper. In a nutshell, the corpus was loaded into a search engine, a seed word list was used as a query, and the most representative terms from the resulting subcorpus were (after some filtering) designated as the emergency vocabulary.
The lists were bootstrapped from manually created lexica, including the ones introduced above and CrisisLex, an emergency lexicon distilled from Twitter messages as a crowdsourcing effort.
Vocabulary types
There are two vocabulary types, which differ in the kinds of bigrams they include. In post, stopword filtering was performed on the corpus after bigram creation, so it contains regular bigrams. In pre, stopword filtering happened first, and bigrams were created second. As a result, the bigrams in this dataset can contain words originally separated by stopwords; i.e. skip-grams.
Word lists
Base word list | Pre | Post |
---|---|---|
Basic Emergency Vocabulary (BEV) | bev_pre | bev_post |
Minimum Seed | minimum_pre | minimum_post |
Wikipedia Seed | wiki_pre | wiki_post |
CrisisLex | crisislex_pre | crisislex_post |
Word common in all the above | common_pre | common_post |
(Skip-)bigrams common in all the above | bigrams_pre |
Citation policy
If you use any of the lists above (with the exception of CrisisLex, of course), please cite the papers below:
Dávid Márk Nemeskey, András Kornai (2018). Emergency Vocabulary. In: Information Systems Frontiers (TBP). (bib here)
Judit Ács, Dávid Márk Nemeskey, András Kornai (2017). Identification of disaster-implicated named entities. In: Proceedings of the First International Workshop on Exploitation of Social Media for Emergency Relief and Preparedness. (bib here)