Extractive summarization - methods and problems
Gábor Recski
Document summarization is the task of creating, given one or more input documents, a significantly shorter text that summarizes the main points of the original document(s). In practical applications this rarely goes beyond extractive summarization, the simplified task of selecting a subset of the original sentences (as opposed to abstractive summarization, which is what humans and some recent experimental systems do).
Extractive summarization can be performed both using supervised learning (see e.g. Kedzle et al. 2018) or by unsupervised methods such as TextRank (Mihalcea & Tarau 2004). Given a ground truth dataset for summarization, i.e. documents with human-written (abstractive) summaries, training data for extractive summarization can be created programatically, by maximizing the overlap of word ngrams between the extractive and abstractive summary, as measured by ROUGE-score, an established and validated metric for evaluating summaries (see Lin 2004 and more recently Graham 2015).
We discuss several methodological issues raised by this current state-of-affairs, including ways to construct training data for extractive summarization and possible training objectives for learning to extract important sentences. We also present results of preliminary experiments on both of these topics, some of which are the product of joint work with Virág Kulcsár (BME Math).