SpanFact: Fix your factually incorrect summaries
A suite of models that fixes inaccuracies in summaries of large text
When I’m trying to decide whether I should read a book, the first thing I look at is the summary on the back. This is also true for when I have to determine whether a scientific paper is relevant to my work, whether an article could be interesting or anything else that involves reading a lot of text. For this to be effective, the summary needs to be concise and accurate.
Machine learning researchers have focused on summarizing large chunks of text and developed models that can generate grammatically correct and coherent summaries. These summarization models fall into 2 broad categories: extractive and abstractive. Extractive models try to identify the important phrases in the text and stitch them together in a coherent way without modifying these phrases. In contrast, abstractive models try to generate a summary one word at a time after processing and understanding the whole text.
Historically, abstractive summaries are generally more concise and unique since they’re not limited to just the words that are in the text but they’re more prone to error because they generate the next word based on the words that have already been generated. However, with the advent of large transformer models like BERT which can understand text better, abstractive models are able to outperform extractive models even grammatically. They’re still prone to one error though: inaccuracy. This is because Seq2Seq models (which form the backbone of most abstractive models) are trained to generate coherent and correct sentences. When asked to be accurate as well, these models struggle to find a balance between the two objectives and generally disregard the accuracy objective.
There are 2 ways to fix this accuracy problem: retrain the model to focus more on generating accurate information or fix the model’s generated summary. Researchers have used multiple approaches to do the former. Cao et al. and Zhu et al. train their summarization model with fact triplets to generate more accurate summaries. Li et al. and Falke et al. use a different approach by trying to include sentences that are relevant to the text to be summarized.
While retraining seems like a more comprehensive fix, it requires a lot of resources and data to train and still faces the problem of trying to balance accuracy and coherence. The researchers at Microsoft Dynamics AI 365 Research try the latter approach of fixing an already generated summary. Introducing SpanFact! It’s two models that use question-answer (QA) style span selection and auto-regression to fix already generated summaries. By doing this, they aim to maintain coherence while improving the accuracy of the summary.
SpanFact focuses on correcting inaccuracies related to entities in the summary (names, numbers, places and more). Both SpanFact models use BERT to generate initial embeddings for the passage to be summarized and the summary itself. They also iterate through all the entities in the summary, first masking them and then trying to identify them from the passage. To do this, they make unique assumptions and approaches.
The QA-span model assumes that all the entities in the generated summary are unrelated. This method works for shorter summaries that have a small number of errors. It masks one entity at a time, uses BERT to learn an embedding for the masked summary along with the passage and tries to predict the part of the passage that corresponds to the entity. Once it has made its selection, the model replaces the masked part of the summary with the selection and proceeds to mask the next entity.

On the other hand, the auto-regressive model assumes that the entities in the summary are related and tries to minimize future errors by correcting earlier ones. To do this, it first masks all the entities before passing the masked summary through a BERT. It then uses the BERT embeddings of the current masked entity combined with all the previously predicted entities to predict the current entity’s start and end position in the passage. The first entity is predicted using the embedding of the [CLS] token. The main difference between this and the QA-span model is that the prediction layer only sees the entities until now while the QA-span model sees the entire summary with just one mask.
To make sure the SpanFact models work, the researchers test them on summaries generated on 3 major summarization datasets: CNN/DailyMail, Gigaword and XSum. Both the models are able to significantly improve the summary’s accuracy without noticeably sacrificing coherency. In fact, their summaries achieve state of the art accuracy scores on all these datasets.
Here’s a link to the paper if you want to go into more detail about SpanFact and click here to see more of our publications and other work.
References
- Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2018. Faithful to the original: Fact aware neural abstractive summarization. In Thirty-Second AAAI Conference on Artificial Intelligence.
- Chenguang Zhu, William Hinthorn, Ruochen Xu, Qingkai Zeng, Michael Zeng, Xuedong Huang, and Meng Jiang. 2020. Boosting factual correctness of abstractive summarization with knowledge graph. arXiv preprint arXiv:2003.08612.
- Haoran Li, Junnan Zhu, Jiajun Zhang, and Chengqing Zong. 2018. Ensure the correctness of the summary: Incorporate entailment knowledge into abstractive sentence summarization. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1430–1441.
- Tobias Falke, Leonardo FR Ribeiro, Prasetya Ajie Utama, Ido Dagan, and Iryna Gurevych. 2019. Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2214–2220.
- Dong, Yue, Shuohang Wang, Zhe Gan, Yu Cheng, Jackie Chi Kit Cheung, and Jingjing Liu. “Multi-fact correction in abstractive text summarization.” arXiv preprint arXiv:2010.02443 (2020).