Natural Language Processing For Automatic text summarization [Datasets] - Survey
DOI:
https://doi.org/10.31185/wjcm.72Keywords:
Natural Language Processing, Automatic Text Summarization, Abstractive Text Summarization, Extractive Text Summarization, Text Summarization DatasetsAbstract
Natural language processing has developed significantly recently, which has progressed the text summarization task. It is no longer limited to reducing the text size or obtaining helpful information from a long document only. It has begun to be used in getting answers from summarization, measuring the quality of sentiment analysis systems, research and mining techniques, document categorization, and natural language Inference, which increased the importance of scientific research to get a good summary. This paper reviews the most used datasets in text summarization in different languages and types, with the most effective methods for each dataset. The results are shown using text summarization matrices. The review indicates that the pre-training models achieved the highest results in the summary measures in most of the researchers' works for the datasets. Dataset English made up about 75% of the databases available to researchers due to the extensive use of the English language. Other languages such as Arabic, Hindi, and others suffered from low resources of dataset sources, which limited progress in the academic field.
References
Suleiman, Dima, and Arafat Awajan. "Deep learning based abstractive text summariza-tion: approaches, datasets, evaluation measures, and challenges." Mathematical prob-lems in engineering 2020 (2020).
Munot, Nikita, and Sharvari S. Govilkar. "Comparative study of text summarization methods." International Journal of Computer Applications 102, no. 12, 2014.
V. Gupta and G. S. Lehal, "A survey of text summarization extractive techniques," Journal of emerging technologies in web intelligence, vol. 2, no. 3, pp. 258-268, 2010.
Sinha, Aakash, Abhishek Yadav, and Akshay Gahlot. "Extractive text summarization using neural networks." arXiv preprint arXiv: 1802.10137, 2018.
W.-T. Hsu, C.-K. Lin, M.-Y. Lee, K. Min, J. Tang, and M. Sun, "A unified model for extractive and abstractive summarization using inconsistency loss," arXiv preprint arXiv:1805.06266, 2018.
Nallapati, Ramesh, Bowen Zhou, Caglar Gulcehre, and Bing Xiang. "Abstractive text summarization using sequence-to-sequence rnns and beyond." arXiv preprint arXiv:1602.06023, 2016.
[C.-Y. Lin and E. Hovy, "From single to multi-document summarization," in Proceed-ings of the 40th annual meeting of the association for computational linguistics, 2002, pp. 457-464.
Christian, Hans, Mikhael Pramodana Agus, and Derwin Suhartono. "Single document automatic text summarization using term frequency-inverse document frequency (TF-IDF)." ComTech: Computer, Mathematics and Engineering Applications 7, no. 4 (2016): 285-294.
Mutlu, Begum, Ebru A. Sezer, and M. Ali Akcayol. "Multi-document extractive text summarization: A comparative assessment on features." Knowledge-Based Systems 183 (2019): 104848.
Lin, Chin-Yew. "Rouge: A package for automatic evaluation of summaries." In Text summarization branches out, pp. 74-81. 2004.
Ozsoy, Makbule Gulcin, Ferda Nur Alpaslan, and Ilyas Cicekli. "Text summarization using latent semantic analysis." Journal of Information Science 37, no. 4 (2011): 405-417.
Gong, Yihong, and Xin Liu. "Generic text summarization using relevance measure and latent semantic analysis." In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 19-25. 2001.
Wissner-Gross, A. .” Edge.com. Retrieved 8 January 2016.
Demian Gholipour Ghalandari; Chris Hokamp; Nghia The Pham; John Glover; Geor-giana Ifrim, WCEP Dataset [Dataset]. https://paperswithcode.com/dataset/wcep
W. Xiao, I. Beltagy, G. Carenini, and A. Cohan, "Primer: Pyramid-based masked sen-tence pre-training for multi-document summarization," arXiv preprint ar X iv: 2110.08499, 2021.
DUC 2004 Dataset [Dataset]. https://paperswithcode.com/dataset/duc-2004
S. Shen, Y. Zhao, Z. Liu, and M. Sun, "Neural headline generation with sentence-wise optimization," arXiv preprint arXiv:1604.01904, 2016.
Ramesh Nallapati; Bo-Wen Zhou; Cicero Nogueira dos santos; Caglar Gulcehre; Bing Xiang, CNN/Daily Mail Dataset [Dataset]. https://paperswithcode.com/dataset/cnn-daily-mail-1.
Y. Zhao, M. Khalman, R. Joshi, S. Narayan, M. Saleh, and P. J. Liu, "Calibrating Se-quence likelihood Improves Conditional Language Generation," arXiv preprint ar X iv: 2210.00045, 2022.
datadiscovery.nlm.nih.gov (2021). PubMed [Dataset]. https://healthdata.gov/dataset/PubMed/h5mw-dwr6 .
B. Pang, E. Nijkamp, W. Kryściński, S. Savarese, Y. Zhou, and C. Xiong, "Long Doc-ument Summarization with Top-down and Bottom-up Inference," arXiv preprint arXiv:2203.07586, 2022.
ArXiv Dataset [Dataset]. https://paperswithcode.com/dataset/arxiv.
Shashi Narayan; Shay B. Cohen; Mirella Lapata, XSum Dataset [Dataset]. https://paperswithcode.com/dataset/xsum
Sajad Sotudeh; Nazli Goharian; Zachary Young (2022). Mental Health Summarization (MentSum) [Dataset]. https://ir.cs.georgetown.edu/resources/mentsum.html.
S. Sotudeh, N. Goharian, and Z. Young, "MentSum: A Resource for Exploring Summa-rization of Mental Health Online Posts," arXiv preprint arXiv:2206.00856, 2022.
Moussa Kamal Eddine; Antoine J. -P. Tixier; Michalis Vazirgiannis, OrangeSum Dataset [Dataset]. https://paperswithcode.com/dataset/orangesum.
M. K. Eddine, A. J.-P. Tixier, and M. Vazirgiannis, "Barthez: a skilled pretrained french sequence-to-sequence model," arXiv preprint arXiv:2010.12321, 2020.
Wojciech Kryściński; Nazneen Rajani; Divyansh Agarwal; Caiming Xiong; Dragomir Radev, BookSum Dataset [Dataset]. https://paperswithcode.com/dataset/booksum .
W. Xiong, A. Gupta, S. Toshniwal, Y. Mehdad, and W.-t. Yih, "Adapting Pretrained Text-to-Text Models for Long Text Sequences," arXiv preprint arXiv: 2209.10052, 2022.
Arman Cohan; Franck Dernoncourt; Doo Soon Kim; Trung Bui; Seokhwan Kim; Walter Chang; Nazli Goharian (2021). arXiv Summarization Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/arxiv-summarization-dataset.
Liang, Xinnian, Shuangzhi Wu, Mu Li, and Zhoujun Li. "Improving unsupervised ex-tractive summarization with facet-aware modeling." The Association for Computational Linguistics Findings: ACL-IJCNLP 2021, pp. 1685-1697. 2021.
Mahnaz Koupaee; William Yang Wang (2021). WikiHow Dataset [Dataset]. https://paperswithcode.com/dataset/wikihow
Savelieva, Alexandra, Bryan Au-Yeung, and Vasanth Ramani. "Abstractive summariza-tion of spoken and written instructions with BERT." arXiv preprint arXiv:2008.09676 (2020).
Saurabh Shahane (2021). Urdu News Dataset [Dataset]. https://www.kaggle.com/saurabhshahane/urdu-news-dataset
PrithwirajSust (2020). Bengali News Summarization Dataset [Dataset]. https://www.kaggle.com/prithwirajsust/bengali-news-summarization-dataset
Bhattacharjee, Prithwiraj, Avi Mallick, and Saiful Islam. "Bengali abstractive news summarization (BANS): a neural attention approach." In Proceedings of International Conference on Trends in Computational and Cognitive Engineering, pp. 41-51. Spring-er, Singapore, 2021.
Gaurav (2020). Hindi Text Short Summarization Corpus [Dataset]. https://www.kaggle.com/disisbig/hindi-text-short-summarization-corpus
Shah, Aashil, Devam Zanzmera, and Kevan Mehta. "Deep Learning based Automatic Hindi Text Summarization." In 2022 6th International Conference on Computing Methodologies and Communication (ICCMC), pp. 1455-1461. IEEE, 2022.
Abdelkader Rhouati (2020). Arabic News articles from Aljazeera.net [Dataset]. https://www.kaggle.com/arhouati/arabic-news-articles-from-aljazeeranet .
Suleiman, Dima, and Arafat Awajan. "Multilayer encoder and single-layer decoder for abstractive Arabic text summarization." Knowledge-Based Systems 237 (2022): 107791.
Allen Institute for AI (2020). cord-19 [Dataset]. https://allenai.org/data/cord-19 .
Bishop, Jennifer, Qianqian Xie, and Sophia Ananiadou. "GenCompareSum: a hybrid unsupervised summarization method using salience." In Proceedings of the 21st Work-shop on Biomedical Language Processing, pp. 220-240. 2022.
Aaditya Raj (2022). Scientific Document Summarization (SciTLDR-A) [Dataset]. https://www.kaggle.com/datasets/adityawithdoublea/scitldra .
Cachola, Isabel, Kyle Lo, Arman Cohan, and Daniel S. Weld. "TLDR: Extreme summa-rization of scientific documents." arXiv preprint arXiv:2004.15011, 2020.
Michihiro Yasunaga; Jungo Kasai; Rui Zhang; Alexander R. Fabbri; Irene Li; Dan Friedman; Dragomir R. Radev (2021). ScisummNet Dataset [Dataset]. https://paperswithcode.com/dataset/scisummnet .
Park, Jong Won. "Continual bert: Continual learning for adaptive extractive summariza-tion of covid-19 literature." arXiv preprint arXiv:2007.03405, 2020.
Mohammad Bani Almarjeh (2022). SumArabic [Dataset]. http://doi.org/10.17632/7kr75c9h24.1 .
Guy Lev; Michal Shmueli-Scheuer; Jonathan Herzig; Achiya Jerbi; David Konopnicki, TalkSumm Dataset [Dataset]. https://paperswithcode.com/dataset/talksumm .
Lev, Guy, Michal Shmueli-Scheuer, Jonathan Herzig, Achiya Jerbi, and David Konopnicki. "Talksumm: A dataset and scalable annotation method for scientific paper summarization based on conference talks." arXiv preprint arXiv:1906.01351 , 2019.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2022 Alaa Ahmed AL-Banna , Abeer K. AL-Mashhadany

This work is licensed under a Creative Commons Attribution 4.0 International License.