Natural Language Processing For Automatic text summarization [Datasets] - Survey


  • Alaa Ahmed AL-Banna Department of Computer Science, College of Science ,Al-Nahrain University, Baghdad, Iraq
  • Abeer K. AL-Mashhadany Department of Computer Science, College of Science ,Al-Nahrain University, Baghdad, Iraq



Natural Language Processing, Automatic Text Summarization, Abstractive Text Summarization, Extractive Text Summarization, Text Summarization Datasets


Natural language processing has developed significantly recently, which has progressed the text summarization task. It is no longer limited to reducing the text size or obtaining helpful information from a long document only. It has begun to be used in getting answers from summarization, measuring the quality of sentiment analysis systems, research and mining techniques, document categorization, and natural language Inference, which increased the importance of scientific research to get a good summary. This paper reviews the most used datasets in text summarization in different languages and types, with the most effective methods for each dataset. The results are shown using text summarization matrices. The review indicates that the pre-training models achieved the highest results in the summary measures in most of the researchers' works for the datasets. Dataset English made up about 75% of the databases available to researchers due to the extensive use of the English language. Other languages such as Arabic, Hindi, and others suffered from low resources of dataset sources, which limited progress in the academic field.


D. Suleiman and A. Awajan, “Deep learning based abstractive text summarization: approaches, datasets, evaluation measures, and challenges,” 2020. 2020-2020.

N. Munot and S. S. Govilkar, “Comparative study of text summarization methods,” International Journal of Computer Applications, vol. 102, no. 12, 2014.

V. Gupta and G. S, “A survey of text summarization extractive techniques,” Journal of emerging technologies in web intelligence, vol. 2, no. 3, pp. 258–268, 2010.

A. Sinha, A. Yadav, and A. Gahlot, “Extractive text summarization using neural networks,” 2018.

W. T. Hsu, C. K. Lin, M. Y. Lee, K. Min, J. Tang, and M. Sun, “A unified model for extractive and abstractive summarization using inconsistency loss,” 2018.

R. Nallapati, B. Zhou, C. Gulcehre, and B. Xiang, “Abstractive text summarization using sequence-to-sequence rnns and beyond,” 2016.

[c. Y Lin and E. Hovy, “From single to multi-document summarization,” Proceedings of the 40th annual meeting of the association for computational linguistics, pp. 457–464, 2002.

H. Christian, M. P. Agus, and D. Suhartono, “Single document automatic text summarization using term frequency-inverse document frequency (TF-IDF),” ComTech: Computer, Mathematics and Engineering Applications, vol. 7, no. 4, pp. 285–294, 2016.

Mutlu, E. A. Begum, M. A. Sezer, and Akcayol, “Multi-document extractive text summarization: A comparative assessment on features,” Knowledge-Based Systems, vol. 183, pp. 104848–104848, 2019.

C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” Text summarization branches out, pp. 74–81, 2004.

M. Ozsoy, Gulcin, I. F. N. Alpaslan, and Cicekli, “Text summarization using latent semantic analysis,” Journal of Information Science, vol. 37, no. 4, pp. 405–417, 2011.

Y. Gong and X. Liu, “Generic text summarization using relevance measure and latent semantic analysis,” Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 19–25, 2001.

A. Wissner-Gross, Retrieved 8 January 2016.

D. G. G. . J. G. G. Ifrim and Dataset, “Chris Hokamp; Nghia The Pham.”

W. Xiao, I. Beltagy, G. Carenini, and A. Cohan, “Primer: Pyramid-based masked sentence pre-training for multi-document summarization.” 2021.

“DUC 2004 Dataset.”

S. Shen, Y. Zhao, Z. Liu, and M. Sun, “Neural headline generation with sentence-wise optimization,” 2016.

R. Nallapati and . B.-W. Zhou, “Cicero Nogueira dos santos; Caglar Gulcehre; Bing Xiang, CNN/Daily Mail Dataset.”

Y. Zhao, M. Khalman, R. Joshi, S. Narayan, M. Saleh, and P. J. Liu, “Calibrating Sequence likelihood Improves Conditional Language Generation.” 2022-2022.

datadiscovery. nlm. nih. gov, ed. 2021.

B. Pang, E. Nijkamp, W. Krysci ´ nski, S. Savarese, Y. Zhou, and C. Xiong, “Long Document Summarization with Top-down and Bottom-up ´ Inference.” 2022-2022.

“Arxiv dataset.”

S. Narayan, . Shay, B. C. M. Lapata, and X. Dataset

S. S. N. Goharian and . Z. Young Mental Health Summarization (MentSum, 2022.

S. Sotudeh, N. Goharian, and Z. Young MentSum: A Resource for Exploring Summarization of Mental Health Online Posts, pp. 2022 2022.

M. K. E. Antoine and J. P. Tixier

M. K. Eddine, A. J, . P. Tixier, and M. Vazirgiannis Barthez: a skilled pretrained french sequence-to-sequence model, 2020.

W. K. N. Rajani

W. Xiong, A. Gupta, S. Toshniwal, Y. Mehdad, and W. T. Yih Adapting Pretrained Text-to-Text Models for Long Text Sequences, pp. 2022–2022.

A. Cohan, . F. D. D. S. K. T. Bui, and . S. K. W. C. N. Goharian 2021.

X. Liang, S. Wu, M. Li, and Z. Li, “Improving unsupervised extractive summarization with facet-aware modeling,” The Association for Computational Linguistics Findings: ACL-IJCNLP 2021, pp. 1685–1697, 2021.

M. Koupaee and . W. Y. Wang WikiHow Dataset, 2021.

A. Savelieva, B. Au-Yeung, and V. Ramani Abstractive summarization of spoken and written instructions with BERT, 2020.

S. Shahane Urdu News Dataset, 2021.

Prithwirajsust Bengali News Summarization Dataset, 2020.

P. Bhattacharjee, A. Mallick, and S. Islam, “Bengali abstractive news summarization (BANS): a neural attention approach,” in Proceedings of International Conference on Trends in Computational and Cognitive Engineering, pp. 2021–2021, Springer.

Gaurav Hindi Text Short Summarization Corpus, 2020.

A. Shah, D. Zanzmera, and K. Mehta, “Deep Learning based Automatic Hindi Text Summarization,” 2022 6th International Conference on Computing Methodologies and Communication (ICCMC), pp. 2022–2022.

A. Rhouati Arabic News articles from Aljazeera, 2020.

D. Suleiman and A. Awajan, “Multilayer encoder and single-layer decoder for abstractive Arabic text summarization,” Knowledge-Based Systems, vol. 237, pp. 107791–107791, 2022.


J. Bishop, Q. Xie, and S. Ananiadou, “GenCompareSum: a hybrid unsupervised summarization method using salience,” Proceedings of the 21st Workshop on Biomedical Language Processing, pp. 220–240, 2022.

A. Raj Scientific Document Summarization (SciTLDR-A), 2022.

I. Cachola, K. Lo, A. Cohan, and D. S. Weld TLDR: Extreme summarization of scientific documents, 2020.

M. Yasunaga, . J. K. R. Z. Alexander, R. F. I. L. D. F. Dragomir, and R. Radev 2021.

J. Park and Won Continual bert: Continual learning for adaptive extractive summarization of covid-19 literature, 2020.

M. Bani and Almarjeh 2022.

G. Lev, . M. S.-S. J. H. A. J. D. Konopnicki, and T. Dataset

G. Lev, M. Shmueli-Scheuer, J. Herzig, A. Jerbi, and D. Konopnicki Talksumm: A dataset and scalable annotation method for scientific paper summarization based on conference talks, 2019.







How to Cite

A. A. AL-Banna and A. K. AL-Mashhadany, “Natural Language Processing For Automatic text summarization [Datasets] - Survey”, WJCMS, vol. 1, no. 4, pp. 102–110, Dec. 2022, doi: 10.31185/wjcm.72.