Pembobotan Kata berdasarkan Kluster untuk Peringkasan Otomatis Multi Dokumen
Abstract
AbstractMulti-document summarization is a technique for getting information. The information consists of several lines of sentences that aim to describe the contents of the entire document relevantly. Several algorithms with various criteria have been carried out. In general, these criteria are the preprocessing, cluster, and representative sentence selection to produce summaries  that  have  high  relevance.  In  some conditions, the cluster stage is one of the important stages to produce summarization. Existing research cannot determine the number of clusters to be formed. Therefore, we propose clustering techniques using cluster hierarchy. This technique measures the  similarity  between  sentences  using  cosine similarity. These  sentences  are  clustered  based  on  their similarity  values.  Clusters  that  have  the  highest  level  of similarity with other clusters will be merged into one cluster. This merger process will continue until one cluster remains. Experimental results on the 2004 Document Understanding Document (DUC) dataset and using two scenarios that use  132, 135, 137 and 140 clusters resulting in fluctuating values. The smaller the number of clusters does not guara ntee an increase in the value of ROUGE-1. The method proposed using the same number of clusters has a lower ROUGE-1 value than the previous method. This is because in cluster 140 the similarity values in each cluster experienced a decrease in similarity values.
Â
Keywordscluster,    cosine   similarity,    multi-document, summarization
References
[2] R. Rautray and R. C. Balabantaray, “An evolutionary framework for multi document summarization using Cuckoo search approach: MDSCSA,” Appl. Comput. Informatics, vol. 14, no. 2, pp. 134–144, 2018.
[3] R. Rautray and R. C. Balabantaray, “Cat swarm optimization based evolutionary framework for multi document summarization,” Phys. A Stat. Mech. its Appl., vol. 477, pp. 174–186, 2017.
[4] A. Wahib, Arifin Z.A, and D. Purwitasari, “Peringkasan Dokumen Berbahasa Inggris Menggunakan Sebaran Local Sentence,” J. Buana Inform., vol. 7, pp. 33–42, 2016.
[5] A. Z. Arifin and A. Asano, “Image segmentation by histogram thresholding using hierarchical cluster analysis,” Pattern Recognit. Lett., vol. 27, no. 13, pp. 1515–1521, 2006.
[6] H. P. Luhn, “The Automatic Creation of Literature Abstracts,” IBM J. Res. Dev., vol. 2, no. 2, pp. 159–165, Apr. 1958.
[7] H. P. Edmundson, “New Methods in Automatic Extracting,” J. ACM, vol. 16, pp. 264–285, 1969.
[8] P. B. Baxendale, “Machine-Made Index for Technical Literature—An Experiment,” IBM J. Res. Dev., vol. 2, no. 4, pp. 354–361, 1958.
[9] E. Liddy, “Advances in Automatic Text Summarization,” Inf. Retr. Boston., vol. 4, no. 1, pp. 82–83, Apr. 2001.
[10] C.-Y. Lin, “Training a Selection Function for Extraction,” in Proceedings of the Eighth International Conference on Information and Knowledge Management, 1999, pp. 55–62.
[11] D. Das and A. F. T. Martins, “A Survey on Automatic Text Summarization,” Eighth ACIS Int. Conf. Softw. Eng. Artif. Intell. Netw. ParallelDistributed Comput. SNPD 2007, vol. 4, pp. 574–578, 2007.
[12] K. Mckeown and D. R.Radev, “Generating Summaries of Multiple News Articles,” Proc. 18th Annu. Int. ACM SIGIR Conf. Res. Dev. Inf. Retr., vol. 3, pp. 74–82, 1995.
[13] J. Carbonell and J. Goldstein, “The use of MMR, diversity-based reranking for reordering documents and producing summaries,” Proc. 21st Annu. Int. ACM SIGIR Conf. Res. Dev. Inf. Retr. - SIGIR ’98, pp. 335–336, 1998.
[14] D. R. Radev, H. Jing, and M. Budzikowska, “Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies,” Inf. Process. Manag. 40.6 919-938., vol. 40, no. 6, p. 10, 2000.
[15] T. Xia and Y. Chai, “An improvement to TF-IDF: Term distribution based term weight algorithm,” J. Softw., vol. 6, no. 3, pp. 413–420, 2011.