Analisis Morfologi untuk Menangani Out-of-Vocabulary Words pada Part-of-Speech Tagger Bahasa Indonesia Menggunakan Hidden Markov Model

  • Febyana Ramadhanti Universitas Pendidikan Indonesia
  • Yudi Wibisono Ilmu Komputer, Universitas Pendidikan Indonesia
  • Rosa Ariani Sukamto Ilmu Komputer, Universitas Pendidikan Indonesia

Abstract

Part-of-speech (PoS) tagger is one of tasks in the field of natural language processing (NLP) as the process of part-of-speech tagging for each word in the inputed sentence. Hidden markov model (HMM) is a probabilistic based PoS tagger algorithm, so it really depends on the train corpus. The limited components in the train corpus and the breadth of words in the Indonesian language pose a problem called out-of-vocabulary (OOV) words. This research compared PoS tagger HMM using Morphological Analysis (AM) method and HMM PoS tagger without AM, using the same train and testing corpus. Testing corpus contains 30% OOV level out of 6,676 tokens or 740 sentences. The result obtained from the HMM system has 97.54% of accuracy, while the HMM system with morphological analysis method has 99.14% as it’s highest accuracy. 

References

[1] Lewis, M. P. (2009). Enthnologue: Language of the World, 6th ed., Dallas.
[2] Liddy, E. D. (2001). Natural Language Processing . In Encyclopedia of Library and Information Science, 2nd Ed. NY. Marcel Decker, Inc.
[3] Pisceldo, F., Adriani, M., & Manurung, R. (2009). Probabilistic Part Of Speech Tagging for Bahasa Indonesia. in Proceedings of Third International Wokshop on Malay and Indonesian Language Engineering, Singapore.
[4] Kumar, R., & Shekhawat, S. S. (2018). Parts Of Speech Tagging For Hindi Languages Using Hmm . International Journal Of Scientific Research.
[5] Brants, T. (2000). Tnt - a Statastical Part-of-Speech Tagger. Proceeding of the sixth conference on Applied Natural Language Processing.
[6] Muljono, Afini, U., & Supriyanto, C. (2017). Marphology Analysis for Hidden Markov Model based Indonesian Part-of-Speech Tagger. 1st International Conference on Informatics and Computational Sciences (ICICoS).
[7] Chaer, A. (2008). Morfologi Bahasa Indonesia. Jakarta: Rineka Cipta.
[8] Larasati, S.D., Kuboň, V. and Zeman, D., 2011, August. Indonesian morphology tool (morphind): Towards an indonesian corpus. In International Workshop on Systems and Frameworks for Computational Morphology (pp. 119-129). Springer, Berlin, Heidelberg.
[9] Alfred, R., Mujat, A. and Obit, J.H., 2013, March. A ruled-based part of speech (RPOS) tagger for Malay text articles. In Asian Conference on Intelligent Information and Database Systems (pp. 50-59). Springer, Berlin, Heidelberg.
[10] Shamsi, F. A., & Guessoum, A. (2006). A Hidden Markov Model –Based POS Tagger for Arabic. Journées internationales d’Analyse statistique des Données Textuelle.
[11] Wicaksono, A. F., & Purwarianti, A. (2010). HMM Based Part-of-Speech Tagger for Bahasa Indonesia. Proceeding of the Fourth Internationul MALINDO Workshop (MALINDO2010), Jakarta.
[12] Brill, E. (1992). A Simple Rule-Based Part of Speech Tagger. Proceedings of the Third Conference on Applied Computational Linguistics. Trento, Italy: Association of Computational Linguistics.
[13] Alfred, R., Mujat, A., & Obit, J. H. (2013). A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles. Asian Conference on Intelligent Information and Database Systems.
[14] Joshi, N., Darbari, H., & Mathur, I. (2013). HMM BASED POS TAGGER FOR HIND. Proceeding of 2013 International Conference on Artificial Intelligence, Soft Computing.
[15] Manurung, R. (2016). Tutorial: Pengenalan terhadap POS Tagging dan Probabilistic Parsing. Workshop Nasional INACL. Depok: Fakultas Ilmu Komputer Universitas Indonesia. .
[16] Dinakaramani, A., Rashel, F., Luthfi, A., & Manurung, R. (2014). Designing an Indonesian Part of Speech Tagset and Manually Tagged Indonesian Corpus. In Asian Language Processing (IALP), Kuching.
[17] Jurafsky, D., & Martin, J. H. (2014). Speech and Language Processing. Vol. 3. London: Pearson.
[18] Kikuchi, M., Yoshida, M., Okabe, M., & Umemura, K. (2015). Confidence Interval of Probability Estimator of Laplace Smoothing. Institute of Electrical and Electronics Engineers.
[19] Rashel, F., Luthfi, A., Dinakaramani, A., & Manurung, R. (2014). Building an Indonesian Rule-Based Part-of-Speech Tagger. Asian Language Processing (IALP), 2014 International Conference on. IEEE.
Published
2019-03-25
How to Cite
RAMADHANTI, Febyana; WIBISONO, Yudi; SUKAMTO, Rosa Ariani. Analisis Morfologi untuk Menangani Out-of-Vocabulary Words pada Part-of-Speech Tagger Bahasa Indonesia Menggunakan Hidden Markov Model. Jurnal Linguistik Komputasional, [S.l.], v. 2, n. 1, p. 6 - 12, mar. 2019. ISSN 2621-9336. Available at: <http://inacl.id/journal/index.php/jlk/article/view/13>. Date accessed: 21 oct. 2019. doi: https://doi.org/10.26418/jlk.v2i1.13.
Section
Articles