Yıl: 2022 Cilt: 30 Sayı: 6 Sayfa Aralığı: 2433 - 2445 Metin Dili: İngilizce DOI: 10.55730/1300-0632.3948 İndeks Tarihi: 09-12-2022

Diacritics correction in Turkish with context-aware sequence to sequence modeling

Öz:
Digital texts in many languages have examples of missing or misused diacritics which makes it hard for natural language processing applications to disambiguate the meaning of words. Therefore, diacritics restoration is a crucial step in natural language processing applications for many languages. In this study we approach this problem as bidirectional transformation of diacritical letters and their ASCII counterparts, rather than unidirectional diacritic restoration. We propose a context-aware character-level sequence to sequence model for this transformation. The model is language independent in the sense that no language-specific feature extraction is necessary other than the utilization of word embeddings and is directly applicable to other languages. We trained the model for Turkish diacritics correction task and for the assessment we used Turkish tweets benchmark dataset. Our best setting for the proposed model improves the state-of-the-art results in terms of F1 score by 4.7% on ambiguous words and 1.24% over all cases.
Anahtar Kelime: Natural language processing diacritics restoration diacritics correction sequence to sequence learning LSTM

Belge Türü: Makale Makale Türü: Araştırma Makalesi Erişim Türü: Erişime Açık
  • [1] Çöltekin Ç. A freely available morphological analyzer for Turkish. In: The Seventh International Conference on Language Resources and Evaluation (LREC’10); Valletta, Malta; 2010. pp. 820–827.
  • [2] Suárez PJO, Romary L, Sagot B. A monolingual approach to contextualized word embeddings for mid-resource languages. In: The 58th Annual Meeting of the Association for Computational Linguistics; Jeju Island, Korea; 2020: 1703–1714
  • [3] Koksal AT, Bozal O, Yürekli E, Gezici G. #Turki$hTweets: A benchmark dataset for Turkish text correction. In: Findings of the Association for Computational Linguistics: EMNLP 2020; Online; 2020. pp. 4190–4198.
  • [4] Masmoudi A, Mdhaffar S, Sellami R, Belguith LH. Automatic diacritics restoration for Tunisian dialect. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 2019; 18 (3): 1–18.
  • [5] Novák A, Siklósi B. Automatic diacritics restoration for Hungarian. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing; Lisbon, Portugal; 2015. pp. 2286–2291.
  • [6] Azmi AM, Almajed RS. A survey of automatic Arabic diacritization techniques. Natural Language Engineering, 2015; 21 (3): 477–495.
  • [7] Nguyen KH, Ock CY. Diacritics restoration in Vietnamese: letter based vs. syllable based model. In Pacific Rim International Conference on Artificial Intelligence; Daegu, Korea; 2010. pp. 631–636.
  • [8] Yarowsky D. A comparison of corpus-based techniques for restoring accents in Spanish and French text. In: Armstrong S, Church K, Isabelle P, Manzi S, Tzoukermann E et al. (editors). Natural Language Processing Using Very Large Corpora. Dordrecht: Springer, 1999, pp. 99–120.
  • [9] Mihalcea R, Nastase V. Letter level learning for language independent diacritics restoration. In: COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002); Philadelphia PA; 2002.
  • [10] Tür G. A statistical information extraction system for Turkish. PhD, Bilkent University, Ankara, Turkey, 2000.
  • [11] Yuret D, De La Maza M. The greedy prepend algorithm for decision list induction. In: International Symposium on Computer and Information Sciences; Istanbul, Turkey; 2006. 37–46.
  • [12] Okur BÇ, Takçi H, Akgül YS. Rewriting Turkish texts written in English alphabet using Turkish alphabet. In: 21st Signal Processing and Communications Applications Conference (SIU); Haspolat, Turkey; 2013. pp. 1–4.
  • [13] Yıldırım S, Yıldız T. An unsupervised text normalization architecture for Turkish language. Research in Computing Science 2015; 90: 183–194.
  • [14] Alpkocak A, Ceylan M. Effects of diacritics on Turkish information retrieval. Turkish Journal of Electrical Engineering & Computer Sciences 2012; 20 (5): 787–804.
  • [15] Çakmak F, Diri B. Correction of Turkish characters with a web-based semantic method. In: 23nd Signal Processing and Communications Applications Conference (SIU); Malatya, Turkey; 2015. pp. 891–894.
  • [16] Arslan A. Deasciification approach to handle diacritics in Turkish information retrieval. Information Processing & Management 2016; 52 (2): 326–339.
  • [17] Adali K, Eryiğit G. Vowel and diacritic restoration for social media texts. In: The 5th Workshop on Language Analysis for Social Media (LASM); Gothenburg, Sweden; 2014. pp. 53–61.
  • [18] Ozer Z, Ozer I, Findik O. Diacritic restoration of Turkish tweets with word2vec. Engineering Science and Technology, an International Journal 2018; 21 (6): 1120–1127.
  • [19] Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. In: The 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13; Red Hook, NY, USA; 2013. pp. 3111–3119.
  • [20] Nga CH, Thinh NK, Chang PC, Wang JC. Deep learning based Vietnamese diacritics restoration. In: IEEE International Symposium on Multimedia (ISM); San Diego, CA, USA; 2019. pp. 331–3313
  • [21] Hucko A, Lacko P. Diacritics restoration using deep neural networks. In: IEEE World Symposium on Digital Intelligence for Systems and Machines (DISA); Kosice, Slovakia; 2018. pp. 195–200.
  • [22] Alkhatlan A, Kateb F, Kalita J. Attention-based sequence learning model for Arabic diacritic restoration. In: IEEE 6th Conference on Data Science and Machine Learning Applications (CDMA); Online; 2020. pp. 7–12.
  • [23] Náplava J, Straka M, Straňák P, Hajic J. Diacritics restoration using neural networks. In: The Eleventh International Conference on Language Resources and Evaluation (LREC 2018); Miyazaki, Japan; 2018. pp. 1566–1573.
  • [24] Klyshinsky E, Karpik O, Bondarenko A. A comparison of neural networks architectures for diacritics restoration. In: IEEE International Conference on Analysis of Images, Social Networks and Texts; Moscow, Russia; 2020. pp. 242–253.
  • [25] Ruseti S, Cotet TM, Dascalu M. Romanian diacritics restoration using recurrent neural networks. arXiv preprint arXiv:2009.02743; 2020.
  • [26] Masala M, Ruseti S, Dascalu M. Robert–a Romanian BERT model. In: The 28th International Conference on Computational Linguistics; Barcelona, Spain; 2020. pp. 6626–6637.
  • [27] Náplava J, Straka M, Straková J. Diacritics restoration using BERT with analysis on Czech language. arXiv preprint arXiv:2105.11408; 2021.
  • [28] Orife I. Attentive sequence-to-sequence learning for diacritic restoration of Yoruba language text. arXiv preprint arXiv:1804.00832; 2018.
  • [29] Hung BT. Vietnamese diacritics restoration using deep learning approach. In: IEEE 10th International Conference on Knowledge and Systems Engineering (KSE); Ho Chi Minh City, Vietnam; 2018. pp. 347–351.
  • [30] Nuţu M, L H orincz B, Stan A. Deep learning for automatic diacritics restoration in Romanian. In: 15th International Conference on Intelligent Computer Communication and Processing (ICCP); Cluj-Napoca, Romania; 2019. pp. 235– 240.
  • [31] Mubarak H, Abdelali A, Sajjad H, Samih Y, Darwish K. Highly effective Arabic diacritization using sequence to sequence modeling. In: 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1; Minneapolis, Minnesota, USA; 2019. pp. 2390–2395.
  • [32] Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606; 2016.
APA Köksal A, Bozal O, Özge U (2022). Diacritics correction in Turkish with context-aware sequence to sequence modeling. , 2433 - 2445. 10.55730/1300-0632.3948
Chicago Köksal Asiye Tuba,Bozal Ozge,Özge Umut Diacritics correction in Turkish with context-aware sequence to sequence modeling. (2022): 2433 - 2445. 10.55730/1300-0632.3948
MLA Köksal Asiye Tuba,Bozal Ozge,Özge Umut Diacritics correction in Turkish with context-aware sequence to sequence modeling. , 2022, ss.2433 - 2445. 10.55730/1300-0632.3948
AMA Köksal A,Bozal O,Özge U Diacritics correction in Turkish with context-aware sequence to sequence modeling. . 2022; 2433 - 2445. 10.55730/1300-0632.3948
Vancouver Köksal A,Bozal O,Özge U Diacritics correction in Turkish with context-aware sequence to sequence modeling. . 2022; 2433 - 2445. 10.55730/1300-0632.3948
IEEE Köksal A,Bozal O,Özge U "Diacritics correction in Turkish with context-aware sequence to sequence modeling." , ss.2433 - 2445, 2022. 10.55730/1300-0632.3948
ISNAD Köksal, Asiye Tuba vd. "Diacritics correction in Turkish with context-aware sequence to sequence modeling". (2022), 2433-2445. https://doi.org/10.55730/1300-0632.3948
APA Köksal A, Bozal O, Özge U (2022). Diacritics correction in Turkish with context-aware sequence to sequence modeling. Turkish Journal of Electrical Engineering and Computer Sciences, 30(6), 2433 - 2445. 10.55730/1300-0632.3948
Chicago Köksal Asiye Tuba,Bozal Ozge,Özge Umut Diacritics correction in Turkish with context-aware sequence to sequence modeling. Turkish Journal of Electrical Engineering and Computer Sciences 30, no.6 (2022): 2433 - 2445. 10.55730/1300-0632.3948
MLA Köksal Asiye Tuba,Bozal Ozge,Özge Umut Diacritics correction in Turkish with context-aware sequence to sequence modeling. Turkish Journal of Electrical Engineering and Computer Sciences, vol.30, no.6, 2022, ss.2433 - 2445. 10.55730/1300-0632.3948
AMA Köksal A,Bozal O,Özge U Diacritics correction in Turkish with context-aware sequence to sequence modeling. Turkish Journal of Electrical Engineering and Computer Sciences. 2022; 30(6): 2433 - 2445. 10.55730/1300-0632.3948
Vancouver Köksal A,Bozal O,Özge U Diacritics correction in Turkish with context-aware sequence to sequence modeling. Turkish Journal of Electrical Engineering and Computer Sciences. 2022; 30(6): 2433 - 2445. 10.55730/1300-0632.3948
IEEE Köksal A,Bozal O,Özge U "Diacritics correction in Turkish with context-aware sequence to sequence modeling." Turkish Journal of Electrical Engineering and Computer Sciences, 30, ss.2433 - 2445, 2022. 10.55730/1300-0632.3948
ISNAD Köksal, Asiye Tuba vd. "Diacritics correction in Turkish with context-aware sequence to sequence modeling". Turkish Journal of Electrical Engineering and Computer Sciences 30/6 (2022), 2433-2445. https://doi.org/10.55730/1300-0632.3948