Yıl: 2023 Cilt: 6 Sayı: 1 Sayfa Aralığı: 59 - 66 Metin Dili: İngilizce DOI: 10.35377/saucis...1207742 İndeks Tarihi: 09-05-2023

The Effects of Preprocessing on Turkish and English News Data

Öz:
In a standard text classification (TC) study, preprocessing is one of the key components to improve performance. This study aims to look at how preprocessing effects TC according to news text, text language, and feature selection. All potential combinations of commonly used preprocessing techniques were compared on one domain, namely news data, and two different news datasets for this aim. Preprocessing technique contributions to classification performance at multiple feature sizes, possible interconnections among these techniques, and technique dependency on corresponding languages were all evaluated in this way. The effect of two important preprocessing techniques on two different common news datasets was examined. While the highest performance for the Turkish dataset is a 0.781 F1 score, the highest performance for the English dataset is a 0.980 F1 score.
Anahtar Kelime: Feature selection news data preprocessing text classification

Belge Türü: Makale Makale Türü: Araştırma Makalesi Erişim Türü: Erişime Açık
  • [1] G. Salton, A. Wong, and C.-S. Yang, "A vector space model for automatic indexing". Communications of the ACM, 1975. 18(11): p. 613-620.
  • [2] T. Joachims, "Text categorization with support vector machines: Learning with many relevant features". in European conference on machine learning. 1998. Springer.
  • [3] Y. Yang, and J.O. Pedersen. "A comparative study on feature selection in text categorization." in ICML. 1997.
  • [4] C. Lee, and G.G. Lee," Information gain and divergence-based feature selection for machine learning-based text categorization." Information processing & management, 2006. 42(1): p. 155-165.
  • [5] S.R. Singh, H.A. Murthy, and T.A. Gonsalves, "Feature Selection for Text Classification Based on Gini Coefficient of Inequality. "Fsdm, 2010. 10: p. 76-85.
  • [6] A. Rehman, K. Javed, and H.A. Babri, "Feature selection based on a normalized difference measure for text classification." Information Processing & Management, 2017. 53(2): p. 473-489.
  • [7] A. Rehman, et al., "Selection of the most relevant terms based on a max-min ratio metric for text classification." Expert Systems with Applications, 2018. 114: p. 78-96.
  • [8] Parlak, B. and A.K. Uysal, A novel filter feature selection method for text classification: Extensive Feature Selector. Journal of Information Science, 2021: p. 0165551521991037.
  • [9] B. Parlak, "Class index corpus index measure: A novel feature selection method for imbalanced text data." Concurrency and Computation: Practice and Experience, 2022: p. e7140.
  • [10] D. Kilinc, et al., "TTC-3600: A new benchmark dataset for Turkish text categorization." Journal of InformationScience, 2017. 43(2): p. 174-185.
  • [11] A. Çiğdem. and A. Çırak, "Türkçe haber metinlerinin konvolüsyonel sinir ağları ve Word2Vec kullanılarak sınıflandırılması." Bilişim Teknolojileri Dergisi, 2019. 12(3): p. 219-228.
  • [12] S. Yıldırım, and T. Yıldız, "Türkçe için karşılaştırmalı metin sınıflandırma analizi. "Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, 2018. 24(5): p. 879-886.
  • [13] Y. Safali, et al. "Deep learning based classification using academic studies in doc2vec model". in 2019 International Artificial Intelligence and Data Processing Symposium (IDAP). 2019. IEEE.
  • [14] Ö. Köksal, "Tuning the Turkish Text Classification Process Using Supervised Machine Learning-based Algorithms". in 2020 International Conference on INnovations in Intelligent SysTems and Applications (INISTA). 2020. IEEE.
  • [15] S.M.H. Dadgar, M.S. Araghi, and M.M. Farahani. "A novel text mining approach based on TF-IDF and Support Vector Machine for news classification." in 2016 IEEE International Conference on Engineering and Technology (ICETECH). 2016. IEEE.
  • [16] A.W. Haryanto, and E.K. Mawardi. "Influence of word normalization and chi-squared feature selection on support vector machine (svm) text classification." in 2018 International Seminar on Application for Technology of Information and Communication. 2018. IEEE.
  • [17] F. Elghannam, "Text representation and classification based on bi-gram alphabet." Journal of King Saud University-Computer and Information Sciences, 2021. 33(2): p. 235-242.
  • [18] V.S. Shirsat, R.S. Jagdale, and S.N. Deshmukh, "Sentence level sentiment identification and calculation from news articles using machine learning techniques," in Computing, Communication and Signal Processing. 2019, Springer. p. 371-376.
  • [19] A.K. Uysal, and S. Gunal, "The impact of preprocessing on text classification." Information Processing & Management, 2014. 50(1): p. 104-112.
  • [20] D. Torunoğlu, et al. "Analysis of preprocessing methods on classification of Turkish texts." In: 2011 International Symposium on Innovations in Intelligent Systems and Applications. IEEE, 2011. p. 112-117.
  • [21] M.F. Porter, "An algorithm for suffix stripping." Program, 1980. 14(3): p. 130-137.
  • [22] A. Akın, M. D. Zemberek, “an open source NLP framework for Turkic languages”. Structure, 2007, 10.2007: 1-5.
  • [23] B. Parlak, and A.K. Uysal, “The effects of globalization techniques on feature selection for text classification.” Journal of Information Science, 2021, 47(6), 727-739.
  • [24] B. Parlak and A.K. Uysal, “On classification of abstracts obtained from medical journals.” Journal of Information Science, 2020, 46(5), 648-663.
APA PARLAK B (2023). The Effects of Preprocessing on Turkish and English News Data. , 59 - 66. 10.35377/saucis...1207742
Chicago PARLAK Bekir The Effects of Preprocessing on Turkish and English News Data. (2023): 59 - 66. 10.35377/saucis...1207742
MLA PARLAK Bekir The Effects of Preprocessing on Turkish and English News Data. , 2023, ss.59 - 66. 10.35377/saucis...1207742
AMA PARLAK B The Effects of Preprocessing on Turkish and English News Data. . 2023; 59 - 66. 10.35377/saucis...1207742
Vancouver PARLAK B The Effects of Preprocessing on Turkish and English News Data. . 2023; 59 - 66. 10.35377/saucis...1207742
IEEE PARLAK B "The Effects of Preprocessing on Turkish and English News Data." , ss.59 - 66, 2023. 10.35377/saucis...1207742
ISNAD PARLAK, Bekir. "The Effects of Preprocessing on Turkish and English News Data". (2023), 59-66. https://doi.org/10.35377/saucis...1207742
APA PARLAK B (2023). The Effects of Preprocessing on Turkish and English News Data. Sakarya University Journal of Computer and Information Sciences (Online), 6(1), 59 - 66. 10.35377/saucis...1207742
Chicago PARLAK Bekir The Effects of Preprocessing on Turkish and English News Data. Sakarya University Journal of Computer and Information Sciences (Online) 6, no.1 (2023): 59 - 66. 10.35377/saucis...1207742
MLA PARLAK Bekir The Effects of Preprocessing on Turkish and English News Data. Sakarya University Journal of Computer and Information Sciences (Online), vol.6, no.1, 2023, ss.59 - 66. 10.35377/saucis...1207742
AMA PARLAK B The Effects of Preprocessing on Turkish and English News Data. Sakarya University Journal of Computer and Information Sciences (Online). 2023; 6(1): 59 - 66. 10.35377/saucis...1207742
Vancouver PARLAK B The Effects of Preprocessing on Turkish and English News Data. Sakarya University Journal of Computer and Information Sciences (Online). 2023; 6(1): 59 - 66. 10.35377/saucis...1207742
IEEE PARLAK B "The Effects of Preprocessing on Turkish and English News Data." Sakarya University Journal of Computer and Information Sciences (Online), 6, ss.59 - 66, 2023. 10.35377/saucis...1207742
ISNAD PARLAK, Bekir. "The Effects of Preprocessing on Turkish and English News Data". Sakarya University Journal of Computer and Information Sciences (Online) 6/1 (2023), 59-66. https://doi.org/10.35377/saucis...1207742