Yıl: 2021 Cilt: 29 Sayı: 2 Sayfa Aralığı: 514 - 530 Metin Dili: İngilizce DOI: 10.3906/elk-1911-116 İndeks Tarihi: 07-06-2022

Development of majority vote ensemble feature selection algorithm augmented with rank allocation to enhance Turkish text categorization

Öz:
The increase in the number of texts as digital documents from numerous sources such as customer reviews, news, and social media has made text categorization crucial in order to be able to manage the enormous amount of data. The high dimensional nature of these texts requires a preliminary feature selection task to reduce the feature space with a potential increase in the prediction accuracy. In this study, we developed an ensemble feature selection method, namely majority vote rank allocation, was developed for Turkish text categorization purposes. The method uses a majority voting ensemble strategy in combination with a rank allocation approach to combine weak filters such as information gain, symmetric uncertainty, relief, and correlation-based feature selection. Thus, the proposed method measures the quality of the features among all features with the majority votes of the filters and ranking allocation. The feature selection efficacy of the method was tested on two datasets, one from the literature and a newly collected dataset. The effect of the obtained features on the classification prediction performance was evaluated on top of the naive bayes, support vector machine J48, and random forests algorithms. It was empirically observed that the developed method improved the prediction accuracies of the classifiers compared to the mentioned filters. The statistical significance of the experimental results were also validated with the use of a two-way analysis of variance test
Anahtar Kelime:

Belge Türü: Makale Makale Türü: Araştırma Makalesi Erişim Türü: Erişime Açık
  • [1] Ghareb AS, Bakar AA, Hamdan AR. Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Systems with Applications 2016; 49: 31-47. doi : 10.1016/j.eswa.2015.12.004
  • [2] Meena MJ, Chandran KR. Naïve Bayes text classification with positive features selected by statistical method. In: IEEE 2009 First International Conference on Advanced Computing; Chennai, India; 2009. pp. 28-33. doi: 10.1109/ICADVC.2009.5378273
  • [3] Labani M, Moradi P, Ahmadizar F, Jalili M. A novel multivariate filter method for feature selection in text classification problems. Engineering Applications of Artificial Intelligence 2018; 70: 25-37. doi: 10.1016/j.engappai.2017.12.014
  • [4] Altinel B, Ganim MC. Semantic text classification: A survey of past and recent advances. Information Processing and Management 2018; 54 (6): 1129-1153. doi: 10.1016/j.ipm.2018.08.001
  • [5] Jin C, Ma T, Hou R, Tang M, Tian Y et al. Chi-square statistics feature selection based on term frequency and distri- bution for text categorization. IETE Journal of Research 2015; 61 (4): 351-362. doi: 10.1080/03772063.2015.1021385
  • [6] Costa H, Galvao LR, Merschmann LHC, Souza MJF. A VNS algorithm for feature selection in hierarchical classification context. Electronic Notes in Discrete Mathematics 2018; 66: 79-86. doi: 10.1016/j.endm.2018.03.011
  • [7] Biricik G, Diri B, Sonmez AC. Abstract feature extraction for text classification. Turkish Journal of Electrical Engineering and Computer Science 2012; 20 (Sup.1): 1137-1159. doi: 10.3906/elk-1102-1015
  • [8] Hira ZM, Gillies DF. A review of feature selection and feature extraction methods applied on microarray data. Advances in Bioinformatics. 2015; 2015: 198363. doi: 10.1155/2015/198363
  • [9] Dietterich TG. Ensemble Methods in Machine Learning. In: Multiple Classifier Systems. MCS 2000. Lecture Notes in Computer Science. Berlin, Heidelberg; 2000. pp. 1-15. doi: 10.1007/3-540-45014-9_1
  • [10] Kilinc D, Ozcift A, Bozyigit F, Yildirim P, Yucalar F. et al. TTC-3600: A new benchmark dataset for Turkish text categorization. Journal of Information Science 2017; 43 (2): 174–185. doi 10.1177/0165551515620551
  • [11] Ersahin B, Aktas O, Kilinc D, Ersahin M. A hybrid sentiment analysis method for Turkish. Turkish Journal of Electrical Engineering & Computer Sciences 2019; 27: 1780–1793. doi: 10.3906/elk-1808-189
  • [12] Novovicova J, Malik A. Information-theoretic feature selection algorithms for text classification. In: IEEE 2005 IEEE International Joint Conference on Neural Networks; Montreal, Quebec, Canada; 2005. pp. 3272-3277.
  • [13] Chakrabarti S. Mining the Web: Discovering Knowledge from Hypertext Data. Amsterdam: Morgan Kaufmann, 2003.
  • [14] Deng X, Li Y, Weng, J, Zhang J. Feature selection for text classification: A review. Multimedia Tools and Applications 2019; 78 (3): 3797-3816. doi: 10.1007/s11042-018-6083-5
  • [15] Gunal S. Hybrid feature selection for text classification. Turkish Journal of Electrical Engineering and Computer Sciences 2012; 20 (Sup.2) : 1296-1131. doi: 10.3906/elk-1101-1064
  • [16] Meng J, Lin H, Yu Y. A two-stage feature selection method for text categorization. Computers & Mathematics with Applications 2011; 62 (7): 2793-2800. doi: 10.1016/j.camwa.2011.07.045
  • [17] Czarnowski I, Wosiak A, Zakrzewska D. Integrating correlation-based feature selection and clustering for improved cardiovascular disease diagnosis, Complexity 2018, 2018: 2520706. doi: 10.1155/2018/2520706
  • [18] Tang J, Alelyani S, Liu H. Feature selection for classification: a review. In: Aggarwal CC, editor. Data Classification: Algorithms and Applications. Boca Raton, FL, USA: CRC Press, 2013, pp. 37-64.
  • [19] Panthonga R, Srivihokb A. Wrapper feature subset selection for dimension reduction based on ensemble learning algorithm. Procedia Computer Science 2015; 72: 162-169. doi: 10.1016/j.procs.2015.12.117
  • [20] Manbari Z, Tab FA, Salavati C. Hybrid fast unsupervised feature selection for high-dimensional data. Expert Systems With Applications 2019; 124: 97-118. doi: 10.1016/j.eswa.2019.01.016
  • [21] Bolon-Canedo V, Alonso-Betansoz A. Ensembles for feature selection: a review and future trends. Information Fusion 2019; 52: 1-12. doi: 10.1016/j.inffus.2018.11.008
  • [22] Bouziane H, Messabih B, Chouarfia A. Profiles and majority voting-based ensemble method for protein secondary structure prediction. Evolutionary Bioinformatics 2011; 7: 171-189. doi: 10.4137/EBO.S7931
  • [23] Parlar T, Ozel SA, Song F. QER: a new feature selection method for sentiment analysis. Human-centric Computing and Information Sciences 2018; 8 (1): 10. doi: 10.1186/s13673-018-0135-8
  • [24] Sahin DO, Kilic E. Two new feature selection metrics for text classification. Automatika 2019; 60 (2): 162-171. doi: 10.1080/00051144.2019.1602293
  • [25] Yelmen I, Zontul M, Kaynar O, Sonmez F. A novel hybrid approach for sentiment classification of Turkish Tweets for GSM operators. International Journal Of Circuits, Systems And Signal Processing 2018; 12: 637-645.
  • [26] Bahassine S, Madani A, Al-Sarem M, Kissi M. Feature selection using an improved Chi-square for Arabic text classification. Journal of King Saud University - Computer and Information Sciences 2020; 32(2): 225-231. doi: 10.1016/j.jksuci.2018.05.010
  • [27] Tutkan M, Ganiz MC, Akyokus S. Helmholtz principle based supervised and unsupervised feature selection methods for text mining. Information Processing & Management 2016; 52 (5): 885-910. doi: 10.1016/j.ipm.2016.03.007
  • [28] Sarac E, Ozel SA. An ant colony optimization based feature selection for web page classification. The Scientific World Journal 2014; 2014: 649260. doi: 10.1155/2014/649260
  • 29] Hoque N, Singh M, Bhattacharyya DK. EFS-MI: an ensemble feature selection method for classification. Complex & Intelligent Systems 2018; 4: 105-118. doi: 10.1007/s40747-017-0060-x
  • [30] Uysal AK. An improved global feature selection scheme for text classification. Expert Systems With Applications 2016; 43: 82-92. doi: 10.1016/j.eswa.2015.08.050 [31] Akın AA, Akın MD. Zemberek, an open source NLP framework for Turkic languages. Structure 2007; 10: 1-5.
  • [32] Tharwat A. Classification assessment methods. Applied Computing and Informatics 2018; 1: 1-13. doi: 10.1016/j.aci.2018.08.003
  • [33] Vapnik VN. The Nature of Statistical Learning Theory. Berlin: Springer-Verlag, 1995.
  • [34] Yadav AK, Chandel SS. Solar energy potential assessment of western Himalayan Indian state of Himachal Pradesh using J48 algorithm of WEKA in ANN based prediction model. Renewable Energy 2015; 75: 675-693. doi: 10.1016/j.renene.2014.10.046
  • [35] Pal M. Random forest classifier for remote sensing classification. International Journal of Remote Sensing 2005; 26: 217-222. doi : 10.1080/01431160412331269698
APA BORANDAĞ E, özçift a, Kaygusuz Y (2021). Development of majority vote ensemble feature selection algorithm augmented with rank allocation to enhance Turkish text categorization. , 514 - 530. 10.3906/elk-1911-116
Chicago BORANDAĞ Emin,özçift akın,Kaygusuz Yeşim Development of majority vote ensemble feature selection algorithm augmented with rank allocation to enhance Turkish text categorization. (2021): 514 - 530. 10.3906/elk-1911-116
MLA BORANDAĞ Emin,özçift akın,Kaygusuz Yeşim Development of majority vote ensemble feature selection algorithm augmented with rank allocation to enhance Turkish text categorization. , 2021, ss.514 - 530. 10.3906/elk-1911-116
AMA BORANDAĞ E,özçift a,Kaygusuz Y Development of majority vote ensemble feature selection algorithm augmented with rank allocation to enhance Turkish text categorization. . 2021; 514 - 530. 10.3906/elk-1911-116
Vancouver BORANDAĞ E,özçift a,Kaygusuz Y Development of majority vote ensemble feature selection algorithm augmented with rank allocation to enhance Turkish text categorization. . 2021; 514 - 530. 10.3906/elk-1911-116
IEEE BORANDAĞ E,özçift a,Kaygusuz Y "Development of majority vote ensemble feature selection algorithm augmented with rank allocation to enhance Turkish text categorization." , ss.514 - 530, 2021. 10.3906/elk-1911-116
ISNAD BORANDAĞ, Emin vd. "Development of majority vote ensemble feature selection algorithm augmented with rank allocation to enhance Turkish text categorization". (2021), 514-530. https://doi.org/10.3906/elk-1911-116
APA BORANDAĞ E, özçift a, Kaygusuz Y (2021). Development of majority vote ensemble feature selection algorithm augmented with rank allocation to enhance Turkish text categorization. Turkish Journal of Electrical Engineering and Computer Sciences, 29(2), 514 - 530. 10.3906/elk-1911-116
Chicago BORANDAĞ Emin,özçift akın,Kaygusuz Yeşim Development of majority vote ensemble feature selection algorithm augmented with rank allocation to enhance Turkish text categorization. Turkish Journal of Electrical Engineering and Computer Sciences 29, no.2 (2021): 514 - 530. 10.3906/elk-1911-116
MLA BORANDAĞ Emin,özçift akın,Kaygusuz Yeşim Development of majority vote ensemble feature selection algorithm augmented with rank allocation to enhance Turkish text categorization. Turkish Journal of Electrical Engineering and Computer Sciences, vol.29, no.2, 2021, ss.514 - 530. 10.3906/elk-1911-116
AMA BORANDAĞ E,özçift a,Kaygusuz Y Development of majority vote ensemble feature selection algorithm augmented with rank allocation to enhance Turkish text categorization. Turkish Journal of Electrical Engineering and Computer Sciences. 2021; 29(2): 514 - 530. 10.3906/elk-1911-116
Vancouver BORANDAĞ E,özçift a,Kaygusuz Y Development of majority vote ensemble feature selection algorithm augmented with rank allocation to enhance Turkish text categorization. Turkish Journal of Electrical Engineering and Computer Sciences. 2021; 29(2): 514 - 530. 10.3906/elk-1911-116
IEEE BORANDAĞ E,özçift a,Kaygusuz Y "Development of majority vote ensemble feature selection algorithm augmented with rank allocation to enhance Turkish text categorization." Turkish Journal of Electrical Engineering and Computer Sciences, 29, ss.514 - 530, 2021. 10.3906/elk-1911-116
ISNAD BORANDAĞ, Emin vd. "Development of majority vote ensemble feature selection algorithm augmented with rank allocation to enhance Turkish text categorization". Turkish Journal of Electrical Engineering and Computer Sciences 29/2 (2021), 514-530. https://doi.org/10.3906/elk-1911-116