Türkçe İstenmeyen E-postaların Farklı Öznitelik Seçim Yöntemleri Kullanılarak Makine Öğrenmesi Algoritmaları ile Tespit Edilmesi

Eryılmaz, Ersin Enes; Şahin, Durmuş Özkan; KILIÇ, Erdal

Türkçe İstenmeyen E-postaların Farklı Öznitelik Seçim Yöntemleri Kullanılarak Makine Öğrenmesi Algoritmaları ile Tespit Edilmesi

E. Enes ERYILMAZ, (Ondokuz Mayıs Üniversitesi, Bilgisayar Mühendisliği Bölümü, Samsun, Türkiye)

Durmuş Özkan ŞAHİN, (Ondokuz Mayıs Üniversitesi, Bilgisayar Mühendisliği Bölümü, Samsun, Türkiye)

Erdal KILIÇ (Ondokuz Mayıs Üniversitesi, Bilgisayar Mühendisliği Bölümü, Samsun, Türkiye)

TBV Bilgisayar Bilimleri ve Mühendisliği Dergisi

18 4

Yıl: 2020 Cilt: 13 Sayı: 2 Sayfa Aralığı: 57 - 77 Metin Dili: Türkçe İndeks Tarihi: 29-07-2022

Türkçe İstenmeyen E-postaların Farklı Öznitelik Seçim Yöntemleri Kullanılarak Makine Öğrenmesi Algoritmaları ile Tespit Edilmesi

Öz:

Elektronik postalar, kullanımının kolaylığı, maliyetlerinin ucuz olmasından dolayı propaganda, reklam, oltalama yapmak isteyen kişi veya topluluklar tarafından etkin bir biçimde kullanılmaktadır. Amaçlarını gerçekleştirmek isteyen kişi veya topluluklar hiç tanımadıkları e-posta hesaplarına gereksiz ve yaramaz postalar gönderirler. Bu postalar internet kullanıcılarına maddi ve manevi ciddi zararlar vermekte ayrıca internet trafiğini de meşgul etmektedirler. Yaramaz e-postalar alıcıya rızası dışında gönderilen ve genellikle kötü niyetli veya tanıtım amaçlı olan kişilerin başvurduğu bir yöntemdir. Bu çalışmada iki farklı Türkçe e-posta veri kümesi üzerinde yedi farklı makine öğrenmesi algoritması kullanılarak yaramaz e-postalar tespit edilmeye çalışılmıştır. Bu algoritmaları kullanmadan önce veri kümesi üzerinde ön işlem adımları gerçekleştirilmiştir. Daha sonrasında ise öznitelik çıkarımı ve öznitelik seçimi yapılmıştır. Öznitelik seçimleri sonrasında özellik vektörü oluşturarak makinenin anlayacağı formatta değerler elde edilmiştir. Özellik vektörü makine öğrenmesi algoritmaları ile test edilerek yaramaz e-posta filtreleme işlemiyle elde edilen başarım sonuçları değerlendirilmiştir. Metin sınıflandırma çalışmalarında sıkça kullanılan filtreleme tabanlı Ki-kare (CHI), Bilgi Kazancı (IG), Doküman Frekansı Eşikleme (DF), Odds Oranı (OR) ve ACC öznitelik seçme yöntemleri kullanılmaktadır. İki Türkçe e-posta veri kümesi ile CHI, IG, ACC, OR, DF öznitelik seçme yöntemlerinin çeşitli makine öğrenmesi sınıflandırma algoritmaları üzerinde verdiği sonuçlar incelendiğinde en başarılı sonuç Ki-Kare öznitelik seçimi ile görülmüştür. “TurkishEmail” veri kümesi ile Destek Vektör Makinesi tabanlı SMO algoritması ve CHI öznitelik seçimi ile 0,985 F-ölçütü başarım sonucu elde edilmiştir. “TRHamSpamEmailv1.0” veri kümesi ile CHI öznitelik seçim yöntemi Rastgele Orman (RF) ve Naive Bayes (NB) algoritması ile 0,748 F-ölçütü başarıma ulaşmıştır. Herhangi bir öznitelik seçimi yapılmadan tüm özniteliklerin kullanılması ile elde edilen sınıflandırma başarıları da verilmiştir. Öznitelik seçimi yapılmadan “TurkishEmail” veri kümesi üzerinde RF algoritması ile başarım sonucu 0,514 F-ölçütü, “TRHamSpamEmailv1.0” veri kümesi üzerinde RF algoritması ile başarım sonucu 0,535 F-ölçütü olarak elde edilmiştir.

Anahtar Kelime: e-posta sınıflandırma Türkçe e-posta sınıflandırma Türkçe spam filtreleme özellik seçimi istenmeyen e-posta metin sınıflandırma öznitelik çıkarımı makine öğrenmesi spam filtreleme

Detection of Turkish Spam Emails with Machine Learning Algorithms Using Different Feature Selection Methods

Öz:

Electronic mails are used effectively by people or communities who want to make propaganda, advertising, phishing because of its ease of use and low cost. People or communities who want to achieve their goals send junk and spam emails to e-mail accounts they do not know. These mails cause serious material and moral damages to internet users and also engage internet traffic. Spam e-mails are a method that is sent to the recipient without their consent and are often used by malicious or promotional people. In this study, it was tried to detect spam e-mails by using seven different machine learning algorithms on two different Turkish e-mail datasets. Before using these algorithms, pre-processing steps were performed on the datasets. Afterward, feature extraction and feature selection were made. After the feature selections, the values were obtained in a format that the machine can understand by creating the feature vector. The performance results of the spam filtering process were evaluated by testing the feature vector with machine learning algorithms. Which are frequently used in text classification studies, filtering-based Chi-square (CHI), Information Gain (IG), Document Frequency Threshold (DF), Odds Ratio (OR), and ACC feature selection methods are used. When examining the results of two Turkish e-mail datasets and CHI, IG, ACC, OR, DF feature selection methods on different machine learning classification algorithms, the most successful result was seen with Chi-Square feature selection. With the “TurkishEmail” dataset, the SMO algorithm based on Support Vector Machine, and CHI feature selection, 0,985 F-measure performance result was obtained. With the “TRHamSpamEmailv1.0” dataset, the CHI feature selection method achieved a 0,748 F-measure with Random Forest (RF) and Naive Bayes (NB) algorithm. Classification successes obtained by using all features without any feature selection are also given. The performance result was obtained as a 0,514 F measure with the RF algorithm on the “TurkishEmail” dataset without the feature selection and as a 0,535 F-measure on the “TRHamSpamEmailv1.0” dataset with the RF algorithm.

Anahtar Kelime:

Belge Türü: Makale Makale Türü: Araştırma Makalesi Erişim Türü: Erişime Açık

[1] Eryılmaz, E. E., Şahin D. Ö. ve Kılıç, E. Filtering Turkish Spam Using LSTM From Deep Learning Techniques, 2020 8th International Symposium on Digital Forensics and Security (ISDFS), IEEE, p. 1-6, 2020.
[2] Eryılmaz, E. E., Kılıç, E. İstenmeyen E-postaların Tespiti için Kullanılan Yöntemlerin İncelenmesi, Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi, 11(3), 977-987, 2020.
[3] LeCun, Y., Bengio, Y. ve Hinton, G. Deep learning, Nature, 521:7553, 436-444, 2015.
[4] Ates, N. Support vector machine and gauss mixture model detection of unsolicited e-mails, Master’s thesis, Süleyman Demirel Üniversitesi, Fen Bilimleri Enstitüsü, Bilgisayar Mühendisliği Anabilim Dalı, 2014.
[5] Sharma, A. ve ark., A Comparative Study Between Naive Bayes and Neural Network (MLP) Classifier for Spam Email Detection, 2014.
[6] Karthika, R. ve Visalakshi, P. A hybrid ACO based feature selection method for email spam classification, WSEAS Trans. Comput 14, 171-177, 2015.
[7] Renuka, D. K., Visalakshi P ve Sankar, T., Improving E-mail spam classification using ant colony optimization algorithm, Int. J. Comput. Appl, 22-26, 2015.
[8] Palanisamy, C., Kumaresan, T. ve Varalakshmi S. E., Combined techniques for detecting email spam using negative selection and particle swarm optimization, Int. J. Adv. Res. Trends Eng. Technol., 3, 2016.
[9] Zavvar, M., Rezaei M. ve Garavand S., Email spam detection using combination of particle swarm optimization and artificial neural network and support vector machine, International Journal of Modern Education and Computer Science, 8(7), 68, 2016.
[10] Foqaha M. A. M., Email spam classification using hybrid approach of RBF neural network and particle swarm optimization, International Journal of Network Security & Its Applications, 8(4), 17-28, 2016.
[11] Sharma A. ve Suryawanshi A., A novel method for detecting spam email using KNN classification with spearman correlation as distance measure. International Journal of Computer Applications, 136(6), 28-35, 2016.
[12] Alkaht I. J. ve Al-Khatib B., Filtering SPAM Using Several Stages Neural Networks, Int. Rev. Comp. Softw., 11, 2, 2016.
[13] Rajamohana S. P., Umamaheswari K. ve Abirami B., Adaptive binary flower pollination algorithm for feature selection in review spam detection, 2017 International Conference on Innovations in Green Energy and Healthcare Technologies (IGEHT), pp. 1-4, IEEE, 2017.
[14] Myle O ve ark., Finding deceptive opinion spam by any stretch of imagination, ACM Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 309-319, 2011.
[15] Akinyelu A. A. ve Adewumi A. O., Classification of phishing email using random forest machine learning technique, Journal of Applied Mathematics, 2014.
[16] Yıldız A., Kurumsal e-posta sınıflandırma sistemi. Yüksek Lisans Tezi, Gazi Üniversitesi Fen Bilimleri Enstitüsü, 82, Ankara, 2017.
[17] Şahin E., Makine öğrenme yöntemleri ve sözcük kümesi tekniği ile yaramaz e-posta / e-posta sınıflaması. Yüksek Lisans Tezi, Hacettepe Üniversitesi Fen Bilimleri Enstitüsü, 60, Ankara, 2018.
[18] Kale B., Veri madenciliği sınıflandırma algoritmaları ile e-posta önemliliğinin belirlenmesi. Yüksek Lisans Tezi, Çukurova Üniversitesi Fen Bilimleri Enstitüsü, 120, Adana, 2018.
[19] Nazlı N., Analysis of machine learning – based spam filtering techniques, Yüksek Lisans Tezi, Çankaya University The Graduate School of Natural and Applied Sciences, 79, Ankara, 2018.
[20] Al-Azzawi F., Wrapper feature selection approach for spam e-mail filtering, Master Thesis, Erciyes University Graduate school of natural and applied science, Kayseri, 2018.
[21] Ablel-Rheem D. M., Ibrahim A. O., Kasim S., Almazroi A. A., ve Ismail M. A., Hybrid Feature Selection and Ensemble Learning Method for Spam Email Classification. International Journal, 9(1.4), 2020.
[22] Zamir A., Khan H. U., Mehmood W., Iqbal T., ve Akram A. U., A feature-centric spam email detection model using diverse supervised machine learning algorithms. The Electronic Library, 2020.
[23] Mohammad R. M. A., A lifelong spam emails classification model. Applied Computing and Informatics. 2020.
[24] Kumar N. ve Sonowal S.. Email Spam Detection Using Machine Learning Algorithms. In 2020 Second International Conference on Inventive Research in Computing Applications (ICIRCA) (pp. 108-113). IEEE. 2020, July.
[25] Deniz E., Erbay H., Coşar M.. Türkçe e-postaların Doc2Vec ile sınıflandırılması. In 2019 1st International Informatics and Software Engineering Conference (UBMYK) (pp. 1-4). IEEE, 2019.
[26] Karamollaoglu H., Dogru İ. A., Dorterler M., Detection of Spam E-mails with Machine Learning Methods, 2018
[27] Kaynar O., Görmez Y. ve Işık Y. E., Oto Kodlayıcı Tabanlı Derin Öğrenme Makinaları İle Spam Tespiti. 3. Uluslararası Yönetim Bilişim Sistemleri Konferansı, 44. 2016.
[28] Ergin S., Sora Gunal E., Yigit H. ve Aydin R.. Turkish anti-spam filtering using binary and probabilistic models. Global Journal on Technology, 1. 2012
[29] Eryilmaz E. E., Ozkan Şahin D. ve Kılıç E., Machine Learning Based Spam E-mail Detection System for Turkish, 2020 5th International Conference on Computer Science and Engineering (UBMK), Diyarbakır, Turkey, pp. 7-12, 2020
[30] Hotho A., Nürnberger A. ve Paaß G., A brief survey of text mining, in Ldv Forum, vol. 20, no. 1. Citeseer, pp. 19–62, 2005.
[31] Akın A. A. ve Akın M. D., Zemberek, an open source nlp framework for turkic languages, Structure, 10, 1-5, 2007.
[32] Domeniconi G. Ve ark. A study on term weighting for text categorization: A novel supervised variant of tf.idf, in DATA, pp. 26–37, 2015.
[33] Şahin D. Ö. ve Kılıç E., Two new feature selection metrics for text classification, Automatika, vol. 60, no. 2, pp. 162–171, 2019.
[34] Zheng Z. Ve ark., Feature selection for text categorization on imbalanced data, ACM Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 80-89, 2004.
[35] Forman G., An extensive empirical study of feature selection metrics for text classification, Journal of machine learning research, vol. 3, no. Mar, pp. 1289-1305, 2003.
[36] Şahi̇n D. Ö., Ateş N. ve Kiliç E., Feature selection in text classification. 2016 24th signal processing and communication application conference (SIU), IEEE, pp. 1777-1780, 2016.
[37] Liaw A. Ve ark., Classification and regression by randomforest, R news, vol. 2, no. 3, pp. 18–22, 2002.
[38] Ruggieri S., Efficient c4.5 classification algorithm, IEEE transactions on knowledge and data engineering, vol. 14, no. 2, pp. 438–444, 2002.
[39] Zeng Z.-Q. Ve ark., Fast training support vector machines using parallel sequential minimal optimization, 2008 3rd international conference on intelligent system and knowledge engineering, vol. 1. IEEE, pp. 997–1001, 2008.
[40] Cover T. ve Hart P., Nearest neighbor pattern classification, IEEE transactions on information theory, vol. 13, no. 1, pp. 21–27, 1967.
[41] Dreiseitl S. ve Ohno-Machado L., Logistic regression and artificial neural network classification models: a methodology review, Journal of biomedical informatics, vol. 35, no. 5-6, pp. 352–359, 2002.
[42] Huang Y. ve Li L., Naive bayes classification algorithm based on small sample set, in 2011 IEEE International Conference on Cloud Computing and Intelligence Systems. IEEE, pp. 34–39, 2011.
[43] Ruck D. W., S. Rogers K., Kabrisky M., Oxley M.E. ve Suter B. W.. The multilayer perceptron as an approximation to a Bayes optimal discriminant function. IEEE Transactions on Neural Networks, 1(4), 296-298. 1990
[44] Frank E., Mark A. Hall ve Witten I. H., The WEKA Workbench. Online Appendix for "Data Mining: Practical Machine Learning Tools and Techniques", Morgan Kaufmann, Fourth Edition, 2016.

APA	Eryılmaz E, Şahin D, KILIÇ E (2020). Türkçe İstenmeyen E-postaların Farklı Öznitelik Seçim Yöntemleri Kullanılarak Makine Öğrenmesi Algoritmaları ile Tespit Edilmesi. , 57 - 77.
Chicago	Eryılmaz Ersin Enes,Şahin Durmuş Özkan,KILIÇ Erdal Türkçe İstenmeyen E-postaların Farklı Öznitelik Seçim Yöntemleri Kullanılarak Makine Öğrenmesi Algoritmaları ile Tespit Edilmesi. (2020): 57 - 77.
MLA	Eryılmaz Ersin Enes,Şahin Durmuş Özkan,KILIÇ Erdal Türkçe İstenmeyen E-postaların Farklı Öznitelik Seçim Yöntemleri Kullanılarak Makine Öğrenmesi Algoritmaları ile Tespit Edilmesi. , 2020, ss.57 - 77.
AMA	Eryılmaz E,Şahin D,KILIÇ E Türkçe İstenmeyen E-postaların Farklı Öznitelik Seçim Yöntemleri Kullanılarak Makine Öğrenmesi Algoritmaları ile Tespit Edilmesi. . 2020; 57 - 77.
Vancouver	Eryılmaz E,Şahin D,KILIÇ E Türkçe İstenmeyen E-postaların Farklı Öznitelik Seçim Yöntemleri Kullanılarak Makine Öğrenmesi Algoritmaları ile Tespit Edilmesi. . 2020; 57 - 77.
IEEE	Eryılmaz E,Şahin D,KILIÇ E "Türkçe İstenmeyen E-postaların Farklı Öznitelik Seçim Yöntemleri Kullanılarak Makine Öğrenmesi Algoritmaları ile Tespit Edilmesi." , ss.57 - 77, 2020.
ISNAD	Eryılmaz, Ersin Enes vd. "Türkçe İstenmeyen E-postaların Farklı Öznitelik Seçim Yöntemleri Kullanılarak Makine Öğrenmesi Algoritmaları ile Tespit Edilmesi". (2020), 57-77.

APA	Eryılmaz E, Şahin D, KILIÇ E (2020). Türkçe İstenmeyen E-postaların Farklı Öznitelik Seçim Yöntemleri Kullanılarak Makine Öğrenmesi Algoritmaları ile Tespit Edilmesi. TBV Bilgisayar Bilimleri ve Mühendisliği Dergisi, 13(2), 57 - 77.
Chicago	Eryılmaz Ersin Enes,Şahin Durmuş Özkan,KILIÇ Erdal Türkçe İstenmeyen E-postaların Farklı Öznitelik Seçim Yöntemleri Kullanılarak Makine Öğrenmesi Algoritmaları ile Tespit Edilmesi. TBV Bilgisayar Bilimleri ve Mühendisliği Dergisi 13, no.2 (2020): 57 - 77.
MLA	Eryılmaz Ersin Enes,Şahin Durmuş Özkan,KILIÇ Erdal Türkçe İstenmeyen E-postaların Farklı Öznitelik Seçim Yöntemleri Kullanılarak Makine Öğrenmesi Algoritmaları ile Tespit Edilmesi. TBV Bilgisayar Bilimleri ve Mühendisliği Dergisi, vol.13, no.2, 2020, ss.57 - 77.
AMA	Eryılmaz E,Şahin D,KILIÇ E Türkçe İstenmeyen E-postaların Farklı Öznitelik Seçim Yöntemleri Kullanılarak Makine Öğrenmesi Algoritmaları ile Tespit Edilmesi. TBV Bilgisayar Bilimleri ve Mühendisliği Dergisi. 2020; 13(2): 57 - 77.
Vancouver	Eryılmaz E,Şahin D,KILIÇ E Türkçe İstenmeyen E-postaların Farklı Öznitelik Seçim Yöntemleri Kullanılarak Makine Öğrenmesi Algoritmaları ile Tespit Edilmesi. TBV Bilgisayar Bilimleri ve Mühendisliği Dergisi. 2020; 13(2): 57 - 77.
IEEE	Eryılmaz E,Şahin D,KILIÇ E "Türkçe İstenmeyen E-postaların Farklı Öznitelik Seçim Yöntemleri Kullanılarak Makine Öğrenmesi Algoritmaları ile Tespit Edilmesi." TBV Bilgisayar Bilimleri ve Mühendisliği Dergisi, 13, ss.57 - 77, 2020.
ISNAD	Eryılmaz, Ersin Enes vd. "Türkçe İstenmeyen E-postaların Farklı Öznitelik Seçim Yöntemleri Kullanılarak Makine Öğrenmesi Algoritmaları ile Tespit Edilmesi". TBV Bilgisayar Bilimleri ve Mühendisliği Dergisi 13/2 (2020), 57-77.