A comprehensive review on data preprocessing techniques in data analysis

ÇETİN, Volkan; YILDIZ, Oktay

doi:10.5505/pajes.2021.62687

A comprehensive review on data preprocessing techniques in data analysis

Volkan ÇETİN, (Bilgisayar Mühendisliği Bölümü, Mühendislik Fakültesi, Gazi Üniversitesi, Ankara, Türkiye.)

Oktay YILDIZ (Bilgisayar Mühendisliği Bölümü, Mühendislik Fakültesi, Gazi Üniversitesi, Ankara, Türkiye.)

Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi

9 4

Yıl: 2022 Cilt: 28 Sayı: 2 Sayfa Aralığı: 299 - 312 Metin Dili: İngilizce DOI: 10.5505/pajes.2021.62687 İndeks Tarihi: 25-06-2022

A comprehensive review on data preprocessing techniques in data analysis

Öz:

With the technological developments, the amount of data stored in the computer environment is increasing very rapidly. Data analysis has become an important research subject for the correct evaluation of these data and to transform them into useful information. Of course, data play an important role in data analysis. However, model performance is highly dependent on the characteristics of the data. For this reason, it is essential to preprocess them before starting any data analysis process. Data preprocessing creates accurate and useful datasets by overcoming erroneous, incomplete, or other unwanted problems. In this study, papers on data preprocessing in the last 5 years have been researched systematically and it has been observed that widely used preprocessing methods are classified under three main branches: data cleaning, data transformation and data reduction. These methods and various algorithms of them are examined, the frequency of use is presented, and comparisons are made in terms of accuracy performance. As the result of the study shows, when data preprocessing methods are not used on raw data or when wrong data preprocessing methods are applied, data analysis methods alone cannot achieve sufficient performance.

Anahtar Kelime:

Veri analizinde veri ön işleme teknikleri üzerine kapsamlı bir inceleme

Öz:

Yaşanan teknolojik gelişmeler ile beraber bilgisayar ortamında saklanan veri miktarı çok hızlı bir şekilde artmaktadır. Bu verilerin doğru bir şekilde değerlendirilmesi ve faydalı bilgiye dönüştürülmesi için de veri analizi önemli bir araştırma konusu olmuştur. Veri analizinde elbette veriler önemli bir rol oynar. Ancak başarım, verinin özelliklerine büyük ölçüde bağımlıdır. Bu sebeple herhangi bir veri analizi süreci başlamadan önce bir ön işlemden geçirmek elzemdir. Veri ön işleme hatalı, eksik ya da istenmeyen diğer sorunların üstesinden gelerek doğru ve kullanışlı veri kümelerini oluşturur. Bu makalede veri ön işleme konusunda son 5 yılda hazırlanmış makale ve bildiriler sistematik olarak araştırılmış ve yaygın olarak kullanılan ön işleme yöntemlerinin üç ana dal altında; veri temizleme, veri dönüştürme ve veri azaltma olarak sınıflandığı görülmüştür. Bu yöntemler ve çeşitli algoritmaları incelenmiş, kullanım sıklıkları sunulmuş ve başarım performansları açısından karşılaştırmaları yapılmıştır. Çalışmanın sonucunun da gösterdiği üzere ham veriler üzerine veri ön işleme yöntemleri kullanılmadığında ya da yanlış veri ön işleme yöntemi kullanıldığında tek başına veri analizi yöntemleri yeterli başarımlara ulaşamamaktadır

Anahtar Kelime:

Belge Türü: Makale Makale Türü: Düzeltme Erişim Türü: Erişime Açık

1] Oussous A, Benjelloun F, Lahcen A, Belfkih S. "Big data technologies: a survey". Journal of King Saud University- Computer and Information Sciences, 30(4), 431-448, 2018.
[2] Choi TM, Wallace SW, Wang Y. “Big data analytics in operations management”. Production and Operations Management, 27(10), 1868-1883, 2018.
[3] García S, Ramírez-Gallego S, Luengo J, Benítez JM. “Big data preprocessing: methods and prospects”. Big Data Analytics, 1(1), 1-22, 2016.
[4] Anoopkumar M, Rahman AMJMZ. “A Review on data mining techniques and factors used in educational data mining to predict student amelioration”. 2016
International Conference on Data Mining and Advanced Computing, Ernakulam, India, 16-18 March, 2016.
[5] Yıldırım P, Birant D. “Application of data mining techniques in cloud computing: a literature review”. Pamukkale University Journal of Engineering Sciences, 24(2), 336-343, 2018.
[6] Venkatkumar IA, Shardaben SJK. “Comparative study of data mining clustering algorithms”. 2016 International Conference on Data Science and Engineering, Cochin, India, 23-25 August 2016.
[7] Çığşar B, Ünal D. "Comparison of data mining classification algorithms determining the default risk”. Scientific Programming, 2019, 1-8, 2019.
[8] Umadevi S, Marseline KSJ. "A survey on data mining classification algorithms". 2017 International Conference on Signal Processing and Communication, Coimbatore, India, 28-29 July 2017.
[9] Ajibade S, Adediran A. “An overview of big data visualization techniques in data mining”. International Journal of Computer Science and Information Technology Research, 4(3), 105-113, 2016.
[10] Kunjir A, Sawant H, Shaikh NF. "Data mining and visualization for prediction of multiple diseases in healthcare”. 2017 International Conference on Big Data Analytics and Computational Intelligence, Chirala, India, 23-25 March 2017.
11] Zhou X, Yang C, Meng N. "Method of knowledge representation on spatial classification". 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery, Tianjin, China, 14-16 August 2009.
[12] Guowei Y, Xinghua L, Xuyan T. "A new knowledge representation based matter element system and the related extension reasoning". International Conference on Natural Language Processing and Knowledge Engineering, 2003, Beijing, China, 26-29 October 2003.
[13] García S, Luengo J, Herrera F. Data Preprocessing in Data Mining. 1st ed. New York, USA, Springer, 2015.
[14] Ramírez-Gallego S, Krawczyk B, García S, Wozniak M, Herrera F. “A survey on data preprocessing for data stream mining: Current status and future directions”. Neurocomputing, 239, 39-57, 2017
[15] Malik JS, Goyal P, Sharma AK. “A comprehensive approach towards data preprocessing techniques & association rules”. Proceedings of the 4th National Conference, Delhi, India, 25-26 February 2010.
[16] Chu X, Ilyas I, Krishnan S, Wang J. “Data cleaning: Overview and emerging challenges”. SIGMOD 16: Proceedings of the 2016 International Conference on Management of Data, San Francisco, USA, 26 June-01 July 2016.
[17] Pelletier C, Valero S, Inglada J, Champion N, Marais Sicre C, Dedieu G. “Effect of training class label noise on classification performances for land cover mapping with satellite image time series”. Remote Sensing, 9(2), 173-197, 2017.
[18] Shanthini A, Vinodhini G, Chandrasekaran RM. “A taxonomy on impact of label noise and feature noise using machine learning techniques”. Soft Computing, 23, 8597-8607, 2019.
[19] Kasar M, Bhattacharyya D, Kim TH. “Face recognition using neural network: A review”. International Journal of Security and Its Applications, 10, 81-100, 2016.
[20] Chandra MA, Bedi SS. “Survey on SVM and their application in image classification”. International Journal of Information Technology, 13, 1-11, 2018.
[21] Fletcher S, Islam Z. “Decision tree classification with differential privacy: A survey”. ACM Computing Surveys. 52(4), 1-33, 2019.
[22] Sluban B, Lavrac N. “Relating ensemble diversity and performance: a study in class noise detection”. Neurocomputing, 160, 120-131, 2015.
[23] Chen X, Kang Q, Zhou M, Wei Z. "A novel under-sampling algorithm based on Iterative-Partitioning Filters for imbalanced classification". 2016 IEEE International Conference on Automation Science and Engineering, Fort Worth, TX, USA, 21-25 August 2016.
[24] Sáez JA, Luengo J, Stefanowski J, Herrera F. "SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering". Information Sciences, 291(5), 184-203, 2015.
[25] García S, Luengo J, Herrera F. “Tutorial on practical tips of the most influential data preprocessing algorithms in data mining”. Knowledge Based Systems. 98, 1-29, 2016.
[26] Alcala-Fdez J, Fernández A, Luengo J, Derrac J, Garcia S, Sanchez L, Herrera F. “KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework”. Journal of Multiple- Valued Logic and Soft Computing, 17, 255-287, 2010.
[27] Yadav A. “A survey on unsupervised clustering algorithm based on k-means clustering”. International Journal of Computer Applications, 156(8), 6-9, 2017.
[28] Zadedehbalaei A, Bagheri A, Afshar H. “A study on DBSCAN clustering algorithm issues and a survey on its improvements”. Soft Computing Journal, 6(1), 2-37, 2017.
[29] Nwadiugwu M. “Gene-Based clustering algorithms: Comparison between denclue, fuzzy-C, and BIRCH”. Bioinformatics and Biology Insights, 14, 1-6, 2020.
[30] Kanagala HK, Jaya Rama Krishnaiah VV. "A comparative study of K-Means, DBSCAN and OPTICS". 2016 International Conference on Computer Communication and Informatics, Coimbatore, India, 7-9 January 2016.
[31] Schelling B, Plant C. “KMN-removing noise from k-means clustering results”. Big Data Analytics and Knowledge Discovery 2018, Regensburg, Germany, 3-6 September 2018.
[32] Gan G, Kwok-Po Ng M. "K-means clustering with outlier removal". Pattern Recognition Letters, 90, 8-14, 2017.
[33] Cigdem B, Katsageorgiou V, Fisher RB. “Extracting statistically significant behaviour from fish tracking data with and without large dataset cleaning”. IET Computer Vision, 12(2), 162-170, 2018.
[34] Meeyai S. “Logistic regression with missing data: A comparison of handling methods, and effects of percent Missing Values”. Journal of Traffic and Logistics Engineering, 4(2), 128-134, 2016.
[35] Ryu S, Kim M, Kim H. "Denoising autoencoder-based missing value imputation for smart meters". IEEE Access, 8, 40656-40666, 2020.
[36] Zhang Z. “Missing data imputation: focusing on single imputation”. Annals of Translational Medicine, 4(1), 9-17, 2016.
[37] Shao X, Wu S, Feng X, Song R. “Categorical missing data imputation approach via sparse representation”. International Journal of Services Technology and Management, 22, 256-270, 2016.
[38] Chomboon K, Chujai P, Teerarassammee P, Kerdprasop K. “An empirical study of distance metrics for k-Nearest neighbor algorithm”. International Conference on Industrial Application Engineering 2015, Kitakyushu, Japan, 28-21 March 2015.
[39] Zhongguo Y, Hongqi L, Liping Z, Qiang L, Ali S. “A case based method to predict optimal k value for k-NN algorithm”. Journal of Intelligent & Fuzzy Systems, 33(1), 55-65, 2017.
[40] de Silva H, Perera AS. "Missing data imputation using Evolutionary k- Nearest neighbor algorithm for gene expression data". 2016 Sixteenth International Conference on Advances in ICT for Emerging Regions, Negombo, Sri Lanka, 1-3 September 2016.
[41] He Y, Pi D. “Improving KNN method based on reduced relational grade for microarray missing values imputation”. IAENG International Journal of Computer Science, 43(3), 356-362, 2016.
[42] Lee JY, Styczynski MP. “NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data”. Metabolomics, 14(12), 153-165, 2018.
[43] Fletcher S, Islam Z. “Decision tree classification with differential privacy: A survey”. ACM Computing Surveys. 52(4), 1-33, 2019.
[44] Gavankar SS, Sawarkar SD. "Eager decision tree". 2017 2nd International Conference for Convergence in Technology, Mumbai, India, 7-9 April 2017.
45] Khan S, Wimmer H, Powell L. "Open vs. close source decision tree algorithms: comparing performance measures of accuracy, sensitivity and specificity". 2017 Proceedings of the Conference on Information Systems Applied Research, Austin, Texas, USA, 5-8 November 2017.
[46] Davis D, Rahman M. “Missing value imputation using stratified supervised learning for cardiovascular data”. Journal of Informatics and Data Mining, 1(2), 1-9, 2016.
[47] Abidin NZ, Ismail AR, Emran N. “Performance analysis of machine learning algorithms for missing value imputation”. International Journal of Advanced Computer Science and Applications, 9(6), 442-447, 2018.
[48] Kamble VB, Deshmukh SN. “Comparision between accuracy and MSE, RMSE by using proposed method with imputation technique”. Oriental Journal of Computer Science and Technology, 10, 773-779, 2017.
[49] Raja PS, Thangavel K. “Missing value imputation using unsupervised machine learning techniques”. Soft Computing, 24, 4361-4392, 2020.
[50] Aljuaid T, Sasi S. "Proper imputation techniques for missing values in data sets". 2016 International Conference on Data Science and Engineering, Cochin, India, 23-25 August 2016.
[51] Venkatesh B, Anuradha J. “A review of feature selection and its methods”. Cybernetics and Information Technologies, 19(1), 3-26, 2019.
[52] Urbanowicz RJ, Meeker M, La Cava W, Olson RS, Moore JH. “Relief-based feature selection: Introduction and review”. Journal of Biomedical Informatics, 85, 189-203, 2018.
[53] Tripathi A, Trivedi SK. "Sentiment analyis of Indian movie review with various feature selection techniques". 2016 IEEE International Conference on Advances in Computer Applications, Coimbatore, India, 24-24 October 2016.
[54] Liu M, Xu L, Yi J, Huang J. "A feature gene selection method based on ReliefF and PSO". 2018 10th International Conference on Measuring Technology and Mechatronics Automation, Changsha, China, 10-11 February 2018.
[55] Wosiak A, Zakrzewska D. “Integrating correlation-based feature selection and clustering for improved cardiovascular disease diagnosis”. Complexity, 2018(1), 1-11, 2018.
[56] Chuanlei Z, Shanwen Z, Jucheng Y, Yancui S, Jia C. “Apple leaf disease identification using genetic algorithm and correlation based feature selection method”. International Journal of Agricultural and Biological Engineering, 10(2), 74-83, 2017.
[57] Amarnath B, & Balamurugan S. “Review on feature selection techniques and its impact for effective data classification using UCI machine learning repository dataset”. Journal of Engineering Science and Technology, 11, 1639-1646, 2016.
[58] Uzer M, Yılmaz N, Inan O. “Feature selection method based on artificial bee colony algorithm and support vector machines for medical datasets classification”. The Scientific World Journal, 2013(11), 1-10, 2013.
[59] Pasyuk A, Semenov E, Tyuhtyaev D. "Feature selection in the classification of network traffic flows". 2019 International Multi-Conference on Industrial Engineering and Modern Technologies, Vladivostok, Russia, 1-4 October 2019.
[60] Gacav C, Benligiray B, Topal C. "Sequential forward feature selection for facial expression recognition". 2016 24th Signal Processing and Communication Application Conference, Zonguldak, Turkey, 16-19 May 2016.
[61] Widiyanti E, Endah SN. "Feature selection for music emotion recognition". 2018 2nd International Conference on Informatics and Computational Sciences, Semarang, Indonesia, 30-31 October 2018.
[62] Yulianti Y, Saifudin A. “Sequential feature selection in customer churn prediction based on naive bayes”. IOP Conference Series: Materials Science and Engineering, Bandung, Indonesia, 4-9 October 2020.
[63] Wang M, Lu Y, Qin J. “A dynamic MLP-based DDoS attack detection method using feature selection and feedback”. Computers & Security, 88, 1-14, 2019.
[64] Muthukrishnan R, Rohini R. "LASSO: A feature selection technique in predictive modeling for machine learning". 2016 IEEE International Conference on Advances in Computer Applications, Coimbatore, India, 24-24 October 2016.
[65] Osman H, Ghafari M, Nierstrasz O. "Automatic feature selection by regularization to improve bug prediction accuracy". 2017 IEEE Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE), Klagenfurt, Austria, 21-21 February 2017.
[66] Panda D, Ray R, Abdullah AA, Dash SR. “Predictive systems: Role of feature selection in prediction of heart disease”. International Conference on Biomedical Engineering, Penang island, Malaysia, 26-27 August 2019. [67] Hart P. "The condensed nearest neighbor rule (Corresp.)". IEEE Transactions on Information Theory, 14(3), 515-516, 1968.
[68] Wilson DL. "Asymptotic properties of nearest neighbor rules using edited data". IEEE Transactions on Systems, Man, and Cybernetics, SMC-2(3), 408-421, 1972.
[69] Kasemtaweechok C, Suwannik W. “Prototype selection for k-nearest neighbors classification using geometric median”. Proceedings of the Fifth International Conference on Network, Communication and Computing, Kyoto, Japan, 17-21 December 2016.
[70] García-Pedrajas N, Romero del Castillo JA, Cerruela- García G. "A proposal for local k values for k-nearest neighbor rule". IEEE Transactions on Neural Networks and Learning Systems, 28(2), 470-475, 2017.
[71] Song Y, Liang J, Lu J, Zhao X. “An efficient instance selection algorithm for k nearest neighbor regression”. Neurocomputing, 251, 26-34, 2017.
[72] Pan J, Zhuang Y, Fong S. “The impact of data normalization on stock market prediction: Using SVM and technical indicators”. International Conference on Soft Computing in Data Science, Kuala Lumpur, Malaysia, 21-22 September 2016.
[73] Singh D, Singh B. "Investigating the impact of data normalization on classification performance". Applied Soft Computing, 97, 1-23, 2020.
[74] Pandey A, Jain A. “Comparative analysis of knn algorithm using various normalization techniques”. International Journal of Computer Network and Information Security, 9, 36-42, 2017.
[75] Eesa A, Arabo W. “A normalization methods for backpropagation: A comparative study”. Science Journal of University of Zakho, 5(4), 319-323, 2017.
[76] Ali A, Senan N. “The effect of normalization in violence video classification performance”. IOP Conference Series: Materials Science and Engineering, Melaka, Malaysia, 6-7 May 2017.
[77] Zhang B, Yi Y, Wang H, Yu J. “MIC-TJU at mediaeval violent scenes detection (VSD)”. Multimedia Evaluation Workshop, Barcelona, Spain, 16-17 October 2014.
[78] Harb H, Makhoul A, Tawbi S, Couturier R. "Comparison of different data aggregation techniques in distributed sensor networks". IEEE Access, 5, 4250-4263, 2017.
[79] Morell A, Correa A, Barceló M, Vicario JL. "Data aggregation and principal component analysis in WSNs". IEEE Transactions on Wireless Communications, 15(6), 3908-3919, 2016.
[80] Xie Y, Chen X, Zhao J. "Data fault detection for wireless sensor networks using multi-scale PCA method". 2011 2nd International Conference on Artificial Intelligence, Management Science and Electronic Commerce, Deng Feng, China, 8-10 August 2011.
[81] Li J, Guo S, Yang Y, He J. “Data aggregation with principal component analysis in big data wireless sensor network”. 2016 12th International Conference on Mobile Ad-Hoc and Sensor Networks, Hefei, China, 16-18 December 2016.
[82] Yu T, Wang X, Shami A. "Recursive principal component analysis-based data outlier detection and sensor data aggregation in IoT systems". IEEE Internet of Things Journal, 4(6), 2207-2216, 2017.
[83] Boubiche S, Boubiche DE, Bilami A, Toral-Cruz H. "Big data challenges and data aggregation strategies in wireless sensor networks". IEEE Access, 6, 20558-20571, 2018.
[84] Yıldız K, Camurcu Y, Doğan B. “Comparison of dimension reduction techniques on high dimensional datasets”. International Arab Journal of Information Technology, 15(2), 256-262, 2018.
[85] Choudhari E, Bodhe KD, Mundada SM. "Secure data aggregation in WSN using iterative filtering algorithm". 2017 International Conference on Innovative Mechanisms for Industry Applications, Bangalore, India, 21-23 February 2017.

APA	ÇETİN V, YILDIZ O (2022). A comprehensive review on data preprocessing techniques in data analysis. , 299 - 312. 10.5505/pajes.2021.62687
Chicago	ÇETİN Volkan,YILDIZ Oktay A comprehensive review on data preprocessing techniques in data analysis. (2022): 299 - 312. 10.5505/pajes.2021.62687
MLA	ÇETİN Volkan,YILDIZ Oktay A comprehensive review on data preprocessing techniques in data analysis. , 2022, ss.299 - 312. 10.5505/pajes.2021.62687
AMA	ÇETİN V,YILDIZ O A comprehensive review on data preprocessing techniques in data analysis. . 2022; 299 - 312. 10.5505/pajes.2021.62687
Vancouver	ÇETİN V,YILDIZ O A comprehensive review on data preprocessing techniques in data analysis. . 2022; 299 - 312. 10.5505/pajes.2021.62687
IEEE	ÇETİN V,YILDIZ O "A comprehensive review on data preprocessing techniques in data analysis." , ss.299 - 312, 2022. 10.5505/pajes.2021.62687
ISNAD	ÇETİN, Volkan - YILDIZ, Oktay. "A comprehensive review on data preprocessing techniques in data analysis". (2022), 299-312. https://doi.org/10.5505/pajes.2021.62687

APA	ÇETİN V, YILDIZ O (2022). A comprehensive review on data preprocessing techniques in data analysis. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, 28(2), 299 - 312. 10.5505/pajes.2021.62687
Chicago	ÇETİN Volkan,YILDIZ Oktay A comprehensive review on data preprocessing techniques in data analysis. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi 28, no.2 (2022): 299 - 312. 10.5505/pajes.2021.62687
MLA	ÇETİN Volkan,YILDIZ Oktay A comprehensive review on data preprocessing techniques in data analysis. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, vol.28, no.2, 2022, ss.299 - 312. 10.5505/pajes.2021.62687
AMA	ÇETİN V,YILDIZ O A comprehensive review on data preprocessing techniques in data analysis. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi. 2022; 28(2): 299 - 312. 10.5505/pajes.2021.62687
Vancouver	ÇETİN V,YILDIZ O A comprehensive review on data preprocessing techniques in data analysis. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi. 2022; 28(2): 299 - 312. 10.5505/pajes.2021.62687
IEEE	ÇETİN V,YILDIZ O "A comprehensive review on data preprocessing techniques in data analysis." Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, 28, ss.299 - 312, 2022. 10.5505/pajes.2021.62687
ISNAD	ÇETİN, Volkan - YILDIZ, Oktay. "A comprehensive review on data preprocessing techniques in data analysis". Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi 28/2 (2022), 299-312. https://doi.org/10.5505/pajes.2021.62687