Yıl: 2019 Cilt: 27 Sayı: 2 Sayfa Aralığı: 1534 - 1545 Metin Dili: İngilizce DOI: 10.3906/elk-1807-212 İndeks Tarihi: 15-05-2020

Optimal training and test sets design for machine learning

Öz:
In this paper, we describe histogram matching, a metric for measuring the distance of two datasets withexactly the same features, and embed it into a mixed integer programming formulation to partition a dataset into fixed size training and test subsets. The partition is done such that the pairwise distances between the dataset and the subsets are minimized with respect to histogram matching. We then conduct a numerical study using a well-known machine learning dataset. We demonstrate that the training set constructed with our approach provides feature distributions almost the same as the whole dataset, whereas training sets constructed via random sampling end up with significant differences. We also show that our method introduces neither positive nor negative bias in prediction accuracy of a decision tree—used as a representative example of a machine learning method.
Anahtar Kelime:

Konular: Mühendislik, Elektrik ve Elektronik Bilgisayar Bilimleri, Yazılım Mühendisliği Bilgisayar Bilimleri, Sibernitik Bilgisayar Bilimleri, Bilgi Sistemleri Bilgisayar Bilimleri, Donanım ve Mimari Bilgisayar Bilimleri, Teori ve Metotlar Bilgisayar Bilimleri, Yapay Zeka
Belge Türü: Makale Makale Türü: Araştırma Makalesi Erişim Türü: Erişime Açık
  • [1] Turing AM. Computing machinery and intelligence. Mind 1950; 59: 433-460.
  • [2] Guyon I, Elisseeff A. An introduction to variable and feature selection. Journal of Machine Learning Research 2003; 3: 1157-1182.
  • [3] Battiti R. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks 1994; 5: 537-550.
  • [4] Forman G. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 2003; 3: 1289-1305.
  • [5] Peng H, Long F, Ding C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 2005; 27: 1226-1238.
  • [6] Kohavi R, John GH. Wrappers for feature subset selection. Artificial Intelligence 1997; 97: 273-324.
  • [7] Narendra PM, Fukunaga K. A branch and bound algorithm for feature subset selection. IEEE Transactions on Computers 1977; 9: 917-922.
  • [8] Pudil P, Novovicova J, Kittler J. Floating search methods in feature selection. Pattern Recognition Letters 1994; 15: 1119-1125.
  • [9] Reunanen J. Overfitting in making comparisons between variable selection methods. Journal of Machine Learning Research 2003; 3: 1371-1382.
  • [10] Chandrashekar G, Sahin F. A survey on feature selection methods. Computers & Electrical Engineering 2014; 40: 16-28.
  • [11] Garcia S, Derrac J, Cano J, Herrera F. Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Transactions On Pattern Analysis and Machine Intelligence 2012; 34: 417-435.
  • [12] Pkekalska E, Duin RPW, Paclik P. Prototype selection for dissimilarity-based classifiers. Pattern Recognition 2006; 39: 189-208.
  • [13] Garcia S, Cano J, Herrera F. A memetic algorithm for evolutionary prototype selection: A scaling up approach. Pattern Recognition 2008; 41: 2693-2709.
  • [14] Arnaiz-Gonzalez A, Diez-Pastor JF, Rodriguez JJ, Garcia-Osorio C. Instance selection of linear complexity for big data. Knowledge-Based Systems 2016; 107: 83-95.
  • [15] Song Y, Liang J, Lu J, Zhao X. An efficient instance selection algorithm for k nearest neighbor regression. Neurocomputing 2017; 251: 26-34.
  • [16] Li Y, Hu Z, Cai Y, Zhang W. Support vector based prototype selection method for nearest neighbor rules. In: International Conference on Natural Computation; 2005; Berlin, Germany: Springer 528-535.
  • [17] Liu Chuan, Wang W, Wang M, Lv F, Konan M. An efficient instance selection algorithm to reconstruct training set for support vector machine. Knowledge-Based Systems 2017; 116: 58-73.
  • [18] Srisawat A, Phienthrakul T, Kijsirikul B. SV-kNNC: An algorithm for improving the efficiency of k-nearest neighbor. In: Pacific Rim International Conference on Artificial Intelligence; 2006; Berlin, Germany: Springer 975-979.
  • [19] Brighton H, Mellish C. Advances in instance selection for instance-based learning algorithms. Data Mining and Knowledge Discovery 2002; 6: 153-172.
  • [20] Riquelme J, Aguilar-Ruiz J, Toro M. Finding representative patterns with ordered projections. Pattern Recognition 2003; 36: 1009-1018.
  • [21] Raicharoen T, Lursinsap C. A divide-and-conquer approach to the pairwise opposite class-nearest neighbor (POCNN) algorithm. Pattern Recognition Letters 2005; 26: 1554-1567.
  • [22] Silva DA, Souza LC, Motta G. An instance selection method for large datasets based on markov geometric diffusion. Data & Knowledge Engineering 2016; 101: 24-41.
  • [23] Ashfaq RAR, He Y, Chen D. Toward an efficient fuzziness based instance selection methodology for intrusion detection system. International Journal of Machine Learning and Cybernetics 2017; 8: 767-1776.
APA Genc B, TUNÇ H (2019). Optimal training and test sets design for machine learning. , 1534 - 1545. 10.3906/elk-1807-212
Chicago Genc Burkay,TUNÇ HÜSEYİN Optimal training and test sets design for machine learning. (2019): 1534 - 1545. 10.3906/elk-1807-212
MLA Genc Burkay,TUNÇ HÜSEYİN Optimal training and test sets design for machine learning. , 2019, ss.1534 - 1545. 10.3906/elk-1807-212
AMA Genc B,TUNÇ H Optimal training and test sets design for machine learning. . 2019; 1534 - 1545. 10.3906/elk-1807-212
Vancouver Genc B,TUNÇ H Optimal training and test sets design for machine learning. . 2019; 1534 - 1545. 10.3906/elk-1807-212
IEEE Genc B,TUNÇ H "Optimal training and test sets design for machine learning." , ss.1534 - 1545, 2019. 10.3906/elk-1807-212
ISNAD Genc, Burkay - TUNÇ, HÜSEYİN. "Optimal training and test sets design for machine learning". (2019), 1534-1545. https://doi.org/10.3906/elk-1807-212
APA Genc B, TUNÇ H (2019). Optimal training and test sets design for machine learning. Turkish Journal of Electrical Engineering and Computer Sciences, 27(2), 1534 - 1545. 10.3906/elk-1807-212
Chicago Genc Burkay,TUNÇ HÜSEYİN Optimal training and test sets design for machine learning. Turkish Journal of Electrical Engineering and Computer Sciences 27, no.2 (2019): 1534 - 1545. 10.3906/elk-1807-212
MLA Genc Burkay,TUNÇ HÜSEYİN Optimal training and test sets design for machine learning. Turkish Journal of Electrical Engineering and Computer Sciences, vol.27, no.2, 2019, ss.1534 - 1545. 10.3906/elk-1807-212
AMA Genc B,TUNÇ H Optimal training and test sets design for machine learning. Turkish Journal of Electrical Engineering and Computer Sciences. 2019; 27(2): 1534 - 1545. 10.3906/elk-1807-212
Vancouver Genc B,TUNÇ H Optimal training and test sets design for machine learning. Turkish Journal of Electrical Engineering and Computer Sciences. 2019; 27(2): 1534 - 1545. 10.3906/elk-1807-212
IEEE Genc B,TUNÇ H "Optimal training and test sets design for machine learning." Turkish Journal of Electrical Engineering and Computer Sciences, 27, ss.1534 - 1545, 2019. 10.3906/elk-1807-212
ISNAD Genc, Burkay - TUNÇ, HÜSEYİN. "Optimal training and test sets design for machine learning". Turkish Journal of Electrical Engineering and Computer Sciences 27/2 (2019), 1534-1545. https://doi.org/10.3906/elk-1807-212