TY - JOUR TI - Optimal training and test sets design for machine learning AB - In this paper, we describe histogram matching, a metric for measuring the distance of two datasets withexactly the same features, and embed it into a mixed integer programming formulation to partition a dataset into fixed size training and test subsets. The partition is done such that the pairwise distances between the dataset and the subsets are minimized with respect to histogram matching. We then conduct a numerical study using a well-known machine learning dataset. We demonstrate that the training set constructed with our approach provides feature distributions almost the same as the whole dataset, whereas training sets constructed via random sampling end up with significant differences. We also show that our method introduces neither positive nor negative bias in prediction accuracy of a decision tree—used as a representative example of a machine learning method. AU - Genc, Burkay AU - TUNÇ, HÜSEYİN DO - 10.3906/elk-1807-212 PY - 2019 JO - Turkish Journal of Electrical Engineering and Computer Sciences VL - 27 IS - 2 SN - 1300-0632 SP - 1534 EP - 1545 DB - TRDizin UR - http://search/yayin/detay/336815 ER -