Yıl: 2022 Cilt: 14 Sayı: 3 Sayfa Aralığı: 168 - 179 Metin Dili: İngilizce DOI: 10.5336/biostatic.2022-88932 İndeks Tarihi: 13-05-2023

TwoClsBalancer: An Interactive Web Application for Handling the Class Imbalance Problem Based on Machine Learning

Öz:
Objective: The main purpose of this research is to develop a novel user-friendly web tool based on machine learning approaches, which applies a variety of techniques to address the class imbalance problem. Material and Methods: Shiny, an open- source R package, was used to develop the proposed web tool. The interactive tool can handle the class imbalance problem for binary classification dataset(s) by implementing sampling-based methods. As a clinical application, the dataset retrospectively obtained from the database of the Cardiovascular Surgery Department of Turgut Özal Medical Center, İnönü University, Malatya, Türkiye was used in this web-based software. To overcome the class imbalance prob- lem, sampling-based methods were implemented on the original dataset. After this process, the classification of hypertension in patients with coronary artery disease was achieved by three clas- sification models. Results: According to the outputs of the devel- oped web application, the best classification performance was obtained by the support vector machines with radial basis func- tion kernel (SVM-RBF) model after applying the density-based synthetic minority over-sampling technique oversampling meth- od. The accuracy, sensitivity, specificity, precision, f-measure, and g-mean metrics of the relevant model were calculated as 0.99, 0.99, 0.99, 0.95, 0.97, and 0.97, respectively. Conclusion: The oversampling methods used in this research indicated a more positive contribution to the classification performance of the models as compared to the undersampling methods. When the undersampling methods were applied, the three classification models did not demonstrate successful classification perfor- mance, whereas the SVM-RBF model outperformed the other two models when the oversampling methods were implemented. The designed interactive web application is freely accessible through http://biostatapps.inonu.edu.tr/twoclsbalancer.
Anahtar Kelime:

TwoClsBalancer: Sınıf Dengesizliği Problemi İçin Makine Öğrenmesine Dayalı Etkileşimli Bir Web Uygulaması

Öz:
Amaç: Bu araştırmanın temel amacı, sınıf dengesizliği so- rununu çözmek için çeşitli teknikler uygulayan makine öğrenimi yaklaşımlarına dayalı yeni, kullanıcı dostu bir web aracı geliştir- mektir. Gereç ve Yöntemler: Açık kaynaklı bir R paketi olan Shiny, önerilen web aracını geliştirmek için kullanıldı. Etkileşimli araç, örneklemeye dayalı yöntemler uygulayarak ikili sınıflandırma veri kümeleri için sınıf dengesizliği sorununu çözebilir. Web taban- lı bu yazılımda, klinik uygulama olarak Malatya İnönü Üniversitesi Turgut Özal Tıp Merkezi Kalp Damar Cerrahisi Anabilim Dalı veri tabanından geriye dönük olarak elde edilen veri seti kullanılmıştır. Sınıf dengesizliği sorununun üstesinden gelmek için orijinal veri seti üzerinde örneklemeye dayalı yöntemler uygulanmıştır. Bu iş- lemden sonra koroner arter hastalığı olan hastalarda hipertansiyo- nun sınıflandırılması üç sınıflandırma modeli ile sağlanmıştır. Bul- gular: Geliştirilen web uygulamasının çıktılarına göre en iyi sınıf- landırma performansı, “density-based synthetic minority over- sampling technique” aşırı örnekleme yöntemi uygulandıktan sonra radyal tabanlı destek vektör makineleri [support vector machines with radial basis function (SVM-RBF)] modeli ile elde edilmiştir. İlgili modelin doğruluk, duyarlılık, özgüllük, kesinlik, f-ölçümü ve g-ortalama metrikleri sırasıyla 0,99, 0,99, 0,99, 0,95, 0,97 ve 0,97 olarak hesaplanmıştır. Sonuç: Bu araştırmada kullanılan aşırı ör- nekleme yöntemleri, alt örnekleme yöntemlerine kıyasla modellerin sınıflandırma performansına daha olumlu katkı sağlamıştır. Alt ör- nekleme yöntemleri uygulandığında, 3 sınıflandırma modeli başarılı sınıflandırma performansı göstermezken, aşırı örnekleme yöntemle- ri uygulandığında SVM-RBF modeli diğer 2 modelden daha iyi performans göstermiştir. Tasarlanan interaktif web uygulamasına http://biostatapps.inonu.edu.tr/twoclsbalancer adresinden ücretsiz olarak erişilebilir.
Anahtar Kelime:

Belge Türü: Makale Makale Türü: Araştırma Makalesi Erişim Türü: Erişime Açık
  • 1. Sagiroglu S, Sinanc D. Big data: A review. International Conference on Collaboration Technologies and Systems (CTS). 2013;42-7. [Crossref]
  • 2. Firat F, Arslan AK, Colak C, Harputluoglu H. Estimation of risk factors associated with colorectal cancer: an application of knowledge discovery in data- bases. Kuwait J. Sci. 2016;43(2):151-61. [Link]
  • 3. Bekkar M, Alitouche TA. Imbalanced data learning approaches review. Int J Data Min Knowl Manag Process. 2013;3(4):15-33. [Crossref]
  • 4. Alpar CR. Uygulamalı Çok Değişkenli İstatistiksel Yöntemler. 4. Baskı. Ankara: Detay Yayıncılık; 2013.
  • 5. Sümbüloğlu V, Sümbüloğlu K. Klinik Saha Araştırmalarında Örnekleme Yöntemleri ve Örneklem Büyüklüğü. 1. Baskı. Ankara: Hatiboğlu Yayınevi; 2005.
  • 6. Colak MC, Colak C, Kocatürk H, Sağiroğlu S, Barutçu I. Predicting coronary artery disease using different artificial neural network models. Anadolu Kardiyol Derg. 2008;8(4):249-54. [PubMed]
  • 7. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C. Safe-Level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho TB, eds. Advances in Knowledge Discovery and Data Mining. 1st ed. Thailand: Springer Berlin Heidelberg; 2009. p.475-82. [Crossref]
  • 8. Cohen G, Hilario M, Sax H, Hugonnet S, Geissbuhler A. Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med. 2006;37(1):7-18. [Crossref] [PubMed]
  • 9. He H, Ma Y. Imbalanced Learning: Foundations, Algorithms, and Applications. 1st ed. USA: John Wiley & Sons; 2013. [Crossref]
  • 10. Tomek I. An experiment with the edited nearest-neighbor rule. IEEE Trans syst Man Cybern. 1976;6(6):448-52. [Crossref]
  • 11. Hart P. The condensed nearest neighbor rule. IEEE Trans Inf Theory. 1968;14(3):515-6. [Crossref]
  • 12. Kubat M, Matwin S. Addressing the curse of imbalanced training sets: one-sided selection. In: Fisher DH, ed. Proceedings of the 14th International Con- ference on Machine Learning. USA: Morgan Kaufmann Publishers Inc.; 1997. p.179-86. [Link]
  • 13. Wilson DL. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans syst Man Cybern. 1972;2(3):408-21. [Crossref]
  • 14. García-Borroto M, Villuendas-Rey Y, Carrasco-Ochoa JA, Martínez-Trinidad JF. Using maximum similarity graphs to edit nearest neighbor classifiers. In: Corrochano EB, Eklundh JO, eds. Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. 1st ed. Springer; 2009. p.489- 96. [Crossref]
  • 15. Laurikkala J. Improving identification of difficult small classes by balancing class distribution. In: Quaglini S, Barahona P, Andreassen S, eds. Artificial Intelligence in Medicine. 1st ed. Portugal: Springer-Verlag; 2001. p.63-6. [Crossref]
  • 16. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16(1):321-57. [Crossref]
  • 17. He H, Garcia EA. Learning from imbalanced data. IEEE Trans knowl data eng. 2009;21(9):1263-84. [Crossref]
  • 18. Han H, Wang WY, Mao BH. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang DS, Zhang XP, Huang GB, eds. 1st ed. Advances in Intelligent Computing. China: Springer; 2005. p.878-87. [Crossref]
  • 19. Verbiest N, Ramentol E, Cornelis C, Herrera F. Preprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection. Appl Soft Comput. 2014;22(2):511-7. [Crossref]
  • 20. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C. Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class im- balanced problem. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho TB, eds. Advances in Knowledge Discovery and Data Mining. 1st ed. Berlin, Hei- delberg: Springer Berlin, Heidelberg; 2009. p.475-82. [Crossref]
  • 21. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C. DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique. Appl Intell. 2012;36(3):664-84. [Crossref]
  • 22. Ester M, Kriegel HP, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. 2nd International Confer- ence on Knowledge Discovery and Data Mining (KDD-96). Portland, Oregon, USA: 1996. p.226-31. [Link]
  • 23. He H, Bai Y, Garcia EA, Li S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. IEEE World Congress on Computational Intelli- gence. 2008;1322-8. [Link]
  • 24. Lemaître G, Nogueira F, Aridas CK. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res. 2017;18(1):559-63. [Link]
  • 25. Li GZ, He Z, Shao FF, Ou AH, Lin XZ. Patient classification of hypertension in Traditional Chinese Medicine using multi-label learning techniques. BMC Med Genomics. 2015;8 Suppl 3(Suppl 3):S4. [Crossref] [PubMed] [PMC]
  • 26. Li GZ, Yan SX, You M, Sun S, Ou A. Intelligent ZHENG Classification of Hypertension Depending on ML-kNN and Information Fusion. Evid Based Com- plement Alternat Med. 2012;2012:837245. [Crossref] [PubMed] [PMC]
  • 27. Antalek MD, Suwa K, Schaffer M, Fenster B, Markl M, Freed B, et al. Non-invasive classification of pulmonary hypertension using 4D flow MRI and ran- dom forests. Circulation. 2017;136(1). [Link]
  • 28. Ye C, Fu T, Hao S, Zhang Y, Wang O, Jin B, et al. Prediction of incident hypertension within the next year: prospective study using statewide electronic health records and machine learning. J Med Internet Res. 2018;20(1):e22. [Crossref] [PubMed] [PMC]
  • 29. LaFreniere D, Zulkernine F, Barber D, Martin K. Using machine learning to predict hypertension from a clinical dataset. IEEE Symposium Series on Computational Intelligence (SSCI). 2016;1-7. [Crossref]
  • 30. Kublanov VS, Dolganov AY, Belo D, Gamboa H. Comparison of machine learning methods for the arterial hypertension diagnostics. Appl Bionics Biomech. 2017;2017:5985479. [Crossref] [PubMed] [PMC]
  • 31. Seffens W, Evans C; Minority Health-GRID Network, Taylor H. Machine learning data imputation and classification in a multicohort hypertension clinical study. Bioinform Biol Insights. 2016;9(Suppl 3):43-54. [Crossref] [PubMed] [PMC]
  • 32. Held E, Cape J, Tintle N. Comparing machine learning and logistic regression methods for predicting hypertension using a combination of gene expres- sion and next-generation sequencing data. BMC Proc. 2016;10(Suppl 7):141-5. [Crossref] [PubMed] [PMC]
  • 33. Chen T, Guestrin C. Xgboost: A scalable tree boosting system. arXiv. 2016:785-94. [Crossref]
  • 34. Cortes C, Mohri M, Syed U. Deep boosting. PMLR. 2014;32(2):1179-87. [Link]
APA ARSLAN A, ÇOLAK C, colak m (2022). TwoClsBalancer: An Interactive Web Application for Handling the Class Imbalance Problem Based on Machine Learning. , 168 - 179. 10.5336/biostatic.2022-88932
Chicago ARSLAN Ahmet Kadir,ÇOLAK Cemil,colak mehmet cengiz TwoClsBalancer: An Interactive Web Application for Handling the Class Imbalance Problem Based on Machine Learning. (2022): 168 - 179. 10.5336/biostatic.2022-88932
MLA ARSLAN Ahmet Kadir,ÇOLAK Cemil,colak mehmet cengiz TwoClsBalancer: An Interactive Web Application for Handling the Class Imbalance Problem Based on Machine Learning. , 2022, ss.168 - 179. 10.5336/biostatic.2022-88932
AMA ARSLAN A,ÇOLAK C,colak m TwoClsBalancer: An Interactive Web Application for Handling the Class Imbalance Problem Based on Machine Learning. . 2022; 168 - 179. 10.5336/biostatic.2022-88932
Vancouver ARSLAN A,ÇOLAK C,colak m TwoClsBalancer: An Interactive Web Application for Handling the Class Imbalance Problem Based on Machine Learning. . 2022; 168 - 179. 10.5336/biostatic.2022-88932
IEEE ARSLAN A,ÇOLAK C,colak m "TwoClsBalancer: An Interactive Web Application for Handling the Class Imbalance Problem Based on Machine Learning." , ss.168 - 179, 2022. 10.5336/biostatic.2022-88932
ISNAD ARSLAN, Ahmet Kadir vd. "TwoClsBalancer: An Interactive Web Application for Handling the Class Imbalance Problem Based on Machine Learning". (2022), 168-179. https://doi.org/10.5336/biostatic.2022-88932
APA ARSLAN A, ÇOLAK C, colak m (2022). TwoClsBalancer: An Interactive Web Application for Handling the Class Imbalance Problem Based on Machine Learning. Türkiye Klinikleri Biyoistatistik Dergisi, 14(3), 168 - 179. 10.5336/biostatic.2022-88932
Chicago ARSLAN Ahmet Kadir,ÇOLAK Cemil,colak mehmet cengiz TwoClsBalancer: An Interactive Web Application for Handling the Class Imbalance Problem Based on Machine Learning. Türkiye Klinikleri Biyoistatistik Dergisi 14, no.3 (2022): 168 - 179. 10.5336/biostatic.2022-88932
MLA ARSLAN Ahmet Kadir,ÇOLAK Cemil,colak mehmet cengiz TwoClsBalancer: An Interactive Web Application for Handling the Class Imbalance Problem Based on Machine Learning. Türkiye Klinikleri Biyoistatistik Dergisi, vol.14, no.3, 2022, ss.168 - 179. 10.5336/biostatic.2022-88932
AMA ARSLAN A,ÇOLAK C,colak m TwoClsBalancer: An Interactive Web Application for Handling the Class Imbalance Problem Based on Machine Learning. Türkiye Klinikleri Biyoistatistik Dergisi. 2022; 14(3): 168 - 179. 10.5336/biostatic.2022-88932
Vancouver ARSLAN A,ÇOLAK C,colak m TwoClsBalancer: An Interactive Web Application for Handling the Class Imbalance Problem Based on Machine Learning. Türkiye Klinikleri Biyoistatistik Dergisi. 2022; 14(3): 168 - 179. 10.5336/biostatic.2022-88932
IEEE ARSLAN A,ÇOLAK C,colak m "TwoClsBalancer: An Interactive Web Application for Handling the Class Imbalance Problem Based on Machine Learning." Türkiye Klinikleri Biyoistatistik Dergisi, 14, ss.168 - 179, 2022. 10.5336/biostatic.2022-88932
ISNAD ARSLAN, Ahmet Kadir vd. "TwoClsBalancer: An Interactive Web Application for Handling the Class Imbalance Problem Based on Machine Learning". Türkiye Klinikleri Biyoistatistik Dergisi 14/3 (2022), 168-179. https://doi.org/10.5336/biostatic.2022-88932