THE IMPACT OF DATA BALANCING METHODS ON THE QUALITY AND COST-EFFECTIVENESS OF BANK CUSTOMER CLASSIFICATION

Keywords: data balancing, credit risk, customer classification, economic efficiency, SMOTE, ROSE, cutoff threshold optimization

Abstract

The article explores the effectiveness of data balancing methods—Oversampling, Undersampling, Over&Under, ROSE, and SMOTE—in improving the classification of unreliable clients in the financial sector. The primary focus is on determining the optimal classification threshold that maximizes bank profitability while maintaining predictive accuracy. The study uses a dataset of 700 clients, with 26.3% identified as unreliable. Logistic regression models were constructed to evaluate the performance of each balancing technique. The findings indicate that the SMOTE technique exhibits the most substantial economic impact, yielding favorable outcomes in terms of both F1-Score and economic returns. Despite some misclassifications, SMOTE effectively improves the identification of high-risk clients, showcasing enhanced sensitivity and specificity. Conversely, the application of the ROSE method proved to be less effective among the techniques studied, demonstrating lower classification quality and economic efficiency. This suggests that the choice of balancing method significantly influences the financial outcomes and forecast accuracy in banking environments. Additionally, the study emphasizes the necessity of balancing data to resolve class imbalance issues, which is a common challenge in real-world financial applications such as credit risk prediction. The analysis reveals that models benefiting from balanced datasets outperform unbalanced ones, demonstrating improved accuracy and reduced bias towards the dominant class. Overall, the research highlights the importance of selecting appropriate data balancing strategies as a critical factor in enhancing the predictive capacity and economic efficiency of classification models in the banking sector. These findings underscore the potential of employing sophisticated machine learning techniques to minimize credit risk and optimize profitability, providing banks with a strategic advantage in risk management and decision-making processes. The study suggests that adopting advanced data balancing methods like SMOTE can significantly contribute to better financial results and risk mitigation in credit lending practices.

References


1. Khatir, A. A. H. A., & Bee, M. (2022). Machine learning models and data-balancing techniques for credit scoring: What is the best combination? Risks, 10, 169–190.
2. MİLLİ, M. E. F., Deveci Kocakoç, İ., & Aras, S. (2024). Investigating the effect of class balancing methods on the performance of machine learning techniques: Credit risk application. Izmir Journal of Management, 5, 55–69.
3. Haque, F. M. A., & Hassan, Md. M. (2024). Bank loan prediction using machine learning techniques. American Journal of Industrial and Business Management, 14, 1690–1711. DOI: https://doi.org/10.4236/ajibm.2024.1412085
4. Yang, C., Dong, Y., Lu, J., & Peng, Z. (2022). Solving imbalanced data in credit risk prediction: A comparison of resampling strategies for different machine learning classification algorithms, taking threshold tuning into account. Proceedings of the 5th International Conference on Machine Learning and Machine Intelligence (MLMI 2022), 30–40.
5. GeeksforGeeks. Logistic regression in machine learning. Available at: https://www.geeksforgeeks.org/understanding-logistic-regression/
6. Zolghadr, Z. (n.d.). Bank loan / credit scoring for bank customers. Kaggle. Available at: https://www.kaggle.com/datasets/zahrazolghadr/bank-loan-cleaned-ver1/data
7. Havrylenko, S. Yu., Zozulia, V. D., & Omelchenko, V. V. (2023). Doslidzhennia metodiv pidvyshchennia yakosti klasyfikatsii na nezbalansovanykh danykh [Research on methods for improving classification quality on imbalanced data]. Systemy upravlinnia, navyhatsii ta zv'iazku – Control, Navigation and Communication Systems, 2, 87–91.
8. Jain, A. Undersampling, oversampling and SMOTE, ensemble method and cost-sensitive learning techniques for dealing with imbalanced data. Medium. Available at: https://medium.com/@abhishekjainindore24/undersampling-oversampling-and-smote-ensemble-mehtod-and-cost-sensitive-learning-techniques-for-08efb557ec68
9. ROSE: Generation of synthetic data by randomly over sampling. Available at: https://rdrr.io/cran/ROSE/man/ROSE.html
10. Train in Data. Overcoming class imbalance with SMOTE. Available at: https://www.blog.trainindata.com/overcoming-class-imbalance-with-smote/
11. Verbraken, T., Bravo, C., Weber, R., & Baesens, B. (2014). Development and application of consumer credit scoring models using profit-based classification measures. European Journal of Operational Research, 238(2), 505–513.
12. Murel, J. (2024). What is a confusion matrix? IBM. Available at: https://www.ibm.com/think/topics/confusion-matrix
13. Brownlee, J. Tour of evaluation metrics for imbalanced classification. Machine Learning Mastery. Available at: https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/
14. Mostovenko, N. A., & Korobchuk, T. I. (2016). Kredytnyi menedzhment: Navchalnyi posibnyk [Credit management: A textbook]. Lutsk National Technical University. Volynpoligraf TM.
Published
2025-03-31
Pages
66-74
Section
SECTION 4 MATHEMATICAL METHODS, MODELS AND INFORMATION TECHNOLOGIES IN ECONOMY