Early Classification of Diabetes Risk in Productive Age Groups Using Machine Learning
DOI:
https://doi.org/10.23917/saintek.v2i1.16430Keywords:
Diabetes, Classification, Machine Learning, SMOTE, Logistic RegressionAbstract
This research aims to develop an early detection classification model for diabetes risk among the productive age group (18–44 years) using a machine learning approach. Implementing the CRISP-DM methodology, this study utilized the Diabetes Health Indicators Dataset from CDC BRFSS 2015, which was refined to 48,867 observations. The class imbalance issue (4.51% diabetes positive) was addressed using the Synthetic Minority Over-sampling Technique (SMOTE) to achieve a 1:1 class ratio in the training set. Elbow curve analysis and mutual information identified 10 optimal features that balance model performance and system usability. Three algorithms were evaluated Logistic Regression, Random Forest, and XGBoost and validated using Stratified 5-Fold Cross-Validation. The results demonstrate that Logistic Regression achieved the best performance for health screening purposes with a recall of 75.06% and ROC-AUC of 83.62%, capable of detecting three out of four diabetes cases with high consistency (cross-validation: recall 75.02% ± 2.35%). This model proved to be the most effective early screening tool for diabetes risk, supporting early detection and medical intervention for the productive age population.
Downloads
References
[1] J. Homepage, A. Oktaviana, D. Puspasari Wijaya, A. Pramuntadi, and D. Heksaputra, “MALCOM: Indonesian Journal of Machine Learning and Computer Science Prediction of Type 2 Diabetes Mellitus Using The K-Nearest Neighbor (K-NN) Algorithm Prediksi Penyakit Diabetes Melitus Tipe 2 Menggunakan Algoritma K-Nearest Neighbor (K-NN),” vol. 4, no. 3, pp. 812–818, 2024.
[2] T. Palabaş, “Early-Stage Diabetes Risk Prediction Using Machine Learning Techniques Based on Ensemble Approach,” Eskişehir Tek. Üniversitesi Bilim ve Teknol. Derg. - C Yaşam Bilim. Ve Biyoteknoloji, vol. 13, no. 2, pp. 74–85, 2024, doi: 10.18036/estubtdc.1320922.
[3] International Diabetes Federation, IDF Diabetes Atlas. In IDF Diabetes Atlas, vol. 11th editi. 2025. [Online]. Available: https://www.idf.org/aboutdiabetes/type-2-diabetes.html
[4] R. Dhadse et al., “Clinical Profile, Risk Factors, and Complications in Young-Onset Type 2 Diabetes Mellitus,” Cureus, vol. 16, no. 9, 2024, doi: 10.7759/cureus.68497.
[5] G. S, R. Venkata Siva Reddy, and M. R. Ahmed, “Exploring the effectiveness of machine learning algorithms for early detection of Type-2 Diabetes Mellitus,” Meas. Sensors, vol. 31, no. December 2023, p. 100983, 2024, doi: 10.1016/j.measen.2023.100983.
[6] F. M. Gilang, R. Ferdiansyah, and N. E. Ardian, “Prediksi Resiko Diabetes,” Semin. Nas. Amikom Surakarta, no. November, pp. 14–24, 2024.
[7] H. Imaduddin, W. Widayat, and F. Y. A’Ia, “Classification of Diabetes Using Ensemble and Individual Machine Learning Algorithms,” 2025 2nd Int. Conf. Adv. Innov. Smart Cities, ICAISC 2025, 2025, doi: 10.1109/ICAISC64594.2025.10959526.
[8] Erlin, Yulvia Nora Marlim, Junadhi, Laili Suryati, and Nova Agustina, “Deteksi Dini Penyakit Diabetes Menggunakan Machine Learning dengan Algoritma Logistic Regression,” J. Nas. Tek. Elektro dan Teknol. Inf., vol. 11, no. 2, pp. 88–96, 2022, doi: 10.22146/jnteti.v11i2.3586.
[9] D. Gunawan, “Classification of Privacy Preserving Data Mining Algorithms: A Review,” J. Elektron. dan Telekomun., vol. 20, no. 2, p. 36, 2020, doi: 10.14203/jet.v20.36-46.
[10] K. R. Ummah, T. Karlita, R. Sigit, E. M. Yuniarno, I. K. E. Purnama, and M. H. Purnomo, “Effect of Image Pre-Processing Method on Convolutional Neural Network Classification of Covid-19 Ct Scan Images,” Int. J. Innov. Comput. Inf. Control, vol. 18, no. 6, pp. 1895–1912, 2022, doi: 10.24507/ijicic.18.06.1895.
[11] O. O. Oladimeji, A. Oladimeji, and O. Oladimeji, “Classification models for likelihood prediction of diabetes at early stage using feature selection,” Appl. Comput. Informatics, vol. 20, no. 3–4, pp. 279–286, 2024, doi: 10.1108/ACI-01-2021-0022.
[12] A. Ali Linkon et al., “Evaluation of Feature Transformation and Machine Learning Models on Early Detection of Diabetes Mellitus,” IEEE Access, vol. 12, no. September, pp. 165425–165440, 2024, doi: 10.1109/ACCESS.2024.3488743.
[13] Z. Amri, M. Rodi, M. N. Wathani, A. Bagja, and V. No, “Infotek : Jurnal Informatika dan Teknologi Prediksi Diabetes Menggunakan Algoritma K-Nearest ( KNN ) Teknik SMOTE-ENN Infotek : Jurnal Informatika dan Teknologi,” vol. 8, no. 1, pp. 193–204, 2025.
[14] N. R. Panda, J. N. Mohanty, R. Bhuyan, P. K. Raut, and Manulata, “Exploring machine learning approaches for early diabetes risk prediction: A comprehensive examination of health indicators and models,” J. Assoc. Med. Sci., vol. 57, no. 3, pp. 155–165, 2024, doi: 10.12982/JAMS.2024.057.
[15] C. Schröer, F. Kruse, and J. M. Gómez, “A systematic literature review on applying CRISP-DM process model,” Procedia Comput. Sci., vol. 181, no. 2019, pp. 526–534, 2021, doi: 10.1016/j.procs.2021.01.199.
[16] A. A. Putri Lo and V. J. E. Tjioe, “Penerapan Model CRISP-DM untuk Prediksi Penyakit Diabetes Menggunakan Metode K-Nearest Neighbor dan Logistic regression,” Pros. SENAM 2024 Semin. Nas. Sist. Inf. Inform. Univ. Ma Chung, vol. 4, pp. 48–57, 2024, [Online]. Available: https://ocs.machung.ac.id/index.php/seminarnasionalmachung/article/view/452
[17] T. Alex, “Diabetes Health Indicators Dataset,” Kaggle. [Online]. Available: https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset
[18] M. C. Peréz, M. B. Calisto, and D. Riofrío, “Application of Machine Learning algorithms for the prediction of payment by agreement in a debt collection company with the CRISP-DM methodology,” vol. 2020, no. February 2021, pp. 474–485, 2023, doi: 10.46254/sa03.20220112.
[19] A. Rianti, N. Wachid, A. Majid, and A. Fauzi, “CRISP-DM: Metodologi Proyek Data Science,” Pros. Semin. Nas. Teknol. Inf. dan Bisnis, pp. 107–114, 2023.
Downloads
Submitted
Accepted
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Muhammad Ghifari Zaki, Helmi Imaduddin

This work is licensed under a Creative Commons Attribution 4.0 International License.





