Application of machine learning in screening for β-thalassemia using complete blood count (CBC) parameter
Main Article Content
Abstract
This study is among the first in Vietnam to apply machine learning (ML) for β-thalassemia screening based solely on complete blood count (CBC) parameters. A regionally imbalanced dataset of 515 CBC samples was collected from students in Lai Châu province at Chemedic Laboratory (Hanoi) between October and December 2023. A validation set of 111 samples, including 55 β-thalassemia cases confirmed by high-performance liquid chromatography (HPLC), was analyzed using an XP-100 hematology analyzer. Supervised ML models-Decision Tree (DT), Random Forest (RF), and Logistic Regression (LR)-were developed with Python libraries (scikit-learn, TensorFlow), incorporating Synthetic Minority Oversampling Technique (SMOTE), Adaptive Synthetic Sampling (ADASYN), Principal Component Analysis (PCA), and Singular Value Decomposition (SVD) for data balancing and feature extraction. The SMOTE-PCA/SVD combinations achieved high accuracy (0.95 for DT and RF; 0.93 for LR), with ROC AUC 0.94–0.96 and F1-score ≈ 0.95. Using ADASYN with PCA/SVD improved DT accuracy to 0.97 but reduced RF to 0.85. Optimal performance occurred with a 500-sample training set, 60:40 class ratio, and test sizes of 0.05 – 0.2. These findings demonstrate that ML, particularly DT and RF, can serve as cost-effective, non-invasive screening tools for β-thalassemia in resource-limited regions of Vietnam, although further validation with larger and genetically confirmed datasets is warranted.
Article Details
Keywords
β-thalassemia, machine learning, complete blood count, screening
References
2. World Health Organization. Management of Haemoglobin Disorders: Report of a Joint WHO–TIF Meeting. Geneva, Switzerland: World Health Organization; 2008.
3. ARUP Consult. Thalassemias-Choose the Right Test. Updated 2025. Accessed October 13, 2025. https://arupconsult.com/content/thalassemias
4. Weatherall DJ. The inherited disorders of haemoglobin: an increasingly neglected global health burden. Indian J Med Res. 2011;134(4):493-497. doi:10.4103/0971-5916.90987
5. Taher AT, Weatherall DJ, Cappellini MD, et al. Thalassaemia. Lancet. 2018;391(10116):155-167. doi:10.1016/S0140-6736(17)31822-6
6. Galanello R, Origa R. Beta-thalassemia. Orphanet J Rare Dis. 2010;5:11. doi:10.1186/1750-1172-5-11
7. Modell B, Darlison M. Global epidemiology of haemoglobin disorders and derived service indicators. Bull World Health Organ. 2008;86(6):480-487. doi:10.2471/BLT.06.036673
8. Saleem M, Ali S, Khan MA, et al. Automated detection of thalassemia using machine learning techniques. Comput Methods Programs Biomed. 2023;231:107407. doi:10.1016/j.cmpb.2023.107407
9. Nguyen Ba Tung, Tran Danh Cuong, Nguyen Thi Trang, et al. Research on the application of artificial intelligence in prenatal screening for thalassemia. Vietnam Medical Journal. 2023;526(2). doi:10.51298/vmj.v526i2.5590
10. Rustam F, Ashraf I, Jabbar S, et al. Prediction of β-thalassemia carriers using complete blood count features. Sci Rep. 2022;12(1):19999. doi:10.1038/s41598-022-22011-8
11. Chawla NV, Bowyer KW, Hall LO, et al. SMOTE: Synthetic Minority Over-sampling Technique. J Artif Intell Res. 2002;16:321–357. doi:10.1613/jair.953
12. Devkota BP. Hemoglobin Electrophoresis-Reference Range. Medscape. Updated July 2, 2025. Accessed October 13, 2025. https://emedicine.medscape.com/article/2085637-overview
13. He H, Bai Y, Garcia EA, et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the 2008 IEEE International Joint Conference on Neural Networks. Piscataway, NJ: IEEE; 2008:1322-1328. doi:10.1109/IJCNN.2008.4633969
14. Christensen F, Kılıç DK, Nielsen IE, et al. Classification of α-thalassemia data using machine learning models. Comput Methods Programs Biomed. 2025;260:108581. doi:10.1016/j.cmpb.2024.108581
15. World Health Organization. Hemoglobin Concentrations for the Diagnosis of Anemia and Assessment of Severity. Geneva, Switzerland: World Health Organization; 2011.
16. AlQarni AM, Althumairi A, Alkaltham NK, et al. Diagnostic test performance of the Mentzer index in evaluating Saudi children with microcytosis. Front Med (Lausanne). 2024;11:1361805. doi:10.3389/fmed.2024.1361805
17. Shah TP, Shrestha A, Agrawal JP, et al. Role of Mentzer Index for differential diagnosis of iron deficiency anaemia and beta thalassemia trait. J Nepal Health Res Counc. 2023;21(1):99-102. doi:10.33314/jnhrc.v21i1.4479