Application of machine learning in screening for β-thalassemia using complete blood count (CBC) parameter

Ta Van Thao, Tran Hai Yen, Dang Thi Thuy Hong

Nội dung chính của bài viết

Tóm tắt

This study is among the first in Vietnam to apply machine learning (ML) for β-thalassemia screening based solely on complete blood count (CBC) parameters. A regionally imbalanced dataset of 515 CBC samples was collected from students in Lai Châu province at Chemedic Laboratory (Hanoi) between October and December 2023. A validation set of 111 samples, including 55 β-thalassemia cases confirmed by high-performance liquid chromatography (HPLC), was analyzed using an XP-100 hematology analyzer. Supervised ML models-Decision Tree (DT), Random Forest (RF), and Logistic Regression (LR)-were developed with Python libraries (scikit-learn, TensorFlow), incorporating Synthetic Minority Oversampling Technique (SMOTE), Adaptive Synthetic Sampling (ADASYN), Principal Component Analysis (PCA), and Singular Value Decomposition (SVD) for data balancing and feature extraction. The SMOTE-PCA/SVD combinations achieved high accuracy (0.95 for DT and RF; 0.93 for LR), with ROC AUC 0.94–0.96 and F1-score ≈ 0.95. Using ADASYN with PCA/SVD improved DT accuracy to 0.97 but reduced RF to 0.85. Optimal performance occurred with a 500-sample training set, 60:40 class ratio, and test sizes of 0.05 – 0.2. These findings demonstrate that ML, particularly DT and RF, can serve as cost-effective, non-invasive screening tools for β-thalassemia in resource-limited regions of Vietnam, although further validation with larger and genetically confirmed datasets is warranted.

Chi tiết bài viết

Tài liệu tham khảo

1. Thalassaemia International Federation. Guidelines for the Management of Transfusion-Dependent Thalassaemia (TDT). 3rd ed. Nicosia, Cyprus: Thalassaemia International Federation; 2022.
2. World Health Organization. Management of Haemoglobin Disorders: Report of a Joint WHO–TIF Meeting. Geneva, Switzerland: World Health Organization; 2008.
3. ARUP Consult. Thalassemias-Choose the Right Test. Updated 2025. Accessed October 13, 2025. https://arupconsult.com/content/thalassemias
4. Weatherall DJ. The inherited disorders of haemoglobin: an increasingly neglected global health burden. Indian J Med Res. 2011;134(4):493-497. doi:10.4103/0971-5916.90987
5. Taher AT, Weatherall DJ, Cappellini MD, et al. Thalassaemia. Lancet. 2018;391(10116):155-167. doi:10.1016/S0140-6736(17)31822-6
6. Galanello R, Origa R. Beta-thalassemia. Orphanet J Rare Dis. 2010;5:11. doi:10.1186/1750-1172-5-11
7. Modell B, Darlison M. Global epidemiology of haemoglobin disorders and derived service indicators. Bull World Health Organ. 2008;86(6):480-487. doi:10.2471/BLT.06.036673
8. Saleem M, Ali S, Khan MA, et al. Automated detection of thalassemia using machine learning techniques. Comput Methods Programs Biomed. 2023;231:107407. doi:10.1016/j.cmpb.2023.107407
9. Nguyen Ba Tung, Tran Danh Cuong, Nguyen Thi Trang, et al. Research on the application of artificial intelligence in prenatal screening for thalassemia. Vietnam Medical Journal. 2023;526(2). doi:10.51298/vmj.v526i2.5590
10. Rustam F, Ashraf I, Jabbar S, et al. Prediction of β-thalassemia carriers using complete blood count features. Sci Rep. 2022;12(1):19999. doi:10.1038/s41598-022-22011-8
11. Chawla NV, Bowyer KW, Hall LO, et al. SMOTE: Synthetic Minority Over-sampling Technique. J Artif Intell Res. 2002;16:321–357. doi:10.1613/jair.953
12. Devkota BP. Hemoglobin Electrophoresis-Reference Range. Medscape. Updated July 2, 2025. Accessed October 13, 2025. https://emedicine.medscape.com/article/2085637-overview
13. He H, Bai Y, Garcia EA, et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the 2008 IEEE International Joint Conference on Neural Networks. Piscataway, NJ: IEEE; 2008:1322-1328. doi:10.1109/IJCNN.2008.4633969
14. Christensen F, Kılıç DK, Nielsen IE, et al. Classification of α-thalassemia data using machine learning models. Comput Methods Programs Biomed. 2025;260:108581. doi:10.1016/j.cmpb.2024.108581
15. World Health Organization. Hemoglobin Concentrations for the Diagnosis of Anemia and Assessment of Severity. Geneva, Switzerland: World Health Organization; 2011.
16. AlQarni AM, Althumairi A, Alkaltham NK, et al. Diagnostic test performance of the Mentzer index in evaluating Saudi children with microcytosis. Front Med (Lausanne). 2024;11:1361805. doi:10.3389/fmed.2024.1361805
17. Shah TP, Shrestha A, Agrawal JP, et al. Role of Mentzer Index for differential diagnosis of iron deficiency anaemia and beta thalassemia trait. J Nepal Health Res Counc. 2023;21(1):99-102. doi:10.33314/jnhrc.v21i1.4479