Development of a machine learning model predicting the incidence of newly diagnosed HIV infection in the subjects of the Russian Federation

Submit

DOI: 10.25881/18110193_2023_3_16

Aim: to develop a model predicting the incidence of newly diagnosed HIV infection in the subjects of the Russian Federation using machine learning methods.
Materials and methods: The initial data were obtained from the Federal statistical observation Form No. 61 and Rosstat data on the average annual population of 85 subjects of the Russian Federation (2016-2022). We made a comparison of machine learning methods and their ensembles in the construction of a regression model for predicting the incidence of newly diagnosed patients with HIV infection in the subjects of the Russian Federation.
Results: The model was built using the following methods: linear regression, decision Tree, random forest, gradient boosting on decision trees (GBDT) and bagging. The interactive computing environment «Jupiter Notebook» (6.5.2) and software libraries «Pandas» (1.5.3), «Scikit-learn» (1.0.2), «Statsmodels» (0.13.5) and CatBoost were utilized. Optimal hyperparameters were selected using the Optuna framework. The following quality metrics were used: root of mean square error (RMSE); coefficient of determination (R2); average absolute error (MAE); average absolute percentage error (MAPE); median absolute error (MedAE).
Conclusions: The use of machine learning methods and algorithms gives different results in terms of metrics of model accuracy. The worst values of all quality metrics were demonstrated by the linear regression method (MAPE 67%). The combination (bagging) of the two ensemble methods — Random Forest and GBDT — was the best, since the highest values were found for a larger number of quality metrics. In this regard, it is reasonable to test all available machine learning methods and algorithms and then select the best-quality model from the results obtained.

References

1. HIV infection and AIDS. National leadership. Аcad. RAS, Professor V.V. Pokrovsky, editor. Moscow: GEOTAR-MEDIA, 2020. 686 p. (In Russ.)
2. Bodrin KA, Krasnoperova AA. The use of machine learning technologies in medicine. Theory and practice ofmodern science. 2018; 10(40): 52-56. (In Russ.)
3. Vostroknutov ME, Dyuzheva EV, Kuznetsova AV, Senko OV. Risk factors of hospital mortality of patients with a combination of tuberculosis and HIV infection in institutions of the penal system. Tuberculosis and lung diseases. 2019; 97(7): 34-41. (In Russ.) doi: 10.21292/2075-1230-2019-97-7-34-41.
4. Tarasova OA, Filimonov DA, Poroikov VV. Computer prediction of human immunodeficiency virus resistance to HIV reverse transcriptase inhibitors. Biomedical chemistry. 2017. 63(5): 457-460. (In Russ.) doi: 10.18097/PBMC20176305457.
5. Rajendran M, Ferran MC, Mouli L, Babbitt GA. Lynch Evolution of drug resistance drives destabilization of flap region dynamics in HIV-1 protease. Biophys Rep (NY). 2023; 3(3): 100121. doi: 10.1016/ j.bpr.2023.100121.
6. Bukic E, Milasin J, Toljic B, Jadzic J, Jevtovic D, Obradovic B, Dragovic G. Association between Combination Antiretroviral Therapy and Telomere Length in People Living with Human Immunodeficiency Virus. Biology (Basel). 2023; 12(9): 1210. doi: 10.3390/biology12091210.
7. Birri Makota RB, Musenge E. Predicting HIV infection in the decade (2005-2015) pre-COVID-19 in Zimbabwe: A supervised classification-based machine learning approach. PLOS Digit Health. 2023; 2(6): e0000260. doi: 10.1371/journal.pdig.0000260.
8. Mamo DN, Yilma TM, Fekadie M, Sebastian Y, Bizuayehu T, Melaku MS, Walle AD. Machine learning to predict virological failure among HIV patients on antiretroviral therapy in the University of Gondar Comprehensive and Specialized Hospital, in Amhara Region, Ethiopia, 2022. BMC Med Inform Decis Mak. 2023; 23(1): 75. doi: 10.1186/s12911-023-02167-7.
9. Jupyter Notebook. Available at: https://docs.jupyter.org/en/latest/. Accessed 10.10.2023.
10. Pandas. Available at: https://pandas.pydata.org/docs/. Accessed 10.10.2023.
11. Scikit-learn. Documentation. Available at: https://scikit-learn.org/stable/index.html. Accessed 10.10.2023.
12. Statsmodels. Available at: https://www.statsmodels.org/stable/user-guide.html. Accessed 10.10.2023.
13. CatBoost. Available at: https://catboost.ai/en/docs/. Accessed 10.10.2023.
14. Optuna. Available at: https://optuna.org/#key_features. Accessed 10.10.2023.
15. Scikit-learn. Evaluation of models. Available at: https://scikit-learn.org/stable/modules/model_evaluation.html. Accessed 10.10.2023.
16. Lysenko AA. Introduction to regression analysis of data and regression models. Proceedings of the St. Petersburg State Maritime Technical University. 2020; 1(S2): 15. (In Russ.)
17. Pernebai BA. Python. Decision tree regression using sklearn. Polish Journal of Science. 2021; 38-1(38): 51-56. (In Russ.)
18. Scikit-learn. Linear models. Available at: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model. Accessed 10.10.2023.
19. Scikit-learn. Decision tree, regressor. Available at: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor. Accessed 10.10.2023.
20. Scikit-learn. Common errors in the interpretation of linear model coefficients. Available at: https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html#sphx-glr-auto-examples-inspection-plot-linear-model-coefficient-interpretation-py. Accessed 10.10.2023.
21. Scikit-learn. Robust scaling. Available at: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html. Accessed 10.10.2023.
22. Scikit-learn. Lasso regression. Available at: scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html. Accessed 10.10.2023.
23. Scikit-learn. Cross-validation. Available at: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html. Accessed 10.10.2023.
24. Nosova GS, Abdullin AH. Machine learning based on nonparametric and nonlinear Random Forest (RF) algorithm. Innovation. The science. Education. 2021; 35: 33-39. (In Russ.)]
25. Scikit-learn. Random forest, regressor. Available at: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble. RandomForestRegressor. html. Accessed 10.10.2023.
26. Zhang И, Ren J, Wei Z, et al. Health data driven on continuous blood pressure prediction based on gradient boosting decision tree algorithm. IEEE Access. 2019; 7: 32423-32433. doi: 10.1109/ACCESS.2019.2902217.
27. Plaia A, Buscemi S, Fürnkranz J, Mencía EL. Comparing Boosting and Bagging for Decision Trees of Rankings. Journal of Classification. 2022; 39(1): 78-99. doi: 10.1007/s00357-021-09397-2.

For citation

Kotlovsky M.Yu., Tsybikova E.B., Lorsanov S.M., Fadeev P.A., Fadeeva S.O., Gusev A.V. Development of a machine learning model predicting the incidence of newly diagnosed HIV infection in the subjects of the Russian Federation. Medical doctor and information technology. 2023; 3: 16-29. doi: 10.25881/18110193_2023_3_16.

Authors

Documents

519,4 KB

Keywords

Back to the list

Authors

Kotlovsky M.Yu. ?

Central Research Institute of Organization and Informatization of Healthcare of the Ministry of Health of Russia, Moscow, Russia
Tsybikova E.B. ?

Central Research Institute of Organization and Informatization of Healthcare of the Ministry of Health of Russia, Moscow, Russia
Lorsanov S.M. ?

Ministry of Health of the Chechen Republic, Grozny, Russia
Fadeev P.A. ?

Ministry of Health of the Chechen Republic, Grozny, Russia
Fadeeva S.O. ?

Republican Center for Public Health and Medical Prevention, Grozny, Russia
Gusev A.V. ?

Central Research Institute of Organization and Informatization of Healthcare of the Ministry of Health of Russia, Moscow, Russia