Aim: to develop a model predicting the incidence of newly diagnosed HIV infection in the subjects of the Russian Federation using machine learning methods.
Materials and methods: The initial data were obtained from the Federal statistical observation Form No. 61 and Rosstat data on the average annual population of 85 subjects of the Russian Federation (2016-2022). We made a comparison of machine learning methods and their ensembles in the construction of a regression model for predicting the incidence of newly diagnosed patients with HIV infection in the subjects of the Russian Federation.
Results: The model was built using the following methods: linear regression, decision Tree, random forest, gradient boosting on decision trees (GBDT) and bagging. The interactive computing environment «Jupiter Notebook» (6.5.2) and software libraries «Pandas» (1.5.3), «Scikit-learn» (1.0.2), «Statsmodels» (0.13.5) and CatBoost were utilized. Optimal hyperparameters were selected using the Optuna framework. The following quality metrics were used: root of mean square error (RMSE); coefficient of determination (R2); average absolute error (MAE); average absolute percentage error (MAPE); median absolute error (MedAE).
Conclusions: The use of machine learning methods and algorithms gives different results in terms of metrics of model accuracy. The worst values of all quality metrics were demonstrated by the linear regression method (MAPE 67%). The combination (bagging) of the two ensemble methods — Random Forest and GBDT — was the best, since the highest values were found for a larger number of quality metrics. In this regard, it is reasonable to test all available machine learning methods and algorithms and then select the best-quality model from the results obtained.
Materials and methods: The initial data were obtained from the Federal statistical observation Form No. 61 and Rosstat data on the average annual population of 85 subjects of the Russian Federation (2016-2022). We made a comparison of machine learning methods and their ensembles in the construction of a regression model for predicting the incidence of newly diagnosed patients with HIV infection in the subjects of the Russian Federation.
Results: The model was built using the following methods: linear regression, decision Tree, random forest, gradient boosting on decision trees (GBDT) and bagging. The interactive computing environment «Jupiter Notebook» (6.5.2) and software libraries «Pandas» (1.5.3), «Scikit-learn» (1.0.2), «Statsmodels» (0.13.5) and CatBoost were utilized. Optimal hyperparameters were selected using the Optuna framework. The following quality metrics were used: root of mean square error (RMSE); coefficient of determination (R2); average absolute error (MAE); average absolute percentage error (MAPE); median absolute error (MedAE).
Conclusions: The use of machine learning methods and algorithms gives different results in terms of metrics of model accuracy. The worst values of all quality metrics were demonstrated by the linear regression method (MAPE 67%). The combination (bagging) of the two ensemble methods — Random Forest and GBDT — was the best, since the highest values were found for a larger number of quality metrics. In this regard, it is reasonable to test all available machine learning methods and algorithms and then select the best-quality model from the results obtained.