Background. The development and implementation of medical information systems make it possible to simplify and automate many processes in medical organizations. At the same time, the amount of data on patients’ health is constantly accumulating which allows solving many problems related to the prediction and diagnosis of diseases.
Aim. To study approaches to processing of Russian unstructured medical texts and to predicting certain groups of diseases based on machine learning methods.
Материалы и методы. Initial data consisted of an array of depersonalized data from medical organizations in the Orenburg region containing 119,780 records. Three approaches to probabilistic forecasting of groups of diseases based on unstructured medical texts of patient complaints in Russian were studied: rule-based approach, logistic regression-based approach and approach using BERT transformer models.
Results. Comparative analysis showed that показывает, logistic regression-based approach combined with TfidfVectorizer method had the best results in Precision (0,8296), F1-score (0,8269) and Matthews’s correlation coefficient (0,7695).
Conclusion. Traditional rule-based approach was the least effective (Precision = 0,7182) among the studied methods, but at the same time it allowed to interpret the results of the classifier as visualization of the decision tree. Logistic regression-based approach (Precision = 0,8296) and approach using BERT transformer models (Precision = 0,8164) showed the best classification results and can be further used as a basis for building and developing medical decision support systems and find application in medical practice.
Aim. To study approaches to processing of Russian unstructured medical texts and to predicting certain groups of diseases based on machine learning methods.
Материалы и методы. Initial data consisted of an array of depersonalized data from medical organizations in the Orenburg region containing 119,780 records. Three approaches to probabilistic forecasting of groups of diseases based on unstructured medical texts of patient complaints in Russian were studied: rule-based approach, logistic regression-based approach and approach using BERT transformer models.
Results. Comparative analysis showed that показывает, logistic regression-based approach combined with TfidfVectorizer method had the best results in Precision (0,8296), F1-score (0,8269) and Matthews’s correlation coefficient (0,7695).
Conclusion. Traditional rule-based approach was the least effective (Precision = 0,7182) among the studied methods, but at the same time it allowed to interpret the results of the classifier as visualization of the decision tree. Logistic regression-based approach (Precision = 0,8296) and approach using BERT transformer models (Precision = 0,8164) showed the best classification results and can be further used as a basis for building and developing medical decision support systems and find application in medical practice.