关键词:
心血管疾病
机器学习
预测模型
摘要:
目的:在我国城乡居民疾病死亡构成比里,心血管疾病位居首位。患者通常在出现症状时才前往就医,而且诊断心血管疾病的传统手段既复杂又昂贵。鉴于此,本研究旨在借助一般人口特征、合并症以及常规体检血检指标来识别心血管疾病患者。方法:样本选取自CHARLS数据库13,420的参与者。删除缺失值后,运用逻辑回归、决策树、K-最邻近算法、随机森林、神经网络构建模型,通过比较接收者操作特征曲线下面积(ROC_AUC)值选择最优模型进一步构建各心血管疾病亚组模型,并采用SHAP算法对模型予以解释。结果:通过逻辑回归构建的模型效能最佳,其ROC_AUC值为0.7644 (95% CI: 0.7397~0.7890),其中对心脏病的识别效能较好,ROC_AUC值为0.7747。SHAP算法对模型的解释显示,年龄、体重指数、糖尿病以及吸烟史在识别心血管病方面有着重要贡献。结论:基于机器学习方法能够识别心血管病患者,可利用简易检查结果在早期对高风险人群进行识别并实施干预。Objective: Cardiovascular diseases account for the highest proportion of deaths among both urban and rural residents in our country. Patients typically seek medical attention only after the onset of symptoms, and traditional diagnostic methods for cardiovascular diseases are often complex and costly. Therefore, this study aimed to identify patients with cardiovascular diseases based on general population characteristics, comorbidities, and routine physical blood test indicators. Methods: Samples were drawn from 13,420 participants in the CHARLS database. After removing missing values, models were constructed using logistic regression, decision trees, the K-nearest neighbor algorithm, random forests, and neural networks. The optimal model was selected by comparing the area under the receiver operating characteristic curve (ROC_AUC) which facilitated the construction of subgroup models for each type of cardiovascular disease. The SHAP algorithm was employed to interpret the models. Results: The logistic regression model exhibited the best performance, achieving an ROC_AUC value of 0.7644 (95% CI: 0.7397~0.7890), with a particularly strong recognition of heart disease, which had an ROC_AUC value of 0.7747. The interpretation provided by the SHAP algorithm indicated that age, body mass index, diabetes, and smoking history significantly contributed to the identification of cardiovascular diseases. Conclusion: Utilizing machine learning methods, it is possible to identify patients with cardiovascular diseases, allowing for the early identification and