[关键词]
[摘要]
基于机器学习算法建立分类预测模型,研究常见食品中化学性污染物的理化结构与其神经毒性间关联。通过查阅文献建立化合物数据库,纳入包含影响神经分化成熟、影响神经元迁移/空间定向等各类神经毒性机制化合物57种,无神经毒性化合物50种。运用R、SPSS软件,使用随机森林(Random Forests,RF)、类神经网络(Artificial Neural Network,ANN)、支持向量机(Support Vector Machine,SVM)等机器学习算法筛选分子描述符并构建分类模型,预测化合物神经毒性。结果显示随机森林算法模型综合表现最佳,十折交叉验证准确率70.24%,训练集、测试集预测准确率分别达95.51%和83.33%,曲线下面积分别达0.99和0.85,是个较为理想的算法。本研究基于机器学习算法建立的分类模型可通过化合物的分子描述符准确预测化合物的神经毒性。在多种机器学习算法中,基于随机森林算法建立的预测模型表现最优。分子描述符重要性结果显示,化合物神经毒性主要与其质量加权Burden矩阵最大特征值有关。
[Key word]
[Abstract]
Prediction models based on machine learning algorithms are established to predict the neurotoxicity of chemical pollutants in food. Database which includes fifty-seven neurotoxic compounds and fifty non-neurotoxic compounds was established through the published paper. By utilizing the R and SPSS software, the random forest (Random Forests, RF), neural network (Artificial Neural Network, ANN), support vector machine (Support Vector Machine, SVM) and other algorithms were used to build the classification models applying the molecular descriptors. The random forest algorithm represented the best performance in aspects of total accuracy and feasibility, illustrating total accuracy of training set and test set was 95.51% and 83.33%, respectively. The area under the curves of training set and test set were 0.99 and 0.85, respectively. The accuracy of 10-fold cross-validation was 70.24%. In this study, the prediction models established on the basis of machine learning algorithms and chemical informatics can accurately distinguish the neurotoxicity compounds from non-neurotoxicity compounds. Our result suggested that among models, the one constructed with the random forest algorithm performs better and the highest eigenvalue from Burden matrix as a molecular descriptor contributes dominantly to the classification of chemicals with neurotoxic potential.
[中图分类号]
[基金项目]
国家重点研发计划重点专项(2018YFC1602405);上海市科学技术委员会科研计划项目(19140901200)