基于随机森林和特征选择方法的蛋白质热稳定性影响因素预测
CSTR:
作者:
作者单位:

作者简介:

张力(1988-),男,博士研究生,研究方向生物卫生统计学;艾海新为并列第一作者 通讯作者:刘宏生(1963-),男,教授,博士生导师,研究方向为生物信息学及功能食品学

通讯作者:

中图分类号:

基金项目:

辽宁省教育厅基金资助项目(L2014001);辽宁省科技厅基金资助项目(2014001015;2013225086);沈阳市科技局科技攻关专项(F14-154-9-00);国家自然科学基金资助项目(31570160)


Prediction of the Influencing Factors of Protein Thermal Stability using Random Forest and Feature Selection Techniques
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    酶的耐热性对其在食品工业中实现应用至关重要。本文以随机森林算法通过蛋白质序列预测酶的热稳定性,并对影响热稳定性的重要特征进行了分析。计算了从Swiss-Prot数据库获得的1600个包含热稳定性信息的酶的430个特征。采用重复欠抽样法处理数据不平衡问题,采用向后递归特征消去法优选出30个最重要的特征。通过交叉验证和独立测试比较以各特征子集构建的模型,发现仅以氨基酸组成为特征集构建的模型获得了最佳预测效果,模型的总体预测准确率为85.83%、敏感性为89.16%、特异性为73.33%、精度为77.00%、F1度量为74.87%。结果表明氨基酸组成对酶热稳定性的影响最大,嗜热酶中含有更多的谷氨酸、异亮氨酸和赖氨酸,而常温酶中含有更多的谷氨酰胺、丝氨酸和苏氨酸。研究为蛋白质工程改造食品工业用酶的热稳定性提供了一定的理论和方法。

    Abstract:

    Thermal stability is crucial for implementation of an enzyme in the food industry. The thermostability of enzymes were predicted through protein sequences, using a random forest algorithm and the important influencing factors on protein thermal stability were analyzed. Four hundred and thirty protein features were calculated for 1600 enzymes extracted from the Swiss-Prot database that contained thermal stability information. The data imbalance was solved by using repeated under-sampling methods, and the 30 most-important features were selected by backward recursive feature elimination (RFE). The classification performances of different random forest models built by different feature subsets were evaluated by cross-validation and independent testing. The results indicated that the model built by amino acid composition exhibited the best performance (accuracy = 85.83%, sensitivity = 89.16%, specificity = 73.33%, precision = 77.00%, and F-measure = 74.87%), suggesting that amino acid composition had the most significant impact on the thermal stability of an enzyme. Further, it was found that thermophilic enzymes contained relatively high contents of glutamic acid, isoleucine, and lysine, whereas mesophilic enzymes contained high contents of glutamine, serine, and threonine. The results in this study provided a theory and method for engineering proteins to improve enzyme thermostability for the food industry.

    参考文献
    相似文献
    引证文献
引用本文

张力,艾海新,张吉宽,胡桓,刘宏生,马树才.基于随机森林和特征选择方法的蛋白质热稳定性影响因素预测[J].现代食品科技,2016,32(7):103-108.

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2015-08-14
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2016-07-29
  • 出版日期:
文章二维码
×
因办公室装修,期间暂时无法接听电话,如有事请QQ或邮件联系。信息咨询:QQ: 2553003667稿件处理1:QQ: 1542354573稿件处理2:QQ: 2195608851 财务咨询:QQ: 1347040116 Email:mfood@scut.edu.cn、mfood@foxmail.com