Prediction of the Influencing Factors of Protein Thermal Stability using Random Forest and Feature Selection Techniques
Article
Figures
Metrics
Preview PDF
Reference
Related
Cited by
Materials
Abstract:
Thermal stability is crucial for implementation of an enzyme in the food industry. The thermostability of enzymes were predicted through protein sequences, using a random forest algorithm and the important influencing factors on protein thermal stability were analyzed. Four hundred and thirty protein features were calculated for 1600 enzymes extracted from the Swiss-Prot database that contained thermal stability information. The data imbalance was solved by using repeated under-sampling methods, and the 30 most-important features were selected by backward recursive feature elimination (RFE). The classification performances of different random forest models built by different feature subsets were evaluated by cross-validation and independent testing. The results indicated that the model built by amino acid composition exhibited the best performance (accuracy = 85.83%, sensitivity = 89.16%, specificity = 73.33%, precision = 77.00%, and F-measure = 74.87%), suggesting that amino acid composition had the most significant impact on the thermal stability of an enzyme. Further, it was found that thermophilic enzymes contained relatively high contents of glutamic acid, isoleucine, and lysine, whereas mesophilic enzymes contained high contents of glutamine, serine, and threonine. The results in this study provided a theory and method for engineering proteins to improve enzyme thermostability for the food industry.