Abstract:The prediction of trypsin-catalyzed specific proteolysis is vital for guiding the theoretical degradation and structural analysis of proteins. In this study, a deep learning-based model was constructed to predict trypsin-catalyzed protein cleavage using convolutional neural networks (CNN) and long short-term memory (LSTM) networks. The impact of various hyperparameters on the model's performance was also explored. Results showed that a lower learning rate, higher batch size, and fewer convolutional layers were beneficial for the model's training stability, ensuring better predictive outcomes. The optimal parameters were identified as a learning rate of 0.001, batch size of 512, and one convolutional layer. In that case, the model achieved an accuracy of 0.950, specificity of 0.987, precision of 0.986, recall of 0.961, and an F1 score of 0.973 on dataset PXD010627, demonstrating excellent predictive capability and stability. Furthermore, when applied to publicly available datasets from different species, the model maintained an accuracy above 0.920, with AUC values, precision, recall, and F1 scores all ranging between 0.900 and 0.989. This indicates a strong generalization ability of the model in predicting specific cleavage sites of trypsin-catalyzed proteins, which could significantly enhance the accuracy and reliability of cleavage site predictions. We hope the work could offer a new idea for protein identification and spatial structure analysis in proteomics, promoting advancements in proteolytic research and bioinformatics.