Search

Article

x

留言板

姓名
邮箱
手机号码
标题
留言内容
验证码

downloadPDF
Citation:

Tang Tian-Yi, Xiong Yi-Ming, Zhang Rui-Ge, Zhang Jian, Li Wen-Fei, Wang Jun, Wang Wei
PDF
HTML
Get Citation
  • The AI revolution, sparked by natural language and image processing, has brought new ideas and research paradigms to the field of protein computing. One significant advancement is the development of pre-training protein language models through self-supervised learning from massive protein sequences. These pre-trained models encode various information about protein sequences, evolution, structures, and even functions, which can be easily transferred to various downstream tasks and demonstrate robust generalization capabilities. Recently, researchers have further developed multimodal pre-trained models that integrate more diverse types of data. The recent studies in this direction are summarized and reviewed from the following aspects in this paper. Firstly, the protein pre-training models that integrate protein structures into language models are reviewed: this is particularly important, for protein structure is the primary determinant of its function. Secondly, the pre-trained models that integrate protein dynamic information are introduced. These models may benefit downstream tasks such as protein-protein interactions, soft docking of ligands, and interactions involving allosteric proteins and intrinsic disordered proteins. Thirdly, the pre-trained models that integrate knowledge such as gene ontology are described. Fourthly, we briefly introduce pre-trained models in RNA fields. Finally, we introduce the most recent developments in protein designs and discuss the relationship of these models with the aforementioned pre-trained models that integrate protein structure information.
        Corresponding author:Zhang Jian,jzhang@nju.edu.cn; Wang Wei,wangwei@nju.edu.cn
      • Funds:Project supported by the Science and Technology Innovation Project of the Ministry of Science and Technology (Grant No. 2030-2021ZD0201300) and the National Natural Science Foundation of China (Grant No. 11934008).
      [1]

      [2]

      [3]

      [4]

      [5]

      [6]

      [7]

      [8]

      [9]

      [10]

      [11]

      [12]

      [13]

      [14]

      [15]

      [16]

      [17]

      [18]

      [19]

      [20]

      [21]

      [22]

      [23]

      [24]

      [25]

      [26]

      [27]

      [28]

      [29]

      [30]

      [31]

      [32]

      [33]

      [34]

      [35]

      [36]

      [37]

      [38]

      [39]

      [40]

      [41]

      [42]

      [43]

      [44]

      [45]

      [46]

      [47]

      [48]

      [49]

      [50]

      [51]

      [52]

      [53]

      [54]

      [55]

      [56]

      [57]

      [58]

      [59]

      [60]

      [61]

      [62]

      [63]

      [64]

      [65]

      [66]

      [67]

      [68]

      [69]

      [70]

      [71]

      [72]

      [73]

      [74]

      [75]

      [76]

      [77]

      [78]

      [79]

      [80]

      [81]

      [82]

      [83]

      [84]

      [85]

      [86]

      [87]

      [88]

      [89]

      [90]

      [91]

      [92]

      [93]

      [94]

      [95]

      [96]

      [97]

      [98]

      [99]

      [100]

      [101]

      [102]

      [103]

      [104]

      [105]

      [106]

      [107]

      [108]

      [109]

      [110]

      [111]

      [112]

      [113]

      [114]

      [115]

      [116]

      [117]

      [118]

      [119]

      [120]

      [121]

      [122]

    • 模型名 时间 模型 数据模态 预训练方法 训练集 参数量 算力要求 下游任务 文献
      融合了结构信息的通用蛋白质预训练模型
      Bepler &Berger 2019 Bi-LSTM Sequence, structure MLM for sequences, supervised learning for 3D structures 76M sequences, 28K structures 1X 32G-V100, 13 to 51 days Fold classification transmembrane region prediction [19,42]
      Guo model 2022 CNN Structure Self-supervised pre-training on noised pair-distance 73K structures QA, PPI [43]
      New IEConv 2022 GCN Sequence, structure Contrastive learning between randomly sampled 3D substructures 476K chains 30M protein function prediction, protein fold classification, structural similarity prediction, protein-ligand binding affinity prediction [44]
      GearNet 2023 ESM-1b, GearNet Sequence, structure PLM, contrastive learning 805K structures from AlphaFoldDB 4X A100 Fold classification, EC, GO
      STEPS 2023 BERT, GCN Sequence, structure PLM, supervised learning from 3D structures 40K structures Membrane protein classification, cellular location prediction, EC
      UNI-MOL 2023 Transformer Sequence, structure Atom 3D position denoise, masked atom type prediction 209M molecule conformations, 3.2M protein pockets structure 8X 32G-V100, 3 days molecular property prediction, molecular conformation generation, pocket property prediction, protein-ligand binding pose prediction
      SaProt 2023 BERT Sequence, structure Convert structures to structure-aware vocabulary, MLM 40M sequences and structures from PDB/AlphaFoldDB 650M 64X 80G-A100, 3 months Thermostability, HumanPPI, Metal Ion Binding, EC, GO, DeepLoc, contact prediction [51]
      融合了结构信息的非通用蛋白质预训练模型
      Evoformer 2021 Evoformer Sequence, structure MLM, Supervised learning BPD+Uniclust30, PDB 128TPU-v3, 11 days Structure prediction [2]
      DeepFRI 2021 LSTM+GCN Sequence, structure PLM(pretrained, frozen), supervised learning for 3D structures 10M sequences for pre-training GO, EC, PPI interaction sites [47]
      LM-GVP 2022 Transformer +GVP Sequence, structure PLM(changeable), supervised learning for 3D structures 8X 32G-V100 fluorescence, protease stability, GO, mutational effects [48]
      ProNet 2023 GCN Sequence, structure Supervised learning Fold classification, reaction classification, binding affinity, PI
      HoloProt 2022 MPN Sequence, structure surface Supervised learning 1.8M 1X 1080Ti, 1 day Ligand binding affinity, EC [56]
      编码动态三维结构信息的预训练模型
      ProtMD 2022 E(3)-Equivariant Graph Matching Network Sequence, structure trajectory Self-supervised learning, atom-level prompt-based denoising generative task, conformation-level snapshot ordering task 62.8K snapshots from MD for 64 protein-ligand pairs 5.2M 4X V100 Binding affinity prediction, binary classification of ligand efficacy [58]
      融合了知识的蛋白质预训练模型
      OntoProtein 2022 ProtBert, Gu-model Sequence, knowledge MLM, contrastive learning ProteinKG25 with 5M knowledge triples V100 TAPE, PPI, Protein function prediction [60]
      KeAP 2023 ProtBert, Gu-model Sequence, knowledge MLM ProteinKG25 TAPE, PPI, Protein function prediction [62]
      ProtST 2023 ProtBert, ESM-1b,
      ESM-2, PubMedBert
      Sequence, knowledge MLM, Multimodal Representation Alignment, Multimodal Mask Prediction ProtDescribe with 553K sequence-property pairs 4X V100 Protein localization prediction, Fitness landscape prediction, Protein function annotation [63]
      RNA语言模型
      RNA-FM 2022.8 BERT Sequence MLM RNAcentral, 23.7M ncRNA sequences 8X A100 80G, 1 month SS prediction, 3D contact/distance map, 3D reconstruction, evolution study, RNA-protein interaction, MRL prediction [78]
      RNABert 2022 BERT Sequence MLM RNAcentral (762K) & Rfam 14.3 dataset V100 structural alignment, clustering [86]
      SpliceBERT 2023 BERT Sequence MLM Pre-mRNA of 72 vertebrates, 2M sequences, 64B nucleotides 19.4M 8X V100, 1 week multi-species splice site prediction, human branch point prediction [79]
      RNA-MSM 2023 MSA-transformer Sequence MLM 4069 RNA families from Rfam 14.7 8X V100 32G SS prediction, solvent accessibility prediction [83]
      Uni-RNA 2023 BERT Sequence MLM RNAcentral & nt & GWH (1billion sequences) 25—400M 128X A100 SS prediction, 3D structure prediction, MRL, Isoform percentage prediction on 3’UTR, splice site prediction, classification of ncRNA functional families, modification site prediction [84]
      RNAErnie 2024 ERNIE Sequence, motif information MLM at base/subsequence/motif level masking RNAcentral, 23M ncRNA sequences 105M 4X V100 32G, 250 hours sequence classification, RNA–RNA interaction, SS prediction [85]
      *PLM, protein language model; MLM, masked language model; GCN, graph convolutional network; GVP, geometric vector perceptrons; EC, enzyme commission number prediction; GO, gene ontology term prediction; PPI, protein-protein interaction; TAPE, the tasks assessing protein embeddings database; QA, quality assessment of structures; SS, secondary structure; MRL, mean ribosome load prediction in mRNA.
      DownLoad: CSV
    • [1]

      [2]

      [3]

      [4]

      [5]

      [6]

      [7]

      [8]

      [9]

      [10]

      [11]

      [12]

      [13]

      [14]

      [15]

      [16]

      [17]

      [18]

      [19]

      [20]

      [21]

      [22]

      [23]

      [24]

      [25]

      [26]

      [27]

      [28]

      [29]

      [30]

      [31]

      [32]

      [33]

      [34]

      [35]

      [36]

      [37]

      [38]

      [39]

      [40]

      [41]

      [42]

      [43]

      [44]

      [45]

      [46]

      [47]

      [48]

      [49]

      [50]

      [51]

      [52]

      [53]

      [54]

      [55]

      [56]

      [57]

      [58]

      [59]

      [60]

      [61]

      [62]

      [63]

      [64]

      [65]

      [66]

      [67]

      [68]

      [69]

      [70]

      [71]

      [72]

      [73]

      [74]

      [75]

      [76]

      [77]

      [78]

      [79]

      [80]

      [81]

      [82]

      [83]

      [84]

      [85]

      [86]

      [87]

      [88]

      [89]

      [90]

      [91]

      [92]

      [93]

      [94]

      [95]

      [96]

      [97]

      [98]

      [99]

      [100]

      [101]

      [102]

      [103]

      [104]

      [105]

      [106]

      [107]

      [108]

      [109]

      [110]

      [111]

      [112]

      [113]

      [114]

      [115]

      [116]

      [117]

      [118]

      [119]

      [120]

      [121]

      [122]

    • [1] Zhang Xu, Ding Jin-Min, Hou Chen-Yang, Zhao Yi-Ming, Liu Hong-Wei, Liang Sheng.Machine learning based laser homogenization method. Acta Physica Sinica, 2024, 73(16): 164205.doi:10.7498/aps.73.20240747
      [2] Yang Tian, Ouyang Qi.Study of non-equilibrium statistical physics of protein machine by cryogenic electron microscopy. Acta Physica Sinica, 2024, 73(13): 138701.doi:10.7498/aps.73.20240592
      [3] Zhang Jia-Hui.Machine learning forin silicoprotein research. Acta Physica Sinica, 2024, 73(6): 069301.doi:10.7498/aps.73.20231618
      [4] Liu Ye, Niu He-Ran, Li Bing-Bing, Ma Xin-Hua, Cui Shu-Wang.Application of machine learning in cosmic ray particle identification. Acta Physica Sinica, 2023, 72(14): 140202.doi:10.7498/aps.72.20230334
      [5] Guan Xing-Yue, Huang Heng-Yan, Peng Hua-Qi, Liu Yan-Hang, Li Wen-Fei, Wang Wei.Machine learning in molecular simulations of biomolecules. Acta Physica Sinica, 2023, 72(24): 248708.doi:10.7498/aps.72.20231624
      [6] Liu Dong, Cui Xin-Yue, Wang Hao-Dong, Zhang Gui-Jun.Recent advances in estimating protein structure model accuracy. Acta Physica Sinica, 2023, 72(24): 248702.doi:10.7498/aps.72.20231071
      [7] Lin Kai-Dong, Lin Xiao-Qian, Lin Xu-Bo.Virtual screening of drugs targeting PD-L1 protein. Acta Physica Sinica, 2023, 72(24): 240501.doi:10.7498/aps.72.20231068
      [8] Zhang Yi-Fan, Ren Wei, Wang Wei-Li, Ding Shu-Jian, Li Nan, Chang Liang, Zhou Qian.Machine learning combined with solid solution strengthening model for predicting hardness of high entropy alloys. Acta Physica Sinica, 2023, 72(18): 180701.doi:10.7498/aps.72.20230646
      [9] Chen Guang-Lin, Zhang Zhi-Yong.Exploring proten’s conformational space by using encoding layer supervised auto-encoder. Acta Physica Sinica, 2023, 72(24): 248705.doi:10.7498/aps.72.20231060
      [10] Luo Fang-Fang, Cai Zhi-Tao, Huang Yan-Dong.Progress in protein pKaprediction. Acta Physica Sinica, 2023, 72(24): 248704.doi:10.7498/aps.72.20231356
      [11] Lin Jian, Ye Meng, Zhu Jia-Wei, Li Xiao-Peng.Machine learning assisted quantum adiabatic algorithm design. Acta Physica Sinica, 2021, 70(14): 140306.doi:10.7498/aps.70.20210831
      [12] Chen Jiang-Zhi, Yang Chen-Wen, Ren Jie.Machine learning based on wave and diffusion physical systems. Acta Physica Sinica, 2021, 70(14): 144204.doi:10.7498/aps.70.20210879
      [13] Liu Chun-Jie, Zhao Xin-Jun, Gao Zhi-Fu, Jiang Zhong-Ying.Modeling study of adsorption/desorption of proteins by polymer mixed brush. Acta Physica Sinica, 2021, 70(22): 224701.doi:10.7498/aps.70.20211219
      [14] Shi Chen-Yang, Min Guang-Zong, Liu Xiang-Yang.Research progress of protein-based memristor. Acta Physica Sinica, 2020, 69(17): 178702.doi:10.7498/aps.69.20200617
      [15] Yuan Fei, Zhang Chuan-Biao, Zhou Xin, Li Ming.An improved algorithm for prediction of protein loop structure based on position specificity of amino acids. Acta Physica Sinica, 2016, 65(15): 158701.doi:10.7498/aps.65.158701
      [16] Deng Hai-You, Jia Ya, Zhang Yang.Protein structure prediction. Acta Physica Sinica, 2016, 65(17): 178701.doi:10.7498/aps.65.178701
      [17] Wan Xi, Zhou Jin, Liu Zeng-Rong.Emergence of features in protein-protein interaction networks. Acta Physica Sinica, 2012, 61(1): 010203.doi:10.7498/aps.61.010203
      [18] Ding Wei, Jiang Fan.A new method of rigid-body refinementfor protein crystal structures. Acta Physica Sinica, 2011, 60(4): 046103.doi:10.7498/aps.60.046103
      [19] YAN XUN-LING, DONG RUI-XIN, WANG BO-YUN.COUPLED-SOLITON FOR THE HELIX CHAIN MODEL OF THE ALPHA-HELIX PROTEIN. Acta Physica Sinica, 1999, 48(4): 751-756.doi:10.7498/aps.48.751
      [20] YAN XUN-LING, DONG RUI-XIN, WANG BO-YUN, HU HAI-QUAN, XU BING-ZHEN.SELECTIVE RULES FOR THE RAMAN SPECTRUM OF α-HELICAL PROTEIN MOLECULES. Acta Physica Sinica, 1998, 47(12): 1963-1967.doi:10.7498/aps.47.1963
    Metrics
    • Abstract views:413
    • PDF Downloads:14
    • Cited By:0
    Publishing process
    • Received Date:07 June 2024
    • Accepted Date:12 July 2024
    • Available Online:09 August 2024
    • Published Online:20 September 2024

      返回文章
      返回
        Baidu
        map