-
自然语言和图像处理领域引发的人工智能革命给蛋白质计算领域带来了新的思路和研究范式。其中一个重大的进展是从海量蛋白质序列通过自监督学习得到预训练的蛋白质语言模型。这类预训练模型编码了蛋白质的序列、进化、结构乃至功能等多种信息,可方便的迁移至多种下游任务,并展现了强大的泛化能力。在此基础上,人们正进一步发展融合更多种类数据的多模态预训练模型。考虑到蛋白质结构是决定其功能的主要因素,融合了结构信息的蛋白质预训练模型可更好的支持下游多种任务,本文对这一方向的研究工作进行了介绍和总结。此外,本文还简介了融合先验知识的蛋白质预训练模型、 RNA 语言模型,蛋白质设计等方面的工作,讨论了这些领域目前的现状、困难、及可能的解决方案。The AI revolution sparked by natural language and image processing has brought new ideas and research paradigms to the field of protein computing. One significant advancement is the development of pre-trained protein language models through self-supervised learning from massive protein sequences. These pre-trained models encode various information about protein sequences, evolution, structures, and even functions, which can be easily transferred to various downstream tasks and demonstrate robust generalization capabilities. Recently, researchers are further developing multimodal pre-trained models that integrate more diverse types of data. This review summarizes the recent studies in this direction from the following aspects. Firstly, it reviews protein pre-trained models that integrate protein structures into language models; this is of particular importance since protein structure is the primary determinant of its function. Secondly, the pre-trained models that integrate protein dynamic information are introduced. These models may benefit downstream tasks such as protein-protein interactions, soft docking of ligands, and interactions involving allosteric proteins and intrinsic disordered proteins. Thirdly, the pre-trained models that integrate knowledge such as gene ontology are described. Fourthly, we briefly introduce pretrained models in RNA fields. Lastly, we introduce the most recent developments in protein designs and discuss the relations of these models with respect to the aforementioned pre-trained models that integrate protein structure information.
-
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] Su J, Li Z, Han C, Zhou Y, Shan J, Zhou X, Ma D, OPMC T, Ovchinnikov S, Yuan F 2024 bioRxiv:2024.05.24.595648
[54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] Pearce R, Omenn G S, Zhang Y 2022 bioRxiv:2022.05.15.491755
[75] Baek M, McHugh R, Anishchenko I, Baker D, DiMaio F 2022 bioRxiv:2022.09.09.507333
[76] [77] [78] [79] Chen K, Zhou Y, Ding M, Wang Y, Ren Z, Yang Y 2023 bioRxiv:2023.01.31.526427
[80] [81] [82] Yang Y, Li G, Pang K, Cao W, Li X, Zhang Z 2023 bioRxiv:2023.09.08.556883
[83] [84] Wang X, Gu R, Chen Z, Li Y, Ji X, Ke G, Wen H 2023 bioRxiv:2023.07.11.548588
[85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] Song Y, Sohl-Dickstein J, Kingma D P, Kumar A, Ermon S, Poole B 2020 arXiv:2011.13456
[112] [113] Liu Y, Chen L, Liu H 2023 bioRxiv:2023.11.18.567666
[114] [115] [116] Kaplan J, McCandlish S, Henighan T, Brown T B, Chess B, Child R, Gray S, Radford A, Wu J, Amodei D 2020 arXiv:2001.08361
[117] [118] [119] Chen T, Kornblith S, Norouzi M, Hinton G 2020 arXiv:2002.05709
[120] Wang Z, Wang Z, Srinivasan B, Ioannidis V N, Rangwala H, Anubhai R 2023 arXiv:2310.03320
[121] [122] [123] [124]
计量
- 文章访问数:122
- PDF下载量:4
- 被引次数:0