融合了结构信息的通用蛋白质预训练模型 |
Bepler &Berger |
2019 |
Bi-LSTM |
Sequence, structure |
MLM for sequences, supervised learning for 3D structures |
76M sequences, 28K structures |
— |
1X 32G-V100, 13 to 51 days |
Fold classification transmembrane region prediction |
[19,42] |
Guo model |
2022 |
CNN |
Structure |
Self-supervised pre-training on noised pair-distance |
73K structures |
— |
— |
QA, PPI |
[43] |
New IEConv |
2022 |
GCN |
Sequence, structure |
Contrastive learning between randomly sampled 3D substructures |
476K chains |
30M |
— |
protein function prediction, protein fold classification, structural similarity prediction, protein-ligand binding affinity prediction |
[44] |
GearNet |
2023 |
ESM-1b, GearNet |
Sequence, structure |
PLM, contrastive learning |
805K structures from AlphaFoldDB |
— |
4X A100 |
Fold classification, EC, GO |
|
STEPS |
2023 |
BERT, GCN |
Sequence, structure |
PLM, supervised learning from 3D structures |
40K structures |
— |
— |
Membrane protein classification, cellular location prediction, EC |
|
UNI-MOL |
2023 |
Transformer |
Sequence, structure |
Atom 3D position denoise, masked atom type prediction |
209M molecule conformations, 3.2M protein pockets structure |
— |
8X 32G-V100, 3 days |
molecular property prediction, molecular conformation generation, pocket property prediction, protein-ligand binding pose prediction |
|
SaProt |
2023 |
BERT |
Sequence, structure |
Convert structures to structure-aware vocabulary, MLM |
40M sequences and structures from PDB/AlphaFoldDB |
650M |
64X 80G-A100, 3 months |
Thermostability, HumanPPI, Metal Ion Binding, EC, GO, DeepLoc, contact prediction |
[51] |
融合了结构信息的非通用蛋白质预训练模型 |
Evoformer |
2021 |
Evoformer |
Sequence, structure |
MLM, Supervised learning |
BPD+Uniclust30, PDB |
— |
128TPU-v3, 11 days |
Structure prediction |
[2] |
DeepFRI |
2021 |
LSTM+GCN |
Sequence, structure |
PLM(pretrained, frozen), supervised learning for 3D structures |
10M sequences for pre-training |
— |
— |
GO, EC, PPI interaction sites |
[47] |
LM-GVP |
2022 |
Transformer +GVP |
Sequence, structure |
PLM(changeable), supervised learning for 3D structures |
— |
— |
8X 32G-V100 |
fluorescence, protease stability, GO, mutational effects |
[48] |
ProNet |
2023 |
GCN |
Sequence, structure |
Supervised learning |
— |
— |
— |
Fold classification, reaction classification, binding affinity, PI |
|
HoloProt |
2022 |
MPN |
Sequence, structure surface |
Supervised learning |
— |
1.8M |
1X 1080Ti, 1 day |
Ligand binding affinity, EC |
[56] |
编码动态三维结构信息的预训练模型 |
ProtMD |
2022 |
E(3)-Equivariant Graph Matching Network |
Sequence, structure trajectory |
Self-supervised learning, atom-level prompt-based denoising generative task, conformation-level snapshot ordering task |
62.8K snapshots from MD for 64 protein-ligand pairs |
5.2M |
4X V100 |
Binding affinity prediction, binary classification of ligand efficacy |
[58] |
融合了知识的蛋白质预训练模型 |
OntoProtein |
2022 |
ProtBert, Gu-model |
Sequence, knowledge |
MLM, contrastive learning |
ProteinKG25 with 5M knowledge triples |
— |
V100 |
TAPE, PPI, Protein function prediction |
[60] |
KeAP |
2023 |
ProtBert, Gu-model |
Sequence, knowledge |
MLM |
ProteinKG25 |
— |
— |
TAPE, PPI, Protein function prediction |
[62] |
ProtST |
2023 |
ProtBert, ESM-1b, ESM-2, PubMedBert |
Sequence, knowledge |
MLM, Multimodal Representation Alignment, Multimodal Mask Prediction |
ProtDescribe with 553K sequence-property pairs |
— |
4X V100 |
Protein localization prediction, Fitness landscape prediction, Protein function annotation |
[63] |
RNA语言模型 |
RNA-FM |
2022.8 |
BERT |
Sequence |
MLM |
RNAcentral, 23.7M ncRNA sequences |
— |
8X A100 80G, 1 month |
SS prediction, 3D contact/distance map, 3D reconstruction, evolution study, RNA-protein interaction, MRL prediction |
[78] |
RNABert |
2022 |
BERT |
Sequence |
MLM |
RNAcentral (762K) & Rfam 14.3 dataset |
— |
V100 |
structural alignment, clustering |
[86] |
SpliceBERT |
2023 |
BERT |
Sequence |
MLM |
Pre-mRNA of 72 vertebrates, 2M sequences, 64B nucleotides |
19.4M |
8X V100, 1 week |
multi-species splice site prediction, human branch point prediction |
[79] |
RNA-MSM |
2023 |
MSA-transformer |
Sequence |
MLM |
4069 RNA families from Rfam 14.7 |
— |
8X V100 32G |
SS prediction, solvent accessibility prediction |
[83] |
Uni-RNA |
2023 |
BERT |
Sequence |
MLM |
RNAcentral & nt & GWH (1billion sequences) |
25—400M |
128X A100 |
SS prediction, 3D structure prediction, MRL, Isoform percentage prediction on 3’UTR, splice site prediction, classification of ncRNA functional families, modification site prediction |
[84] |
RNAErnie |
2024 |
ERNIE |
Sequence, motif information |
MLM at base/subsequence/motif level masking |
RNAcentral, 23M ncRNA sequences |
105M |
4X V100 32G, 250 hours |
sequence classification, RNA–RNA interaction, SS prediction |
[85] |