An evolution of the ESM model family that scales up in terms of parameters, data, and computational power compared to ESM2, which allows it to improve on sequence, structure, and function representations of proteins.
ESM3 is a multimodal, bidirectional transformer that models sequence, structure, and function using discrete token representations for each modality. It merges these representations into a single latent space and is trained with a masked language model objective, allowing it to generate and predict across different modalities.
ESM3's largest model has 98 billion parameters.
The model was trained with 1.07 × 10²⁴ floating point operations (FLOPs) over a dataset of 771 billion tokens from 2.78 billion proteins.
Structural tokens in ESM3 are encoded by a discrete autoencoder that compresses three-dimensional protein structures into a sequence of discrete tokens. This is done by encoding local atomic environments around each amino acid and representing them in a simplified form that captures geometric properties.
There are a total of 4096 structural tokens to be had.
The structural autoencoder tokenizes protein structures by encoding local neighborhoods around each amino acid into discrete tokens. It uses a geometric attention mechanism that operates in local reference frames, based on bond geometry. This mechanism encodes and reconstructs the atomic structure, supervised by a geometric loss that preserves distances and orientations of bonds and atoms.
ESM3 can be used to generate novel sequences/proteins. generates protein sequences and structures by prompting the model with sequence or structural tokens. It uses iterative sampling, starting from a masked context, where tokens are predicted and unmasked progressively until a full sequence or structure is generated. This allows the model to create novel proteins that respect the given prompts or constraints.
The model was verified experimentally by generating novel proteins, including a green fluorescent protein, that was synthesized and tested for fluorescence in laboratory conditions. The novel protein had a sequence identity of 58% to the nearest known fluorescent protein.
Foundational model following in the footsteps of AlphaFold3 attempting prediction of molecular interactions.
Model architecture is closely modeled on this of AF3 - however it was not benchmarked against it (nor ESM3) because of use restrictions.
It takes 30 days on 128 A100s to train the model. Back of the envelope Google Colab pro+ pricing of 128 (which is NOT distributed) puts it at ca. 120k USD :)
Addressing antibodies, they introduce constraints (e.g. known epitope residue) to help the model out. Adding even a single residue makes a big difference with respect to baseline which is tantamount to global docking.
It only takes one residue as constraints to improve ab-ag complex prediction. Success rate for ‘local’ mode, with just one residue is about 50% in getting it with DockQ score >.21, 30% >.49 and less than 10% for high quality hits.
Success rate for ‘global’ mode, so without constraints is about 35% in getting it with DockQ score >.21, 20% >.49 and less than 5% for high quality hits.
So altogether if I want to ‘hit the epitope’, the model has ca. 30% success rate.
If I want a high quality ab-ag complex structure, unfortunately it seems that constraints do not help much currently.
GearBind - Novel framework to predict the effect of mutations on an antibody-antigen complex
The architecture is graph-based, trained in a contrastive fashion on real atoms and their surroundings versus randomly samples (from rotamer libraries) atoms within the same environment. They use the real proteins from CATH for this purpose. The random points are serving as ‘negatives’ for contrastive learning whereas the real ones as positives.
The method shows improvement on previous datasets: SKEMPI and the Absci HER2 dataset.
The authors demonstrated the effectiveness of the method by performing in silico affinity maturation on two existing binders.
Study evaluates a number of generative models on datasets of antibodies with reported affinities.
The methods tested were: MEAN, dyMEAN, IgBLEND, Ablang, Ablang2, AntiBerty, ESM, Antifold, ESM-IF, AbX, Diffab + their own version of Diffab.
Datasets used were the Absci HER2 datasets (100s of binders) and a number of datasets with tens of binders each.
All models have some correlation with the affinity data, though weak.
Adding epitope information is not a game changer, showing that information that is mostly captured is fitness of antibody first and antigen second, if at all.
Employing structural information helps as compared to purely sequence approaches.
Novel humanization software, allowing for rapid re-design of both heavy and light chains.
Unlike other tools such as Hu-mAb and Sapiens, which humanize heavy and light chains separately, Humatch jointly humanizes both chains, improving stability and reducing the risk of immunogenic epitopes between chains.
Humatch consists of three lightweight Convolutional Neural Networks (CNNs). Each CNN is trained for a specific task: one for heavy chains (CNN-H), one for light chains (CNN-L), and one for assessing natural heavy/light chain pairing (CNN-P). The CNNs are designed to output multiclass predictions for identifying human V-genes and classifying chain pairings.
The CNNs were trained on data from the Observed Antibody Space (OAS), which includes millions of human and non-human antibody sequences.
Humatch's performance was measured through precision-recall and ROC-AUC metrics, achieving near-perfect accuracy in classifying human and non-human sequences. Performance was also tested by humanizing 25 precursor antibodies and comparing the mutations with experimentally derived humanized versions, showing high overlap (77-82%) with experimental designs.
Authors describe how using a structure predictor one can re-design the binding site, to maintain binding.
They use a proprietary GaluxDesign method, the method achieves 1.4 Å Ca RMSD in predicting CDR-H3 loop structures, leveraging a unique scoring metric (G-pass rate) that assesses both confidence and structural consistency for antibody design.
The method outperforms AlphaFold 2.3, ABlooper, and ImmuneBuilder in predicting CDR-H3 loop structures, with significantly lower RMSD values (1.4 Å compared to 2.4-3.7 Å), particularly on a more challenging, time-separated dataset.
The binding propensity to HER2 was evaluated using a large mutant library and calculated via the G-pass rate, outperforming AlphaFold's PAE-based scoring. The model showed strong discrimination with an AUROC of 0.758, compared to 0.529 for AlphaFold. The novel loop is scored using their metric (G-pass rate) in complex with Her2.
Novel antibody sequences were designed by predicting six CDR loops in antibody-protein complexes, using GaluxDesign models. These designs were experimentally tested, achieving high success rates, including a 13.2% success rate for HER2 antibody designs using yeast display methods.
Authors demonstrate that using scores from DeepAb one can sort mutations in an antibody that improve affinity and a series of other properties.
The authors used the DeepAb structure prediction mode model to rank mutations based on their impact on structure prediction confidence, leading to the design of 200 novel anti-hen egg lysozyme (HEL) antibody variants.
Single-point mutations from a deep mutational scanning (DMS) dataset (Warszawski et al.) were combined into multi-mutation variants (up to 7 mutations), and these variants were selected based on DeepAb scores for experimental testing.
The designed variants were expressed and tested for thermostability, colloidal stability, and binding affinity to HEL.
Large percentage of the variants showed improved thermostability (91%) and affinity (94%), with 10% showing significant increases in binding affinity.
A subset of 27 high-performing variants was further tested for developability characteristics, including nonspecific binding, aggregation propensity, and self-association, ensuring their practical usability.
Novel language model applied to predicting antibody binding affinity in antigen-less manner.
AntiFormer is a graph-based large language model that combines sequence information with graph structures to predict antibody binding affinity. Its dual-flow architecture includes a transformer-based encoder for sequence features and a graph convolutional network (GCN) for capturing structural relationships (from sequence!), offering enhanced prediction accuracy.
AntiFormer was compared against advanced models like AntiBERTy and AntiBERTa, as well as basic transformer models with 6 and 12 layers, demonstrating superior performance across all evaluation metrics. It shows a better performance but not by a huge margin.
The model's performance was evaluated using affinity datasets, including the Observed Antibody Space (OAS) database and an additional dataset containing 104,972 antibody sequences with annotated affinity values, highlighting its accuracy and efficiency.
Novel language model incorporating structural information, with demonstrated experimental ability to improve design of therapeutic antibodies.
The new language model, ProseLM, builds upon Progen family of models from the same authors.
Structural information in the form of structural adapter layers after language model layers, encoding backbone and associated functional annotations.
Models with more parameters achieve much better perplexity. There is also some improvement by adding tangential context information such as ligands etc.
They trained an antibody-specific version of ProseLM, only on SABDAB data and it does much better on sequence recovery even than the larger models.
They use the model to propose mutations for Nivolumab ad Secukinumab, with mutations both in CDRs and Frameworks. THey used structures from the PDB as the basis for designs.
They found better binders, however if CDRs were re-designed the overall success rate of maintaining binding was lower (25% for Nivolumab) than when frameworks were redesigned (92%).
Novel generative model for antibody sequences that supports Vh/Vl pairing and generation of developable sequences.
Three models were created, IgGen (unpaired model), p-IgGen (unpaired fine-tuned on pairs) and developable p-IgGen (paired fine-tuned on developable sequences).
They used ca. 250m unpaired sequences and 1.8m paired sequences for training.
The model is based on GPT-2 but with rotary position embedding.
Developable sequences were defined as structural models of the 1.8m that had good TAP metrics (900,000 in total).
The model is much smaller than many of the models out there, (17m params), so it is more lightweight in training and application.
The model performs better on immunogenicity prediction than other models but worse on expression prediction.