Computational Antibody Papers

Filter by tags
All
Filter by published year
All
TitleKey points
    • Novel epitope-paratope prediction method, ImaPEp.
    • The method projects the epitope and paratope onto 2D images and then uses a ResNET to predict the interacting vs non-interacting pairs.
    • Negative set was done by pairing non-cognate antibody-antigen pairs, rotations etc.
    • The method was not benchmarked against epitope predictions tools, that arguably do not take pairs into account, but against docking tools, scoring 13 out of 18 methods tested.
    • AlphaBind, a deep learning model designed to optimize antibody sequences, by leveraging large-scale pre-trained affinity datasets and fine-tuning on experimental data.
    • AlphaBind was pre-trained on a dataset of 7.5 million antibody-antigen affinity measurements, which includes data from yeast display systems and diverse antibody libraries obtained from multiple experimental sources, focusing on quantitative affinity measurements.
    • The model utilizes transformer-based architecture with protein sequence embeddings generated using ESM-2nv (Evolutionary Scale Model) embeddings. The model consists of 4 attention heads, 7 layers, and about 15 million parameters.
    • The model fine-tunes on specific antibody-antigen systems using AlphaSeq data, then performs stochastic greedy optimization by generating sequence mutations (using ESM-2nv logits) to explore sequence space and predict binding affinity. This process generates thousands of candidate sequences, which are filtered based on affinity predictions and developability metrics before in vitro validation.
    • The novel sequences based on three systems were verified in experimentally.
  • 2024-11-22

    Structure Language Models for Protein Conformation Generation

    • generative methods
    • language models
    • non-antibody stuff
    • Novel model for sampling the structural space of proteins - applied to nanobodies.
    • Protein structures are encoded into latent tokens using a discrete variational auto-encoder (dVAE), which captures residue-level local geometric features in a roto-translation invariant manner.
    • The model combines a dVAE encoder-decoder with a Structure Language Model (SLM) to model sequence-to-structure relationships.
    • One can sample alternate conformations by providing an amino acid sequence as input and use the SLM to generate latent tokens representing potential conformations. Decode these latent tokens back into 3D structures using the dVAE decoder to generate diverse structural ensembles.
    • Altogether, though they develop ESMdiff in the context of the paper, it is rather the structural language in the context of the broader framework of language models that forms the core of the message, rather than any single method.
    • The model was evaluated on tasks like generating equilibrium dynamics, fold-switching, and intrinsically disordered proteins, using metrics like JS-divergence, TM-score, and RMSD. It outperformed existing methods in speed and accuracy, generating structures 20-100× faster.
    • Novel method for antibody library design.
    • Given a structural complex, they employ AntiFold and Protbert to explore the ‘fitness’ space, to generate a set of mutants.
    • To achieve multi-parameter optimization they use linear programming, rather than neural nets that are in vogue.
    • The method was not validated experimentally, rather using computational proxies.
    • Pilot single-shot computational antibody design, where known binders were taken and new ones computationally generated on their basis, maintaining binding with a good developability profile.
    • The pipeline starts with known binders to the SARS-CoV-2 RBD
    • Novel binders were generated using a combination of computational approaches, including: Observed Antibody Space (OAS): Paired and unpaired sequences from the OAS dataset to identify antibody candidates within a certain edit distance from the starting antibodies. Inverse Folding Model (AbMPNN):generated new antibody sequences maintaining structural features compatible with binding to the SARS-CoV-2 RBD. ESM: guided the mutation of sequences to retain or improve binding affinity while enhancing developability.
    • The developability properties were assessed using Rosetta scoring to evaluate antibody stability and interface energetics, alongside TAP.
    • Experimental methods for screening included size-exclusion chromatography (SEC) to assess aggregation propensity and differential scanning fluorimetry (DSF) for thermal stability. Antibodies that passed these criteria were deemed suitable for development​.
    • Success rate of the method: The pipeline demonstrated a success rate of 54% for generating binding antibodies that retained affinity against escape mutations on the SARS-CoV-2 RBD.
    • Authors tested RFDiffusion for the design task but with poor success rate - albeit not the antibody-fine tuned version it appears, that should work better.
    • Novel method to predict stability of proteins, based on ProteinMPNN.
    • The method employs ProteinMPNN embeddings with a stability prediction module to gauge the effect of single point mutations on protein stability.
    • The Stability prediction module is composed of the light attention module (that figures out which parts of the embeddings should be upvoted) followed by shallow multi layer perceptron.
    • For training/evaluation they employed Megascale and Fireprot datasets with measured protein stability data. Though after much pre-processing because the original datasets either contained many unreliable data points, or there was a risk that the mutations would change the structure too much.
    • Ablations show that all the elements of the network are important and bring something to the prediction, with ProteinMPNN having quite some predictive power out-of-the box.
    • Authors perform humanization of VHHs and generate experimental data confirming their designs.
    • The protocol involves grafting CDRs1-3 and then systematically modifying Hallmars/Verniers and others to make them more human.
    • Positions 49 and 50 (e.g., E49G, H50L in VHH1): These were generally well-tolerated, allowing for humanization without major impact on binding affinity or stability.
    • Position 52 (e.g., S52W in VHH2): In some cases, changing this residue even improved affinity.
    • Position 42: Humanizing residue F42 to a more human-like amino acid (e.g., F42V) in VHH2 led to a significant reduction in binding affinity. This residue plays a key role in stabilizing the CDR3 loop through interactions with other regions, making it essential for maintaining the bioactive conformation.
    • Position 52 (in some contexts): In VHH1, the mutation G52W led to a loss of binding due to steric clashes, demonstrating that this position can be critical depending on the structural context.
    • Measured binding affinities, expression yields, and purities of humanized variants. Crystal structures confirmed effects of humanization on binding; non-canonical disulfides stabilize CDR3
    • Novel humanization method, employing diffusion.
    • The model first learns the diffusion of a human sequence (with CDRs intact). The framework residues are diffused back. The network is then fine-tuned on mouse sequences.
    • There are two flavors of the model - one nanobody, another antibody.
    • They curate a great dataset from patents with over 300 sequences of paired humanized/native seuqences.
    • They demonstrate in silico and in vitro that their designs make sense.
    • An evolution of the ESM model family that scales up in terms of parameters, data, and computational power compared to ESM2, which allows it to improve on sequence, structure, and function representations of proteins.
    • ESM3 is a multimodal, bidirectional transformer that models sequence, structure, and function using discrete token representations for each modality. It merges these representations into a single latent space and is trained with a masked language model objective, allowing it to generate and predict across different modalities.
    • ESM3's largest model has 98 billion parameters.
    • The model was trained with 1.07 × 10²⁴ floating point operations (FLOPs) over a dataset of 771 billion tokens from 2.78 billion proteins.
    • Structural tokens in ESM3 are encoded by a discrete autoencoder that compresses three-dimensional protein structures into a sequence of discrete tokens. This is done by encoding local atomic environments around each amino acid and representing them in a simplified form that captures geometric properties.
    • There are a total of 4096 structural tokens to be had.
    • The structural autoencoder tokenizes protein structures by encoding local neighborhoods around each amino acid into discrete tokens. It uses a geometric attention mechanism that operates in local reference frames, based on bond geometry. This mechanism encodes and reconstructs the atomic structure, supervised by a geometric loss that preserves distances and orientations of bonds and atoms.
    • ESM3 can be used to generate novel sequences/proteins. generates protein sequences and structures by prompting the model with sequence or structural tokens. It uses iterative sampling, starting from a masked context, where tokens are predicted and unmasked progressively until a full sequence or structure is generated. This allows the model to create novel proteins that respect the given prompts or constraints.
    • The model was verified experimentally by generating novel proteins, including a green fluorescent protein, that was synthesized and tested for fluorescence in laboratory conditions. The novel protein had a sequence identity of 58% to the nearest known fluorescent protein.
    • Foundational model following in the footsteps of AlphaFold3 attempting prediction of molecular interactions.
    • Model architecture is closely modeled on this of AF3 - however it was not benchmarked against it (nor ESM3) because of use restrictions.
    • It takes 30 days on 128 A100s to train the model. Back of the envelope Google Colab pro+ pricing of 128 (which is NOT distributed) puts it at ca. 120k USD :)
    • Addressing antibodies, they introduce constraints (e.g. known epitope residue) to help the model out. Adding even a single residue makes a big difference with respect to baseline which is tantamount to global docking.
    • It only takes one residue as constraints to improve ab-ag complex prediction. Success rate for ‘local’ mode, with just one residue is about 50% in getting it with DockQ score >.21, 30% >.49 and less than 10% for high quality hits.
    • Success rate for ‘global’ mode, so without constraints is about 35% in getting it with DockQ score >.21, 20% >.49 and less than 5% for high quality hits.
    • So altogether if I want to ‘hit the epitope’, the model has ca. 30% success rate.
    • If I want a high quality ab-ag complex structure, unfortunately it seems that constraints do not help much currently.