Computational Antibody Papers

Filter by tags
All
Filter by published year
All
TitleKey points
  • 2024-08-15

    Active learning for affinity prediction of antibodies

    • binding prediction
    • experimental techniques
    • An active learning framework is proposed to efficiently identify antibody mutations that enhance binding affinity, minimizing wet-lab experiments.
    • From the paper: “Active learning is a framework from experimental design that focuses on making informed decisions about which experiments to perform next”. That’s important these days to effectively generate prediction first data to maximize the effectiveness of models.
    • Bayesian optimization is used with relative binding free energy (RBFE) methods to iteratively propose and evaluate new antibody sequences, improving binding affinity predictions.
    • Various encoding schemes, including one-hot, bag of amino acids, BLOSUM, and AbLang2, are tested. The best performing methods are identified through validation with pre-computed data.
    • The study uses two RBFE methods: NQFEP for accurate but costly simulations, and Schrödinger Res Scan for faster but less precise results. The active learning loop consistently finds better binding sequences using these methods.
    • AbLang2 encoding with the Tanimoto kernel consistently outperformed other methods in the validation phase, indicating its effectiveness in predicting improved binding affinities.
    • NQFEP method provides more accurate but computationally expensive simulations, while the Schrödinger Res Scan offers quicker but less precise results. Despite the higher computational cost, the NQFEP method might be preferable when high accuracy is crucial.
    • Novel docking score based on graph model and language model embeddings, with an antibody-specific variety.
    • It is based on the Equiformer models, giving the structure an equivariant representation. With respect to previous work, authors moved from atom to residue-level representations, input complexes rather than single chains and provide NLP embeddings.
    • They employ DistilBert for embeddings, training it on interaction data from BioGrid.
    • The antibody specific model is trained on antibody data downloaded from ABDB to distinguish native from non-native poses after local docking with RosettaFold.
    • The antibody-specific model is in fact heavy-antigen as the authors did not see benefit in the light-only model and having it three-way heavy-light-antigen was too computationally expensive.
    • Though the antibody-specific scoring method has predictive power in distinguishing native from non-native poses.
    • The antibody-specific model did not outperform AF2-multimer, however authors note that it has predictive power that can be harnessed to rescoring AF2-multimer outputs.
    • The code is available in https://gitlab.com/mcfeemat/eudockscore
    • Authors introduce a large antibody-specific language model, Fabcon (2.4B) that demonstrably improves on predicting antibody specificity.
    • Model is based on the Falcon LLM and is trained no the CLM objective (predict next amino acid going N-to-C terminal)
    • The model was trained on paired (2.5m) and unpaired (821m) data from OAS.
    • The pre-trained model was tested on its ability to fine tune on binder prediction using three datasets anti-her2, anti-sars-cov2 and anti-il6.
    • When comparing against multiple other models on the binders prediction, the largest Fabcon model fares best, showing the benefit of overparametrization.
    • Since the model was trained on CLS objective, it can be used for sequence generation, producing sequences that are very human-like as compared to human PBMCs.
    • IgDiff - antibody-specific diffusion method to generate antibody-like coordinates.
    • The method is the result of fine-tuning FrameDiff on ~150,000 antibody models (ABB2) from OAS.
    • The method supports several design scenarios, such as generating the whole Vh/Vl, just the CDRs, CDR-H3 or light chain redesign.
    • Several antibody coordinates that were generated using IgDiff and whose sequence was predicted using AbMPNN were successfully synthesized in the lab.
    • Novel tool to automatically annotate immunoglobulin (and t-cell) genes from assemblies.
    • They compared the human manual annotations from IMGT to what they identify and they were in close agreement for functional and open reading frame genes.
    • By automating annotation, strict rules can be enforced avoiding manual curation errors.
    • Software is available at: https://github.com/williamdlees/Digger
  • 2024-07-30

    Fast and accurate modeling and design of antibody-antigen complex using tFold

    • binding prediction
    • structure prediction
    • docking
    • Update on tfold-AB including modeling of the complex with the antigen, with applications to virtual screening.
    • They mostly use SABDAB/Covabdab as reference datasets.
    • The modeling happens by generating antibody & antigen features supplemented by a large language model, followed by flexible docking.
    • For antigen feature generation, they use AF2
    • They are training on several tasks simultaneously, ab structure prediction, complex prediction etc. making it a multi-task training.
    • On docking their method achieves DockQ 0.217 vs AlphaFold-Multimer DockQ score of 0.158 - that is global docking.
    • Whe local docking information is given, constraining paratope/epitope sites, their algorithm achieves DockQ of 0.416.
    • They demonstrate that filtering antibodies by their predicted modeling confidence score gives moderate enrichment against PD1 and Sars-cov-2 antigens, showing promise for virtual screening.
    • Inverse folding method specific to antibodies.
    • They employed ESM-IF as a base model for fine tuning.
    • They made one pass through ABodyBuilder2 models of paired OAS sequences (~150k sequences) and then ~2000 crystal structures.
    • They tested whether shotgun masking (random residues) is better than span-masking. Though shotgun performed better in general, span-masking is better in case the the entire CDRs are obscured (realistic case for design).
    • AntiFold improves upon author’s earlier ab-specific inverse folding method AbMPNN (fine-tuned ProteinMPNN), 43 % vs 60% sequence recovery on CDR-H3.
    • Authors took a handful of native structures, sampled sequences using different methods and modeled them using ABodyByuilder2 to see if the sampled sequences maintain the same fold. AntiFold achieves better (0.67) RMSD to the original backbone than AbMPNN (0.74) and ESM-IF (0.75).
    • Inverse folding method for redesign of binding antibodies from crystal structure.
    • IgDesign draws from LM-design which introduces structural conditioning of language models. They fine-tune ProteinMPNN on Sabdab to get an antibody specific IgMPNN. The structural embeddings from IgMPNN are then used in ESM2-3B. The method receives coordinates antigen & antibody but no CDRs, as these are designed.
    • They experimentally validated the protocol on 8 antigens with co-complex in the PDB, some of which are given in the poster presenting the method: https://www.absci.com/antibody-inverse-folding/
    • They design either the CDR-h3 only or all the heavy chain CDRs. They accuracy of designs to still bind after the CDR-H3 redesign is in the ballpark of 20/30% with two outliers with 70% and 90%. Success rate of redesign of all CDRs of heavy chain is lower but in some cases even comparable to the success rate of CDR-H3 redesign alone.
    • Authors propose a new training regime for language models of antibodies, noting that previous approaches drew a lot from natural language.
    • They use a masking rate between 50-70%, opposed to the normal pre-training rate of 15% in natural languages.
    • Rather than masking individual residues they also mask entire spans of sequence, with focus on masking CDR-H3.
    • Infilling using PARA is much more accurate in the CDR-H3 region - they get 48.7% accuracy in CDR-H3 versus 36.4%, closest SOTA, AntiBERTy.
    • They applied their model to downstream tasks such as heavy/light calling and prediction of trastuzumab binding, however the gains here were modest with respect to other models (e.g. Her2 binding using simple CNN model got accuracy of 82.8 vs 83.7 using their method).
    • The model is available at: https://github.com/xtalpi-xic/PARA/tree/main
  • 2024-07-23

    Baselining the Buzz Trastuzumab-HER2 Affinity, and Beyond

    • databases
    • protein design
    • generative methods
    • binding prediction
    • Novel dataset of 0.5m anti-trastuzumab antibodies together with benchmarking of affinity classification methods.
    • They generated a dataset of ~500,000 anti-Her2 trastuzumab variants by modifying the CDR-H3. The binding affinity is divided into high/medium/low with reasonably even split (178,160, 196,392, and 171,732 respectively).
    • They split their dataset into positives and negatives by putting medium and low binders into the negative set.
    • They contrast their dataset with this from Mason et al. (~39k variants vs theirs 0.5m), to show that in a small (relatively) number of cases, binders in one set can be labeled as negative in another.
    • They test that the predictor developed by Mason et al. for binding/non binding classification works as intended on the novel 0.5m dataset. Likewise their model trained on Mason data and tested on their ds (and vice versa) has predictive power. It does but to a much lesser extent than training on data from the same experiment.
    • As methods to benchmark they used FLAML https://arxiv.org/abs/1911.04706, CNN and EGNN.
    • CNN and FLAML are top performers, but CNN performs well on the small data (signal starting with ~170 sequences).
    • Performance drops radically when train/val splits are done with respect to the clonotype.
    • THey tested AbLang, ProteinMPNN, ESM and Blosum on their ability to generate binding trastuzumab variants. As control they also randomly generated sequences. They observed the following percentages of sequences exhibiting CNN-HER2-max binding probabilities greater than 90%: 13% for Random, 26% for BLOSUM, 27% for AbLang (when masking all ten residues simultaneously), 29% for AbLang (when masking one residue at a time), 19% for ProteinMPNN, and 30% for ESM (when masking one residue at a time), respectively.