Using language models & structural predictions to predict antibody-antigen interactions.
AntiBinder integrates sequence and structural information using IgFold for antibodies and ESM-2 for antigens, employing specialized encoders to extract meaningful features before passing them through multiple Bidirectional Attention Blocks (BidAttBlock) and a classifier.
The model was trained and evaluated on four datasets: COVID-19 (Cov-AbDab), HIV (LANL database), BioMap, and MET. These datasets contain antigen–antibody interaction pairs across multiple species and applications, covering viruses like SARS-CoV-2 and HIV, plenty of antigenic variants in total.
AntiBinder was benchmarked against 11 state-of-the-art models, including AttABseq, DG-affinity, DeepAAI, and general protein–protein interaction (PPI) models. AntiBinder did better.
Authors test the generalizability but chiefly within antigenic species, such as different covid variants or HIV mutants.
Inverse folding and thus antibody design via database search.
Authors train a vector retrieval database on SAbDab. In this way for a single sequence one can figure out where it falls structurally.
They benchmark against state of the art inverse folding tools such as AbMPNN, AntiFold, ProteinMPNN and ESM-IF - their tools comes on top in terms of sequence retrieval.
The database search is orders of magnitude faster than the state of the art inverse folding tools.
They compare IgSeek versus FoldSeek - their tool gets a higher accuracy in sequence retrieval, for most CDRs, but CDR-H3. Therefore FoldSeek seems like a very good choice alongside IgSeek for such a database-driven inverse folding protocol.
Novel generative antibody method, CloneLM/CloneBO, following clonally plausible evolutionary paths.
They train CloneLM, an autoregressive language model, on antibody clonal family data from the OAS. There were two separate models for heavy and light sequences. They use FastBCR to call clonal families.
CloneLM generates new clonal families by conditioning on a given antibody sequence. They use a martingale posterior approach to ensure sampled sequences follow plausible evolutionary paths. So it takes antigen into account, but only by the virtue of the clonal family.
For benchmarking they train a language model oracle on a real human clonal family and use it as a simulated fitness function.
They further perform training on affinity and stability data to generate oracles for these and show that the newly generated sequences can be made to be more stable/have higher affinity.
Demonstration that general purpose language models - like GPT3.5 - can reason about antibody - engineering tasks.
Authors explore the topic of in context learning - e.g. few shot learning where several examples are given and on the basis of that the model needs to provide a prediction for a new case.
They tested an array of general purpose models, such as GPTs, LLamas, Mistrals etc.
They tested on three antibody tasks - mouse/human discrimination, specificity prediction (from ngs) and isotype identification. In theory not that difficult tasks, but remember we are dealing witha general purpose language model.
They literally prompt the model with examples on, say mouse antibodies, human antibodies and provide a next one to predict.
They find that the predictions are not bad, especially in few shot scenario (16 examples or so).
In one test it even achieved accuracy on par with AntiBERTy.
Heavy light chain pairing has long been posited to be random, or at the very least VERY promiscuous. Authors check that via training their model on different portions of the variable region and showing that there is signal where full sequences are used.
Authors curated a set of ca. 233k positive heavy/light chain pairs from OAS. Negative samples were made by random shuffling - so they could occur in nature, just were not observed in this ds.
They use Antiberta2 as a basis for training the classification model.
The model achieves 0.75 and 0.66 ROC AUC on two test sets - so there seems to be some signal there.
When the model is split between lamdbas/kappas, it does better - though lambda have signal for kappas (remember that lambda is a rescue rearrangement for not-working kappa).
Naive B-cell pairs have less predictability than mature ones.
One of the first studies showing that introducing structure to protein language models, improves the predictive ability.
They fed ProteinMPNN (structural) inputs to ESM-1B to show that it improved recovery as opposed to using ESM-1B mask alone.
To marry ProteinMPNN and ESM-1B they use an ‘adapter’. Adapters in machine learning are lightweight modules that modify or extend a model’s functionality without retraining all parameters; in LM-DESIGN, a structural adapter integrates structural information into protein sequence predictions by bridging the structure encoder and a pretrained language model (pLM).
LM-DESIGN benchmarked against state-of-the-art protein inverse folding models, including ProteinMPNN, PiFold, GVP-Transformer, Structured Transformer, and GVP, while utilizing pretrained language models such as ESM-1b 650M and the ESM-2 series.
LM-DESIGN was evaluated on CATH 4.2 and CATH 4.3 datasets using sequence recovery rates and perplexity, compared against baselines.
LM-DESIGN outperformed individual models, improving sequence recovery by 4-12% points, surpassing ProteinMPNN and PiFold.
Method to employ low-N data for biologic engineering.
Assuming we have a dataset of ~100 affinity data points, we can choose (100 choose 2) pairs where we know which one has a larger readout than the other (e.g. stronger affinity) giving combinatorially larger amount of data points to train on.
The architecture used is CNN on top of a language model.
Benchmarked on three internal campaigns, Il6, EGFR and an undisclosed target.
New (old :) ) therapeutic antibody database, larger than what is available from other sources several times.
Includes over 2,900 investigational antibody candidates and more than 450 approved or late-stage molecules.
It tracks molecular format, target antigen, development status, clinical history, and company data, along with antibody isotype, conjugation status, and mechanism of action.
Analysis highlights a rise in bispecifics, ADCs, and immunoconjugates, with most clinical-stage antibodies targeting cancer and originating from China or the U.S.
The data are collected from public sources beyond INN lists, including company websites, press releases, clinical trial registries, regulatory agencies, and literature reports.
Architecturally, it is a mix of language models, diffusion and structure prediction methods.
Training happens by noising diffusion, firstly perturbing structure and making the model get it right and afterwards doing the same thing for sequences.
After these two steps the model is distilled into a consistency model. This results in a model that can get the final coordinates/sequence in a single step rather than iterative denoising.
Method achieves comparable accuracy to many methods out there, such as DiffAb, dyMEAN and others.
On docking, the best performance is in the order of 4A iRMSD when using an AlphaFold3 antibody model - so still some challenges remain.