Authors employ patent data to develop a model (selfPAD) of humanness that achieves state of the art in immunigenicity prediction.
They employ data from PAD which at the time were roughly 290k sequences from 16,000 patent families.
They recognize the noisiness inherent to the patent data and employ a training procedure to train a latent representation of patent sequences that is associated with function - in this case the target of the sequence.
In the first stage of training they employ contrastive learning, with sequence for the same target trained to be ‘closer’ in latent space and those against different targets to be ‘farther away’.
In the second stage, they perform fine tuning on humanness detection.
They tested their method on humanness prediction, ADA prediction and agreement with humanization choices. Taking all the tests together their method achieves the best performance.
Authors expand the existing IMGT-mab-db with knowledge graph querying via user-friendly interface.
As of February 2024, IMGT/mAb-KG contains 139,629 triplets, 1,867 concepts, 114 properties, and links 21,842 entities. It includes detailed information on approximately 1,500 mAbs, 500 targets, and 500 clinical indications.
It is linked to various external resources, such as Thera-SAbDab, PharmGKB, PubMed, and HGNC, making it a valuable tool for researchers and developers working on therapeutic mAbs
An active learning framework is proposed to efficiently identify antibody mutations that enhance binding affinity, minimizing wet-lab experiments.
From the paper: “Active learning is a framework from experimental design that focuses on making informed decisions about which experiments to perform next”. That’s important these days to effectively generate prediction first data to maximize the effectiveness of models.
Bayesian optimization is used with relative binding free energy (RBFE) methods to iteratively propose and evaluate new antibody sequences, improving binding affinity predictions.
Various encoding schemes, including one-hot, bag of amino acids, BLOSUM, and AbLang2, are tested. The best performing methods are identified through validation with pre-computed data.
The study uses two RBFE methods: NQFEP for accurate but costly simulations, and Schrödinger Res Scan for faster but less precise results. The active learning loop consistently finds better binding sequences using these methods.
AbLang2 encoding with the Tanimoto kernel consistently outperformed other methods in the validation phase, indicating its effectiveness in predicting improved binding affinities.
NQFEP method provides more accurate but computationally expensive simulations, while the Schrödinger Res Scan offers quicker but less precise results. Despite the higher computational cost, the NQFEP method might be preferable when high accuracy is crucial.
Novel docking score based on graph model and language model embeddings, with an antibody-specific variety.
It is based on the Equiformer models, giving the structure an equivariant representation. With respect to previous work, authors moved from atom to residue-level representations, input complexes rather than single chains and provide NLP embeddings.
They employ DistilBert for embeddings, training it on interaction data from BioGrid.
The antibody specific model is trained on antibody data downloaded from ABDB to distinguish native from non-native poses after local docking with RosettaFold.
The antibody-specific model is in fact heavy-antigen as the authors did not see benefit in the light-only model and having it three-way heavy-light-antigen was too computationally expensive.
Though the antibody-specific scoring method has predictive power in distinguishing native from non-native poses.
The antibody-specific model did not outperform AF2-multimer, however authors note that it has predictive power that can be harnessed to rescoring AF2-multimer outputs.
The code is available in https://gitlab.com/mcfeemat/eudockscore
Authors introduce a large antibody-specific language model, Fabcon (2.4B) that demonstrably improves on predicting antibody specificity.
Model is based on the Falcon LLM and is trained no the CLM objective (predict next amino acid going N-to-C terminal)
The model was trained on paired (2.5m) and unpaired (821m) data from OAS.
The pre-trained model was tested on its ability to fine tune on binder prediction using three datasets anti-her2, anti-sars-cov2 and anti-il6.
When comparing against multiple other models on the binders prediction, the largest Fabcon model fares best, showing the benefit of overparametrization.
Since the model was trained on CLS objective, it can be used for sequence generation, producing sequences that are very human-like as compared to human PBMCs.
Novel tool to automatically annotate immunoglobulin (and t-cell) genes from assemblies.
They compared the human manual annotations from IMGT to what they identify and they were in close agreement for functional and open reading frame genes.
By automating annotation, strict rules can be enforced avoiding manual curation errors.
Software is available at: https://github.com/williamdlees/Digger
Update on tfold-AB including modeling of the complex with the antigen, with applications to virtual screening.
They mostly use SABDAB/Covabdab as reference datasets.
The modeling happens by generating antibody & antigen features supplemented by a large language model, followed by flexible docking.
For antigen feature generation, they use AF2
They are training on several tasks simultaneously, ab structure prediction, complex prediction etc. making it a multi-task training.
On docking their method achieves DockQ 0.217 vs AlphaFold-Multimer DockQ score of 0.158 - that is global docking.
Whe local docking information is given, constraining paratope/epitope sites, their algorithm achieves DockQ of 0.416.
They demonstrate that filtering antibodies by their predicted modeling confidence score gives moderate enrichment against PD1 and Sars-cov-2 antigens, showing promise for virtual screening.