Novel language model incorporating structural information, with demonstrated experimental ability to improve design of therapeutic antibodies.
The new language model, ProseLM, builds upon Progen family of models from the same authors.
Structural information in the form of structural adapter layers after language model layers, encoding backbone and associated functional annotations.
Models with more parameters achieve much better perplexity. There is also some improvement by adding tangential context information such as ligands etc.
They trained an antibody-specific version of ProseLM, only on SABDAB data and it does much better on sequence recovery even than the larger models.
They use the model to propose mutations for Nivolumab ad Secukinumab, with mutations both in CDRs and Frameworks. THey used structures from the PDB as the basis for designs.
They found better binders, however if CDRs were re-designed the overall success rate of maintaining binding was lower (25% for Nivolumab) than when frameworks were redesigned (92%).
Novel generative model for antibody sequences that supports Vh/Vl pairing and generation of developable sequences.
Three models were created, IgGen (unpaired model), p-IgGen (unpaired fine-tuned on pairs) and developable p-IgGen (paired fine-tuned on developable sequences).
They used ca. 250m unpaired sequences and 1.8m paired sequences for training.
The model is based on GPT-2 but with rotary position embedding.
Developable sequences were defined as structural models of the 1.8m that had good TAP metrics (900,000 in total).
The model is much smaller than many of the models out there, (17m params), so it is more lightweight in training and application.
The model performs better on immunogenicity prediction than other models but worse on expression prediction.
Novel CDR-H3 structure prediction method, ComMat based on ensemble sampling.
Rather than generating a single structure, the method generates several solutions that are then all informing the next iteration.
The method was integrated into the structure module of AlphaFold2.
Crucially, with the introduction of the second prediction into the ‘community’, the predictions become better. However these quickly plateau, showing the limits of the approach.
The method does not produce better results than ABodyBuilder2 and EquiFold.
Novel humanization protocol employing language models and large-scale repertoire data.
Human OAS and germline sequences are embedded using ESM2.
K-nearest neighbors algorithm is then used to introduce mutations into the ESM-2 embedded query sequence coming from closest functional neighbors in the ESM2-embedded OAS+germlinse space.
Results of humanized abs are validated experimentally via ELISA.
Novel language model AntiBARTy with demonstration of how to use it to diffuse novel antibodies with favorable solubility properties.
The core model is a BART-based transformer, with 16m parameters.
It was firstly trained on all human heavy and light chains from OAS (254m heavies and 342m lights <- yes, more lights). This was followed by fine tuning on the higher quality paired data from OAS.
The diffusion model was based on U-net (CNN used for segmentation of medical images), totaling 3m parameters.
They define low and high solubility classes as predicted by protein-sol on paired OAS, with roughly 20k samples for each class.
Overall, one can sample from multivariate to get a vector in Antibarty latent space and use it to get an antibody sequence that is either high or low protein-sol predicted solubility.
Authors introduce AntPack - software for rapid numbering of antibody sequences, germline identification and humanization.
Authors use a mixture model (so not ML!) on millions of sequences from NGS.
The sequences are pre-numbered to standardize them and then assigned to clusters which offer explainability on germline assignment and residue probability at a given position.
The method is very fast in comparison to HMM-based approaches such as ANARCI.
Method is available via https://github.com/Wang-lab-UCSD/AntPack
Authors demonstrate that using inverse folding, one can affinity mature antibodies, confirmed experimentally.
Authors employ ESM-IF as the inverse folding algorithm.
They take two existing antibodies, bebletovimab and BD55-5840, both instrumental in COVID-19.
They introduce all possible single point mutations to the Vh and Vl regions (about 4300). They pick the best perplexity for experimental characterization.
The best perplexity ones have many framework mutations (bebletovimab 10/14 and BD55 5840 3/6). There was only one mutation to CDR-H3 in Bebletovimab.
Inverse folding mother achieves much better performance when antigen is used as well.
Proposal for modeling antibodies using language, that is more fit-for purpose than current approaches.
It is plausible to represent antibodies/proteins as language to draw from existing trove of research on natural language.
Current approaches of porting the models from natural language to proteins/antibodies verbatim, might not release their full potential because of not focusing on key differences between natural language and proteins.
Authors propose a more fit for purpose formalization, where quite an important part is better token definition and associating them with function. For instance do not simply use amino acids or k-mers but have something more complex such as C*U and RA*, associated with hydrophobicity, binding zinc fingers or similar.
Authors employ patent data to develop a model (selfPAD) of humanness that achieves state of the art in immunigenicity prediction.
They employ data from PAD which at the time were roughly 290k sequences from 16,000 patent families.
They recognize the noisiness inherent to the patent data and employ a training procedure to train a latent representation of patent sequences that is associated with function - in this case the target of the sequence.
In the first stage of training they employ contrastive learning, with sequence for the same target trained to be ‘closer’ in latent space and those against different targets to be ‘farther away’.
In the second stage, they perform fine tuning on humanness detection.
They tested their method on humanness prediction, ADA prediction and agreement with humanization choices. Taking all the tests together their method achieves the best performance.
Authors expand the existing IMGT-mab-db with knowledge graph querying via user-friendly interface.
As of February 2024, IMGT/mAb-KG contains 139,629 triplets, 1,867 concepts, 114 properties, and links 21,842 entities. It includes detailed information on approximately 1,500 mAbs, 500 targets, and 500 clinical indications.
It is linked to various external resources, such as Thera-SAbDab, PharmGKB, PubMed, and HGNC, making it a valuable tool for researchers and developers working on therapeutic mAbs