Computational Antibody Papers

Filter by tags
All
Filter by published year
All
TitleKey points
    • Authors revisit computational calculations from sequence and structure to filter out clinical stage therapeutics as an alternative/refinement to the popular TAP metrics.
    • Authors explain how the FvCSP charge asymmetry calculated in TAP might not be the ideal formulation.
    • They introduce FV_CHML which as opposed to FvCSP is a difference between the net charges.
    • Of the several computational metrics employed they show that the FV_CHML metric captures most of the clinical stage therapeutics.
    • They analyse the effect of the isotype, demonstrating that for accurate pI calculations, constant region should be modeled and not only the Fv
    • They propose four descriptors that appear to show good degree of separation of natural vs clinical antibodies and some correlation with the experimental values: 1. Patch_cdr_hyd - hydrophobicity of CDRs, not the same as in TAP 2. ens_charge_Fv - in lieu of PPC and PNC from TAP 3. Cdr_len - these separate repertoire from clinical abs. 4. Fv_chml - in lieu of FvCSP from TAP
    • Novel experimental/computational workflow that demonstrates how little data might be needed to develop antibody affinity predictors.
    • Mice were immunized with hen egg white lysozyme and via computational procedure of clustering with known binders 35 antibodies were characterized together with their affinities.
    • These 35 antibodies were used to train the methods: Gaussian Process (GP) models with Matern and RBF kernels, Kernel Ridge Regression (KRR), Random Forest (RF) and Linear Regression (used as a baseline).
    • Seed sequences were point or double-mutated and their affinity predicted using GP (that performed the best). Eight mutants predicted to span the whole range of affinities were selected for experimental testing and they had very good agreement with the predictions.
    • Novel masking scheme for antibody sequences with applications to Vh-Vl pairing and specificity prediction.
    • Since antibodies have intrinsically biased mutation patterns in favor of CDRs, the authors questioned the canonical 15% uniform masking procedure in antibody language models.
    • They focused the masking on the CDR3 regions during training which resulted in faster convergence.
    • They tested pairing prediction of Vh/Vl and they noticed 60% random vs non-random pairing accuracy.
    • Novel algorithm to perform structural search of proteins and at the same time introduces an innovative way to encode protein structures
    • It encodes protein structures as sequences over a 20-state 3Di alphabet, representing tertiary residue-residue interactions (Ca of neighboring residues) rather than backbone conformations, enabling faster sequence-based comparisons.
    • The 3Di alphabet and substitution matrix were trained on the SCOPe40 dataset (~11k structures), which consists of manually classified single-domain protein structures clustered at 40% sequence identity.
    • FoldSEEK is thousands of times faster than structural alignment tools like Dali, TM-align, and CE, being 184,600 times faster than Dali and 23,000 times faster than TM-align on the AlphaFoldDB.
    • FoldSEEK achieves sensitivities of 86%, 88%, and 133% relative to Dali, TM-align, and CE, respectively, and ranks among the top tools in precision-recall benchmarks.
    • FoldSEEK produces alignments with accuracy comparable to Dali and TM-align, is 15% more sensitive than CE, and excels in detecting homologous multi-domain structures efficiently.
    • Prost5 : novel language model using the FoldSeek structural representation to introduce structural dimension to the model.
    • The foldseek 3Di representation is used to encode 3D protein structures as 1D token sequences, enabling seamless translation between amino acid sequences and structural representations.
    • The model was fine-tuned on 17 million AlphaFoldDB structures using ProtT5 as a base, with bi-directional translation tasks to map between amino acid (AA) and 3Di sequences.
    • ProstT5 achieves 3600-fold faster remote homology detection compared to AlphaFold-based methods, while maintaining near-experimental accuracy and improving fold classification tasks like CATH.
    • ProstT5 embeddings outperform ProtT5, ESM-1b, and Ankh for structure-related tasks and show competitive performance in inverse folding, generating diverse sequences with preserved structural similarity. Though in most cases ProteinMPNN still performs better for inverse folding.
    • Novel language model for antibodies, blending sequence and structural information.
    • The model encodes sequence ‘as usual’ and uses GVP-GNN (like esm-if) for structural representation. Only the three backbone atoms (C,N,Ca) are taken per residue to get the structural representation.
    • The data is a mix of sequence data and X-ray structures. The sequence datasets were modeled using ImmuneBuilder to increase structural coverage.
    • The model has an MLM objective on sequence & structure with three losses - sequence only, sequence + structure and structure only.
    • On sequence infilling IgBLEND performs better than other methods (e.g. AbLang, Nanobert), though arguably CDR-H3 predictions look very ‘close’ across the board.
    • On inverse folding the method performs quite a stretch better with large gaps in CDR-H3 with notable improvements for nanobodies - that other methods like ESM-IF or AntiFold did not handle natively.
    • Novel epitope-paratope prediction method, ImaPEp.
    • The method projects the epitope and paratope onto 2D images and then uses a ResNET to predict the interacting vs non-interacting pairs.
    • Negative set was done by pairing non-cognate antibody-antigen pairs, rotations etc.
    • The method was not benchmarked against epitope predictions tools, that arguably do not take pairs into account, but against docking tools, scoring 13 out of 18 methods tested.
    • AlphaBind, a deep learning model designed to optimize antibody sequences, by leveraging large-scale pre-trained affinity datasets and fine-tuning on experimental data.
    • AlphaBind was pre-trained on a dataset of 7.5 million antibody-antigen affinity measurements, which includes data from yeast display systems and diverse antibody libraries obtained from multiple experimental sources, focusing on quantitative affinity measurements.
    • The model utilizes transformer-based architecture with protein sequence embeddings generated using ESM-2nv (Evolutionary Scale Model) embeddings. The model consists of 4 attention heads, 7 layers, and about 15 million parameters.
    • The model fine-tunes on specific antibody-antigen systems using AlphaSeq data, then performs stochastic greedy optimization by generating sequence mutations (using ESM-2nv logits) to explore sequence space and predict binding affinity. This process generates thousands of candidate sequences, which are filtered based on affinity predictions and developability metrics before in vitro validation.
    • The novel sequences based on three systems were verified in experimentally.
  • 2024-11-22

    Structure Language Models for Protein Conformation Generation

    • generative methods
    • language models
    • non-antibody stuff
    • Novel model for sampling the structural space of proteins - applied to nanobodies.
    • Protein structures are encoded into latent tokens using a discrete variational auto-encoder (dVAE), which captures residue-level local geometric features in a roto-translation invariant manner.
    • The model combines a dVAE encoder-decoder with a Structure Language Model (SLM) to model sequence-to-structure relationships.
    • One can sample alternate conformations by providing an amino acid sequence as input and use the SLM to generate latent tokens representing potential conformations. Decode these latent tokens back into 3D structures using the dVAE decoder to generate diverse structural ensembles.
    • Altogether, though they develop ESMdiff in the context of the paper, it is rather the structural language in the context of the broader framework of language models that forms the core of the message, rather than any single method.
    • The model was evaluated on tasks like generating equilibrium dynamics, fold-switching, and intrinsically disordered proteins, using metrics like JS-divergence, TM-score, and RMSD. It outperformed existing methods in speed and accuracy, generating structures 20-100× faster.
    • Novel method for antibody library design.
    • Given a structural complex, they employ AntiFold and Protbert to explore the ‘fitness’ space, to generate a set of mutants.
    • To achieve multi-parameter optimization they use linear programming, rather than neural nets that are in vogue.
    • The method was not validated experimentally, rather using computational proxies.