They created a siamese EGNN, one given WT the other one mutant, with their difference being the ddG prediction.
They used the AB-Bind dataset which consisted of 645 mutants from 29 complexes.
They created a set of non-redundant antibody-antigen binders with 1475 complexes. They imposed 70% clustering on antigens.
They mutated one complex per cluster and ran foldx resulting in 942,723 ddG foldX data points.
On ABBind dataset they achieve a pearson correlation of 0.8 - however when they impose stringent CDR cutoffs the correlation drops dramatically, indicating overtraining.
When they run the training on the synthetic dataset, it stops being sensitive to overtraining.
Using AF2 they developed a pipeline to fold and dock proteins simultaneously. The pipeline shows good performance in distinguishing interacting and non-interacting proteins.
Acceptable models are those with DockQ > 0.23. Success rate is defined as percentage of acceptable poses.
The best version of their model achieves a 39.4 success rate.
AlphaFold2 outperforms other docking methods.
Using the number of Cb in contact (within 8A) or plDDT of the interface results in ROC AUC in the region of .9 distinguishing interacting and non-interacting proteins.
As input they insert a chain break of 200 residues to model the interaction.
They note that it is very important to create the right MSAs for AF2.
As negative cases for interactions (non-interacting proteins) they employ data from Negatome.
They draw from the Masif method in that they define a triangular mesh. Each vertex is encoded with physicochemical information and then each patch of a defined radius is encoded numerically.
They teach overlapping patches to have similar embeddings as they are assumed to have overlapping functions as well.
They employed contrastive learning, annotating patches as positive if they were within 1.5A from centered vertices and negative if they were centered on vertex 5A away.
Their learned similarity distances cluster by curvature, hydropathy and charge.
They compared Surface ID to structure based similarity measurement approaches with SurfaceID performing slightly better.
They clustered the antibody-epitope patches simultaneously. It clustered the binding modes between HIV-1 GP120, two for influenza HA and one for SARS-CoV-2 RBD. Anti-ha clustered had same epitope but different paratopes showing that the algorithm distinguishes on that level
They proposed a design scheme for antibodies. Look for similar epitopes by surface id and use the antibodies as putative binders to the query.
Using a siamese network and Sabdab to predict antibody-antigen binding in a binary fashion.
They clustered the antigens at 0.9 sequence identity. They assumed that similar antibodies from the same antigen group bind in the same manner. This resulted in 3,892 antigen pairs.
They also created a dataset of covid specific antibodies with 9309 positive samples and 1710 negatives.
They used tha CKSAAP encoding, but compared against others such as one-hot, pssm or their-own trained word2vec.
They benchmark the different encodings and models to show that CKSAAP + CNN come out on top.
Their siamese CNN with CKSAAP achieve a staggering .85 PR AUC.
Mildly flexible docking tools that runs very fast, as compared to traditional docking methods.
They used the DIPS datasets of about 42,000 binary complexes from the PDB
They represent proteins as graphs. Nodes are given the ESM2 650M embeddings and the edges the distances alongside orientation distributions from trRosetta.
The graph module serves as input to the structural module that, similar to AF2, performs recycling of the rotation of the two proteins.
The number of trainable parameters is 4.3m.
Losses are from AF2-multimer, FAPE, IDDT-ca and structure violation loss.
DockQ score of 0.23 is seen as a successful dock.
The method runs in seconds, which is significantly faster than typical docking methods.
Though faster, it does not perform better than traditional docking methods.
They have successfully illustrated GeoDock's capability to induce minimal backbone movement, even though its training data exclusively comprises bound protein complexes. Notably, the resultant predicted structures bear a striking resemblance to the initial unbound structures, underscoring the method's ability to generate structurally consistent outcomes despite its limited training scope.
Tour de force of impact of NGS sequencing depth and clustering on picking hits from display campaigns. Repository of information for a detailed walk-through of a display campaign.
Altogether NGS is big help with respect to random colony picking - better binders can be produced and larger epitope diversity.
Campaigns against three (related) antigens, sars-cov-2 trimer protein, monomer s1 and RBD.
They define a set of research questions on the relation of NGS statistics and kinetics - e.g. is higher frequency in NGS correlating with higher affinity?
Sequences for VH and VL of 200 unique antibodies were synthesized, cloned into expression vectors for mammalian IgG, and subsequently expressed and purified as complete IgG molecules. Out of the 200 antibodies tested, 169 (84.5%) exhibited affinities <1 µM for RBD, S1, or the trimer. The selection of these 200 distinct antibody sequences was based on 57 well-defined clusters, which were identified at the convergence of three target populations (41 clusters), exclusive to either the S1 (1 cluster) or RBD (1 cluster) populations, or originating from 14 clusters derived from the trimer NGS population. The selection criteria considered the most abundant representative per cluster, regardless of whether they intersected with S1 or RBD.
Clustering methods used were 100% identity, clonotyping and their own unsupervised clustering (abscan!).
Abscan is based on in-house usupervised method
The Abscan clustering method typically results in higher diversity, relative to traditional clonotyping.
The abundance of the top representative in each AbScan cluster gives best correlation to binding affinity.
One of chief advantages of clustering is identifying sequences within cluster of interest with lower number of liabilities.
They trained an XGBoost method on NGS statistics etc. to discriminate binders and non-binders (though the dataset is very small)
Same cluster = same epitope
They employ abundance threshold of 0.005% using concatenated CDRs as a basic way to discriminate binders and non-binders.
Abscan can be described (in high level) as follows: They utilize an unsupervised machine learning approach to cluster specific regions of interest, such as HCDR3. This clustering process is based on various sequence-related properties, NGS statistics (including relative abundance and round-to-round enrichment) pertaining to different regions of interest (HCDR3, HCDR3 + LCDR3, concatenated CDRs), and employs diverse algorithms (such as the Elbow method, Ordering Points to Identify the Clustering Structure (OPTICS), physicochemical reduction of the amino acid space, traditional clonotyping, and Levenshtein distance (LD)).
An update on SPACE1, employing ABodyBuilder2. Better coverage of the structural method.
They used binders against coronavirus, ebola, lysozyme among others.
Structures are modeled using ABodyBuilder2. Structures are sorted by CDR lengths, frameworks aligned by Cas and RMSD calculated for CDR loops. Finally a clustering algorithm is used.
The clustering algorithms benchmarked were DBSCAN, OPTICS-xi, OPTICS-DBSCAN, K-means, Butina clustering, greedy clustering.
Two more variants were developed, SPACE2-HC, for heavy chains only as well as SPACE2-Paratope, for paratyping.
Two accuracy metrics were used, the fraction of epitope-consistent clusters (number of epitope-consistent multiple-occupancy clusters / number of multiple-occupancy clusters) and the fraction of clustered antibodies in epitope-consistent clusters (number of antibodies in epitope-consistent multiple-occupancy clusters / number of antibodies in multiple-occupancy clusters)
Two coverage metrics were used, the number of multiple-occupancy clusters and the number of antibodies in multiple-occupancy clusters were used. In order to examine accuracy and coverage with one measure they calculated the number of antibodies in consistent multiple-occupancy clusters
They selected agglomerative clustering as best, though it is not better than Optics-XI, but it was providing larger clusters.
Space2 using all loops was better than SPACE2-HC or SPACE2-paratope
Space2 improves the coverage over SPACE1, thanks to the ABodyBuilder2 protocol.
Space2 increases coverage with respect to just clonotyping, but clonotyping remains much more accurate.
ClusPro server with the AbeMap module for epitope mapping. It employs homology modeling if antibody structure is unavailable and makes it possible to predict epitopes by ranking the most common contacting residues in its docking poses.
For epitope prediction, the 1,000 structures are used to calculate the frequency of each antigen surface atom’s occurrence in the antibody–antigen interface. To map an epitope, AbEMap defines the atomic epitope likelihood score as the Boltzmann weighted atomic interface occurrence frequency averaged over the ensemble of antibody structures.
If the structure of the antibody is not known, the structure is modeled using homology methods, with the completion by MODELLER
They count contact as 5Å away.
The epitope frequency/energy scores are calculated for each atom.
To assess the precision of epitope prediction, they transform atom likelihoods into residue likelihoods through the summation of atomic contributions attributed to each residue. While incorporating atomic likelihood values implies that larger residues with a greater number of surface-accessible atoms receive higher scores, it's important to note that the residue likelihood values remain unadjusted for size, and therefore, users may need to address this potential bias.
AbEmap gets F1 ~.2 for the top 10 residues ranked by scores.
Antibody diversity treatise, arguing that repertoire cannot possibly be ‘that big’, rather there is some, yet unknown commonality across independent repertoires.
Human body has 10^11 B-cells.
B-cells are produced at a rate of 10^9 per day but majority are removed due to self-reactivity etc.
Naive B-cells are estimated at 10^15.
The number of pathogenic species thought to be infectious for humans has been estimated at ~1400
It would be not feasible for an organism to go through 10^15 possible antibodies in mounting an immune response.
Author suggests that the antibody repertoire is highly redundant.
Author suggests that the N individuals have different but significantly overlapping fractions, M1-n
Author suggests that one should identify convergent motifs responsible for responses.
They benchmarked a range of experimental and computational measures to see which ones correlated with therapeutics moving it through the clinical trials.
Table 2 - cheat sheet of experimental methods and what they do.
Some of the experimental results are highly correlated with one another (The retention time on the FcRn column was highly correlated with the affinity-capture self-interaction nanoparticle spectroscopy (AC-SINS), polyspecificity reagent (PSR), clone self-interaction using bio-layer interferometry (CSI) and cross-interaction chromatography (CIC) assays, which constituted one of the polyspecificity clusters in our prior work)
For each experimental assay they update the 90% intervals wrt to their previous experimental recommendations.
Table 4 shows the descriptors that need to be calculated for developability assessments.
Some of the in silico metrics also correlate with one another, forming conceptual groups (e.g. charge calculations)
They notice that there is a slight trend for mabs that progress in trials to have less violations of their experimental descriptors than those that were regressed.