AbDiver – a tool for exploring the natural sequence diversity of antibodies for therapeutic design

abdiver; antibody sequence

Executive summary: AbDiver is an online platform that helps scientists to navigate the vast NGS data sets on antibody mutations. Allowing to draw parallels between natural and therapeutic antibodies, the solution supports and accelerates the decision-making process in the rational design of therapeutics at the lead optimization stage for a faster therapeutic pipeline.

Note: This article covers content published in Jakub Młokosiewicz, Piotr Deszyński, Wiktoria Wilman, Igor Jaszczyszyn, Rajkumar Ganesan, Aleksandr Kovaltsuk, Jinwoo Leem, Jacob Galson, Konrad Krawczyk. AbDiver – A tool to explore the natural antibody landscape to aid therapeutic design. bioRxiv 2021.11.03.467080; doi: https://doi.org/10.1101/2021.11.03.467080.

Introduction

To design therapeutic antibodies, scientists need to harness the broad sequence diversity of these molecules. Studying the antibody mutational diversity is easier thanks to the hundreds of millions of human antibody sequences from next-generation sequencing (NGS) repositories.

By contrasting a query antibody sequence to naturally observed diversity in similar antibody sequences from NGS, researchers can build a mutational roadmap for antibody engineers designing biotherapeutics.

Problem

Due to the sheer volume of NGS data, generating such insights for a single sequence is a computational challenge. Ad-hoc investigation of antibody diversity is mostly limited by time-consuming bioinformatic activities.

Solution

To address this issue, we created AbDiver - a tool that allows scientists to contrast their query sequences to those available in NGS repertoires.

We mapped out the three most common use cases:

  1. Contrasting the query antibody to positional variability extracted from multiple independent studies,
  2. Finding close variable region matches to the variable region sequence,
  3. Identifying CDRH3s or clonotypes (combination of CDRH3 and germline gene).

Next, we benchmarked AbDiver on a set of 742 therapeutic antibodies. For each use case, our system retrieved relevant results for the majority of query sequences.

AbDiver provides an accessible method for performing a fast natural reference for any therapeutic query sequence, offering insights into sequence selection and rational design of therapeutic antibodies.

Implementation

Data

We employed the publicly curated B-cell receptor (BCR) NGS datasets from the Observed Antibody Space as underlying data. We identified 81 studies in total, and we used the IMGT numbering throughout the study for consistency.

The foundation of our calculation was the unique variable region sequences - 906,933,358 sequences in total without counting redundancies between the datasets. We’re going to update AbDiver as more data sets become available.

We used a set of 742 therapeutic antibodies for benchmarking purposes.

V-region profiling service

We developed the V-region natural profiling service to annotate a query variable region sequence with the naturally-observed positional amino acid frequency statistics (independent of study-specific biases).

We developed the diversity profiles at the fundamental level of complexity - the allelic combination of an organism (V-gene and J-gene alleles). To reflect the ongoing effort in allele annotation and account for the granularity of annotations we also calculated the gene-based profiles.

Every IMGT position in each profile includes statistics from amino frequencies calculated for each study separately. An amino acid positional frequency was incorporated if it included at least 100 observations at a given position.

For each position, we also calculated the study-specific Shannon entropy and ranks of the amino acids. We then annotated the query sequence sharing the germline genes of a given profile at each IMGT position - with the ranks, and entropies of the given amino acid averaged from values of individual studies. That way, we mitigated the effects of different numbers of sequences, techniques, and disease states contributed by different studies.

What’s more, when clicking on each annotated position, you can see the frequency distribution box plot, which indicates the frequencies of amino acids common across the independent studies.

Sequence retrieval service

We developed antibody-specific indexes for full variable-region sequences and CDRs separately to improve the identification of close-sequence and clonotype matches in NGS.

In this case, variable sequence matches are identified for similar sequences that have the same length CDR1, CDR2, with one residue discrepancy allowed for CDR-H3. This was designed to identify only very-high quality matches while still allowing us to recapitulate the results of our previous study where such strict constraints were not imposed (Krawczyk et al., 2019).

Variable sequence matches are IMGT-aligned and presented with the help of Multiple Sequence Alignment employing JSAV. The best CDR3 matches are pairwise aligned with the Biopython library to achieve an ad-hoc optimal alignment.

Benchmarks

1. Profiling benchmark

We benchmarked the allele profiling on the number of therapeutic antibodies for which we could find suitable profiles.

Allele-based profiles

  • We could find 688/738 (93.22%) heavy chains with profiles exceeding 10,000 sequences
  • We could find 486/707 (68.74%) light chains with profiles exceeding 10,000 sequences


Gene-based profiles

  • We could find 699/738 (94.71%) heavy chains with profiles exceeding 10,000 sequences
  • We could find 496/707 (70.15%) light chains with profiles exceeding 10,000 sequences


These results show that AbDiver can find profiles with sufficient sequences to inform on the mutational landscape of a given sequence across independent studies for the majority of therapeutic antibodies.

2. Variable region sequence retrieval benchmark


We also benchmarked AbDiver ability to retrieve high-sequence-identity matches (>90% sequence identity) from NGS repositories.

In our previous study, we demonstrated that for 90 therapeutic heavy chains and 158 therapeutic light chains, we could find NGS matches better than 90% (Krawczyk et al., 2019), but without strict CDR-H3 length constraints.

Our current AbDiver variable region search allows only up to one amino acid mismatch in CDR3.

Using the more restrictive service, we found:

  • We found 189/738 (25.60%) heavy chains matching more than 90% sequence identity - with 4/738 (0.54%) perfect matches (Dusigitumab, Edrecolomab, Melredableukin, and Zanolimumab).
  • We found 288/707 (40.73%) light chains with the sequence identity of 90% or more and 50/707 (7.07%) perfect matches.

What about the chains for which AbDiver couldn’t find any matches? We confirmed that the differences derived from CDR3 length discrepancies. So, despite the restrictive length constraints that produce more relevant results, AbDiver managed to identify most of the high-quality matches.

3. CDR3 or clonotype retrieval benchmark

We benchmarked the ability of AbDiver’s CDR3/clonotype service to retrieve combinations of CDR3-germline with CDR3 identity above sequence identity 80% (the mark for length-normalized clonotype studies).

Here’s what we found:

  • For 409/686 (59.62%) therapeutic heavy chains with unique CDRH3s, we found matches higher than 80% sequence identity and for 172/686 (25.07%) matches greater than 90%.
  • In 35/686 (5.10%) instances, a combination of V gene and CDRH3 could be matched to an identical CDRH3 and V gene in an NGS sequence.
  • For 384/573 (67.01%) therapeutic light chains with unique CDRL3s, we found sequence identity matches surpassing 80% and for 279/573 (48.69%), matches greater than 90% - with 244/573 (42.58%) perfect matches in V gene and CDRL3 sequence.


AbDiver successfully found a high number of relevant matches for most of the therapeutics in our dataset.

Results


AbDiver facilitates the navigation of the antibody mutational space available via NGS studies. We believe that the solution will enable drawing parallels between natural and therapeutic antibodies.

The identification of alternative mutations and similar naturally-observed sequences can be used to support the decision-making process in the rational design of therapeutics at the lead optimization stage.

Want to see achieve a similar result? Take a look at AbStudio - a solution that allows teams to create, collate, and discover antibody-specific datasets to accelerate research decision-making.