AbNGS Database

Large database of immunoglobulin sequences for immunological and machine learning applications.

Introduction

Immunoglobulins are a highly versatile type of protein, boasting estimated theoretical diversity close to 1018. Next-generation sequencing studies now allow us to sample portions of this diversity to create large datasets to facilitate novel immunological studies but also for machine learning applications such as training large language models.

We automatically mined the Sequence Read Archive repository for depositions containing immunoglobulin sequences. We found more than 220 bioprojects across multiple disease states. We annotated the ~11,000 biosamples associated with these and processed the sequences using a uniform pipeline. Of these, more than 130 are bioprojects containing human immunoglobulin sequences and we make these available in a public version accompanying our manuscript.

Database statistics

130+

source bioprojects

~60GB

database size in .fasta

~3B

number of unique sequences

The entire database of 220 bioprojects in .airr format was more than 1TB-compressed, making it inefficient for most practical applications. To facilitate public access, we make a subset of human bioprojects available to accompany our publication, with sequences and metadata annotations in .fasta format. The entire dataset contains approximately 3b unique sequences, comprising heavy & light sequences, and is a more manageable 60GB in size.

Fasta file sequence entry has the following metadata in its header line:

  • V region call (detected by us)

  • J region call (detected by us)

  • Isotype (detected by us)

  • CDR-H3 sequence according to IMGT

  • Whether large part of fw1 is missing (is fw1 length less than 20)

  • Redundancy of the sequence within bioproject

  • Bioproject-annotated isotype info

  • Bioproject-annotated disease state

  • Bioproject-annotated B-cell type

  • Internal bioproject index to distinguish between studies

For instance in the following example:

IGHV4-59*05,IGHJ4*02,IGHM,ALTWIQLWLAPHSFDY,True,1.0,None,healthy,naive,30

The V call is IGHV4-59*05, the J call is IGHJ4*02, we detected this isotype as IGHM, the CDR-H3 sequence is ALTWIQLWLAPHSFDY, the fw1 is largely incomplete (True), there was only one copy of the sequence in the bioproject, there was no bioproject isotype annotation, the disease state was "healthy", the B-cell type was "naive", and the bioproject index is 30. Sample from the data is given here:

Download sample data

B-cell types

We distinguish the following B-cell types in our metadata:

  • Naive: B-cell that hasn't been previously exposed to an antigen

  • Memory: B-cells which were produced in response to T-cell dependent antigens. They enable recognition of previously known antigens and trigger secondary immune responses

  • Pre: the last stage of B-cell development just before naive B-cell

  • Plasmablast: immature plasma cells, which are capable of antibody production but in lesser amount than the plasma cells (short-lived effector cells)

  • Germline center: those B-cells are source of high-affinity and class switched antibodies

  • Regulatory: type of B-cell that takes part in suppression and immunomodulation. They execute their role by secretion of anti-inflammatory cytokines (IL-10, TGF-beta etc) and Granzyme B production.

  • Plasma cells: type of long-lived B-cell that produces antibody after being presented to a specific antigen

  • Follicular: B-cells freely recirculating cells that home to the lymphoid follicles in the secondary lymphoid organs

  • ProB - PreB: the first stage of B-cell development just after CLP (common lymphoid progenitor)

  • Transitional: B-cells that bone marrow-derived, immature B-cells, which are also considered to be precursors of mature B-cells

  • Marginal zone: B-cells occurring in the marginal (peripheral) zone of the lymph node

  • Mature: B-cell that has been exposed to an antigen

Citing this work

We make the public version of ABNGS database available as a companion to our paper on convergence between natural and therapeutic antibodies. If you use this work please cite the following paper:

Large-scale data mining of four billion human antibody variable regions reveals convergence between therapeutic and natural antibodies that constrains search space for biologics drug discovery.

Pawel Dudzic, Dawid Chomicz, Jarosław Kończak, Tadeusz Satława, Bartosz Janusz, Sonia Wrobel, Tomasz Gawłowski, Igor Jaszczyszyn, Weronika Bielska, Samuel Demharter, Roberto Spreafico, Lukas Schulte, Kyle Martin, Stephen R. Comeau, Konrad Krawczyk

Under review