PairedNGS
Database

Large database of paired heavy and light chains for machine learning and data mining applications.

Learn more

Introduction

Next generation sequencing is producing copious amounts of sequence data that deepen our understanding of biological processes and allow for development of ever more efficient modeling techniques. Antibodies are composed of heavy and light chains and first NGS technology addressing these molecules produced exclusively unpaired entries. The largest repositories to exist of antibody sequences have billions of entries but are unpaired (see naturalantibody.com/ngs).

To address this discrepancy we developed the PairedNGS dataset - a large repository of natural paired antibodies from a heterogeneous set of studies.

Database statistics

58

bioprojects

2

organisms
(human, mouse)

~7M

paired sequences

Our dataset currently covers 58 bioprojects. We divided that dataset into the fasta version and the .airr version. The .fasta version is designed to be more lightweight and to give users an immediate entry point to using our data. The .airr dataset is designed to fulfill the community standard requirements.

.fasta dataset

There is a .fasta file for each independent study, with the name giving the organism and the bioproject ID. For instance human_PRJNA1024473.fasta contains human sequences from PRJNA1024473. Each entry in the fasta file has a header with the following metadata:

  • Bioproject ID

  • Heavy gene calls

  • SRA ID

  • Kappa gene calls

  • Organism

  • CDR-H3 sequence

  • Barcode

  • CDR-L3 sequence

The sequence record is given as a heavy chain amino acid sequence first followed by a separator sign ’/’ and then light chain.

An example of a sequence entry formatted in this fashion is given below:

>PRJNA1024473|SRR26292911|human|ATGCTACGTAAGTAGT|IGHV1-69*06,IGHD1-14*01,IGHJ5*02,IGHA1|IGKV1D-39*01,IGKJ2*01,IGKC|ASKGSLVSHYFDP|QQSYSTPTYT QVQLVQSGAEVKKPGSSVKVSCKASGDTFSSYGITWVRQAPGQGLEWMGGIIPMFGTTNYARKFQGRVTITADKSTSTAYMELSSLRSDDTAVYYCASKGSLVSHYFDPWGQGTLVTVSS/DIQMTQSPSSLSASVGDRVTITCRASQNIISYLNWYQQKPGKAPKLLIHAASTLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSYSTPTYTFGQGTKLEIK

.airr dataset

The .airr dataset adheres to the standards given by the community as defined here. The paired dataset contains additional fields as described in the Table 1 below.

Table 1. PairedNGS database in the .airr format, together with extra fields.

Name

Type

Definition

sra

string

NCBI identifier of the read run

scheme

string

The name of the numbering scheme applied.

v_frame

int

V gene reading frame offset from v_alignment_start. Could be [0,1,2].

j_frame

int

J gene reading frame offset from j_alignment_start. Could be [0,1,2].

sequence_aa_scheme_cigar

string

Alignment of the amino acid sequence to the scheme of choice specified in CIGAR format.

scheme_residue_mapping

string

Base64 encoded json object containing scheme to residue mapping.

positional_scheme_mapping

string

Base64 encoded json object containing amino acid sequence positional to scheme mapping.

additional_validation_flags

string

Base64 encoded json object containing results from validations performed by RIOT.

exc

string

Exception (if any) thrown during numbering.

Data availability & access

We make the dataset available to non-profit organizations for non-commercial purposes. Other organizations & usage types, are kindly asked to contact us

Citing this work

We make the PairedNGS database available as a companion to our paper, that will be posted here once it clears reviews.

PairedNGS - Large dataset of paired heavy and light chains for machine learning and data mining applications.

Under review