PairedAbNGS
Database

Conserved heavy/light contacts and germline preferences revealed by a large-scale analysis of natively paired human antibody sequences and structural data

Learn more

Introduction

Next generation sequencing is producing copious amounts of sequence data that deepen our understanding of biological processes and allow for development of ever more efficient modeling techniques. Antibodies are composed of heavy and light chains and first NGS technology addressing these molecules produced exclusively unpaired entries. The largest repositories to exist of antibody sequences have billions of entries but are unpaired (see naturalantibody.com/ngs).

To address this discrepancy we developed the PairedAbNGS dataset - a large repository of natural paired antibodies from a heterogeneous set of studies.

Database statistics

bioprojects

organisms
(human, mouse)

~7M

paired sequences

Our dataset currently covers 58 bioprojects. We divided that dataset into the fasta version and the .airr version. The .fasta version is designed to be more lightweight and to give users an immediate entry point to using our data. The .airr dataset is designed to fulfill the community standard requirements.

.fasta dataset

There is a .fasta file for each independent study, with the name giving the organism and the bioproject ID. For instance human_PRJNA1024473.fasta contains human sequences from PRJNA1024473. Each entry in the fasta file has a header with the following metadata:

Bioproject ID
Heavy gene calls
SRA ID
Kappa gene calls
Organism
CDR-H3 sequence
Barcode
CDR-L3 sequence

The sequence record is given as a heavy chain amino acid sequence first followed by a separator sign ’/’ and then light chain.

An example of a sequence entry formatted in this fashion is given below:

>PRJNA1024473|SRR26292911|human|ATGCTACGTAAGTAGT|IGHV1-69*06,IGHD1-14*01,IGHJ5*02,IGHA1|IGKV1D-39*01,IGKJ2*01,IGKC|ASKGSLVSHYFDP|QQSYSTPTYT QVQLVQSGAEVKKPGSSVKVSCKASGDTFSSYGITWVRQAPGQGLEWMGGIIPMFGTTNYARKFQGRVTITADKSTSTAYMELSSLRSDDTAVYYCASKGSLVSHYFDPWGQGTLVTVSS/DIQMTQSPSSLSASVGDRVTITCRASQNIISYLNWYQQKPGKAPKLLIHAASTLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSYSTPTYTFGQGTKLEIK

.airr dataset

The .airr dataset adheres to the standards given by the community as defined here. The paired dataset contains additional fields as described in the Table 1 below.

Table 1. PairedAbNGS database in the .airr format, together with extra fields.

Name	Type	Definition
sra	string	NCBI identifier of the read run
scheme	string	The name of the numbering scheme applied.
v_frame	int	V gene reading frame offset from v_alignment_start. Could be [0,1,2].
j_frame	int	J gene reading frame offset from j_alignment_start. Could be [0,1,2].
sequence_aa_scheme_cigar	string	Alignment of the amino acid sequence to the scheme of choice specified in CIGAR format.
scheme_residue_mapping	string	Base64 encoded json object containing scheme to residue mapping.
positional_scheme_mapping	string	Base64 encoded json object containing amino acid sequence positional to scheme mapping.
additional_validation_flags	string	Base64 encoded json object containing results from validations performed by RIOT.
exc	string	Exception (if any) thrown during numbering.

Data availability & access

We make the dataset available to non-profit organizations for non-commercial purposes. Other organizations & usage types, are kindly asked to contact us

.fasta version dataset .airr version dataset

Citing this work

We make the PairedAbNGS database available as a companion to our paper, that will be posted here once it clears reviews.

PairedAbNGS - Conserved heavy/light contacts and germline preferences revealed by a large-scale analysis of natively paired human antibody sequences and structural data.

Under review

PairedAbNGSDatabase

Data availability & access

PairedAbNGS
Database