Large database of paired heavy and light chains for machine learning and data mining applications.
Learn moreIntroduction
Next generation sequencing is producing copious amounts of sequence data that deepen our understanding of biological processes and allow for development of ever more efficient modeling techniques. Antibodies are composed of heavy and light chains and first NGS technology addressing these molecules produced exclusively unpaired entries. The largest repositories to exist of antibody sequences have billions of entries but are unpaired (see naturalantibody.com/ngs).
To address this discrepancy we developed the PairedNGS dataset - a large repository of natural paired antibodies from a heterogeneous set of studies.
58
bioprojects
2
organisms
(human, mouse)
~7M
paired sequences
Our dataset currently covers 58 bioprojects. We divided that dataset into the fasta version and the .airr version. The .fasta version is designed to be more lightweight and to give users an immediate entry point to using our data. The .airr dataset is designed to fulfill the community standard requirements.
.fasta dataset
There is a .fasta file for each independent study, with the name giving the organism and the bioproject ID. For instance human_PRJNA1024473.fasta contains human sequences from PRJNA1024473. Each entry in the fasta file has a header with the following metadata:
Bioproject ID
Heavy gene calls
SRA ID
Kappa gene calls
Organism
CDR-H3 sequence
Barcode
CDR-L3 sequence
The sequence record is given as a heavy chain amino acid sequence first followed by a separator sign ’/’ and then light chain.
An example of a sequence entry formatted in this fashion is given below:
>PRJNA1024473|SRR26292911|human|ATGCTACGTAAGTAGT|IGHV1-69*06,IGHD1-14*01,IGHJ5*02,IGHA1|IGKV1D-39*01,IGKJ2*01,IGKC|ASKGSLVSHYFDP|QQSYSTPTYT QVQLVQSGAEVKKPGSSVKVSCKASGDTFSSYGITWVRQAPGQGLEWMGGIIPMFGTTNYARKFQGRVTITADKSTSTAYMELSSLRSDDTAVYYCASKGSLVSHYFDPWGQGTLVTVSS/DIQMTQSPSSLSASVGDRVTITCRASQNIISYLNWYQQKPGKAPKLLIHAASTLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSYSTPTYTFGQGTKLEIK
.airr dataset
The .airr dataset adheres to the standards given by the community as defined here. The paired dataset contains additional fields as described in the Table 1 below.
Table 1. PairedNGS database in the .airr format, together with extra fields.
Name | Type | Definition |
---|---|---|
sra | string | NCBI identifier of the read run |
scheme | string | The name of the numbering scheme applied. |
v_frame | int | V gene reading frame offset from v_alignment_start. Could be [0,1,2]. |
j_frame | int | J gene reading frame offset from j_alignment_start. Could be [0,1,2]. |
sequence_aa_scheme_cigar | string | Alignment of the amino acid sequence to the scheme of choice specified in CIGAR format. |
scheme_residue_mapping | string | Base64 encoded json object containing scheme to residue mapping. |
positional_scheme_mapping | string | Base64 encoded json object containing amino acid sequence positional to scheme mapping. |
additional_validation_flags | string | Base64 encoded json object containing results from validations performed by RIOT. |
exc | string | Exception (if any) thrown during numbering. |
We make the dataset available to non-profit organizations for non-commercial purposes. Other organizations & usage types, are kindly asked to contact us
Citing this work
We make the PairedNGS database available as a companion to our paper, that will be posted here once it clears reviews.
PairedNGS - Large dataset of paired heavy and light chains for machine learning and data mining applications.
Under review