This article is based on findings from Konrad Krawczyk, Andrew Buchanan & Paolo Marcatili (2021) Data mining patented antibody sequences, mAbs, 13:1, DOI: 10.1080/19420862.2021.1892366
Patent documents offer a glimpse into the past 30 years of antibody engineering geared at developing monoclonal antibody therapeutics. The information in patents is potentially valuable for antibody design. But patents aren’t designed to communicate scientific knowledge but provide legal protection.
Can antibody data from patent documents be helpful in sharing engineering know-how, or is it just a legal reference?
To answer this question, we quantified the number of antibody sequences in patents destined for medicinal purposes and checked how well they reflect the primary sequences of therapeutic antibodies in clinical use.
Our analysis of 245,109 antibody chains from patents showed that they reflect the primary sequences of antibody therapeutics in clinical use really well. This means that researchers can find therapeutically relevant information in patents if they identify and extract pertinent data points.
Accessing information about antibodies held in patent literature is challenging. Sequences are buried within documents, hindering researchers’ attempts to quickly check if the sequences similar to what they’re working on have already been developed by another entity.
To address this challenge, we developed the Patented Antibody Database: a collection of antibody data from patent documents encompassing major sources such as USPTO and WIPO.
Our patent database covers c. 250,000 sequences from c. 19,000 documents. We linked each antibody sequence to the text metadata of the document from which it originated - for example, patent title or abstract. This accelerates the process of text-based searches for sequences associated with specific biological entities.