Executive summary: The protein folding problem recently obtained a practical solution in AlphaFold2 and advances in deep learning. Some protein classes - for example, antibodies - are structurally unique, so the general solution still needs to be improved - especially around the prediction of the CDR-H3 loop, which is an instrumental part of an antibody in its antigen recognition abilities.
In this publication, we hypothesised that pre-training the network on a large, augmented set of models with correct physical geometries, rather than a small set of real antibody X-ray structures, would allow the network to learn better bond geometries. We show that fine-tuning such a pre-trained network on a task of shape prediction on real X-ray structures improves the number of correct peptide bond distances. We further demonstrate that pre-training allows the network to produce physically plausible shapes on an artificial set of CDR-H3s, showing the ability to generalise to the vast antibody sequence space.
We hope that our strategy will benefit the development of deep-learning antibody models that rapidly generate physically plausible geometries without the burden of time-consuming energy minimization.
Full publication:
Jarosław Kończak, Bartosz Janusz, Jakub Młokosiewicz, Tadeusz Satława, Sonia Wróbel, Paweł Dudzic, Konrad Krawczyk. Structural pre-training improves physical accuracy of antibody structure prediction using deep learning.
doi: https://doi.org/10.1101/2022.12.06.519288
The publication addresses the problem of improving antibody structure prediction by generating more physically plausible structures directly from the network.
With antibody diversity estimated at 1018 unique molecules, the publicly available 6,500 redundant experimentally resolved antibody structures is a relatively small sample. Most of the models that tackled antibody structure prediction were trained on such a small publicly available dataset of antibody structures, and, to the best of our knowledge, only IgFold employed AlphaFold to augment its input dataset. Such a small dataset of antibodies is clearly powerful enough to teach the broad features of an immunoglobulin shape but needs to make the structures physically plausible out-of-the-network.
To address this issue, one may augment the input dataset to expose the network to more correct bond geometries. We demonstrate that pre-training a network on a large, augmented dataset of antibody model structures allows it to learn such physical features.