Unconstrained generation of synthetic antibody-antigen structures to guide machine learning methodology for real-world antibody specificity prediction

Abstract

Abstract Machine learning (ML) is a key technology for accurate prediction of antibody-antigen binding. Two orthogonal problems hinder the application of ML to antibody-specificity prediction and the benchmarking thereof: The lack of a unified ML formalization of immunological antibody specificity prediction problems and the unavailability of large-scale synthetic benchmarking datasets of real-world relevance. Here, we developed the Absolut! software suite that enables parameter-based unconstrained generation of synthetic lattice-based 3D-antibody-antigen binding structures with ground-truth access to conformational paratope, epitope, and affinity. We formalized common immunological antibody specificity prediction problems as ML tasks and confirmed that for both sequence and structure-based tasks, accuracy-based rankings of ML methods trained on experimental data hold for ML methods trained on Absolut!-generated data. The Absolut! framework thus enables real-world relevant development and benchmarking of ML strategies for biotherapeutics design. Graphical abstract The software framework Absolut! enables (A,B) the generation of virtually arbitrarily large numbers of synthetic 3D-antibody-antigen structures, (C,D) the formalization of antibody specificity as machine learning (ML) tasks as well as the exploration of ML strategies for real-world antibody-antigen binding or paratope-epitope prediction. Highlights Software framework Absolut! to generate an arbitrarily large number of synthetic 3D-antibody-antigen structures that contain biological layers of antibody-antigen binding complexity that render ML predictions challenging Immunological antibody specificity prediction problems formalized as machine learning tasks for which the in silico complexes are immediately usable as benchmark datasets Exploration of machine learning prediction accuracy as a function of architecture, dataset size, choice of negatives, and sequence-structure encoding Relative ML performance learnt on Absolut! datasets transfers to experimental datasets

References

Page 1

	Year	Citations

Page 1