LIT-PCBA: An Unbiased Data Set for Machine Learning and Virtual Screening

TLDR

Virtual screening benchmarking requires unbiased, realistic data sets, yet commonly used sets such as DUD, DUD‑E, and MUV are chemically biased and overestimate method accuracy. The authors introduce LIT‑PCBA, a novel data set designed to support virtual screening and machine‑learning studies. LIT‑PCBA is built from 149 curated PubChem dose–response assays, filtered for false positives and property similarity, limited to single‑target proteins with available X‑ray complexes, and further refined using asymmetric validation embedding to yield 15 targets with 7,844 actives and 407,381 inactives. Preliminary screening of 21 target sets with 2D fingerprint, 3D shape, and docking methods identified 15 targets where at least one method achieved ≥2× enrichment of top‑ranked actives, and the dataset’s hit rate and potency distribution match experimental decks; it is publicly available at http://drugdesign.unistra.fr/LIT‑PCBA.

Abstract

Comparative evaluation of virtual screening methods requires a rigorous benchmarking procedure on diverse, realistic, and unbiased data sets. Recent investigations from numerous research groups unambiguously demonstrate that artificially constructed ligand sets classically used by the community (e.g., DUD, DUD-E, MUV) are unfortunately biased by both obvious and hidden chemical biases, therefore overestimating the true accuracy of virtual screening methods. We herewith present a novel data set (LIT-PCBA) specifically designed for virtual screening and machine learning. LIT-PCBA relies on 149 dose–response PubChem bioassays that were additionally processed to remove false positives and assay artifacts and keep active and inactive compounds within similar molecular property ranges. To ascertain that the data set is suited to both ligand-based and structure-based virtual screening, target sets were restricted to single protein targets for which at least one X-ray structure is available in complex with ligands of the same phenotype (e.g., inhibitor, inverse agonist) as that of the PubChem active compounds. Preliminary virtual screening on the 21 remaining target sets with state-of-the-art orthogonal methods (2D fingerprint similarity, 3D shape similarity, molecular docking) enabled us to select 15 target sets for which at least one of the three screening methods is able to enrich the top 1%-ranked compounds in true actives by at least a factor of 2. The corresponding ligand sets (training, validation) were finally unbiased by the recently described asymmetric validation embedding (AVE) procedure to afford the LIT-PCBA data set, consisting of 15 targets and 7844 confirmed active and 407,381 confirmed inactive compounds. The data set mimics experimental screening decks in terms of hit rate (ratio of active to inactive compounds) and potency distribution. It is available online at http://drugdesign.unistra.fr/LIT-PCBA for download and for benchmarking novel virtual screening methods, notably those relying on machine learning.

References

Page 1

	Year	Citations

Page 1