Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets

TLDR

Jarvis–Patrick clustering is widely used in pharma but suffers from parameter dependence, producing either large heterogeneous or small homogeneous clusters and requiring time‑consuming manual tuning, which hampers large‑dataset clustering. This study proposes an automated algorithm that identifies dense clusters where each member is at least as similar to the cluster centroid as to any other member, based on a specified Tanimoto similarity threshold.

Abstract

One of the most commonly used clustering algorithms within the worldwide pharmaceutical industry is Jarvis−Patrick's (J−P) (Jarvis, R. A. IEEE Trans. Comput. 1973, C-22, 1025−1034). The implementation of J−P under Daylight software, using Daylight's fingerprints and the Tanimoto similarity index, can deal with sets of 100 k molecules in a matter of a few hours. However, the J−P clustering algorithm has several associated problems which make it difficult to cluster large data sets in a consistent and timely manner. The clusters produced are greatly dependent on the choice of the two parameters needed to run J−P clustering, such that this method tends to produce clusters which are either very large and heterogeneous or homogeneous but too small. In any case, J−P always requires time-consuming manual tuning. This paper describes an algorithm which will identify dense clusters where similarity within each cluster reflects the Tanimoto value used for the clustering, and, more importantly, where the cluster centroid will be at least similar, at the given Tanimoto value, to every other molecule within the cluster in a consistent and automated manner. The similarity term used throughout this paper reflects the overall similarity between two given molecules, as defined by Daylight's fingerprints and the Tanimoto similarity index.

References

Page 1

	Year	Citations

Page 1