Publication | Closed Access
Hashing Algorithms and Data Structures for Rapid Searches of Fingerprint Vectors
29
Citations
28
References
2010
Year
EngineeringBiometric PrivacyBiometricsMolecular BiologyInformation ForensicsComputational ComplexityBioinformatics DatabaseData StructuresFingerprint AnalysisSignature VectorString-searching AlgorithmInformation RetrievalData ScienceData MiningPattern RecognitionPerceptual HashingKnowledge DiscoveryMagnitude SpeedupHash FunctionComputer ScienceBioinformaticsIntersection BoundProtein BioinformaticsCryptographyRapid SearchesFingerprint VectorsComputational BiologyCombinatorial Pattern MatchingSystems BiologySimilarity Search
In many large chemoinformatics database systems, molecules are represented by long binary fingerprint vectors whose components record the presence or absence of particular functional groups or combinatorial features. To speed up database searches, we propose to add to each fingerprint a short signature integer vector of length M. For a given fingerprint, the i component of the signature vector counts the number of 1-bits in the fingerprint that fall on components congruent to i modulo M. Given two signatures, we show how one can rapidly compute a bound on the Jaccard-Tanimoto similarity measure of the two corresponding fingerprints, using the intersection bound. Thus, these signatures allow one to significantly prune the search space by discarding molecules associated with unfavorable bounds. Analytical methods are developed to predict the resulting amount of pruning as a function of M. Data structures combining different values of M are also developed together with methods for predicting the optimal values of M for a given implementation. Simulations using a particular implementation show that the proposed approach leads to a 1 order of magnitude speedup over a linear search and a 3-fold speedup over a previous implementation. All theoretical results and predictions are corroborated by large-scale simulations using molecules from the ChemDB. Several possible algorithmic extensions are discussed.
| Year | Citations | |
|---|---|---|
Page 1
Page 1