Publication | Open Access
Sensitive clustering of protein sequences at tree-of-life scale using DIAMOND DeepClust
22
Citations
36
References
2023
Year
Unknown Venue
Cluster ComputingMolecular BiologyDiamond DeepclustGenomicsSequence AlignmentGene RecognitionBioinformatics DatabaseProtein SequencesMolecular EcologyProteomicsSensitive ClusteringProtein UniverseBiological DatabaseBiosphere Genomics EraSequence AnalysisPresent Diamond DeepclustOmicsBioinformaticsFunctional GenomicsProtein BioinformaticsBiologyNatural SciencesComputational BiologySystems BiologyMedicine
Abstract The biosphere genomics era is transforming life science research, but existing methods struggle to efficiently reduce the vast dimensionality of the protein universe. We present DIAMOND DeepClust, an ultra-fast cascaded clustering method optimized to cluster the 19 billion protein sequences currently defining the protein biosphere. As a result, we detect 1.7 billion clusters of which 32% hold more than one sequence. This means that 544 million clusters represent 94% of all known proteins, illustrating that clustering across the tree of life can significantly accelerate comparative studies in the Earth BioGenome era.
| Year | Citations | |
|---|---|---|
Page 1
Page 1