Publication | Closed Access
A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining.
550
Citations
23
References
1997
Year
Unknown Venue
EngineeringClustering Cost FunctionPattern MiningUnsupervised Machine LearningText MiningOptimization-based Data MiningData ScienceData MiningPattern RecognitionBiostatisticsPublic HealthStatisticsDocument ClusteringHomogeneous ClustersClustering (Nuclear Physics)Knowledge DiscoveryComputer ScienceFast Clustering AlgorithmEvolutionary Data MiningFrequent Pattern MiningClassificationClustering (Data Mining)Large SetFuzzy ClusteringHealth InformaticsBig Data
Partitioning large sets of objects into homogeneous clusters is a fundamental data‑mining operation, yet the efficient k‑means algorithm is limited to numeric data, which restricts its applicability to the categorical datasets common in the field. This paper introduces k‑modes, an algorithm that extends the k‑means paradigm to categorical domains. k‑modes replaces cluster means with modes, employs new dissimilarity measures for categorical objects, and updates modes using a frequency‑based method to minimize the clustering cost function. On the soybean disease dataset, k‑modes achieves strong classification performance, and on a half‑million‑record health‑insurance dataset it scales effectively with both cluster count and record size.
Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The k-means algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining because data sets in data mining often contain categorical values. In this paper we present an algorithm, called k-modes, to extend the k-means paradigm to categorical domains. We introduce new dissimilarity measures to deal with categorical objects, replace means of clusters with modes, and use a frequency based method to update modes in the clustering process to minimise the clustering cost function. Tested with the well known soybean disease data set the algorithm has demonstrated a very good classification performance. Experiments on a very large health insurance data set consisting of half a million records and 34 categorical attributes show that the algorithm is scalable in terms of both the number of clusters and the number of records.
| Year | Citations | |
|---|---|---|
Page 1
Page 1