Publication | Closed Access
Model-Based Gaussian and Non-Gaussian Clustering
2.4K
Citations
30
References
1993
Year
EngineeringApproximate Bayesian MethodUnsupervised Machine LearningOptimization-based Data MiningData ScienceData MiningPattern RecognitionMixture AnalysisBiostatisticsBayesian MethodsPublic HealthStatisticsDocument ClusteringKnowledge DiscoveryFunctional Data AnalysisNon-gaussian ClusteringGaussian ProcessStatistical InferenceGaussian DistributionsFuzzy Clustering
The classification maximum likelihood framework encompasses many clustering algorithms but cannot specify which features are shared across clusters, is limited to Gaussian distributions, and lacks noise modeling. The authors aim to extend this framework to allow feature specification, non-Gaussian distributions, and noise incorporation. They reparameterize the covariance matrix to permit selective feature sharing, propose a non-Gaussian clustering framework with Poisson noise, and provide an approximate Bayesian method for selecting the number of clusters. Simulations show encouraging performance, and application to a diabetes dataset yields results superior to previous analyses. RH.
Abstract : The classification maximum likelihood approach is sufficiently general to encompass many current clustering algorithms, including those based on the sum of squares criterion and on the criterion of Friedman and Rubin (1967). However, as currently implemented, it does not allow the specification of which features (orientation, size and shape) are to be common to all clusters and which may differ between clusters. Also, it is restricted to Gaussian distributions and it does not allow for noise. We propose ways of overcoming these limitations. A reparameterization of the covariance matrix allows us to specify that some features, but not all, be the same for all clusters. A practical framework for non-Gaussian clustering is outlined, and a means of incorporating noise in the form of a Poisson process is described. An approximate Bayesian method for choosing the number of clusters is given. The performance of the proposed methods is studied by simulation, with encouraging results. The methods are applied to the analysis of a data set arising in the study of diabetes, and the results seem better than those of previous analyses. (RH)
| Year | Citations | |
|---|---|---|
Page 1
Page 1