Concepedia

Publication | Closed Access

Model-Based Clustering, Discriminant Analysis, and Density Estimation

4.2K

Citations

127

References

2002

Year

TLDR

Cluster analysis automatically searches for groups of related observations, but most existing methods are heuristic and lack systematic guidance for key practical questions such as determining the number of clusters, selecting methods, and handling outliers. The paper reviews a general model‑based clustering methodology that offers a principled statistical framework for addressing these practical questions. The methodology is illustrated with examples from medical diagnosis, minefield detection, noisy data cluster recovery, and spatial density estimation, and its limitations and recent developments for non‑Gaussian, high‑dimensional, large, and Bayesian settings are discussed. The authors demonstrate that model‑based clustering can also be applied to discriminant analysis and multivariate density estimation.

Abstract

Cluster analysis is the automated search for groups of related observations in a dataset. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures, and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as how many clusters are there, which clustering method should be used, and how should outliers be handled. We review a general methodology for model-based clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, minefield detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology and discuss recent developments in model-based clustering for non-Gaussian data, high-dimensional datasets, large datasets, and Bayesian estimation.

References

YearCitations

Page 1