Publication | Closed Access
Impact of Similarity Measures on Web-page Clustering
662
Citations
11
References
2000
Year
Unknown Venue
Clustering of web documents enables (semi-)automated categorization, and facilitates certain types of search. Any clustering method has to embed the documents in a suitable similarity space. While several clustering methods and the associated similarity measures have been proposed in the past, there is no systematic comparative study of the impact of similarity metrics on cluster quality, possibly because the popular cost criteria do not readily translate across qualitatively different metrics. We observe that in domains such as Yahoo that provide a categorization by human experts, a useful criteria for comparisons across similarity metrics is indeed available. We then compare four popular similarity measures (Euclidean, cosine, Pearson correlation and extended Jaccard) in conjunction with several clustering techniques (random, self-organizing feature map, hyper-graph partitioning, generalized k-means, weighted graph partitioning), on high dimensional sparse data rep...
| Year | Citations | |
|---|---|---|
Page 1
Page 1