Abstract
This paper describes a new approach for clustering - pattern preserving clustering - which produces more easily interpretable and usable clusters. This approach is motivated by the following observation: while there are usually strong patterns in the data-patterns that may be key for the analysis and description of the data-these patterns are often split among different clusters by current clustering approaches. This is, perhaps, not surprising, since clustering algorithms have no built in knowledge of these patterns and may often have goals that are in conflict with preserving patterns, e.g., minimize the distance of points to their nearest cluster centroids. Also, patterns are typically overlapping, i.e., may involve some of the same objects, and if the clustering algorithm produces disjoint clusters, then some patterns must be split when the objects are clustered. In this paper we describe a technique for pattern preserving clustering that first finds patterns composed of tightly connected groups of objects or attributes and then, starting from these patterns, performs agglomerative clustering using the Group Average (UPGMA) technique. We present the results of some experiments on document data that compare our approach, HIerarchical Clustering with PAttern Preservation (HICAP), to two other clustering techniques: bisecting K-means and traditional UPGMA. These results show that, despite the extra constraint of pattern preservation, HICAP has performance very much like traditional UPGMA with respect to the cluster evaluation criteria of entropy and F-measure. More importantly, we also illustrate how patterns, if preserved, can aid cluster interpretation.
Original language | English (US) |
---|---|
Title of host publication | Proceedings of the Fourth SIAM International Conference on Data Mining |
Editors | M.W. Berry, U. Dayal, C. Kamath, D. Skillicorn |
Pages | 279-290 |
Number of pages | 12 |
State | Published - Jun 22 2004 |
Event | Proceedings of the Fourth SIAM International Conference on Data Mining - Lake Buena Vista, FL, United States Duration: Apr 22 2004 → Apr 24 2004 |
Other
Other | Proceedings of the Fourth SIAM International Conference on Data Mining |
---|---|
Country/Territory | United States |
City | Lake Buena Vista, FL |
Period | 4/22/04 → 4/24/04 |
Keywords
- Cluster Analysis
- Hyperclique Pattern
- Pattern Preserving Clustering