This paper describes a new approach for clustering-pattern preserving clustering-which produces more easily interpretable and usable clusters. This approach is motivated by the following observation: while there are usually strong patterns in the data - patterns that may be key for the analysis and description of the data - these patterns are often split among different clusters by current clustering approaches. This is, perhaps, not surprising, since clustering algorithms have no built-in knowledge of these patterns and may often have goals that are in conflict with preserving patterns, e.g., minimize the distance of points to their nearest cluster centroids. In this paper, our focus is to characterize (1) the benefits of pattern preserving clustering and (2) the most effective way of performing pattern preserving clustering. To that end, we propose and evaluate two clustering algorithms, HIerarchical Clustering with pAttern Preservation (HICAP) and bisecting K-means Clustering with pAttern Preservation (K-CAP). Experimental results on document data show that HICAP can produce overlapping clusters that preserve useful patterns, but has relatively worse clustering performance than bisecting K-means with respect to the clustering evaluation criterion of entropy. By contrast, in terms of entropy, K-CAP can perform substantially better than the bisecting K-means algorithm when data sets contain clusters of widely different sizes - a common situation in the real-world. Most importantly, we also illustrate how patterns, if preserved, can aid cluster interpretation.
- Hierarchical clustering
- Hyperclique pattern
- K-means clustering
- Pattern preserving clustering