TY - GEN
T1 - Co-clustering as multilinear decomposition with sparse latent factors
AU - Papalexakis, Evangelos E.
AU - Sidiropoulos, Nicholas D.
PY - 2011
Y1 - 2011
N2 - The K-means clustering problem seeks to partition the columns of a data matrix in subsets, such that columns in the same subset are 'close' to each other. The co-clustering problem seeks to simultaneously partition the rows and columns of a matrix to produce 'coherent' groups called co-clusters. Co-clustering has recently found numerous applications in diverse areas. The concept readily generalizes to higher-way data sets (e.g., adding a temporal dimension). Starting from K-means, we show how co-clustering can be formulated as constrained multilinear decomposition with sparse latent factors. In the case of three- and higher-way data, this corresponds to a PARAFAC decomposition with sparse latent factors. This is important, for PARAFAC is unique under mild conditions - and sparsity further improves identifiability. This allows us to uniquely unravel a large number of possibly overlapping co-clusters that are hidden in the data. Interestingly, the imposition of latent sparsity pays a collateral dividend: as one increases the number of fitted co-clusters, new co-clusters are added without affecting those previously extracted. An important corollary is that co-clusters can be extracted incrementally; this implies that the algorithm scales well for large datasets. We demonstrate the validity of our approach using the ENRON corpus, as well as synthetic data.
AB - The K-means clustering problem seeks to partition the columns of a data matrix in subsets, such that columns in the same subset are 'close' to each other. The co-clustering problem seeks to simultaneously partition the rows and columns of a matrix to produce 'coherent' groups called co-clusters. Co-clustering has recently found numerous applications in diverse areas. The concept readily generalizes to higher-way data sets (e.g., adding a temporal dimension). Starting from K-means, we show how co-clustering can be formulated as constrained multilinear decomposition with sparse latent factors. In the case of three- and higher-way data, this corresponds to a PARAFAC decomposition with sparse latent factors. This is important, for PARAFAC is unique under mild conditions - and sparsity further improves identifiability. This allows us to uniquely unravel a large number of possibly overlapping co-clusters that are hidden in the data. Interestingly, the imposition of latent sparsity pays a collateral dividend: as one increases the number of fitted co-clusters, new co-clusters are added without affecting those previously extracted. An important corollary is that co-clusters can be extracted incrementally; this implies that the algorithm scales well for large datasets. We demonstrate the validity of our approach using the ENRON corpus, as well as synthetic data.
UR - http://www.scopus.com/inward/record.url?scp=80051642158&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80051642158&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2011.5946731
DO - 10.1109/ICASSP.2011.5946731
M3 - Conference contribution
AN - SCOPUS:80051642158
SN - 9781457705397
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 2064
EP - 2067
BT - 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings
T2 - 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011
Y2 - 22 May 2011 through 27 May 2011
ER -