Abstract
Cluster analysis is a popular technique in statistics and computer science with the objective of grouping similar observations in relatively distinct groups generally known as clusters. Semi-supervised clustering assumes that some additional information about group memberships is available. Under the most frequently considered scenario, labels are known for some portion of data and unavailable for the rest of observations. In this paper, we discuss a general type of semi-supervised clustering defined by so called positive and negative constraints. Under positive constraints, some data points are required to belong to the same cluster. On the contrary, negative constraints specify that particular points must represent different data groups. We outline a general framework for semi-supervised clustering with constraints naturally incorporating the additional information into the EM algorithm traditionally used in mixture modeling and model-based clustering. The developed methodology is illustrated on synthetic and classification datasets. A dendrochronology application is considered and thoroughly discussed.
Original language | English (US) |
---|---|
Pages (from-to) | 327-349 |
Number of pages | 23 |
Journal | Advances in Data Analysis and Classification |
Volume | 10 |
Issue number | 3 |
DOIs | |
State | Published - Sep 1 2016 |
Externally published | Yes |
Bibliographical note
Publisher Copyright:© 2015, Springer-Verlag Berlin Heidelberg.
Keywords
- BIC
- Finite mixture models
- Model-based clustering
- Positive and negative constraints
- Semi-supervised clustering