Categorical data are ubiquitous in real-world databases. However, due to the lack of an intrinsic proximity measure, many powerful algorithms for numerical data analysis may not work well on their categorical counterparts, making it a bottleneck in practical applications. In this paper, we propose a novel method to transform categorical data to numerical representations, so that abundant numerical learning methods can be exploited in categorical data mining. Our key idea is to learn a pairwise dissimilarity among categorical symbol-s, henceforth a continuous embedding, which can then be used for subsequent numerical treatment. There are two important criteria for learning the dissimilarities. First, it should capture the important "transitivity" which has shown to be particularly useful in measuring the proximity relation in categorical data. Second, the pairwise sample geometry arising from the learned symbol distances should be maximally consistent with prior knowledge (e.g., class labels) to obtain a good generalization performance. We achieve them through multiple transitive distance learning and embedding. Encouraging results are observed on a number of benchmark classification tasks against state-of-the-art.
|Original language||English (US)|
|Title of host publication||SIAM International Conference on Data Mining 2015, SDM 2015|
|Editors||Suresh Venkatasubramanian, Jieping Ye|
|Publisher||Society for Industrial and Applied Mathematics Publications|
|Number of pages||9|
|State||Published - 2015|
|Event||SIAM International Conference on Data Mining 2015, SDM 2015 - Vancouver, Canada|
Duration: Apr 30 2015 → May 2 2015
|Name||SIAM International Conference on Data Mining 2015, SDM 2015|
|Other||SIAM International Conference on Data Mining 2015, SDM 2015|
|Period||4/30/15 → 5/2/15|
Bibliographical notePublisher Copyright:
Copyright © SIAM.