Abstract
In this paper we present a simple linear-time centroid-based document classification algorithm, that despite its simplicity and robust performance, has not been extensively studied and analyzed. Our experiments show that this centroidbased classifier consistently and substantially outperforms other algorithms such as Naive Bayesian, k-nearest-neighbors, and C4.5, on a wide range of datasets. Our analysis shows that the similarity measure used by the centroid-based scheme allows it to classify a new document based on how closely its behavior matches the behavior of the documents belonging to different classes. This matching allows it to dynamically adjust for classes with different densities and accounts for dependencies between the terms in the different classes.
Original language | English (US) |
---|---|
Title of host publication | Principles of Data Mining and Knowledge Discovery - 4th European Conference, PKDD 2000, Proceedings |
Editors | Djamel A. Zighed, Jan Komorowski, Jan Zytkow |
Publisher | Springer Verlag |
Pages | 424-431 |
Number of pages | 8 |
ISBN (Print) | 9783540410669 |
DOIs | |
State | Published - 2000 |
Event | 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD 2000 - Lyon, France Duration: Sep 13 2000 → Sep 16 2000 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 1910 |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Other
Other | 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD 2000 |
---|---|
Country/Territory | France |
City | Lyon |
Period | 9/13/00 → 9/16/00 |
Bibliographical note
Publisher Copyright:© Springer-Verlag Berlin Heidelberg 2000.