Abstract
Automated clustering of text documents such as web pages is becoming increasingly important for organizing the vast amounts of information available over the internet. This problem is also very challenging since typically text is represented by very high dimensional (> 1000), normalized (unit length) vectors. Moreover documents are continually being created and their statistics also change with time because of changing new-stories etc, so one needs incremental learning algorithms that can adapt to non-stationary environments. We model high-dimensional, normalized data using a mixture of von Mises-Fisher distributions, and then modify this generative model in a principled way to yield frequency sensitive competitive learning mechanisms that are applicable to streaming data, and produce balanced clusters. Experimental results on clustering of high-dimensional text data sets are provided to show the effectiveness and applicability of the proposed techniques.
Original language | English (US) |
---|---|
Title of host publication | Proceedings of the International Joint Conference on Neural Networks |
Pages | 2697-2702 |
Number of pages | 6 |
Volume | 4 |
State | Published - Sep 25 2003 |
Event | International Joint Conference on Neural Networks 2003 - Portland, OR, United States Duration: Jul 20 2003 → Jul 24 2003 |
Other
Other | International Joint Conference on Neural Networks 2003 |
---|---|
Country/Territory | United States |
City | Portland, OR |
Period | 7/20/03 → 7/24/03 |