A segment-based approach to clustering multi-topic documents

Andrea Tagarelli, George Karypis

Research output: Contribution to journalArticlepeer-review

40 Scopus citations

Abstract

Document clustering has been recognized as a central problem in text data management. Such a problem becomes particularly challenging when document contents are characterized by subtopical discussions that are not necessarily relevant to each other. Existing methods for document clustering have traditionally assumed that a document is an indivisible unit for text representation and similarity computation, which may not be appropriate to handle documents with multiple topics. In this paper, we address the problem of multi-topic document clustering by leveraging the natural composition of documents in text segments that are coherent with respect to the underlying subtopics. We propose a novel document clustering framework that is designed to induce a document organization from the identification of cohesive groups of segment-based portions of the original documents. We empirically give evidence of the significance of our segment-based approach on large collections of multi-topic documents, and we compare it to conventional methods for document clustering.

Original languageEnglish (US)
Pages (from-to)563-595
Number of pages33
JournalKnowledge and Information Systems
Volume34
Issue number3
DOIs
StatePublished - Mar 2013

Bibliographical note

Funding Information:
Portions of this work appeared in SDM 2008 Workshop on Text Mining []. This work was supported in part by NSF ACI-0133464 and IIS-0431135; the Digital Technology Center at the University of Minnesota. Access to research and computing facilities was provided by the Digital Technology Center and the Minnesota Supercomputing Institute. This work was performed during a research fellowship of the first author at the University of Minnesota.

Keywords

  • Document clustering
  • Interdisciplinary documents
  • Text segmentation
  • Topic identification

Fingerprint

Dive into the research topics of 'A segment-based approach to clustering multi-topic documents'. Together they form a unique fingerprint.

Cite this