Topic modeling for segment-based documents

Giovanni Ponti, Andrea Tagarelli, George Karypis

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Statistical topic models have traditionally assumed that a document is an indivisible unit for the generative process, which may not be appropriate to handle documents that are relatively long and show an explicit multi-topic structure. In this paper we describe a generative model that exploits a given decomposition of documents in smaller, topically cohesive text units, or segments. The key-idea is to introduce a new variable in the generative process to model the document segments in order to relate the word generation not only to the topics but also to the segments. Moreover, the topic latent variable is directly associated to the segments, rather than to the document as a whole. Experimental results have shown the significance of the proposed model and its better support for the document clustering task compared to other existing generative models. Copyright (c) 2012 - Edizioni Libreria Progetto and the authors.

Original languageEnglish (US)
Title of host publicationProceedings of the 20th Italian Symposium on Advanced Database Systems, SEBD 2012
Pages205-212
Number of pages8
StatePublished - 2012
Event20th Italian Symposium on Advanced Database Systems, SEBD 2012 - Venice, Italy
Duration: Jun 24 2012Jun 27 2012

Publication series

NameProceedings of the 20th Italian Symposium on Advanced Database Systems, SEBD 2012

Other

Other20th Italian Symposium on Advanced Database Systems, SEBD 2012
Country/TerritoryItaly
CityVenice
Period6/24/126/27/12

Fingerprint

Dive into the research topics of 'Topic modeling for segment-based documents'. Together they form a unique fingerprint.

Cite this