Anchor-free correlated topic modeling

Xiao Fu, Kejun Huang, Nicholas D. Sidiropoulos, Qingjiang Shi, Mingyi Hong

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

In topic modeling, identifiability of the topics is an essential issue. Many topic modeling approaches have been developed under the premise that each topic has a characteristic anchor word that only appears in that topic. The anchor-word assumption is fragile in practice, because words and terms have multiple uses; yet it is commonly adopted because it enables identifiability guarantees. Remedies in the literature include using three- or higher-order word co-occurence statistics to come up with tensor factorization models, but such statistics need many more samples to obtain reliable estimates, and identifiability still hinges on additional assumptions, such as consecutive words being persistently drawn from the same topic. In this work, we propose a new topic identification criterion using second order statistics of the words. The criterion is theoretically guaranteed to identify the underlying topics even when the anchor-word assumption is grossly violated. An algorithm based on alternating optimization, and an efficient primal-dual algorithm are proposed to handle the resulting identification problem. The former exhibits high performance and is completely parameter-free; the latter affords up to 200 times speedup relative to the former, but requires step-size tuning and a slight sacrifice in accuracy. A variety of real text copora are employed to showcase the effectiveness of the approach, where the proposed anchor-free method demonstrates substantial improvements compared to a number of anchor-word based approaches under various evaluation metrics.

Original languageEnglish (US)
Article number8338424
Pages (from-to)1056-1071
Number of pages16
JournalIEEE Transactions on Pattern Analysis and Machine Intelligence
Volume41
Issue number5
DOIs
StatePublished - May 1 2019

Fingerprint

Identifiability
Anchors
Modeling
Statistics
Primal-dual Algorithm
Identification Problem
Order Statistics
Consecutive
Tuning
Factorization
Speedup
Efficient Algorithms
Tensor
High Performance
Higher Order
Metric
Hinges
Optimization
Evaluation
Tensors

Keywords

  • Topic modeling
  • anchor free
  • identifiability
  • non-convex optimization
  • nonnegative matrix factorization
  • sufficiently scattered

PubMed: MeSH publication types

  • Journal Article

Cite this

Anchor-free correlated topic modeling. / Fu, Xiao; Huang, Kejun; Sidiropoulos, Nicholas D.; Shi, Qingjiang; Hong, Mingyi.

In: IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, No. 5, 8338424, 01.05.2019, p. 1056-1071.

Research output: Contribution to journalArticle

Fu, Xiao ; Huang, Kejun ; Sidiropoulos, Nicholas D. ; Shi, Qingjiang ; Hong, Mingyi. / Anchor-free correlated topic modeling. In: IEEE Transactions on Pattern Analysis and Machine Intelligence. 2019 ; Vol. 41, No. 5. pp. 1056-1071.
@article{80010c7c50604acbb6ea9134f845a944,
title = "Anchor-free correlated topic modeling",
abstract = "In topic modeling, identifiability of the topics is an essential issue. Many topic modeling approaches have been developed under the premise that each topic has a characteristic anchor word that only appears in that topic. The anchor-word assumption is fragile in practice, because words and terms have multiple uses; yet it is commonly adopted because it enables identifiability guarantees. Remedies in the literature include using three- or higher-order word co-occurence statistics to come up with tensor factorization models, but such statistics need many more samples to obtain reliable estimates, and identifiability still hinges on additional assumptions, such as consecutive words being persistently drawn from the same topic. In this work, we propose a new topic identification criterion using second order statistics of the words. The criterion is theoretically guaranteed to identify the underlying topics even when the anchor-word assumption is grossly violated. An algorithm based on alternating optimization, and an efficient primal-dual algorithm are proposed to handle the resulting identification problem. The former exhibits high performance and is completely parameter-free; the latter affords up to 200 times speedup relative to the former, but requires step-size tuning and a slight sacrifice in accuracy. A variety of real text copora are employed to showcase the effectiveness of the approach, where the proposed anchor-free method demonstrates substantial improvements compared to a number of anchor-word based approaches under various evaluation metrics.",
keywords = "Topic modeling, anchor free, identifiability, non-convex optimization, nonnegative matrix factorization, sufficiently scattered",
author = "Xiao Fu and Kejun Huang and Sidiropoulos, {Nicholas D.} and Qingjiang Shi and Mingyi Hong",
year = "2019",
month = "5",
day = "1",
doi = "10.1109/TPAMI.2018.2827377",
language = "English (US)",
volume = "41",
pages = "1056--1071",
journal = "IEEE Transactions on Pattern Analysis and Machine Intelligence",
issn = "0162-8828",
publisher = "IEEE Computer Society",
number = "5",

}

TY - JOUR

T1 - Anchor-free correlated topic modeling

AU - Fu, Xiao

AU - Huang, Kejun

AU - Sidiropoulos, Nicholas D.

AU - Shi, Qingjiang

AU - Hong, Mingyi

PY - 2019/5/1

Y1 - 2019/5/1

N2 - In topic modeling, identifiability of the topics is an essential issue. Many topic modeling approaches have been developed under the premise that each topic has a characteristic anchor word that only appears in that topic. The anchor-word assumption is fragile in practice, because words and terms have multiple uses; yet it is commonly adopted because it enables identifiability guarantees. Remedies in the literature include using three- or higher-order word co-occurence statistics to come up with tensor factorization models, but such statistics need many more samples to obtain reliable estimates, and identifiability still hinges on additional assumptions, such as consecutive words being persistently drawn from the same topic. In this work, we propose a new topic identification criterion using second order statistics of the words. The criterion is theoretically guaranteed to identify the underlying topics even when the anchor-word assumption is grossly violated. An algorithm based on alternating optimization, and an efficient primal-dual algorithm are proposed to handle the resulting identification problem. The former exhibits high performance and is completely parameter-free; the latter affords up to 200 times speedup relative to the former, but requires step-size tuning and a slight sacrifice in accuracy. A variety of real text copora are employed to showcase the effectiveness of the approach, where the proposed anchor-free method demonstrates substantial improvements compared to a number of anchor-word based approaches under various evaluation metrics.

AB - In topic modeling, identifiability of the topics is an essential issue. Many topic modeling approaches have been developed under the premise that each topic has a characteristic anchor word that only appears in that topic. The anchor-word assumption is fragile in practice, because words and terms have multiple uses; yet it is commonly adopted because it enables identifiability guarantees. Remedies in the literature include using three- or higher-order word co-occurence statistics to come up with tensor factorization models, but such statistics need many more samples to obtain reliable estimates, and identifiability still hinges on additional assumptions, such as consecutive words being persistently drawn from the same topic. In this work, we propose a new topic identification criterion using second order statistics of the words. The criterion is theoretically guaranteed to identify the underlying topics even when the anchor-word assumption is grossly violated. An algorithm based on alternating optimization, and an efficient primal-dual algorithm are proposed to handle the resulting identification problem. The former exhibits high performance and is completely parameter-free; the latter affords up to 200 times speedup relative to the former, but requires step-size tuning and a slight sacrifice in accuracy. A variety of real text copora are employed to showcase the effectiveness of the approach, where the proposed anchor-free method demonstrates substantial improvements compared to a number of anchor-word based approaches under various evaluation metrics.

KW - Topic modeling

KW - anchor free

KW - identifiability

KW - non-convex optimization

KW - nonnegative matrix factorization

KW - sufficiently scattered

UR - http://www.scopus.com/inward/record.url?scp=85045694619&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85045694619&partnerID=8YFLogxK

U2 - 10.1109/TPAMI.2018.2827377

DO - 10.1109/TPAMI.2018.2827377

M3 - Article

VL - 41

SP - 1056

EP - 1071

JO - IEEE Transactions on Pattern Analysis and Machine Intelligence

JF - IEEE Transactions on Pattern Analysis and Machine Intelligence

SN - 0162-8828

IS - 5

M1 - 8338424

ER -