Word segmentation in Chinese language processing

Xinxin Shu, Junhui Wang, Xiaotong Shen, Annie Qu

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

This paper proposes a new statistical learning method for word segmentation in Chinese language processing. Word segmentation is the crucial first step towards natural language processing. Segmentation, despite progress, remains under-studied; particularly for the Chinese language, the second most popular language among all internet users. One major difficulty is that the Chinese language is highly context-dependent and ambiguous in terms of word representations. To overcome this difficulty, we cast the problem of segmentation into a framework of sequence classification, where an instance (observation) is a sequence of characters, and a class label is a sequence determining how each character is segmented. Given the class label, each character sequence can be segmented into linguistically meaningful words. The proposed method is investigated through the Peking university corpus of Chinese documents. Our numerical study shows that the proposed method compares favorably with the state-of-the-art segmentation methods in the literature.

Original languageEnglish (US)
Pages (from-to)165-173
Number of pages9
JournalStatistics and its Interface
Volume10
Issue number2
DOIs
StatePublished - Jan 1 2017

Fingerprint

Labels
Segmentation
Processing
Internet
Statistical Learning
Ambiguous
Natural Language
Numerical Study
Language
Dependent
Character
Class

Keywords

  • Cutting-plane algorithm
  • Language processing
  • Support vector machines
  • Word segmentation

Cite this

Word segmentation in Chinese language processing. / Shu, Xinxin; Wang, Junhui; Shen, Xiaotong; Qu, Annie.

In: Statistics and its Interface, Vol. 10, No. 2, 01.01.2017, p. 165-173.

Research output: Contribution to journalArticle

Shu, Xinxin ; Wang, Junhui ; Shen, Xiaotong ; Qu, Annie. / Word segmentation in Chinese language processing. In: Statistics and its Interface. 2017 ; Vol. 10, No. 2. pp. 165-173.
@article{7cde67012ad14121bbc556fcffa3fe7c,
title = "Word segmentation in Chinese language processing",
abstract = "This paper proposes a new statistical learning method for word segmentation in Chinese language processing. Word segmentation is the crucial first step towards natural language processing. Segmentation, despite progress, remains under-studied; particularly for the Chinese language, the second most popular language among all internet users. One major difficulty is that the Chinese language is highly context-dependent and ambiguous in terms of word representations. To overcome this difficulty, we cast the problem of segmentation into a framework of sequence classification, where an instance (observation) is a sequence of characters, and a class label is a sequence determining how each character is segmented. Given the class label, each character sequence can be segmented into linguistically meaningful words. The proposed method is investigated through the Peking university corpus of Chinese documents. Our numerical study shows that the proposed method compares favorably with the state-of-the-art segmentation methods in the literature.",
keywords = "Cutting-plane algorithm, Language processing, Support vector machines, Word segmentation",
author = "Xinxin Shu and Junhui Wang and Xiaotong Shen and Annie Qu",
year = "2017",
month = "1",
day = "1",
doi = "10.4310/SII.2017.v10.n2.a1",
language = "English (US)",
volume = "10",
pages = "165--173",
journal = "Statistics and its Interface",
issn = "1938-7989",
publisher = "International Press of Boston, Inc.",
number = "2",

}

TY - JOUR

T1 - Word segmentation in Chinese language processing

AU - Shu, Xinxin

AU - Wang, Junhui

AU - Shen, Xiaotong

AU - Qu, Annie

PY - 2017/1/1

Y1 - 2017/1/1

N2 - This paper proposes a new statistical learning method for word segmentation in Chinese language processing. Word segmentation is the crucial first step towards natural language processing. Segmentation, despite progress, remains under-studied; particularly for the Chinese language, the second most popular language among all internet users. One major difficulty is that the Chinese language is highly context-dependent and ambiguous in terms of word representations. To overcome this difficulty, we cast the problem of segmentation into a framework of sequence classification, where an instance (observation) is a sequence of characters, and a class label is a sequence determining how each character is segmented. Given the class label, each character sequence can be segmented into linguistically meaningful words. The proposed method is investigated through the Peking university corpus of Chinese documents. Our numerical study shows that the proposed method compares favorably with the state-of-the-art segmentation methods in the literature.

AB - This paper proposes a new statistical learning method for word segmentation in Chinese language processing. Word segmentation is the crucial first step towards natural language processing. Segmentation, despite progress, remains under-studied; particularly for the Chinese language, the second most popular language among all internet users. One major difficulty is that the Chinese language is highly context-dependent and ambiguous in terms of word representations. To overcome this difficulty, we cast the problem of segmentation into a framework of sequence classification, where an instance (observation) is a sequence of characters, and a class label is a sequence determining how each character is segmented. Given the class label, each character sequence can be segmented into linguistically meaningful words. The proposed method is investigated through the Peking university corpus of Chinese documents. Our numerical study shows that the proposed method compares favorably with the state-of-the-art segmentation methods in the literature.

KW - Cutting-plane algorithm

KW - Language processing

KW - Support vector machines

KW - Word segmentation

UR - http://www.scopus.com/inward/record.url?scp=84995561798&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84995561798&partnerID=8YFLogxK

U2 - 10.4310/SII.2017.v10.n2.a1

DO - 10.4310/SII.2017.v10.n2.a1

M3 - Article

AN - SCOPUS:84995561798

VL - 10

SP - 165

EP - 173

JO - Statistics and its Interface

JF - Statistics and its Interface

SN - 1938-7989

IS - 2

ER -