Classification With Unstructured Predictors and an Application to Sentiment Analysis

Junhui Wang, Xiaotong Shen, Yiwen Sun, Annie Qu

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

Unstructured data refer to information that lacks certain structures and cannot be organized in a predefined fashion. Unstructured data often involve words, texts, graphs, objects, or multimedia types of files that are difficult to process and analyze with traditional computational tools and statistical methods. This work explores ordinal classification for unstructured predictors with ordered class categories, where imprecise information concerning strengths of association between predictors is available for predicting class labels. However, imprecise information here is expressed in terms of a directed graph, with each node representing a predictor and a directed edge containing pairwise strengths of association between two nodes. One of the targeted applications for unstructured data arises from sentiment analysis, which identifies and extracts the relevant content or opinion of a document concerning a specific event of interest. We integrate the imprecise predictor relations into linear relational constraints over classification function coefficients, where large margin ordinal classifiers are introduced, subject to many quadratically linear constraints. The proposed classifiers are then applied in sentiment analysis using binary word predictors. Computationally, we implement ordinal support vector machines and ψ-learning through a scalable quadratic programming package based on sparse word representations. Theoretically, we show that using relationships among unstructured predictors improves prediction accuracy of classification significantly. We illustrate an application for sentiment analysis using consumer text reviews and movie review data. Supplementary materials for this article are available online.

Original languageEnglish (US)
Pages (from-to)1242-1253
Number of pages12
JournalJournal of the American Statistical Association
Volume111
Issue number515
DOIs
StatePublished - Jul 2 2016

Fingerprint

Sentiment Analysis
Predictors
Classifier
Linear Constraints
Vertex of a graph
Quadratic Programming
Directed Graph
Margin
Statistical method
Multimedia
Sentiment analysis
Pairwise
Support Vector Machine
Integrate
Binary
Prediction
Coefficient
Graph in graph theory

Keywords

  • Large margin learners
  • Large n and p
  • Natural language processing
  • Sentiment analysis
  • Text and opinion mining
  • Unstructured data

Cite this

Classification With Unstructured Predictors and an Application to Sentiment Analysis. / Wang, Junhui; Shen, Xiaotong; Sun, Yiwen; Qu, Annie.

In: Journal of the American Statistical Association, Vol. 111, No. 515, 02.07.2016, p. 1242-1253.

Research output: Contribution to journalArticle

@article{e23bd3088e964343a0b19a772804358a,
title = "Classification With Unstructured Predictors and an Application to Sentiment Analysis",
abstract = "Unstructured data refer to information that lacks certain structures and cannot be organized in a predefined fashion. Unstructured data often involve words, texts, graphs, objects, or multimedia types of files that are difficult to process and analyze with traditional computational tools and statistical methods. This work explores ordinal classification for unstructured predictors with ordered class categories, where imprecise information concerning strengths of association between predictors is available for predicting class labels. However, imprecise information here is expressed in terms of a directed graph, with each node representing a predictor and a directed edge containing pairwise strengths of association between two nodes. One of the targeted applications for unstructured data arises from sentiment analysis, which identifies and extracts the relevant content or opinion of a document concerning a specific event of interest. We integrate the imprecise predictor relations into linear relational constraints over classification function coefficients, where large margin ordinal classifiers are introduced, subject to many quadratically linear constraints. The proposed classifiers are then applied in sentiment analysis using binary word predictors. Computationally, we implement ordinal support vector machines and ψ-learning through a scalable quadratic programming package based on sparse word representations. Theoretically, we show that using relationships among unstructured predictors improves prediction accuracy of classification significantly. We illustrate an application for sentiment analysis using consumer text reviews and movie review data. Supplementary materials for this article are available online.",
keywords = "Large margin learners, Large n and p, Natural language processing, Sentiment analysis, Text and opinion mining, Unstructured data",
author = "Junhui Wang and Xiaotong Shen and Yiwen Sun and Annie Qu",
year = "2016",
month = "7",
day = "2",
doi = "10.1080/01621459.2015.1089771",
language = "English (US)",
volume = "111",
pages = "1242--1253",
journal = "Journal of the American Statistical Association",
issn = "0162-1459",
publisher = "Taylor and Francis Ltd.",
number = "515",

}

TY - JOUR

T1 - Classification With Unstructured Predictors and an Application to Sentiment Analysis

AU - Wang, Junhui

AU - Shen, Xiaotong

AU - Sun, Yiwen

AU - Qu, Annie

PY - 2016/7/2

Y1 - 2016/7/2

N2 - Unstructured data refer to information that lacks certain structures and cannot be organized in a predefined fashion. Unstructured data often involve words, texts, graphs, objects, or multimedia types of files that are difficult to process and analyze with traditional computational tools and statistical methods. This work explores ordinal classification for unstructured predictors with ordered class categories, where imprecise information concerning strengths of association between predictors is available for predicting class labels. However, imprecise information here is expressed in terms of a directed graph, with each node representing a predictor and a directed edge containing pairwise strengths of association between two nodes. One of the targeted applications for unstructured data arises from sentiment analysis, which identifies and extracts the relevant content or opinion of a document concerning a specific event of interest. We integrate the imprecise predictor relations into linear relational constraints over classification function coefficients, where large margin ordinal classifiers are introduced, subject to many quadratically linear constraints. The proposed classifiers are then applied in sentiment analysis using binary word predictors. Computationally, we implement ordinal support vector machines and ψ-learning through a scalable quadratic programming package based on sparse word representations. Theoretically, we show that using relationships among unstructured predictors improves prediction accuracy of classification significantly. We illustrate an application for sentiment analysis using consumer text reviews and movie review data. Supplementary materials for this article are available online.

AB - Unstructured data refer to information that lacks certain structures and cannot be organized in a predefined fashion. Unstructured data often involve words, texts, graphs, objects, or multimedia types of files that are difficult to process and analyze with traditional computational tools and statistical methods. This work explores ordinal classification for unstructured predictors with ordered class categories, where imprecise information concerning strengths of association between predictors is available for predicting class labels. However, imprecise information here is expressed in terms of a directed graph, with each node representing a predictor and a directed edge containing pairwise strengths of association between two nodes. One of the targeted applications for unstructured data arises from sentiment analysis, which identifies and extracts the relevant content or opinion of a document concerning a specific event of interest. We integrate the imprecise predictor relations into linear relational constraints over classification function coefficients, where large margin ordinal classifiers are introduced, subject to many quadratically linear constraints. The proposed classifiers are then applied in sentiment analysis using binary word predictors. Computationally, we implement ordinal support vector machines and ψ-learning through a scalable quadratic programming package based on sparse word representations. Theoretically, we show that using relationships among unstructured predictors improves prediction accuracy of classification significantly. We illustrate an application for sentiment analysis using consumer text reviews and movie review data. Supplementary materials for this article are available online.

KW - Large margin learners

KW - Large n and p

KW - Natural language processing

KW - Sentiment analysis

KW - Text and opinion mining

KW - Unstructured data

UR - http://www.scopus.com/inward/record.url?scp=84991585670&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84991585670&partnerID=8YFLogxK

U2 - 10.1080/01621459.2015.1089771

DO - 10.1080/01621459.2015.1089771

M3 - Article

AN - SCOPUS:84991585670

VL - 111

SP - 1242

EP - 1253

JO - Journal of the American Statistical Association

JF - Journal of the American Statistical Association

SN - 0162-1459

IS - 515

ER -