Mind the gap

Accounting for measurement error and misclassification in variables generated via data mining

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

The application of predictive data mining techniques in information systems research has grown in recent years, likely because of their effectiveness and scalability in extracting information from large amounts of data. A number of scholars have sought to combine data mining with traditional econometric analyses. Typically, data mining methods are first used to generate new variables (e.g., text sentiment), which are added into subsequent econometric models as independent regressors. However, because prediction is almost always imperfect, variables generated from the first-stage data mining models inevitably contain measurement error or misclassification. These errors, if ignored, can introduce systematic biases into the second-stage econometric estimations and threaten the validity of statistical inference. In this commentary, we examine the nature of this bias, both analytically and empirically, and show that it can be severe even when data mining models exhibit relatively high performance. We then show that this bias becomes increasingly difficult to anticipate as the functional form of the measurement error or the specification of the econometric model grows more complex. We review several methods for error correction and focus on two simulation-based methods, SIMEX and MC-SIMEX, which can be easily parameterized using standard performance metrics from data mining models, such as error variance or the confusion matrix, and can be applied under a wide range of econometric specifications. Finally, we demonstrate the effectiveness of SIMEX and MC-SIMEX by simulations and subsequent application of the methods to econometric estimations employing variables mined from three real-world data sets related to travel, social networking, and crowdfunding campaign websites.

Original languageEnglish (US)
Pages (from-to)4-24
Number of pages21
JournalInformation Systems Research
Volume29
Issue number1
DOIs
StatePublished - Mar 1 2018

Fingerprint

Measurement errors
Data mining
econometrics
Specifications
trend
Error correction
performance standard
simulation
Measurement error
Misclassification
Scalability
Websites
systems research
Information systems
networking
website
information system
campaign
travel
performance

Keywords

  • Data mining
  • Econometrics
  • Measurement error
  • Misclassification
  • Statistical inference

Cite this

@article{82e0264ee86c460a951947fc7ea555d3,
title = "Mind the gap: Accounting for measurement error and misclassification in variables generated via data mining",
abstract = "The application of predictive data mining techniques in information systems research has grown in recent years, likely because of their effectiveness and scalability in extracting information from large amounts of data. A number of scholars have sought to combine data mining with traditional econometric analyses. Typically, data mining methods are first used to generate new variables (e.g., text sentiment), which are added into subsequent econometric models as independent regressors. However, because prediction is almost always imperfect, variables generated from the first-stage data mining models inevitably contain measurement error or misclassification. These errors, if ignored, can introduce systematic biases into the second-stage econometric estimations and threaten the validity of statistical inference. In this commentary, we examine the nature of this bias, both analytically and empirically, and show that it can be severe even when data mining models exhibit relatively high performance. We then show that this bias becomes increasingly difficult to anticipate as the functional form of the measurement error or the specification of the econometric model grows more complex. We review several methods for error correction and focus on two simulation-based methods, SIMEX and MC-SIMEX, which can be easily parameterized using standard performance metrics from data mining models, such as error variance or the confusion matrix, and can be applied under a wide range of econometric specifications. Finally, we demonstrate the effectiveness of SIMEX and MC-SIMEX by simulations and subsequent application of the methods to econometric estimations employing variables mined from three real-world data sets related to travel, social networking, and crowdfunding campaign websites.",
keywords = "Data mining, Econometrics, Measurement error, Misclassification, Statistical inference",
author = "Mochen Yang and Gediminas Adomavicius and Gordon Burtch and Yuqing Ren",
year = "2018",
month = "3",
day = "1",
doi = "10.1287/isre.2017.0727",
language = "English (US)",
volume = "29",
pages = "4--24",
journal = "Information Systems Research",
issn = "1047-7047",
publisher = "INFORMS Inst.for Operations Res.and the Management Sciences",
number = "1",

}

TY - JOUR

T1 - Mind the gap

T2 - Accounting for measurement error and misclassification in variables generated via data mining

AU - Yang, Mochen

AU - Adomavicius, Gediminas

AU - Burtch, Gordon

AU - Ren, Yuqing

PY - 2018/3/1

Y1 - 2018/3/1

N2 - The application of predictive data mining techniques in information systems research has grown in recent years, likely because of their effectiveness and scalability in extracting information from large amounts of data. A number of scholars have sought to combine data mining with traditional econometric analyses. Typically, data mining methods are first used to generate new variables (e.g., text sentiment), which are added into subsequent econometric models as independent regressors. However, because prediction is almost always imperfect, variables generated from the first-stage data mining models inevitably contain measurement error or misclassification. These errors, if ignored, can introduce systematic biases into the second-stage econometric estimations and threaten the validity of statistical inference. In this commentary, we examine the nature of this bias, both analytically and empirically, and show that it can be severe even when data mining models exhibit relatively high performance. We then show that this bias becomes increasingly difficult to anticipate as the functional form of the measurement error or the specification of the econometric model grows more complex. We review several methods for error correction and focus on two simulation-based methods, SIMEX and MC-SIMEX, which can be easily parameterized using standard performance metrics from data mining models, such as error variance or the confusion matrix, and can be applied under a wide range of econometric specifications. Finally, we demonstrate the effectiveness of SIMEX and MC-SIMEX by simulations and subsequent application of the methods to econometric estimations employing variables mined from three real-world data sets related to travel, social networking, and crowdfunding campaign websites.

AB - The application of predictive data mining techniques in information systems research has grown in recent years, likely because of their effectiveness and scalability in extracting information from large amounts of data. A number of scholars have sought to combine data mining with traditional econometric analyses. Typically, data mining methods are first used to generate new variables (e.g., text sentiment), which are added into subsequent econometric models as independent regressors. However, because prediction is almost always imperfect, variables generated from the first-stage data mining models inevitably contain measurement error or misclassification. These errors, if ignored, can introduce systematic biases into the second-stage econometric estimations and threaten the validity of statistical inference. In this commentary, we examine the nature of this bias, both analytically and empirically, and show that it can be severe even when data mining models exhibit relatively high performance. We then show that this bias becomes increasingly difficult to anticipate as the functional form of the measurement error or the specification of the econometric model grows more complex. We review several methods for error correction and focus on two simulation-based methods, SIMEX and MC-SIMEX, which can be easily parameterized using standard performance metrics from data mining models, such as error variance or the confusion matrix, and can be applied under a wide range of econometric specifications. Finally, we demonstrate the effectiveness of SIMEX and MC-SIMEX by simulations and subsequent application of the methods to econometric estimations employing variables mined from three real-world data sets related to travel, social networking, and crowdfunding campaign websites.

KW - Data mining

KW - Econometrics

KW - Measurement error

KW - Misclassification

KW - Statistical inference

UR - http://www.scopus.com/inward/record.url?scp=85043991353&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85043991353&partnerID=8YFLogxK

U2 - 10.1287/isre.2017.0727

DO - 10.1287/isre.2017.0727

M3 - Article

VL - 29

SP - 4

EP - 24

JO - Information Systems Research

JF - Information Systems Research

SN - 1047-7047

IS - 1

ER -