Goodness-of-fit tests for categorical data

Rino Bellocco, Sara Algeri

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

A significant aspect of data modeling with categorical predictors is the definition of a saturated model. In fact, there are different ways of specifying it-the casewise, the contingency table, and the collapsing approaches-and they strictly depend on the unit of analysis considered. The analytical units of reference could be the subjects or, alternatively, groups of subjects that have the same covariate pattern. In the first case, the goal is to predict the probability of success (failure) for each individual; in the second case, the goal is to predict the proportion of successes (failures) in each group. The analytical unit adopted does not affect the estimation process; however, it does affect the definition of a saturated model. Consequently, measures and tests of goodness of fit can lead to different results and interpretations. Thus one must carefully consider which approach to choose. In this article, we focus on the deviance test for logistic regression models. However, the results and the conclusions are easily applicable to other linear models involving categorical regressors. We show how Stata 12.1 performs when implementing goodness of fit. In this situation, it is important to clarify which one of the three approaches is implemented as default. Furthermore, a prominent role is played by the shape of the dataset considered (individual format or events-trials format) in accordance with the analytical unit choice. In fact, the same procedure applied to different data structures leads to different approaches to a saturated model. Thus one must attend to practical and theoretical statistical issues to avoid inappropriate analyses.

Original languageEnglish (US)
Pages (from-to)356-365
Number of pages10
JournalStata Journal
Volume13
Issue number2
StatePublished - Jul 17 2013
Externally publishedYes

Fingerprint

Nominal or categorical data
Goodness of Fit Test
Unit
Goodness of fit
Categorical
Deviance
Predict
Logistic Regression Model
Data Modeling
Collapsing
Contingency Table
Covariates
Predictors
Linear Model
Data Structures
Proportion
Strictly
Choose
Model

Keywords

  • Categorical data
  • Deviance
  • Goodness-of-fit tests
  • Saturated models
  • St0299

Cite this

Goodness-of-fit tests for categorical data. / Bellocco, Rino; Algeri, Sara.

In: Stata Journal, Vol. 13, No. 2, 17.07.2013, p. 356-365.

Research output: Contribution to journalArticle

Bellocco, R & Algeri, S 2013, 'Goodness-of-fit tests for categorical data', Stata Journal, vol. 13, no. 2, pp. 356-365.
Bellocco, Rino ; Algeri, Sara. / Goodness-of-fit tests for categorical data. In: Stata Journal. 2013 ; Vol. 13, No. 2. pp. 356-365.
@article{b09ff552d67b4b3f987e51b6b6a7bcd6,
title = "Goodness-of-fit tests for categorical data",
abstract = "A significant aspect of data modeling with categorical predictors is the definition of a saturated model. In fact, there are different ways of specifying it-the casewise, the contingency table, and the collapsing approaches-and they strictly depend on the unit of analysis considered. The analytical units of reference could be the subjects or, alternatively, groups of subjects that have the same covariate pattern. In the first case, the goal is to predict the probability of success (failure) for each individual; in the second case, the goal is to predict the proportion of successes (failures) in each group. The analytical unit adopted does not affect the estimation process; however, it does affect the definition of a saturated model. Consequently, measures and tests of goodness of fit can lead to different results and interpretations. Thus one must carefully consider which approach to choose. In this article, we focus on the deviance test for logistic regression models. However, the results and the conclusions are easily applicable to other linear models involving categorical regressors. We show how Stata 12.1 performs when implementing goodness of fit. In this situation, it is important to clarify which one of the three approaches is implemented as default. Furthermore, a prominent role is played by the shape of the dataset considered (individual format or events-trials format) in accordance with the analytical unit choice. In fact, the same procedure applied to different data structures leads to different approaches to a saturated model. Thus one must attend to practical and theoretical statistical issues to avoid inappropriate analyses.",
keywords = "Categorical data, Deviance, Goodness-of-fit tests, Saturated models, St0299",
author = "Rino Bellocco and Sara Algeri",
year = "2013",
month = "7",
day = "17",
language = "English (US)",
volume = "13",
pages = "356--365",
journal = "Stata Journal",
issn = "1536-867X",
publisher = "DPC Nederland",
number = "2",

}

TY - JOUR

T1 - Goodness-of-fit tests for categorical data

AU - Bellocco, Rino

AU - Algeri, Sara

PY - 2013/7/17

Y1 - 2013/7/17

N2 - A significant aspect of data modeling with categorical predictors is the definition of a saturated model. In fact, there are different ways of specifying it-the casewise, the contingency table, and the collapsing approaches-and they strictly depend on the unit of analysis considered. The analytical units of reference could be the subjects or, alternatively, groups of subjects that have the same covariate pattern. In the first case, the goal is to predict the probability of success (failure) for each individual; in the second case, the goal is to predict the proportion of successes (failures) in each group. The analytical unit adopted does not affect the estimation process; however, it does affect the definition of a saturated model. Consequently, measures and tests of goodness of fit can lead to different results and interpretations. Thus one must carefully consider which approach to choose. In this article, we focus on the deviance test for logistic regression models. However, the results and the conclusions are easily applicable to other linear models involving categorical regressors. We show how Stata 12.1 performs when implementing goodness of fit. In this situation, it is important to clarify which one of the three approaches is implemented as default. Furthermore, a prominent role is played by the shape of the dataset considered (individual format or events-trials format) in accordance with the analytical unit choice. In fact, the same procedure applied to different data structures leads to different approaches to a saturated model. Thus one must attend to practical and theoretical statistical issues to avoid inappropriate analyses.

AB - A significant aspect of data modeling with categorical predictors is the definition of a saturated model. In fact, there are different ways of specifying it-the casewise, the contingency table, and the collapsing approaches-and they strictly depend on the unit of analysis considered. The analytical units of reference could be the subjects or, alternatively, groups of subjects that have the same covariate pattern. In the first case, the goal is to predict the probability of success (failure) for each individual; in the second case, the goal is to predict the proportion of successes (failures) in each group. The analytical unit adopted does not affect the estimation process; however, it does affect the definition of a saturated model. Consequently, measures and tests of goodness of fit can lead to different results and interpretations. Thus one must carefully consider which approach to choose. In this article, we focus on the deviance test for logistic regression models. However, the results and the conclusions are easily applicable to other linear models involving categorical regressors. We show how Stata 12.1 performs when implementing goodness of fit. In this situation, it is important to clarify which one of the three approaches is implemented as default. Furthermore, a prominent role is played by the shape of the dataset considered (individual format or events-trials format) in accordance with the analytical unit choice. In fact, the same procedure applied to different data structures leads to different approaches to a saturated model. Thus one must attend to practical and theoretical statistical issues to avoid inappropriate analyses.

KW - Categorical data

KW - Deviance

KW - Goodness-of-fit tests

KW - Saturated models

KW - St0299

UR - http://www.scopus.com/inward/record.url?scp=84880077733&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84880077733&partnerID=8YFLogxK

M3 - Article

VL - 13

SP - 356

EP - 365

JO - Stata Journal

JF - Stata Journal

SN - 1536-867X

IS - 2

ER -