Abstract
The ability to synthesize realistic data in a parameterizable way is valuable for a number of reasons, including privacy, missing data imputation, and evaluating the performance of statistical and computational methods. When the underlying data generating process is complex, data synthesis requires approaches that balance realism and simplicity. In this article, we address the problem of synthesizing sequential categorical data of the type that is increasingly available from mobile applications and sensors that record participant status continuously over the course of multiple days and weeks. We propose the paired Markov Chain (paired-MC) method, a flexible framework that produces sequences that closely mimic real data while providing a straightforward mechanism for modifying characteristics of the synthesized sequences. We demonstrate the paired-MC method on two datasets, one reflecting daily human activity (time use) patterns collected via a smartphone application, and one encoding the intensities of physical activity measured by wearable accelerometers. In both settings, sequences synthesized by paired-MC better capture key characteristics of the real data than alternative approaches. Supplemental materials for this article are available online.
Original language | English (US) |
---|---|
Journal | Journal of Computational and Graphical Statistics |
DOIs | |
State | Accepted/In press - 2025 |
Bibliographical note
Publisher Copyright:© 2025 American Statistical Association and Institute of Mathematical Statistics.
Keywords
- Categorical data
- Human activity sequences
- Sequence analysis
- Synthesis