Speculative distributed CSV data parsing for big data analytics

Chang Ge, Yinan Li, Eric Eilebrecht, Badrish Chandramouli, Donald Kossmann

Research output: Chapter in Book/Report/Conference proceedingConference contribution

18 Scopus citations

Abstract

There has been a recent flurry of interest in providing query capability on raw data in today's big data systems. These raw data must be parsed before processing or use in analytics. Thus, a fundamental challenge in distributed big data systems is that of efficient parallel parsing of raw data. The difficulties come from the inherent ambiguity while independently parsing chunks of raw data without knowing the context of these chunks. Specifically, it can be difficult to find the beginnings and ends of fields and records in these chunks of raw data. To parallelize parsing, this paper proposes a speculation-based approach for the CSV format, arguably the most commonly used raw data format. Due to the syntactic and statistical properties of the format, speculative parsing rarely fails and therefore parsing is efficiently parallelized in a distributed setting. Our speculative approach is also robust, meaning that it can reliably detect syntax errors in CSV data. We experimentally evaluate the speculative, distributed parsing approach in Apache Spark using more than 11,000 real-world datasets, and show that our parser produces significant performance benefits over existing methods.

Original languageEnglish (US)
Title of host publicationSIGMOD 2019 - Proceedings of the 2019 International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages883-899
Number of pages17
ISBN (Electronic)9781450356435
DOIs
StatePublished - Jun 25 2019
Externally publishedYes
Event2019 International Conference on Management of Data, SIGMOD 2019 - Amsterdam, Netherlands
Duration: Jun 30 2019Jul 5 2019

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

Conference

Conference2019 International Conference on Management of Data, SIGMOD 2019
Country/TerritoryNetherlands
CityAmsterdam
Period6/30/197/5/19

Bibliographical note

Publisher Copyright:
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.

Fingerprint

Dive into the research topics of 'Speculative distributed CSV data parsing for big data analytics'. Together they form a unique fingerprint.

Cite this