Optimizing grouped aggregation in geo-distributed streaming analytics

Benjamin Heintz, Abhishek Chandra, Ramesh K. Sitaraman

Research output: Chapter in Book/Report/Conference proceedingConference contribution

21 Scopus citations

Abstract

Large quantities of data are generated continuously over time and from disparate sources such as users, devices, and sensors located around the globe. This results in the need for efficient geo-distributed streaming analytics to extract timely information. A typical analytics service in these settings uses a simple hub-and-spoke model, comprising a single central data warehouse and multiple edges connected by a wide-area network (WAN). A key decision for a geodistributed streaming service is how much of the computation should be performed at the edge versus the center. In this paper, we examine this question in the context of windowed grouped aggregation, an important and widely used primitive in streaming queries. Our work is focused on designing aggregation algorithms to optimize two key metrics of any geo-distributed streaming analytics service: WAN traffic and staleness (the delay in getting the result). Towards this end, we present a family of optimal offline algorithms that jointly minimize both staleness and traffic. Using this as a foundation, we develop practical online aggregation algorithms based on the observation that grouped aggregation can be modeled as a caching problem where the cache size varies over time. This key insight allows us to exploit well known caching techniques in our design of online aggregation algorithms. We demonstrate the practicality of these algorithms through an implementation in Apache Storm, deployed on the PlanetLab testbed. The results of our experiments, driven by workloads derived from anonymized traces of a popular web analytics service offered by a large commercial CDN, show that our online aggregation algorithms perform close to the optimal algorithms for a variety of system configurations, stream arrival rates, and query types.

Original languageEnglish (US)
Title of host publicationHPDC 2015 - Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing
PublisherAssociation for Computing Machinery, Inc
Pages133-144
Number of pages12
ISBN (Electronic)9781450335508
DOIs
StatePublished - Jun 15 2015
Event24th ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC 2015 - Portland, United States
Duration: Jun 15 2015Jun 19 2015

Publication series

NameHPDC 2015 - Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing

Other

Other24th ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC 2015
CountryUnited States
CityPortland
Period6/15/156/19/15

Keywords

  • Aggregation
  • Geo-distributed systems
  • Storm
  • Stream processing

Fingerprint Dive into the research topics of 'Optimizing grouped aggregation in geo-distributed streaming analytics'. Together they form a unique fingerprint.

  • Cite this

    Heintz, B., Chandra, A., & Sitaraman, R. K. (2015). Optimizing grouped aggregation in geo-distributed streaming analytics. In HPDC 2015 - Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (pp. 133-144). (HPDC 2015 - Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing). Association for Computing Machinery, Inc. https://doi.org/10.1145/2749246.2749276