Optimizing Timeliness and Cost in Geo-Distributed Streaming Analytics

Benjamin Heintz, Abhishek Chandra, Ramesh K. Sitaraman

Research output: Contribution to journalArticlepeer-review

11 Scopus citations


Rapid data streams are generated continuously from diverse sources including users, devices, and sensors located around the globe. This results in the need for efficient geo-distributed streaming analytics to extract timely information. A typical geo-distributed analytics service uses a hub-and-spoke model, comprising multiple edges connected by a wide-area-network (WAN) to a central data warehouse. In this paper, we focus on the widely used primitive of windowed grouped aggregation, and examine the question of how much computation should be performed at the edges versus the center. We develop algorithms to optimize two key metrics: WAN traffic and staleness(delay in getting results). We present a family of optimal offline algorithms that jointly minimize these metrics, and we use these to guide our design of practical online algorithms based on the insight that windowed grouped aggregation can be modeled as a caching problem where the cache size varies over time. We evaluate our algorithms through an implementation in Apache Storm deployed on PlanetLab. Using workloads derived from anonymized traces of a popular analytics service from a large commercial CDN, our experiments show that our online algorithms achieve near-optimal traffic and staleness for a variety of system configurations, stream arrival rates, and queries.

Original languageEnglish (US)
Article number8031021
Pages (from-to)232-245
Number of pages14
JournalIEEE Transactions on Cloud Computing
Issue number1
StatePublished - Jan 1 2020

Bibliographical note

Funding Information:
We would like to thank Ravali Kandur for her help deploying Apache Storm on PlanetLab, and US National Science Foundation grants CNS-1413998 and CNS-1619254, as well as an IBM Faculty Award, which partially supported this research.

Publisher Copyright:
© 2017 IEEE.


  • Geo-distributed data analytics
  • resource management
  • scheduling
  • stream computing
  • windowed aggregation


Dive into the research topics of 'Optimizing Timeliness and Cost in Geo-Distributed Streaming Analytics'. Together they form a unique fingerprint.

Cite this