Optimizing Timeliness and Cost in Geo-Distributed Streaming Analytics

Benjamin Heintz, Abhishek Chandra, Ramesh K. Sitaraman

Research output: Contribution to journalArticle

Abstract

Rapid data streams are generated continuously from diverse sources including users, devices, and sensors located around the globe. This results in the need for efficient geo-distributed streaming analytics to extract timely information. A typical geo-distributed analytics service uses a hub-and-spoke model, comprising multiple edges connected by a wide-area-network (WAN) to a central data warehouse. In this paper, we focus on the widely used primitive of windowed grouped aggregation, and examine the question of how much computation should be performed at the edges versus the center. We develop algorithms to optimize two key metrics: WAN traffic and staleness(delay in getting results). We present a family of optimal offline algorithms that jointly minimize these metrics, and we use these to guide our design of practical online algorithms based on the insight that windowed grouped aggregation can be modeled as a caching problem where the cache size varies over time. We evaluate our algorithms through an implementation in Apache Storm deployed on PlanetLab. Using workloads derived from anonymized traces of a popular analytics service from a large commercial CDN, our experiments show that our online algorithms achieve near-optimal traffic and staleness for a variety of system configurations, stream arrival rates, and queries.

Original languageEnglish (US)
Article number8031021
Pages (from-to)232-245
Number of pages14
JournalIEEE Transactions on Cloud Computing
Volume8
Issue number1
DOIs
StatePublished - Jan 1 2020

Keywords

  • Geo-distributed data analytics
  • resource management
  • scheduling
  • stream computing
  • windowed aggregation

Fingerprint Dive into the research topics of 'Optimizing Timeliness and Cost in Geo-Distributed Streaming Analytics'. Together they form a unique fingerprint.

  • Cite this