TowardsWAN-Aware Join Sampling over Geo-Distributed Data

Dhruv Kumar, Joel Wolfrath, Abhishek Chandra, Ramesh K. Sitaraman

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Large scale data analytics over geographically distributed data sources is challenging primarily due to the constrained and heterogeneous resource availability such as the wide area network (WAN) bandwidth. In this work, we look at the problem of generating random samples over joins for geo-distributed data sources. Joins are one of the most fundamental yet expensive operations in data analytics. To reduce the cost of computing joins, existing techniques have looked at efficiently generating a random sample over the join result for centralized environments, where all the data is available in one location. These techniques fail to address the unique challenges posed by geo-distributed environments. To address these challenges, we propose a sampling technique which aims to reduce the WAN traffic and latency, thereby reducing the overall latency for generating samples over joins for geo-distributed data sources. We implement our geo-distributed sampling technique on top of Apache Spark and compare it with existing state-of-The-Art sampling techniques to identify scenarios where the proposed approach gives significant benefits. Based on this exploration, we provide a detailed outline of additional factors which should be considered when designing a WAN-Aware join sampling technique for geo-distributed environments.

Original languageEnglish (US)
Title of host publicationEdgeSys 2022 - Proceedings of the 5th International Workshop on Edge Systems, Analytics and Networking, Part of EuroSys 2022
PublisherAssociation for Computing Machinery, Inc
Pages13-18
Number of pages6
ISBN (Electronic)9781450392532
DOIs
StatePublished - Apr 5 2022
Event5th International Workshop on Edge Systems, Analytics and Networking, EdgeSys 2022, in conjunction with ACM EuroSys 2022 - Virtual, Online, France
Duration: Apr 5 2022Apr 8 2022

Publication series

NameEdgeSys 2022 - Proceedings of the 5th International Workshop on Edge Systems, Analytics and Networking, Part of EuroSys 2022

Conference

Conference5th International Workshop on Edge Systems, Analytics and Networking, EdgeSys 2022, in conjunction with ACM EuroSys 2022
Country/TerritoryFrance
CityVirtual, Online
Period4/5/224/8/22

Bibliographical note

Funding Information:
The authors thank the anonymous reviewers for many constructive comments and suggestions. This work was sponsored in part by NSF under Grants CNS-1717834 and CNS-1717179, as well as by DARPA contract HR001117C0049.

Publisher Copyright:
© 2022 ACM.

Keywords

  • cloud
  • edge
  • geo-distributed systems
  • join sampling

Fingerprint

Dive into the research topics of 'TowardsWAN-Aware Join Sampling over Geo-Distributed Data'. Together they form a unique fingerprint.

Cite this