AggFirstJoin: Optimizing Geo-Distributed Joins using Aggregation-Based Transformations

Dhruv Kumar, Sohaib Ahmad, Abhishek Chandra, Ramesh K. Sitaraman

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations

Abstract

Geo-distributed analytics (GDA) involves processing of data stored across geographically distributed sites. Such analytics involves data transfer over the wide area network (WAN) links. WAN links are highly constrained and heterogeneous in nature, making the data transfer over the WAN slow and costly. To tackle this issue, recent approaches have proposed WAN-aware scheduling and placement of geo-distributed analytics tasks. However, computing joins in a geo-distributed setting remains a challenging problem. In this work, we propose AggFirstJoin, an approach to minimize the cost of geo-distributed joins using a theoretically sound query transformation technique. Our optimization approach takes a combined view of the join and aggregation operations which are often part of the same query and pushes (a transformed) aggregation before join in a manner to produce the same results as the original query. We augment our query transformation technique with a WAN-aware task placement and a Bloom filtering approach to further reduce query execution time and WAN usage respectively. We implement our proposed technique on top of Apache Spark, a popular engine for big data analytics. We extensively evaluate our proposed technique using synthetic, TPC-H and Amplab Big Data benchmark datasets on a real geo-distributed testbed on AWS as well as an emulated testbed. Our evaluations show our proposed technique achieves up to 300x reduction in query execution time and 200x reduction in WAN usage as compared to state-of-the-art GDA techniques.

Original languageEnglish (US)
Title of host publicationProceedings - 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2023
EditorsYogesh Simmhan, Ilkay Altintas, Ana-Lucia Varbanescu, Pavan Balaji, Abhinandan S. Prasad, Lorenzo Carnevale
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages414-425
Number of pages12
ISBN (Electronic)9798350301199
DOIs
StatePublished - 2023
Event23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2023 - Bangalore, India
Duration: May 1 2023May 4 2023

Publication series

NameProceedings - 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2023

Conference

Conference23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2023
Country/TerritoryIndia
CityBangalore
Period5/1/235/4/23

Bibliographical note

Publisher Copyright:
© 2023 IEEE.

Keywords

  • data aggregation
  • data join
  • edge cloud infrastructure
  • geo distributed analytics

Fingerprint

Dive into the research topics of 'AggFirstJoin: Optimizing Geo-Distributed Joins using Aggregation-Based Transformations'. Together they form a unique fingerprint.

Cite this