Abstract
Geo-distributed analytics (GDA) involves processing of data stored across geographically distributed sites. Such analytics involves data transfer over the wide area network (WAN) links. WAN links are highly constrained and heterogeneous in nature, making the data transfer over the WAN slow and costly. To tackle this issue, recent approaches have proposed WAN-aware scheduling and placement of geo-distributed analytics tasks. However, computing joins in a geo-distributed setting remains a challenging problem. In this work, we propose AggFirstJoin, an approach to minimize the cost of geo-distributed joins using a theoretically sound query transformation technique. Our optimization approach takes a combined view of the join and aggregation operations which are often part of the same query and pushes (a transformed) aggregation before join in a manner to produce the same results as the original query. We augment our query transformation technique with a WAN-aware task placement and a Bloom filtering approach to further reduce query execution time and WAN usage respectively. We implement our proposed technique on top of Apache Spark, a popular engine for big data analytics. We extensively evaluate our proposed technique using synthetic, TPC-H and Amplab Big Data benchmark datasets on a real geo-distributed testbed on AWS as well as an emulated testbed. Our evaluations show our proposed technique achieves up to 300x reduction in query execution time and 200x reduction in WAN usage as compared to state-of-the-art GDA techniques.
Original language | English (US) |
---|---|
Title of host publication | Proceedings - 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2023 |
Editors | Yogesh Simmhan, Ilkay Altintas, Ana-Lucia Varbanescu, Pavan Balaji, Abhinandan S. Prasad, Lorenzo Carnevale |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 414-425 |
Number of pages | 12 |
ISBN (Electronic) | 9798350301199 |
DOIs | |
State | Published - 2023 |
Event | 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2023 - Bangalore, India Duration: May 1 2023 → May 4 2023 |
Publication series
Name | Proceedings - 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2023 |
---|
Conference
Conference | 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2023 |
---|---|
Country/Territory | India |
City | Bangalore |
Period | 5/1/23 → 5/4/23 |
Bibliographical note
Publisher Copyright:© 2023 IEEE.
Keywords
- data aggregation
- data join
- edge cloud infrastructure
- geo distributed analytics