Abstract
Modern applications are increasingly generating and persisting data across geo-distributed data centers or edge clusters rather than a single cloud. This paradigm introduces challenges for traditional query execution due to increased latency when transferring data over wide-area network links. Join queries in particular are heavily affected, due to their large output size and amount of data that must be shuffled over the network. Join sampling—computing a uniform sample from the join results—is a useful technique for reducing resource requirements. However, applying it to a geo-distributed setting is challenging, since acquiring independent samples from each location and joining on the samples does not produce uniform and independent tuples from the join result. To address these challenges, we first generalize an existing join sampling algorithm to the geo-distributed setting. We then present our system, Plexus, which introduces three additional optimizations to further reduce the network overhead and handle network and data heterogeneity: (i) weight approximation, (ii) heterogeneity awareness and (iii) sample prefetching. We evaluate Plexus on a geo-distributed system deployed across multiple AWS regions, with an implementation based on Apache Spark. Using three real-world datasets, we show that Plexus can reduce query latency by up to 80% over the default Spark join implementation on a wide class of join queries without substantially impacting sample uniformity.
Original language | English (US) |
---|---|
Title of host publication | SoCC 2023 - Proceedings of the 2023 ACM Symposium on Cloud Computing |
Publisher | Association for Computing Machinery, Inc |
Pages | 1-16 |
Number of pages | 16 |
ISBN (Electronic) | 9798400703874 |
DOIs | |
State | Published - Oct 30 2023 |
Event | 14th ACM Symposium on Cloud Computing, SoCC 2023 - Santa Cruz, United States Duration: Oct 30 2023 → Nov 1 2023 |
Publication series
Name | SoCC 2023 - Proceedings of the 2023 ACM Symposium on Cloud Computing |
---|
Conference
Conference | 14th ACM Symposium on Cloud Computing, SoCC 2023 |
---|---|
Country/Territory | United States |
City | Santa Cruz |
Period | 10/30/23 → 11/1/23 |
Bibliographical note
Publisher Copyright:© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Keywords
- Distributed Systems
- Join Algorithms
- Query Optimization
- Wide Area Network