Abstract
MapReduce has proven remarkably effective for a wide variety of data-intensive applications, but it was designed to run on large single-site homogeneous clusters. Researchers have begun to explore the extent to which the original MapReduce assumptions can be relaxed, including skewed workloads, iterative applications, and heterogeneous computing environments. This paper continues this exploration by applying MapReduce across geo-distributed data over geo-distributed computation resources. Using Hadoop, we show that network and node heterogeneity and the lack of data locality lead to poor performance, because the interaction of MapReduce phases becomes pronounced in the presence of heterogeneous network behavior. To address these problems, we take a two-pronged approach: We first develop a model-driven optimization that serves as an oracle, providing high-level insights. We then apply these insights to design cross-phase optimization techniques that we implement and demonstrate in a real-world MapReduce implementation. Experimental results in both Amazon EC2 and PlanetLab show the potential of these techniques as performance is improved by 7-18 percent depending on the execution environment and application.
| Original language | English (US) |
|---|---|
| Article number | 6893011 |
| Pages (from-to) | 293-306 |
| Number of pages | 14 |
| Journal | IEEE Transactions on Cloud Computing |
| Volume | 4 |
| Issue number | 3 |
| DOIs | |
| State | Published - Jul 1 2016 |
Bibliographical note
Publisher Copyright:© 2013 IEEE.
Keywords
- Batch processing systems
- distributed systems
- parallel systems