TY - JOUR
T1 - End-to-End Optimization for Geo-Distributed MapReduce
AU - Heintz, Benjamin
AU - Chandra, Abhishek
AU - Sitaraman, Ramesh K.
AU - Weissman, Jon
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2016/7/1
Y1 - 2016/7/1
N2 - MapReduce has proven remarkably effective for a wide variety of data-intensive applications, but it was designed to run on large single-site homogeneous clusters. Researchers have begun to explore the extent to which the original MapReduce assumptions can be relaxed, including skewed workloads, iterative applications, and heterogeneous computing environments. This paper continues this exploration by applying MapReduce across geo-distributed data over geo-distributed computation resources. Using Hadoop, we show that network and node heterogeneity and the lack of data locality lead to poor performance, because the interaction of MapReduce phases becomes pronounced in the presence of heterogeneous network behavior. To address these problems, we take a two-pronged approach: We first develop a model-driven optimization that serves as an oracle, providing high-level insights. We then apply these insights to design cross-phase optimization techniques that we implement and demonstrate in a real-world MapReduce implementation. Experimental results in both Amazon EC2 and PlanetLab show the potential of these techniques as performance is improved by 7-18 percent depending on the execution environment and application.
AB - MapReduce has proven remarkably effective for a wide variety of data-intensive applications, but it was designed to run on large single-site homogeneous clusters. Researchers have begun to explore the extent to which the original MapReduce assumptions can be relaxed, including skewed workloads, iterative applications, and heterogeneous computing environments. This paper continues this exploration by applying MapReduce across geo-distributed data over geo-distributed computation resources. Using Hadoop, we show that network and node heterogeneity and the lack of data locality lead to poor performance, because the interaction of MapReduce phases becomes pronounced in the presence of heterogeneous network behavior. To address these problems, we take a two-pronged approach: We first develop a model-driven optimization that serves as an oracle, providing high-level insights. We then apply these insights to design cross-phase optimization techniques that we implement and demonstrate in a real-world MapReduce implementation. Experimental results in both Amazon EC2 and PlanetLab show the potential of these techniques as performance is improved by 7-18 percent depending on the execution environment and application.
KW - Batch processing systems
KW - distributed systems
KW - parallel systems
UR - http://www.scopus.com/inward/record.url?scp=84986550276&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84986550276&partnerID=8YFLogxK
U2 - 10.1109/TCC.2014.2355225
DO - 10.1109/TCC.2014.2355225
M3 - Article
AN - SCOPUS:84986550276
SN - 2168-7161
VL - 4
SP - 293
EP - 306
JO - IEEE Transactions on Cloud Computing
JF - IEEE Transactions on Cloud Computing
IS - 3
M1 - 6893011
ER -