End-to-End Optimization for Geo-Distributed MapReduce

Benjamin Heintz, Abhishek Chandra, Ramesh K. Sitaraman, Jon Weissman

Research output: Contribution to journalArticle

18 Scopus citations

Abstract

MapReduce has proven remarkably effective for a wide variety of data-intensive applications, but it was designed to run on large single-site homogeneous clusters. Researchers have begun to explore the extent to which the original MapReduce assumptions can be relaxed, including skewed workloads, iterative applications, and heterogeneous computing environments. This paper continues this exploration by applying MapReduce across geo-distributed data over geo-distributed computation resources. Using Hadoop, we show that network and node heterogeneity and the lack of data locality lead to poor performance, because the interaction of MapReduce phases becomes pronounced in the presence of heterogeneous network behavior. To address these problems, we take a two-pronged approach: We first develop a model-driven optimization that serves as an oracle, providing high-level insights. We then apply these insights to design cross-phase optimization techniques that we implement and demonstrate in a real-world MapReduce implementation. Experimental results in both Amazon EC2 and PlanetLab show the potential of these techniques as performance is improved by 7-18 percent depending on the execution environment and application.

Original languageEnglish (US)
Article number6893011
Pages (from-to)293-306
Number of pages14
JournalIEEE Transactions on Cloud Computing
Volume4
Issue number3
DOIs
StatePublished - Jul 1 2016

Keywords

  • Batch processing systems
  • distributed systems
  • parallel systems

Fingerprint Dive into the research topics of 'End-to-End Optimization for Geo-Distributed MapReduce'. Together they form a unique fingerprint.

  • Cite this