CRESP: Towards Optimal Cloud Resource Provisioning for Large Scale Data Intensive Parallel Processing Programs

Data Intensive Analysis and Computing (DIAC) Lab
The Ohio Center of Excellence on Knowledge Enabled Computing (Kno.e.sis Center)
Wright State University

Problem:

With the deployment of web applications, scientific computing, and sensor networks, large datasets are collected from users, simulations, and the environment. As a flexible and scalable parallel programming and processing model, recently MapReduce has been widely used for processing and analyzing such large scale datasets. However, data analysts in most companies, research institutes, and government agencies have no luxury to access large private Hadoop clouds. Therefore, running Hadoop/MapReduce on top of a public cloud has become a realistic option for most users.

Running a Hadoop cluster on top of a public cloud shows very different features compared to on a private Hadoop cluster. First, for each job a dedicated Hadoop cluster will be started on a number of virtual nodes. There is no multi-user or multi-job resource competition happening within such a single-job Hadoop cluster. Second, it is now the user's responsibility to set the appropriate number of virtual nodes for the Hadoop cluster. This number may differ from application to application and depend on the amount of input data. To our knowledge, there is no effective method helping the user make this decision.

The problem of optimal resource provisioning involves two intertwined factors: the cost of provisioning the virtual nodes and the time finishing the job. Intuitively, with a larger amount of resources, the job can take shorter time to finish. However, resources are provisioned at cost. It is tricky to find the best setting that minimizes the cost. With other constraints such as a time deadline or a financial budget to finish the job, this problem appears more complicated. Amazon has developed the Elastic MapReduce that runs on-demand Hadoop/MapReduce clusters on top of Amazon EC2 nodes. However, it does not provide tools to address these decision problems.

 

Proposed Research:

We propose to study the optimal resource problem based on the MapReduce cost model. (1) Different from existing work on the performance analysis of MapReduce program, our approach focuses on the relationship among the number of Map/Reduce slots, the amount of input data, and the complexity of application-specific components. The resulting cost model can be represented as a linear model in terms of transformed variables. Linear models provide robust generalization power that allows we determine the parameters with the data collected on small scale tests. (2) Based on this cost model, we formulate the important decision problems as several optimization problems. The resource requirement is mapped to the number of Map/Reduce slots; the financial cost of provisioning resources is the product between the cost function and the acquired Map/Reduce slots.

This study will result in the Cresp resource decision wizard that can be applied to public clouds, such as Amazon EC2 or Elastic MapReduce. This research will also be extended to the MPI programming model that is widely used for compute-intensive scientific computing.

  

Personnel:

Jim Powers (Ph.D. student), Shumin Guo (Ph.D. student), Fengguang Tian (Master student), Huiqi Xu (Master student), Keke Chen (Faculty)

 

References:

1.      Fengguang Tian, Keke Chen: “Towards Optimal Resource Provisioning for Running MapReduce Programs in Public Clouds”, in proceedings of IEEE conference on Cloud Computing 2011 (CLOUD 2011).

2.      Keke Chen, James Powers, Shumin Guo, and Fengguang Tian: "CRESP: Towards Optimal Resource Provisioning for MapReduce Computing in Public Clouds ", IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 25, Number 6, 2014.