Thursday, August 20, 2009

HPC Cloud platforms






Industry analysts have made bullish projections on how Cloud computing will transform the entire computing industry. As the computing industry shifts toward providing Platform as a Service (PaaS) and Software as a Service (SaaS) for consumers and enterprises to access on demand regardless of time and location, there will be an increase in the number of Cloud platforms available.

Recently, several academic and industrial organizations have started investigating
and developing technologies and infrastructure for Cloud Computing. Academic efforts include Virtual Workspaces and OpenNebula


we compare three representative Cloud platforms with industrial linkages which provides MAP Reduce as a platform








Amazon Elastic Map Reduce

Elastic MapReduce is a web service that makes it easy for researchers, data analysts, and developers to efficiently and cost-effectively process vast amounts of data using the Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (Amazon S3). Whether you already have an end-to-end data processing job flow or only a data set, Elastic MapReduce lets you focus on analyzing your data instead of the mechanics of the processing, including managing a cluster of computers in a complex distributed software development environment.
All Elastic MapReduce customers can use our simple web console or command line interface to execute most of the functionality available in the Elastic MapReduce API. Or, developers can programmatically access the distributed processing power of the Elastic MapReduce API to process large data sets using Hadoop technology.



Manjrasoft’s Aneka MAP Reduce
GRIDS Lab Aneka which is being commercialized through Manjrasoft, is a .NET-based
Service-oriented platform for constructing enterprise Grids. It is designed to support multiple application models, persistence and security solutions, and communication protocols such that the preferred selection can be changed at anytime without affecting an existing Aneka ecosystem. To create an enterprise Grid, the service provider only needs to start an
instance of the configurable Aneka container hosting required services on each selected desktop computer. The purpose of the Aneka container is to initialize services and acts as a single point for interaction with the rest of the enterprise Grid.









Manager: the manager works as an agent of MapReduce computation. It submits applications to the MapReduce scheduler and collects the final results after the execution completes successfully.

Scheduler: after users submit MapReduce.NET applications to the scheduler, itmaps sub tasks to available resources. During the execution, it monitors the progress of each task and takes corresponding task migration operation in case some nodes are much slower than others due to heterogeneity.

Executor: each executor waits task execution commands from the scheduler. For a Map task, normally its input data locates locally. Otherwise, the executor needs to fetch input data from neighbors. For a Reduce task, the executor has to fetch all the input and merge them before execution. Furthermore, the executor monitors the progress of executing task and frequently reports the progress to the scheduler.

Storage: the storage component of MapReduce.NET provides a distributed storage service over the .NET platform. It organizes the disk spaces on all the available resources as a virtual storage pool and provides an object based interface with a flat name space, which is used to manage data stored in it.
Google MAP Reduce
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.
Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.








No comments:

Post a Comment