Location-aware replication in virtual Hadoop environment
Authors
Advisors
Issue Date
Type
Keywords
Citation
Abstract
MapReduce is a framework for processing highly distributable tasks across huge datasets using a large number of compute nodes. As an implementation of MapReduce, Hadoop is widely used in the industry. Hadoop is a software platform that utilizes the distributed processing of big data across a cluster of servers. Virtualization of Hadoop Cluster shows great potential as it is easy to configure and economical to use. With some of the advantages like rapid provisioning, security and efficient resource utilization, Virtualization can be a great tool to increase efficiency of a Hadoop Cluster. However, the data redundancy which is a critical part of the Hadoop architecture can be compromised using traditional Hadoop data allocation methods. MapReduce which is known for its I/O intensive applications faces a problem with the decrease in data redundancy and unbalanced load in the virtual Hadoop cluster. In this research, the authors consider a Hadoop cluster where multiple virtual machines (VMs) co-exist on several physical machines to analyze the data allocation problem in a virtual environment. The authors also design a strategy for file block allocation which is compatible with the native Hadoop data allocation method. This research shows the serious implications of the native Hadoop data redundancy method and proposes a new algorithm that can correct the data placement in the nodes and maintain the redundancy in Hadoop cluster.