Data replication in data intensive scientific applications with performance guarantee
Abstract
Data replication is well adopted in data intensive scientific applications to reduce the data file transfer time and the bandwidth consumption. However, the problem of data replication in Data Grids, an enabling technology for data intensive applications, is proved to be NP-hard and even non-approximable. Previous research in this field are either theoretical investigations without practical consideration, or heuristics-based with little or no theoretical background. In this paper, we propose a data replication algorithm which not only has provable theoretical performance guarantee, but also can be implemented in a distributed and practical manner. Specifically, we design a replication technique which reduces the total job execution time at least half of that obtained from the optimal solution. Our centralized replication algorithm is amenable to distributed implementation, which can be easily adopted in a distributed environment such as the Data Grid. We have done extensive simulations to validate the proposed replication algorithms. Using our own simulator, we show that the centralized greedy replication algorithm performs comparably to the optimal algorithm under different network parameters. Using GridSim, a popular distributed Grid simulator, we demonstrate that the distributed replication technique significantly outperforms an existing replication technique; moreover, it is more adaptive to the dynamic change of file access pattern in Data Grids.
Description
Thesis (M.S.)--Wichita State University, College of Engineering, Dept. of Electrical Engineering and Computer Science