Integrated replication and scheduling in Data Grids with performance guarantee
Data Grid consists of geographically distributed computing and storage resources that are used in large scale scientific applications such as high energy physics, bioinformatics, climate modeling. Scheduling and Replication are two well-known techniques to boost the performance of Data Grid. There has been research on integrating both the techniques in Data Grids to improve performance. However, most of the work is heuristic based. In their work, data replication is used to minimize the file transfer time thus total job execution time of all the sites, while scheduling is used to minimize the maximum job execution time (so called makespan) among all the sites. We propose to utilize both data replication and job scheduling to minimize the total job execution time in Data Grid, and formulate our Data Replication and Job Scheduling Problem. Unlike previous work, our problem seamlessly integrates both techniques into one framework. This problem is NP-hard. We first propose a Job Scheduling and Data Replication algorithm whose performance is provable theoretically, and which also dramatically reduces time complexity compared to that of the optimal algorithm. We then design a series of heuristic algorithms to further reduce the time complexity of our Job Scheduling and Data Replication algorithm. Using simulations, we demonstrate that the heuristic algorithms perform comparably to the Job Scheduling and Data Replication algorithm.