Erasure codes for optimal performance in geographically distributed storage systems
AuthorJ Mohan, Lakshmi
AffiliationComputing and Information Systems
Document TypePhD thesis
Access StatusOpen Access
© 2018 Dr. Lakshmi J Mohan
The amount of digital data generated is overwhelmingly growing. Big data distributed storage systems will require a large storage backbone to store, process and analyse massive volumes of data. Such systems employ cheap commodity hardware for storage that are failure-prone, attributed to events like maintenance shut downs, software glitches or periodic outages. Replicating data is the default mechanism for ensuring reliability. With the exponential increase in the data that need to be stored, replication no longer serves to be a feasible option given the immense storage overhead incurred. Erasure codes turn out to be an attractive alternative to replication, because they are efficient in terms of storage overhead with flexible design choices and pluggable interfaces. They are also optimal in providing reliability. However, during recovery from node failures, erasure codes demand extra network communication, disk I/O and compute cycles. Spreading stored data out to geographically distributed locations is being adopted as an efficient mechanism to manage disaster recovery and crashes. This requires multiple storage clusters to be connected to the global shared Internet that inter-operate by passing messages, targeting to provide high levels of availability and performance to the user. Consequently, the assumptions about system’s performance parameters like latency, throughput and reliability that were made with the view of a storage cluster operating at a single location, become invalid. This results in changes to the performance of erasure codes when deployed for storage on a geo-distributed environment. There has been extensive research on codes that reduce the communication required during repair and codes that minimise the nodes contacted during repair, but code performance in a geo-distributed setting still remains an area to be explored further. This thesis addresses this challenge, exploring the usage of erasure codes in geo-diverse settings and providing methods for better performing codes during node failures. It provides effective strategies for improved repair performance in coded geo-distributed storage centres and proposes a novel optimisation problem formulation supplemented with simulation and actual implementation results to reduce average repair time in such storage centres. A study of erasure coded deployments is performed, experimenting with popular codes to understand their behaviour during repair on a real geo-diverse storage cluster setting. The results demonstrate that during repair process initiated upon node failures, popular codes do not perform in geo-diverse systems as reported in literature. Specifically, this thesis makes the following main contributions: Strategies for enhancing repair performance of codes in geo-distributed storage: Arguments and hypothesis that adapt to better performing erasure codes during node failures in a geo-distributed setting are presented in the thesis. An XOR-based code augmented with two simple techniques in Hadoop storage setting is presented that leads to better performance during repairs supported by actual implementation on a cluster with nodes spanning various locations across Australia. The thesis then presents a heuristic termed code aware blocks placement, in order to place the blocks of erasure coded storage onto data nodes in the storage system. Motivated by the idea of locality, it does intelligent placement based on the code structure, validated with implementation results on a code that helps in reducing repair time in a geo-distributed cluster. This code aware placement strategy is new and could be applied to all classes of local codes. Optimization model for repair efficient erasure code deployments in geo-distributed storage centres: Motivated by the observation that better placement of blocks lead to better repair performance from the proposed heuristic, the thesis takes a different route, exploring the application of optimisation theory in approaching the problem. The default blocks placement being sub-optimal, the thesis contributes in recognising that by placing the blocks appropriately, the repair process can be optimized for performance. It proposes a novel framework based on geometric programming to achieve optimal block placement with the objective of minimising the average single-block repair cost. The problem formulation, notations and assumptions thereof are presented, along with simulation and implementation results that validate the framework. A set of enhancements that make the presented model work with other practical scenarios is also presented.
Keywordsdistributed storage; reliability; erasure codes; Hadoop erasure coding; optimization; block placement
- Click on "Export Reference in RIS Format" and choose "open with... Endnote".
- Click on "Export Reference in RIS Format". Login to Refworks, go to References => Import References