Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 2 of 2
  • Item
    Thumbnail Image
    Similarity analysis with advanced relationships on big data
    Huang, Jin ( 2015)
    Similarity analytic techniques such as distance based joins and regularized learning models are critical tools employed in numerous data mining and machine learning tasks. We focus on two typical such techniques in the context of large scale data and distributed clusters. Advanced distance metrics such as the Earth Mover's Distance (EMD) are usually employed to capture the similarity between data dimensions. The high computational cost of EMD calls for a distributed solution, yet it is difficult to achieve a balanced workloads given the skewed distribution of the EMDs. We propose efficient bounding techniques and effective workload scheduling strategies on the Hadoop platform to design a scalable solution, named HEADS-Join. We investigate both the range joins and the top-k joins, and explore different computation paradigms including MapReduce, BSP, and Spark. We conduct comprehensive experiments and confirm that the proposed techniques achieve an order of magnitude speedup over the state-of-the-art MapReduce join algorithms. The hypergraph model is demonstrated to achieve excellent effectiveness in a wide range of applications where high-order relationships are of interest. When processing a large scale hypergraph, the straightforward approach is to convert it to a graph and reuse the distributed graph frameworks. However, such an approach significantly increases the problem size, incurs excessive replicas due to partitioning, and renders it extremely difficult to achieve a balanced workloads. We propose a novel scalable framework, named HyperX, to directly operate on a distributed hypergraph representation and minimize the numbers of replicas while still maintain a great workload balance among the distributed machines. We closely investigate an optimization problem of partitioning a hypergraph in the context of distributed computation. With extensive experiments, we confirm that HyperX achieve an order of magnitude improvement over the graph conversion approach in terms of the execution time, network communication, and memory consumption.
  • Item
    Thumbnail Image
    Energy-efficient management of virtual machines in data centers for cloud computing
    BELOGLAZOV, ANTON ( 2013)
    Cloud computing has revolutionized the information technology industry by enabling elastic on-demand provisioning of computing resources. The proliferation of Cloud computing has resulted in the establishment of large-scale data centers around the world containing thousands of compute nodes. However, Cloud data centers consume enormous amounts of electrical energy resulting in high operating costs and carbon dioxide emissions. In 2010, energy consumption by data centers worldwide was estimated to be between 1.1% and 1.5% of the global electricity use and is expected to grow further. This thesis presents novel techniques, models, algorithms, and software for distributed dynamic consolidation of Virtual Machines (VMs) in Cloud data centers. The goal is to improve the utilization of computing resources and reduce energy consumption under workload independent quality of service constraints. Dynamic VM consolidation leverages fine-grained fluctuations in the application workloads and continuously reallocates VMs using live migration to minimize the number of active physical nodes. Energy consumption is reduced by dynamically deactivating and reactivating physical nodes to meet the current resource demand. The proposed approach is distributed, scalable, and efficient in managing the energy-performance trade-off. The key contributions are: - Competitive analysis of dynamic VM consolidation algorithms and proofs of the competitive ratios of optimal online deterministic algorithms for the formulated single VM migration and dynamic VM consolidation problems. - A distributed approach to energy-efficient dynamic VM consolidation and several novel heuristics following the proposed approach, which lead to a significant reduction in energy consumption with a limited performance impact, as evaluated by a simulation study using real workload traces. - An optimal offline algorithm for the host overload detection problem, as well as a novel Markov chain model that allows a derivation of an optimal randomized control policy under an explicitly specified QoS goal for any known stationary workload and a given state configuration in the online setting. - A heuristically adapted host overload detection algorithm for handling unknown non-stationary workloads. The algorithm leads to approximately 88% of the mean inter-migration time produced by the optimal offline algorithm. - An open source implementation of a software framework for distributed dynamic VM consolidation called OpenStack Neat. The framework can be applied in both further research on dynamic VM consolidation, and real OpenStack Cloud deployments to improve the utilization of resources and reduce energy consumption.