Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 4 of 4
  • Item
    Thumbnail Image
    Efficient Stateful Computations in Distributed Stream Processing
    Jayasekara, Weragoda Achchillage Sachini ( 2020)
    Stream processing is used in a plethora of applications that deal with high volumes and varieties of data. The focus towards scalable and efficient stream processing solutions has been increasing due to the vast number of time-sensitive applications such as electronic trading, fraud detection that have low latency requirements. While stream processing systems were originally envisaged to use only stateless computations, the use of stateful computations has grown to accommodate a greater range of complex stream processing applications in various domains. Unlike stateless computations, supporting stateful computations requires addressing new challenges including state distribution to achieve scalability and state sharing among resources in a distributed environment. For instance, most stateful computations have synchronization requirements that need to be satisfied to guarantee the correctness of the results. Therefore, efficient mechanisms that can support scalable state distribution and state sharing while ensuring correctness of the results are needed to satisfy the low latency requirement of stream processing applications. Moreover, a fault-tolerance mechanism to recover state after failures is an essential functionality required to support stateful computations and minimizing the overhead imposed by the fault-tolerance mechanism is another challenge associated with stateful computations. This thesis first focuses on providing models to support complex stateful use cases. Windows that are used to partition the continuous input streams expected in streaming applications are a main component of stream processing systems. The existing models that define window semantics do not represent use cases that have a hierarchy of window stages and therefore, we propose a generic model for stream processing that supports a hierarchical approach to windowing. Then we propose a communication model to support iterative computations which is one of the most common stateful computation types. Due to communication restrictions that limit the ways to share the state of iterative computations, existing approaches used to represent iterative computations have limitations in terms of scalability and efficiency. We address these scalability issues and provide an efficient way to share the state of iterative computations in a distributed environment. We demonstrate that our model can support different iterative algorithms that have complex communication patterns and show the scalability and high performance of the proposed model compared to the traditional approaches used for constructing iterative streaming applications. For example, our model outperforms existing state-of-the-art solutions 72% in terms of throughput and 65% in terms of latency in some cases. Next, we investigate the most common fault-tolerance approach used by existing systems, checkpointing and address how we can minimize the overhead imposed by the checkpointing process. We derive an expression for the optimal checkpoint interval that gives the maximum system utilization using a theoretical model and validate the model using a set of simulations. To the best of our knowledge, this is the first theoretical optimization framework for stream processing systems that use a global checkpointing approach. Our model yields an elegant expression for the optimal checkpoint interval, interestingly showing the optimal checkpoint interval to be dependent only on the checkpoint cost and the failure rate of the system. Next, we use the derived optimal checkpoint interval in real-world streaming applications and demonstrate that the theoretical optimal interval can improve the performance of practical applications. We demonstrate that our theoretical optimal checkpoint interval can achieve utilization improvements from 10% - 200% for a range of failure rates from 0.3 failures per hour to 0.075 failures per minute compared to the default checkpoint interval of 30 minutes used by most systems. Moreover, we show that the optimal interval results in lower latency and higher throughput, with 54% throughput increase and 58% latency decrease for some cases. Then we investigate the multi-level checkpointing approach which is introduced to address the inefficiencies of single-level checkpointing and derive the optimal checkpointing parameters that minimize the overhead of the multi-level checkpointing process. This work is the first to present a theoretical framework for determining optimal parameter settings in a multi-level global checkpointing system that uses a single periodic checkpoint interval. We demonstrate that our solution outperforms existing single level optimizations in terms of utilization by as much as 36% in some cases.
  • Item
    Thumbnail Image
    Robust resource management in distributed stream processing systems
    Liu, Xunyun ( 2018)
    Stream processing is an emerging in-memory computing paradigm that ingests dynamic data streams with a process-once-arrival strategy. It yields real-time insights by applying continuous queries over data in motion, giving birth to a wide range of time-critical applications such as fraud detection, algorithmic trading and health surveillance. Resource management is an integral part of the deployment process to ensure that the stream processing system meets the requirements articulated in the Service Level Agreement (SLA). It involves the construction of the system deployment stack over distributed resources, as well as its continuous adjustment to deal with the constantly changing runtime environment and the fluctuating workload. However, most existing resource management techniques are optimised towards a pre-configured deployment platform, thus facing a variety of challenges in resource provisioning, operator parallelisation, task scheduling, and state management to realise robustness, i.e. maintaining a certain level of performance and reliability guarantee with minimum resource costs. In this thesis, we investigate novel techniques and solutions for robust resource management to tackle arising challenges associated with the cloud deployment of stream processing systems. The outcome is a series of research work that incorporate SLA-awareness into the resource management process and relieve the burden of the developers to monitor, analyse, and rectify the performance and reliability problems encountered during execution. Specifically, we have advanced the state-of-the-art by making the following contributions: • A stepwise profiling and controlling framework that improves application performance by automatically scaling up the parallelism degree of streaming operators. It also ensures proper resource allocation between data sources and data sinks to avoid processing backlogs and starvation. • A resource-efficient scheduler that monitors the application execution, models the resource consumption, and consolidates the task placement for improving cost efficiency without causing resource contention. • A replication-based state management framework that masks state loss in the cases of node crashes and JVM failures, which also reduces the fault-tolerance overhead by eliminating the need of synchronisation with a remote state storage. • A performance-oriented deployment framework that conducts iterative scaling of the streaming application to reach its pre-defined targets on throughput and latency, regardless of the initial amount of resource allocation.
  • Item
    Thumbnail Image
    Robust and fault-tolerant scheduling for scientific workflows in cloud computing environments
    Chandrashekar, Deepak Poola ( 2015)
    Cloud environments offer low-cost computing resources as a subscription-based service. These resources are elastically scalable and dynamically provisioned. Furthermore, new pricing models have been pioneered by cloud providers that allow users to provision resources and to use them in an efficient manner with significant cost reductions. As a result, scientific workflows are increasingly adopting cloud computing. Scientific workflows are used to model applications of high throughput computation and complex large scale data analysis. However, existing works on workflow scheduling in the context of clouds are either on deadline or cost optimization, ignoring the necessity for robustness. Cloud is not a utopian environment. Failures are inevitable in such large complex distributed systems. It is also well studied that cloud resources experience fluctuations in the delivered performance. Therefore, robust and fault-tolerant scheduling that handles performance variations of cloud resources and failures in the environment is essential in the context of clouds. This thesis presents novel workflow scheduling heuristics that are robust against performance variations and fault-tolerant towards failures. Here, we have presented and evaluated static and just-in-time heuristics using multiple fault-tolerant techniques. We have used different pricing models offered by the cloud providers and proposed schedules that are fault-tolerant and at the same time minimize time and cost. We have also proposed resource selection policies and bidding strategies for spot instances. The proposed heuristics are constrained by either deadline and budget or both. These heuristics are evaluated with the prominent state-of-the art workflows. Finally, we have also developed a multi-cloud framework for the Cloudbus workflow management system, which has matured with years of research and development at the CLOUDS Lab in the University of Melbourne. This multi-cloud framework is demonstrated with a private and a public cloud using an astronomy workflow that creates a mosaic of astronomic images. In summary, this thesis provides effective fault-tolerant scheduling heuristics for workflows on cloud computing platforms, such that performance variations and failures can be mitigated whilst minimizing cost and time.
  • Item
    Thumbnail Image
    Resource provisioning in spot market-based cloud computing environments
    VOORSLUYS, WILLIAM ( 2014)
    Recently, cloud computing providers have started offering unused computational resources in the form of dynamically priced virtual machines (VMs), also known as "spot instances". In spite of the apparent economical advantage, an intermittent nature is inherent to these biddable resources, which may cause VM unavailability. When an out-of-bid situation occurs, i.e. the current spot price goes above the user's maximum bid, spot instances are terminated by the provider without prior notice. This thesis presents a study on employing cloud computing spot instances as a means of executing computational jobs on cloud computing resources. We start by proposing a resource management and job scheduling policy, named SpotRMS, which addresses the problem of running deadline-constrained compute-intensive jobs on a pool of low-cost spot instances, while also exploiting variations in price and performance to run applications in a fast and economical way. This policy relies on job runtime estimations to decide what are the best types of spot instances to run each job and when jobs should run. It is able to minimise monetary spending and make sure jobs finish within their deadlines. We also propose an improvement for SpotRMS, that addresses the problem of running compute-intensive jobs on a pool of intermittent virtual machines, while also aiming to run applications in a fast and economical way. To mitigate potential unavailability periods, a multifaceted fault-aware resource provisioning policy is proposed. Our solution employs price and runtime estimation mechanisms, as well as three fault tolerance techniques, namely checkpointing, task duplication and migration. As a further improvement, we equip SpotRMS with prediction-assisted resource provisioning and bidding strategies. Our results demonstrate that both costs savings and strict adherence to deadlines can be achieved when properly combining and tuning the policy mechanisms. Especially, the fault tolerance mechanism that employs migration of VM state provides superior results in virtually all metrics. Finally, we employ a statistical model of spot price dynamics to artificially generate price patterns of varying volatility. We then analyse how SpotRMS performs in environments with highly variable price levels and more frequent changes. Fault tolerance is shown to be even more crucial in such scenarios.