Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 4 of 4
  • Item
    Thumbnail Image
    Cost-efficient resource provisioning for large-scale graph processing systems in cloud computing environments
    Heidari, Safiollah ( 2018)
    A large amount of data that is being generated on Internet every day is in the form of graphs. Many services and applications namely as social networks, Internet of Things (IoT), mobile applications, business applications, etc. in which every data entity can be considered as a vertex and the relationships between entities shape the edges of a graph, are in this category. Since 2010, exclusive large-scale graph processing frameworks are being developed to overcome the inefficiency of traditional processing solutions such as MapReduce. However, most frameworks are designed to be employed on high performance computing (HPC) clusters which are only available to whom can afford such infrastructure. Cloud computing is a new computing paradigm that offers unprecedented features such as scalability, elasticity and pay-as-you-go billing model and is accessible to everyone. Nevertheless, the advantages that cloud computing can bring to the architecture of large-scale graph processing systems are less studied. Resource provisioning and management is a critical part of any processing system in cloud environments. To provide the optimized amount of resources for a particular operation, several factors such as monetary cost, throughput, scalability, network performance, etc. can be taken into consideration. In this thesis, we investigate and propose novel solutions and algorithms for cost-efficient resource provisioning for large-scale graph processing systems. The outcome is a series of research works that increase the performance of such processing by making it aware of the operating environment while decreasing the dollar cost significantly. We have particularly made the following contributions: 1. We introduced iGiraph, a cost-efficient framework for processing large-scale graphs on public clouds. iGiraph also provides a new graph algorithm categorization and processes the graph accordingly. 2. To demonstrate the impact of network on the processing in cloud environment, we developed two network-aware algorithms that utilize network factors such as traffic, bandwidth and also the computation power. 3. We developed an auto-scaling technique to take advantage of resource heterogeneity on clouds. 4. We introduced a large-scale graph processing service for clouds where we consider the service level agreement (SLA) requirements in the operations. The service can handle multiple processing requests by its new prioritization and provisioning approach.
  • Item
    Thumbnail Image
    Similarity analysis with advanced relationships on big data
    Huang, Jin ( 2015)
    Similarity analytic techniques such as distance based joins and regularized learning models are critical tools employed in numerous data mining and machine learning tasks. We focus on two typical such techniques in the context of large scale data and distributed clusters. Advanced distance metrics such as the Earth Mover's Distance (EMD) are usually employed to capture the similarity between data dimensions. The high computational cost of EMD calls for a distributed solution, yet it is difficult to achieve a balanced workloads given the skewed distribution of the EMDs. We propose efficient bounding techniques and effective workload scheduling strategies on the Hadoop platform to design a scalable solution, named HEADS-Join. We investigate both the range joins and the top-k joins, and explore different computation paradigms including MapReduce, BSP, and Spark. We conduct comprehensive experiments and confirm that the proposed techniques achieve an order of magnitude speedup over the state-of-the-art MapReduce join algorithms. The hypergraph model is demonstrated to achieve excellent effectiveness in a wide range of applications where high-order relationships are of interest. When processing a large scale hypergraph, the straightforward approach is to convert it to a graph and reuse the distributed graph frameworks. However, such an approach significantly increases the problem size, incurs excessive replicas due to partitioning, and renders it extremely difficult to achieve a balanced workloads. We propose a novel scalable framework, named HyperX, to directly operate on a distributed hypergraph representation and minimize the numbers of replicas while still maintain a great workload balance among the distributed machines. We closely investigate an optimization problem of partitioning a hypergraph in the context of distributed computation. With extensive experiments, we confirm that HyperX achieve an order of magnitude improvement over the graph conversion approach in terms of the execution time, network communication, and memory consumption.
  • Item
    Thumbnail Image
    Anomaly detection in participatory sensing networks
    MONAZAM ERFANI, SARAH ( 2015)
    Anomaly detection or outlier detection aims to identify unusual values in a given dataset. In particular, there is growing interest in collaborative anomaly detection, where multiple data sources submit their data to an online data mining service, in order to detect anomalies with respect to the wider population. By combining data from multiple sources, collaborative anomaly detection aims to improve detection accuracy through the construction of a more robust model of normal behaviour. Cloud-based collaborative architectures such as Participatory Sensing Networks (PSNs) provide an open distributed platform that enables participants to share and analyse their local data on a large scale. Two major issues with collaborative anomaly detection are how to ensure the privacy of participants’ data, and how to efficiently analyse the large-scale high-dimensional data collected in these networks. The first problem we address is the issue of data privacy in PSNs, by introducing a framework for privacy-preserving collaborative anomaly detection with efficient local data perturbation at participating nodes, and global processing of the perturbed records at a data mining server. The data perturbation scheme that we propose enables the participants to perturb their data independently without requiring the cooperation of other parties. As a result our privacy-preservation approach is scalable to large numbers of participants and is computationally efficient. By collecting the participants’ data, the PSN server can generate a global anomaly detection model from these locally perturbed records. The global model identifies interesting measurements or unusual patterns in participants’ data without revealing the true values of the measurements. In terms of privacy, the proposed scheme thwarts several major types of attacks, namely, the Independent Component Analysis (ICA), Distance-inference, Maximum a Posteriori (MAP), and Collusion attacks. We further improve the privacy of our data perturbation scheme by: (i) redesigning the nonlinear transformation to better defend against MAP estimation attacks for normal and anomalous records, and (ii) supporting individual random linear transformations for each participant in order to provide the system with greater resistance to malicious collusion. A notable advantage of our perturbation scheme is that it preserves participants’ privacy while achieving comparable accuracy to non-privacy preserving anomaly detection techniques. The second problem we address in the thesis is how to model and interpret the large volumes of high-dimensional data that are generated in participatory domains by using One-class Support Vector Machines (1SVMs). While 1SVMs are effective at producing decision surfaces for anomaly detection from well-behaved feature vectors, they can be inefficient at modelling the variations in large, high-dimensional datasets. We overcome this challenge by taking two different approaches. The first approach is an unsupervised hybrid architecture, in which a Deep Belief Network (DBN) is used to extract generic underlying features, in combination with a 1SVM that uses the features learned by the DBN. DBNs have important advantages as feature detectors for anomaly detection, as DBNs use unlabelled data to capture higher-order correlations among features. Furthermore, using a DBN to reduce the number of irrelevant and redundant features improves the scalability of a 1SVM for use with large training datasets containing high-dimensional records. Our hybrid approach is able to generate an accurate anomaly detection model with lower computational and memory complexity compared to a 1SVM on its own. Alternatively, to overcome the shortcomings of 1SVMs in processing high-dimensional datasets, in our second approach we calculate a lower rank approximation of the optimisation problem that underlies the 1SVM training task. Instead of performing the optimisation in a high-dimensional space, the optimisation is conducted in a space of reduced dimension but on a larger neighbourhood. We leverage the theory of nonlinear random projections and propose the Reduced 1SVM (R1SVM), which is an efficient and scalable anomaly detection technique that can be trained on large-scale datasets. The main objective of R1SVM is to replace a nonlinear machine with randomised features and a linear machine. In summary, we have proposed efficient privacy-preserving anomaly detection approaches for PSNs, and scalable data modelling approaches for high-dimensional datasets, which lower the computational and memory complexity compared to traditional anomaly detection techniques. We have shown that the proposed methods achieve higher or comparable accuracy in detecting anomalies compared to existing state-of-art techniques.
  • Item
    Thumbnail Image
    Towards high-dimensional classification using Michigan style generic based machine learning
    ABEDINI, MANI ( 2013)
    High-dimensional classification problems arise frequently in many areas of science and engineering. Examples include: disease classification based on gene expression profiles, document classification, image recognition and fraud detection. Machine learning algorithms are widely used to tackle such problems. Typically, statistical techniques or artificial intelligence methods are used to build models that can learn from the data and subsequently new observations can be classified as belonging to a particular category of the data set. Genetic-based machine learning (GBML), which combines evolutionary algorithms and reinforcement learning, has been successful applied in a wide range of complex problem solving and classification tasks. The eXtended Classifier System (XCS), a well-known GMBL technique, evolves a population of classifiers in the form of condition-action rules that can be used for classification problems. A key step in XCS’s iterative learning process is the rule discovery component which creates new classifiers to be added to the bounded population. Unfortunately, population-based approaches are inherently slow when faced with large-scale problems. This may in part be attributed to the additional cost associated with maintaining relatively large populations of classifiers, often encapsulating multiple features. Consequently, few studies have examined the application of XCS in high-dimensional cases. The over-arching aim of this thesis is to develop new GBML classification models, based on an underlying XCS architecture, for high-dimensional classification problems. The objective it to improve the performance measured in terms of accuracy, population diversity and execution costs. To do this, three alternative approaches have been proposed and evaluated: In the first approach, we use “feature quality” to guide the XCS rule discovery process. Here, a pre-processing phase is used to extract feature quality information. Subsequently, the feature quality information is used to bias the evolutionary operators via a “gene mask”. Detailed numerical simulation experiments encapsulating a number of alternative feature ranking methods, parameter setting and benchmark data sets suggest that our proposed model can improve the classification performance of data sets with large numbers of features. In the second approach, a hybrid multi-population ensemble classifier inspired by co-evolutionary algorithms and ensemble learning is used to improve the baseline-XCS performance. In this model, separate sub-populations are evolved in parallel using a binary gene mask to guide the evolutionary operators. Sub-population outputs are weighted in a typical ensemble style to produce an output. As is typical in parallel evolutionary models of this form, we investigate alternative “migration” methods between sub-populations as well as specific mechanisms to adapt the gene mask within a sub-population. Numerical simulation results have shown that the multi-population model has superior performance to the single population model. Significantly, by combining feature quality information and the hybrid co-evolutionary XCS, superior performance can be achieved when measured across the benchmark data sets. The focus of the third approach is different from the previous approaches, in that emphasis is placed more on investigating techniques to reduce execution time of the proposed GMBL models for high-dimensional complex classification tasks using alternative parallel and distributed architectures. We start by comparing deployment techniques and the mapping of the multi-population GBML models using parallel or distributed programming libraries, before listing advantages and disadvantages of each approach. This section concludes with a preliminary study investigating the use of cloud computing infrastructures to run our proposed model. Large-scale classification problems pose many challenges for traditional machine learning algorithms. In this thesis, we have proposed novel extensions to XCS to meet such challenges. Our contributions are based on the development of informed rule discovering mechanisms that employ gene masks, co-evolving sub-populations and parallel and distributed architectures. The experimental results have demonstrated that the proposed models have better performance in terms of accuracy, execution costs and promoting population diversity across the data sets considered – benchmark low dimensional data sets; synthetic data sets and high-dimensional gene expression profile data sets. There are further opportunities to extend the proposed models in interesting ways, including: examining feature interactions more closely and exploring the effectiveness of alternative evolutionary and statistical models.