Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 27
  • Item
    Thumbnail Image
    Anomaly detection in streaming data from air quality monitoring system
    Cong, Yue ( 2015)
    Detection of abnormalities is an important aspect of air quality monitoring. Wireless Sensor Networks (WSNs) provide a flexible and low-cost solution for air quality monitoring. However, considering the limited resources available in these networks in terms of power, memory and computational resources, obtaining a high anomaly detection rate while prolonging the life span of these networks is a challenging task. In recent years, both parametric and non-parametric algorithms are put forward to tackle this challenge. In order to save energy and memory, researchers have been investigating the iterative detection algorithms. In this thesis, we proposed a new efficient parametric iterative algorithm, in which the cumulative sum of likelihood ratio is calculated then we compare the cumulative sum with a manually defined control limit. We also evaluate effectiveness of our proposed algorithms both on synthetic data and real sensor data and compare it with a recently proposed algorithm. In evaluation on synthetic data, we design different experimental cases with respect to real environment and point out principles in selection of the two algorithms in practice. In evaluation on real data, we analyse and discuss the result and compare the effectiveness and efficiency of the two algorithms.
  • Item
    Thumbnail Image
    Anomaly detection in data streams: challenges and techniques
    Salehi, Mahsa ( 2015)
    Anomaly detection in data streams plays a vital role in on-line data mining applications, such as network intrusion detection, environmental monitoring and road traffic analysis. However, there are significant challenges with anomaly detection in streaming environments and in this thesis we propose effective and efficient techniques to address these challenges. A major challenge for anomaly detection in these applications is the dynamically changing nature of these monitoring environments. This causes a problem for traditional anomaly detection techniques, which assume a relatively static monitoring environment, and hence construct a static model of normal behaviour as the basis for anomaly detection. However, in an environment that is intermittently changing (known as a switching data stream), such an approach can yield a high error rate in terms of false positives. To cope with the challenge of dynamic environments, we require an approach that can learn from the history of normal behaviour in a data stream, while accounting for the fact that not all time periods in the past are equally relevant. Consequently, to address this problem first we propose a relevance-weighted ensemble model for learning normal behaviour, which forms the basis of our anomaly detection scheme. The second challenge for anomaly detection in data streams is the high rate of incoming observations. Since traditional approaches in anomaly detection require multi-passes over datasets, they are not applicable to data streams. In terms of streaming data, processing each observation multiple times is not feasible due to the unbounded amount of data that is generated at a high rate. The advantage of our proposed relevance-weighted ensemble model is that it can improve the accuracy of detection by making use of relevant history, while remaining computationally efficient, and addresses both major challenges in data streams. We then propose two ensemble based approaches called Biased SubSampling (BSS) and Diversity-based Biased SubSampling (DBSS) for anomaly detection, where we improve the detection accuracy of each ensemble detector on one hand and induce diversity among them on the other hand, both in an unsupervised manner. We discuss the effectiveness of our approaches in terms of the bias-variance trade-off. Such an approach is effective in terms of improving the detection accuracy of outliers and can be potentially used in streaming data. With the growing need to analyze high speed data streams, the task of anomaly detection becomes even more challenging as traditional anomaly detection techniques can no longer assume that all the data can be stored for processing. This motivates our third major challenge in anomaly detection in data streams, i.e., the unbounded quantity of data points. To address this challenge we propose a memory efficient incremental local outlier detection algorithm for data streams called MiLOF, and a more flexible version called MiLOF_F, which have an accuracy close to incremental local outlier factor (iLOF) algorithm, a well-known density based outlier detection algorithm, but within a fixed memory bound and with lower time complexity. Hence, our proposed methods are well suited to application environments with limited memory (e.g., wireless sensor networks), and can be applied to high volume data streams. In addition, MiLOF_F is robust to changes in the number of data points, the number of underlying clusters and the number of dimensions in the data stream. While a variety of approaches have been widely used in supervised learning tasks, all of our approaches in this thesis provide novel contributions through the use of ensemble techniques and clustering models for anomaly detection in data streams in an unsupervised manner, which is the case in many real applications. Finally, this thesis is concluded by a case study of car racing driver distraction using EEG brain signals of drivers, as an example of of a potential application for anomaly detection in data streams.
  • Item
    Thumbnail Image
    Improving the effectiveness of information sharing during nursing handover
    Alturki, Nazik Mohammad ( 2015)
    During handover in clinical settings, nurses share critical information associated with the planning, delivery and evaluation of patient care. Effective sharing of this handover information is vital to ensure continuity and safety of patient care. However, handover information sharing has been described as challenging, problematic and often ineffective. Recent suggested artefacts to improve the effectiveness of handover information sharing include the use of handover sheets and/or the Electronic Patient Record (EPR). However, these artefacts have not received adequate attention from researchers, while the majority of studies on nursing handover lack a holistic view on studying the handover information sharing activity. Thus, the broad aim of this study was to understand how the effectiveness of information sharing during nursing handover can be improved. To achieve this aim, this study investigated the range of information sharing problems that hinder effective handover information sharing; the respective role of handover sheets and the EPR in facilitating effective handover information sharing; and the impact of other handover elements (excluding handover artefacts) on the effectiveness of handover information sharing. This study was designed as a qualitative study that employed a multiple-case study method. Twelve units, distributed across three hospitals located in Riyadh (Saudi Arabia), participated in this study. Data collection techniques included semi-structured interviews, observations of handover practices, and examination of handover artefacts. This study applied two theories Activity Theory (AT) (Engeström, 1987) and Distributed Cognition theory (DCog) (Hutchins, 1995). AT formed the theoretical basis of analysis in this study by providing a broad and holistic perspective on the handover information sharing activity, through analysis of how the effectiveness of handover information sharing is influenced by different handover elements. The complementary application of a second theory, DCog, was useful towards examining the role, content and properties of handover artefacts that facilitate handover information sharing. DCog also aided in examining the complex cognitive interactions between primary nurses and members of the nursing community and hence deepened the understanding of the distribution of knowledge among them. The findings of this study highlight the unique information sharing problems that nurses experience during handover and provide a deeper understanding of the impact and role of handover sheets and the EPR on the effectiveness of handover information sharing. In addition, the findings provide deep insight into the role of handover rules in regulating nurses’ actions during handover. The findings further identify the different nursing community members who play instrumental roles in nursing handover, and how the division of labour of these members influence the effectiveness of handover information sharing. The insights gained from this study led to the specification of nine key propositions that fulfil this study’s aim of understanding how the effectiveness of handover information sharing between nurses can be improved. Thus, a range of contributions to theory and practice are offered by this study. In terms of its theoretical contributions, this study identifies and explores the four key handover elements that influence the effectiveness of handover information sharing: handover artefacts, handover rules, nursing community members that assist nurses during handover meetings and the division of labour between these members. Furthermore, the study extends the analysis of information sharing problems nurses experience during handover by applying AT’s concept of secondary contradictions. This provided a systematic and holistic mechanism of revealing the root causes of handover information sharing problems. In terms of its practical contributions, this study provides insights into the ways in which the design of handover sheets can be improved and offers a better understanding of the EPR’s advantages and limitations with respect to handover information sharing practices.
  • Item
    Thumbnail Image
    Design and adjustment of dependency measures
    Romano, Simone ( 2015)
    Dependency measures are fundamental for a number of important applications in data mining and machine learning. They are ubiquitously used: for feature selection, for clustering comparisons and validation, as splitting criteria in random forest, and to infer biological networks, to list a few. More generally, there are three important applications of dependency measures: detection, quantification, and ranking of dependencies. Dependency measures are estimated on finite data sets and because of this the tasks above become challenging. This thesis proposes a series of contributions to improve performances on each of these three goals. When differentiating between strong and weak relationships using information theoretic measures, the variance plays an important role: the higher the variance, the lower the chance to correctly rank the relationships. In this thesis, we discuss the design of a dependency measure based on the normalized mutual information whose estimation is based on many random discretization grids. This approach allows us to reduce its estimation variance. We show that a small estimation variance for the grid estimator of mutual information if beneficial to achieve higher power when the task is detection of dependencies between variables and when ranking different noisy dependencies. Dependency measure estimates can be high because of chance when the sample size is small, e.g. because of missing values, or when the dependency is estimated between categorical variables with many categories. These biases cause problems when the dependency must have an interpretable quantification and when ranking dependencies for feature selection. In this thesis, we formalize a framework to adjust dependency measures in order to correct for these biases. We apply our adjustments to existing dependency measures between variables and show how to achieve better interpretability in quantification. For example, when a dependency measure is used to quantify the amount of noise on functional dependencies between variables, we experimentally demonstrate that adjusted measures have more interpretable range of variation. Moreover, we demonstrate that our approach is also effective to rank attributes during the splitting procedure in random forests where a dependency measure between categorical variables is employed. Finally, we apply our framework of adjustments to dependency measures between clusterings. In this scenario, we are able to analytically compute our adjustments. We propose a number of adjusted clustering comparison measures which reduce to well known adjusted measures as special cases. This allows us to propose guidelines for the best applications of our measures as well as for existing ones for which guidelines are missing in literature, e.g. for the Adjusted Rand Index (ARI).
  • Item
    Thumbnail Image
    Similarity analysis with advanced relationships on big data
    Huang, Jin ( 2015)
    Similarity analytic techniques such as distance based joins and regularized learning models are critical tools employed in numerous data mining and machine learning tasks. We focus on two typical such techniques in the context of large scale data and distributed clusters. Advanced distance metrics such as the Earth Mover's Distance (EMD) are usually employed to capture the similarity between data dimensions. The high computational cost of EMD calls for a distributed solution, yet it is difficult to achieve a balanced workloads given the skewed distribution of the EMDs. We propose efficient bounding techniques and effective workload scheduling strategies on the Hadoop platform to design a scalable solution, named HEADS-Join. We investigate both the range joins and the top-k joins, and explore different computation paradigms including MapReduce, BSP, and Spark. We conduct comprehensive experiments and confirm that the proposed techniques achieve an order of magnitude speedup over the state-of-the-art MapReduce join algorithms. The hypergraph model is demonstrated to achieve excellent effectiveness in a wide range of applications where high-order relationships are of interest. When processing a large scale hypergraph, the straightforward approach is to convert it to a graph and reuse the distributed graph frameworks. However, such an approach significantly increases the problem size, incurs excessive replicas due to partitioning, and renders it extremely difficult to achieve a balanced workloads. We propose a novel scalable framework, named HyperX, to directly operate on a distributed hypergraph representation and minimize the numbers of replicas while still maintain a great workload balance among the distributed machines. We closely investigate an optimization problem of partitioning a hypergraph in the context of distributed computation. With extensive experiments, we confirm that HyperX achieve an order of magnitude improvement over the graph conversion approach in terms of the execution time, network communication, and memory consumption.
  • Item
    Thumbnail Image
    A white matter lesion load prediction model using retinal micro vascular features
    Roy, Pallab Kanti ( 2015)
    White matter lesion (WML) is one of the prominent cerebral abnormalities, which indicates the death of white matter of the human brain. Recent studies have shown a significant correlation between the brain WML and diseases such as Stroke, Dementia, Parkinson's. Early diagnosis of WML is important to prevent these life-threatening diseases. While Magnetic Resonance Imaging (MRI) of the brain is frequently used to diagnose white matter lesion (WML) volume, it is impractical for regular screening of a patient because of its high cost and unavailability. Thus, early diagnosis of the WML volume/load becomes extremely difficult, especially in the rural, remote areas and developing countries. Research studies have shown that the changes in the retinal microvascular system reflect the changes in the cerebral microvascular system. Therefore, we have proposed a retinal image-based WML volume and severity prediction model, which is very convenient and easy to operate. Our model can aid the physician in detecting the patients who need immediate MRI screening for a detailed diagnosis of WML. The proposed model uses quantified measurement of retinal microvascular signs (such as arteriovenous nicking (AVN), and focal arteriolar narrowing (FAN) as input, estimates the WML volume/load, and classifies its severity. Our main contributions in this research project is a novel and accurate WML volume and severity prediction model. Our model uses quantified retinal microvascular signs such as FAN and AVN to predict WML volume. Besides the WML prediction model, our research also finds some novel MRI and colour retinal image analysis algorithms such as automatic WML segmentation method and novel and robust FAN and AVN quantification methods. We have evaluated our proposed model on a dataset, consisting of 111 patients, chosen from the ENVISion study, which holds the retinal and MRI images of every patient. Our model shows a high degree of accuracy in estimating the WML volume. The mean square error (MSE) between our predicted WML load and manually annotated WML load is 0.15. Moreover, the proposed method obtains an F1 score of 0.76, and AUC of 0.80, in classifying the patients who have mild and severe WML load. The results indicate that our retinal image-based WML prediction model can be helpful for the physicians in identifying those patients who need immediate MRI screening, for the further diagnosis of WML load.
  • Item
    Thumbnail Image
    Recommendation systems for travel destination and departure time
    Xue, Yuan ( 2015)
    People travel on a daily basis to various local destinations such as office, home, restaurant, appointment venue, and sightseeing spot. It is vital to most people that we have a positive experience and high efficiency of daily travel. With this observation, my research strives to provide daily-travel related recommendations by solving two optimisation problems, driving destination prediction and departure time recommendation for appointments. Our “SubSyn” destination prediction algorithm, by definition, predicts potential destinations at real-time for drivers on the road. Its applications include recommending sightseeing places, pushing targeted advertisement, and providing early warnings for road congestion. It employs the Bayesian inference framework and second-order Markov model to compute a list of high-probability destinations. The key contributions include real-time processing and the ability to predict destinations with very limited amount of training data. We also look into the problem of privacy protection against such prediction. The “iTIME” departure time recommendation system is a smart calendar that reminds users to depart in order to arrive at appointment venues on time. It also suggests the best transport mode based on users’ travel history and preferences. Currently, it is very inefficient for people to manually and repeatedly check the departure time and compare all transport modes using, for instance, Google Maps. The functionalities of iTIME were realised by machine learning algorithms that learn users’ habits, analyse the importance of appointments and optimal mode of transport, and estimate the start location and travel time. Our field study showed that we can save up to 40% of time by using iTIME. The system can also be extended easily to provide additional functionalities such as clashing appointments detection and appointment scheduling, both taking into account the predicted start location and travel time of future appointments. Both problems can be categorised as recommender systems (or recommendation systems) that provide insightful suggestions in order to improve daily-travel experiences and efficiency.
  • Item
    Thumbnail Image
    Brokering techniques for managing three-tier applications in distributed cloud computing environments
    Grozev, Nikolay ( 2015)
    Cloud computing is a model of acquiring and using preconfigured IT resources on demand. Cloud providers build and maintain large data centres and lease their resources to customers in a pay-as-you-go manner. This enables organisations to focus on their core lines of business instead of building and managing in-house infrastructure. Such in-house IT facilities can often be either under or over utilised given dynamic and unpredictable workloads. The cloud model resolves this problem by allowing organisations to flexibly resize/scale their rented infrastructure in response to the demand. The confluence of these incentives has caused the recent widespread adoption of cloud services. However, cloud adoption has introduced challenges in terms of service unavailability, regulatory compliance, low network latency to end users, and vendor lock-in. These factors are of special importance for large scale interactive web-facing applications, which observe unpredictable workload spikes and need to serve users worldwide with low latency. The utilisation of multiple cloud sites (i.e. a Multi-Cloud) has emerged as a promising solution. Multi-Cloud approaches also facilitate cost reduction by taking advantage of the diverging prices in different cloud sites. The 3-Tier architectural model is the de-facto standard approach to build interactive web systems. It divides the application in three tiers: (i) presentation tier which implements the user interfaces, (ii) domain tier implementing the core business logic, and (iii) data tier managing the persistent storage. This logical division most often leads to deployment separation as well. This thesis investigates dynamic approaches for workload distribution and resource provisioning (a.k.a. brokering) of 3-Tier applications in a Multi-Cloud environment. It advances the field by making the following key contributions: 1. A performance model and a simulator for 3-Tier applications in one and multiple clouds. 2. A system architecture for brokering 3-Tier applications across clouds, which considers latency, availability, and regulatory requirements and minimises the overall operational costs. 3. An approach for Virtual Machine (VM) type selection that reduces the total cost within a cloud site. It uses online machine learning techniques to address the variability of both the application requirements and the capacity of the underlying resources. 4. A rule-based domain specific model for regulatory requirements, which can be interpreted by a rule inference engine. 5. Design and implementation of a workload redirection system that directs end users to the individual cloud sites in a Multi-Cloud environment.
  • Item
    Thumbnail Image
    Automatic memory management techniques for the go programming language
    Davis, Matthew ( 2015)
    Memory management is a complicated task. Many programming languages expose such complexities directly to the programmer. For instance, languages such as C or C++ require the programmer to explicitly allocate and reclaim dynamic memory. This opens the doors for many software bugs (e.g., memory leaks and null pointer dereferences) which can cause a program to crash. Automated techniques of memory management were introduced to relieve programmers from managing such complicated aspects. Two automated techniques are garbage collection and region-based memory management. The more common technique, garbage collection, is primarily driven by a runtime analysis (e.g., scanning live memory and reclaiming the bits that are no longer reachable from the program), where as the less common region-based technique performs a static analysis during compilation and determines program points where the compiler can insert memory reclaim operations. Each option has its drawbacks. In the case of garbage collection it can be computationally expensive to scan memory at runtime, often requiring the program to halt execution during this stage. In contrast, region-based methods often require objects to remain resident in memory longer than garbage collection, resulting in a less than optimal use of a system’s resources. This thesis investigates the less common form of automated memory management (region-based) within the context of the relatively new concurrent language Go. We also investigate combining both techniques, in a new way, with hopes of achieving the benefits of a combined system without the drawbacks that each automated technique provides alone. We conclude this work by applying our region-based system to a concurrent processing environment.
  • Item
    Thumbnail Image
    Answer set programming: founded bounds and model counting
    Aziz, Rehan Abdul ( 2015)
    Answer Set Programming (ASP) is a powerful modelling formalism that is very efficient for solving combinatorial problems. This work extends the state-of-the-art in theory and practice of ASP. It has two parts. The first part looks at the intersection of ASP and Constraint Programming and proposes theory and algorithms for implementing Bound Founded ASP, which is a generalization of both ASP and CP. The second part discusses model counting in the context of ASP.