Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 97
  • Item
    Thumbnail Image
    Gossip-based Asynchronous Aggregation Protocols for Unstructured P2P Systems
    Rao, Imran Ahmad ( 2017)
    Decentralized nature of Peer-to-Peer (P2P) networks has proven to be efficient and effective in providing scalable solutions for the implementation of large-scale distributed systems. However, this decentralized nature of P2P networks also poses significant challenges in resource discovery and management. To efficiently deploy and manage P2P networks, system administrators may need to identify the aggregate capabilities of all the nodes in the system from a global perspective. For example, for efficient scheduling of jobs, they may need to locate the least loaded nodes in the system. To execute such global functions which result in the aggregate capabilities of nodes, P2P systems require decentralized and distributed protocols without the coordination of a central mediator. For these reasons, gossip-based protocols have emerged as a popular choice to compute aggregates in large-scale P2P systems. In a gossip-based push-pull aggregation protocol, each node at a given frequency exchanges its local information with one of its neighbor nodes. As a result of this exchange, both nodes update their local estimate of the global aggregate. These locally computed estimates at individual nodes, asymptotically converge to a constant, provided that the network topology remains connected and the system mass is conserved. In existing aggregation protocols, the accuracy and convergence of the estimated aggregate at local nodes heavily depends upon synchronization of aggregation rounds. Synchronization is not trivial to achieve and maintain in large-scale distributed P2P systems due to a number of factors such as different process execution speeds, message transmission delays, and clock drifts. Moreover, nodes joining and departing the system at random make it even harder to keep aggregation rounds synchronized. In this thesis, we investigate the synchronization behavior of popular existing gossip-based aggregation protocols. Through detailed simulations, we evaluate the impacts of asynchrony on the accuracy and the diffusion speed of these protocols. We propose a number of push-pull aggregation protocols to improve their accuracy in the presence of asynchronous time and compare these protocols with some of the existing protocols and list their respective pros and cons. Based upon these results, we identify the challenges in efficiently computing the aggregates in the presence of communication delays and asynchrony. Especially, we identify the scenarios in a synchronous push-pull protocol which cause the loss of system mass and measure this loss. We then propose a push-pull gossip-style novel aggregation protocol, called LAP, which addresses the above-mentioned issues and compute the system aggregate efficiently and accurately. This protocol is optimistic in nature and executes the recovery procedures after an anomaly is detected. Our protocol strives to preserve the system mass in the presence of system asynchrony and dynamics. More precisely, it does not require coordination and therefore the start and the end of an aggregation round can be asynchronous and arbitrarily long. Through detailed simulations, we evaluate the impacts of asynchrony on the accuracy and the diffusion speed of the LAP protocol.
  • Item
    Thumbnail Image
    Privacy preserving protocols for large distributed systems
    Ramchen, Kim Sasha ( 2017)
    A fundamental problem in large distributed systems is how to enable parties to communicate securely while maintaining privacy. In this thesis we investigate the construction of privacy preserving protocols in three problem domains. These are secure group communications, secure outsourceable computation and secret sharing. Within these domains, flexible data sharing, low round complexity and the distribution of access control are guiding principles for our constructions. We present a novel construction of attribute based encryption from correlation-relaxed two-to-one recodings. This construction is based upon the use of noisy cryptographic multilinear maps and entails replacing a correlation-secure encoding function with an indistinguishability property that states that a ciphertext is hard to decrypt without access to a certain input encoding. We construct the first noninteractive protocols for several tasks related to private set intersection. We provide efficient protocols for three related problems, each motivated by a particular kind of genomic testing. Set intersection with labelling hides the intersecting set itself and returns only the labels of the common elements, thus allowing a genomics company to return diagnoses without exposing the IP of its database. Fuzzy matching with labelling extends this to allow matching at a particular Hamming distance, which solves the same problem but incorporates the possibility of genetic variation. Closest matching returns the item in the server's database closest to the client's query - this can be used for taxonomic classification. Our protocols are optimised for the matching of k-mers (sets of k-length strings) rather than individual nucleotides, which is particularly useful for representing the short reads produced by next generation sequencing technologies. We present a very simple universally verifiable MPC protocol. The first component is a threshold somewhat homomorphic cryptosystem that permits an arbitrary number of additions (in the source group), followed by a single multiplication, followed by an arbitrary number of additions in the target group. The second component is a black-box construction of universally verifiable distributed encryption switching between any public key encryption schemes supporting shared setup and key generation phases, as long as the schemes satisfy some natural additive-homomorphic properties. This allows us to switch back from the target group to the source group, and hence perform an arbitrary number of multiplications. The key generation algorithm of our prototypical cryptosystem, which is based upon concurrent verifiable secret sharing, permits robust re-construction of powers of a shared secret. We demonstrate the scalability of distribution switching as a viable approach to secure vote tallying by implementing a private verifiable form of Instant Runoff Voting on real Australian election data comprising 40,000 votes. We investigate the construction of algebraic manipulation detection codes which are secure against general algebraic attacks, i.e., error-detecting/correcting codes which are secure against algebraic tampering functions. We prove that such codes exist when the families of tampering functions are point additions and polynomial functions modulo a prime. We prove both positive and negative results concerning the existence of general algebraic manipulation detection codes compared to non-malleable codes.
  • Item
    Thumbnail Image
    Hybrid optimization of vehicle routing problems
    Lam, Edward ( 2017)
    Vehicle routing problems are combinatorial optimization problems that aspire to design vehicle routes that minimize some measure of cost, such as the total distance traveled or the time at which the last vehicle returns to a depot, while adhering to various restrictions. Vehicle routing problems are of profound interest in both academia and industry because they are opportunities to study graph structures and algorithms, and because they underpin practical applications in a multitude of industries, but notably, the transportation and logistics industries. This dissertation presents two applications relevant to industry and develops a fully hybrid method for solving a classical vehicle routing problem. The first application combines vehicle routing with crew scheduling. In industry, vehicle routing and crew scheduling are usually performed in stages due to the large search space of integrated models. The first stage finds minimal-cost vehicle routes with little consideration to crew constraints and objectives. The second stage schedules crews on the routes from the first stage. Disregarding crew constraints in the first stage can lead to suboptimality or even infeasibility of the overall problem in the second stage. To quantify the suboptimality of staged optimization models, two formulations of the integrated problem are developed. The first is an ordinary mixed integer programming model, and the second is a constraint programming model containing a linear relaxation global constraint that performs cost-based filtering. The two integrated models are solved using a branch-and-bound search and a highly specialized large neighborhood search. The large neighborhood search exploits the substructures linking the vehicle routing and crew scheduling elements of the problem, and when executed on the constraint programming model, is shown to perform significantly better than the other approaches. The second application introduces a number of scheduling constraints to the Vehicle Routing Problem with Pickup and Delivery and Time Windows. The scheduling constraints arise from a lack of loading bays or equipment that unloads and loads incoming vehicles. These constraints limit the number of vehicles present or in service at any particular site by requiring the arrival of vehicles to be scheduled around the availabilities of a scarce resource. A mixed integer programming model, a constraint programming model and a sequential model are implemented for the problem but are shown to be inferior to a branch-and-price-and-check model, which hybridizes column generation and constraint programming with nogood learning. This thesis concludes with a hybrid method, named Branch-and-Check with Explanations, that unifies linear programming, constraint programming and Boolean satisfiability. The method begins with a linear programming model that omits several critical constraints. The solver uses the linear programming model to find objective bounds and candidate solutions, which are checked by a constraint programming model for feasibility of the omitted constraints. A Boolean satisfiability model performs conflict analysis on infeasible candidate solutions to derive nogood cuts, which are placed into the linear programming model and the constraint programming model. The method is implemented in a proof-of-concept solver for the Vehicle Routing Problem with Time Windows and is shown to be competitive against a branch-and-cut model while avoiding the intricacies involved in developing the cutting planes and separation algorithms required in branch-and-cut.
  • Item
    Thumbnail Image
    Understanding how cloud computing enables business model innovation in start-up companies
    Alrokayan, Mohammed ( 2017)
    Start-up companies contribute significantly to the national economies of many countries but their failure rate is notably high. Successful start-ups typically depend on innovative business models to be competitive and maintain profitability. This thesis explores how the new technologies of cloud computing might enable start-ups to create and maintain competitive advantage. A conceptual framework called Cloud-Enabled Business Model Innovation (CEBMI) is presented that identifies three research questions concerning how cloud computing might enable business model innovation, what form this innovation takes, and how the innovation leads to competitive advantage. These questions were then investigated through three empirical studies involving six case studies with start-ups and two qualitative studies involving interviews with 11 business consultants and three cloud service providers. The detailed findings are presented as a set of key propositions that offer answers to the research questions, and together sketch a view of how CEBMI might enable start-ups to achieve competitive advantage.
  • Item
    Thumbnail Image
    Investigating the evolution of structural variation in cancer
    Cmero, Marek ( 2017)
    Cancers arise from single progenitor cells that acquire mutations, eventually dividing into mixed populations with distinct genotypes. These populations can be estimated by identifying common mutational profiles, using computational techniques applied to sequencing data from tumour tissue samples. Existing methods have largely focused on single nucleotide variants (SNVs), despite growing evidence of the importance of structural variation (SV) as drivers in certain subtypes of cancer. While some approaches use copy-number aberrant SVs, no method has incorporated balanced rearrangements. To address this, I developed a Bayesian inference approach for estimating SV cancer cell fraction called SVclone. I validated SVclone using in silico mixtures of real samples in known proportions and found that clonal deconvolution using SV breakpoints can yield comparable results to SNV-based clustering. I then applied the method to 2,778 whole-genomes across 39 distinct tumour types, uncovering a subclonal copy-number neutral rearrangement phenotype with decreased overall survival. This clinically relevant finding could not have been found using existing methods. To further expand the methodology, and demonstrate its application to low data quality contexts, I developed a novel statistical approach to test for clonal differences in high-variance, formalin-fixed, paraffin-embedded (FFPE) samples. Together with variant curation strategies to minimise FFPE artefact, I applied the approach to longitudinal samples from a cohort of neo-adjuvant treated prostate cancer patients to investigate whether clonal differences can be inferred in highly noisy data. This thesis demonstrates that characterising the evolution of structural variation, particularly balanced rearrangements, results in clinically relevant insights. Identifying the patterns and dynamics of structural variation in the context of tumour evolution will ultimately help improve understanding of common pathways of tumour progression. Through this knowledge, cancers driven by SVs will have clearer prognoses and clinical treatment decisions will ultimately be improved, leading to better patient outcomes.
  • Item
    Thumbnail Image
    Practical approximation algorithms for clustering and covering
    McClintock, Jessica ( 2017)
    The aim of this thesis is to provide solutions to computationally difficult network optimisation problems on modern architectures. There are three types of problem that we will be studying – clustering, covering and ordering – all which are computationally hard to solve, and in practice often involve very large data sets. These models are used in range of real-world applications, and therefore we will investigate both the practical and theoretical aspects of solving these problems in big data contexts. The main approach we consider for solving such instances is to obtain polynomial-time approximation-algorithms, which efficiently solve these problems to within some constant factor of optimality. We also introduce several heuristics for a type of scheduling problem with graph-based constraints and demonstrate their performance in a practical setting, before providing an approximation algorithm and hardness results for a formalised version of the problem. For instances on big data, where the computational bottleneck is the available RAM, we consider models for algorithm design that would allow such instances to be solved. For this purpose, we also design clustering algorithms using the MapReduce paradigm for parallelisation, giving experimental evaluations of their performance in comparison to existing approaches.
  • Item
    Thumbnail Image
    Spectrum-based fault localization using machine learning
    Neelofar ( 2017)
    Debugging is critical in the production of reliable software. One of the effective bug localization techniques is Spectrum-based Fault Localization (SBFL). This technique locates a buggy statement by applying an evaluation metric to program spectra and ranking program components on the basis of the score it computes. Most SBFL research to date has had a strong bias toward single bug programs and there are many good evaluation metrics available. The same is true for deterministic bugs (which cause test cases to fail whenever they are executed). However, having multiple bugs has now become a norm in large software systems, and debugging metrics that perform well for such software are consequently in high demand. In this thesis, we propose a parametric class of metrics that adapts itself to single, multiple and deterministic bugs. The parameter values are learnt using optimization methods such as Genetic Programming or Simulated Annealing. We name our proposed class of metric as Hyperbolic” metric due to the nature of its contours which are like hyperbola. We evaluate the performance of our proposed metric both on real programs and model programs with single bugs, multiple bugs, deterministic bugs and non-deterministic bugs, and find that the proposed class of metrics performs as well or better than the previous best-performing metrics over a broad range of data. SBFL is lightweight but has limited accuracy due to the limited amount of information it uses for localizing faults. It depends solely on the execution information of program components in passed and failed test cases. In this thesis, we propose a debugging approach that is a hybrid of SBFL and Static Analysis to provide extra information to SBFL about programs under test. Program statements are categorized into various types which are given weights based on how likely a category is related to a bug. We evaluate the performance of our technique both for small programs from the Siemens Test Suite (STS) and the larger Space program. Results show that our technique improves the performance of a wide variety of fault localization metrics on single and multiple bug data. Statement type is one of the many program features that can be used to get valuable clues about the location of a bug. Other features could be statement length, nesting depth, cyclomatic complexity etc. However, it is very expensive and thus impractical to plugin each of these features into our proposed technique to find how effective they are in fault localization. We devise a statistical method that can be used to evaluate the importance of a program feature in fault localization without using it in a machine learning method. The similarity of the results obtained by our proposed statistical method and actual implementation of the feature in machine learning techniques depicts the accuracy of our proposed method in feature-importance identification. Usually, when two or more well-performing techniques are combined, they do not perform better than any of the techniques individually. We combined our hyperbolic class of metrics with statement weightage technique, and the combined technique further improves the performance of the hyperbolic class of metrics. The improvement in performance is statistically significant for many single and multi-bug datasets.
  • Item
    Thumbnail Image
    Extraction of neologisms from Japanese corpora
    Breen, James ( 2017)
    In this thesis an exploration of the application of natural-language processing techniques to the extraction of neologisms from Japanese corpora is described. The research aim was to establish techniques which can be developed and exploited to assist significantly in neologism extraction for compiling Japanese monolingual and bilingual dictionaries. The particular challenge of the task is presented by the lack of word boundaries in Japanese text which creates a problem in the identification of unrecorded words. Three broad approaches have been explored, using a variety of language processing and artificial intelligence techniques, and drawing on large-scale Japanese corpora and reference lexicons: synthesis of possible Japanese words by mimicking Japanese morphological processes, followed by testing for the presence of candidate words in Japanese corpora; analysis of morpheme sequences in Japanese texts to determine the presence of potential new or unrecorded terms; and analysis of language patterns which are often used in Japanese in association with new and emerging terms. The research described in this thesis has identified a number of processes which can be used to assist lexicographers in the identification of unrecorded lexical items in Japanese texts.
  • Item
    Thumbnail Image
    Automatic understanding of unwritten languages
    Adams, Oliver ( 2017)
    Many of the world's languages are falling out of use without a written record and minimal linguistic documentation. Language documentation is a slow process and there are an insufficient number of linguists working to ensure the world's languages are documented before they die out. This thesis addresses automatic understanding of unwritten languages in order to perform tasks such as phonemic transcription and bilingual lexicon induction. The automation of such tasks promises to improve the leverage of field linguists and ultimately speed up the language documentation process. Modelling endangered languages is challenging due to the nature of the available data, which is typically not written text but limited quantities of recorded speech. Manually annotated information in the form of lexicons and grammars is typically also limited. Since the languages are spoken, the most efficient way of sourcing data is to collect speech in the language. Most speakers of endangered languages are bilingual or multilingual, so acquiring spoken translations works to the strength of the speakers. Key approaches described in this thesis make use of bilingual data, in particular translated speech, which consists of segments of endangered language speech paired with translations in a larger language. Such data is important for relating the source language speech with a larger language. Additionally, the application of monolingual phoneme transcription is also explored, since it has direct applicability in more traditional phonemic transcription workflows. The overarching question is this: what can be automatically learnt about the languages with the data we have available, and how can this help automate language documentation? We first consider translation modelling of accurate phoneme transcriptions. This assumption allows us to investigate the feasibility of phoneme--word translation and the effectiveness of inferring bilingual lexical items from such data in isolation from confounding acoustic factors. A second investigation explores how bilingual lexicons can be used to improve language models, which are crucial components of speech recognition and machine translation systems. In a third set of experiments we remove the assumption of accurate transcriptions and investigate operating in the face of acoustic uncertainty. Experiments in this space demonstrate that translated speech can improve automatic phoneme transcription even without a prior translation model. Finally, we make a step towards further generalisability, exploring acoustic modelling in resource-scarce environments without a lexicon or language model. In particular, we assess the use of automatic phoneme and tone transcription on Yongning Na, a threatened tonal language spoken in south-west China. Beyond quantitative investigation, we report on the use of this method in linguistic documentation of Na. Its effectiveness has led to its incorporation into the language documentation workflow for Na.
  • Item
    Thumbnail Image
    Building better predictive models for health-related outcomes
    Kankanige, Yamuna ( 2017)
    Predicting health-related outcomes is important for developing decision support systems for assisting clinicians and other healthcare workers regularly faced with critical decisions. Such models will save time, help to manage healthcare resources and ultimately provide better quality of care for patients. These outcomes are now made possible thanks to complex medical data routinely generated at hospitals and laboratories, and developments in data mining methods. This thesis focusses on development of such decision support systems as well as techniques for improving the data, such as feature selection and acquisition, generically useful for building better prognostic models for predicting health-related outcomes. Data mining in healthcare is an interesting and unique domain. The data available is heterogeneous, including demographic and diagnostic information of the patients, clinical notes, medical imaging results and whole genome sequence data. Since most data is not collected for research purposes, there can be issues with data quality such as missing information, ambiguous and erroneous data. In addition, some data might not be available in electronic format, which makes it time consuming to collect. Missing values is a big problem in this domain which occurs not only due to data entry or collection issues. Some information is just not available for some records. For example, different pathology test results available for a patient depend on laboratory tests ordered by the clinician for that patient. Another aspect of data mining in healthcare is that these models need to be sufficiently transparent for users to trust and use them. Therefore, techniques/algorithms that can be used for such models is subjective to how much trust users have on those methods. In particular, it is imperative that data analysis on healthcare data generalizes. The topic of this thesis, building better predictive models for health-related data, can be divided roughly to two parts. The first part investigates various data mining techniques used to improve the performance of prediction models, especially with regards to healthcare data, which helps to build better prognostic models for health-related outcomes. The second part of the thesis concerns applications of data mining models on clinical and biomedical data, to provide better health-related outcomes. A common occurrence for classification at test time, is partial missing test case features. Since obtaining all missing features is rarely cost effective or even feasible, identifying and acquiring those features that are most likely to improve prediction accuracy is of significant impact. This challenge arises frequently in health data, where clinicians order only a subset of test panels on a patient, at a time. In this thesis, we propose a confidence-based solution to this generic scenario using random forests. We sequentially suggest the features that are most likely to improve the prediction accuracy of each test instance, using a set of existing training instances which may themselves suffer missing values. Density based logistic regression is a recently introduced classification technique, which has been successful in real clinical settings, that performs one-to-one non-linear transformation of the original feature space to another feature space based on density estimations. This new feature space is particularly well suited for learning a logistic regression model, a popular technique for predicting health-related outcomes. Whilst performance gains, good interpretability and time efficiency make density based logistic regression attractive, there exist limitations to its formulation. As another technique for improving features, we tackle these limitations of the feature transformation method and propose several new extensions in this thesis. Liver transplants are a common type of organ transplantation, second only to kidney transplantations in frequency. The ability to predict organ failure or primary non-function, at liver transplant decision time, facilitates utilization of scarce resource of donor livers, while ensuring that patients who are urgently in need of a liver transplant are prioritized. An index that is derived to predict organ failure using donor as well as recipient characteristics, based on local datasets, is of benefit in the Australian context. In a study using real liver transplant data, we propose that by using donor, transplant and recipient characteristics which are known at decision time of a transplantation, with data mining techniques, we can achieve high accuracy in matching donors and recipients, potentially providing better organ survival outcomes. Serotyping is a common bacterial typing process where isolated microorganism samples are grouped according to their distinctive surface structures called antigens, which is important for public health and epidemiological surveillance. In a study using whole genome sequencing data of four publicly available Streptococcus Pneumoniae datasets, we demonstrate that data mining approaches can be used to predict the serotypes of isolates faster and accurately when compared with the traditional approaches. In summary, this thesis focusses on techniques for improving data, such as feature selection, transformation and acquisition, generically useful for building better prognostic models for predicting health-related outcomes as well as applications of data mining techniques on clinical and biomedical data for improving health-related outcomes.