Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 312
  • Item
    Thumbnail Image
    Explicit Feature Interaction Modeling for Recommender Systems
    Su, Yixin ( 2022)
    Recommender systems play a key role in addressing information overload issues in many Web applications, such as e-commerce, social media platforms, and lifestyle apps. For example, the Amazon shopping website lists items that a user might be interested in at the top of the website. The core function of recommender systems is to predict how likely a user will select an item (e.g., purchase, click). Typical recommender systems leverage users' historical selected items (i.e., user-item interactions) to infer users' interests. For example, matrix factorization (MF), one of the most common recommendation algorithms, learns user and item representations by factorizing the user-item interaction matrix as the multiplication of a user matrix, and an item matrix. Besides user-item interactions, studies show that features (e.g., user/item attributes, contexts), especially feature interactions (e.g., the co-occurrences of features), are important side information that can significantly enhance the recommendation accuracy. Deep neural networks (DNNs) have achieved great success in many recent studies due to their powerful information analysis ability. As a result, recent recommendation models seek to leverage DNNs for better feature interaction modeling. However, they model feature interactions implicitly (i.e., model all feature interactions together in a black-box DNN model without knowing which interactions are modeled or how they are modeled). Recent studies have shown that implicit interaction learning is less effective in extracting useful information about feature interaction for accurate recommendations. In this thesis, we explore how to effectively leverage feature interactions and improve the performance of recommender systems. Specifically, we focus on modeling feature interactions in an explicit manner. Unlike the implicit modeling methods, explicit methods model each feature interaction individually, allowing to choose which feature interactions to model, and decide how to model each feature interaction for more accurate recommendations. Then, we focus on three challenging research questions in improving recommender systems through explicit feature interaction modeling. The first research question explores how to improve the performance of MF by enabling it to explicitly model feature interactions. MF learns user and item representations from user-item interactions, which has a drawback: it cannot consider feature interactions. This limits MF's potential to perform fine-grained analysis for more accurate predictions. Meanwhile, MF may encounter the cold-start problem (i.e., an item has too few historical user-item interactions to conduct an effective analysis). Focusing on this drawback, instead of factorizing the user-item interaction matrix, we propose to factorize multiple user-attribute interaction matrices to learn attribute representations. The final prediction is an aggregation of all the ratings predicted in the user-attribute matrices. Our proposed method achieves higher accuracy than MF-based methods, while resolving the cold-start problem. The second research question explores which feature interactions should be modeled in recommender systems. Existing recommendation algorithms consider all feature interactions to generate predictions. However, not all feature interactions are relevant to the recommendation prediction, and capturing irrelevant feature interactions may introduce noise and decrease the prediction accuracy. Therefore, we propose to detect a set of most relevant feature interactions (we formally define them as beneficial feature interactions in terms of prediction accuracy) and model only beneficial feature interactions for more accurate recommendation. We propose novel frameworks that leverage the relational reasoning ability of graph neural networks (GNNs) to achieve more effective explicit feature interaction modeling. Under these frameworks, beneficial feature interaction detection and recommendation prediction are achieved via an edge prediction task and a graph classification task, respectively. The third research question explores how to model different feature interactions in recommender systems. Existing recommendation algorithms model each feature interaction equally, neglecting their different impacts while performing the prediction. We explore how feature interactions can be categorized and modeled to fit their roles in recommendation. More specifically, for user attributes and item attributes (e.g., user gender, item color), we define two types of interactions: inner interactions for profile learning and cross interactions for preference matching. We propose a neural graph matching method, which is based on our proposed GNN-based interaction modeling framework, to model the two types of interactions so that can better analyze their impacts on recommendation predictions. For context features (e.g., weather, time), inspired by psychology, we leverage them to learn intrinsic factors and extrinsic factors that jointly influence users selection. Contrastive learning and disentanglement learning algorithms are leveraged to learn these factors. In summary, this thesis has made several contributions to explicit feature interaction modeling for improving recommender systems through: enabling feature interaction modeling in matrix factorization, detecting beneficial feature interactions, and categorizing and modeling different types of feature interactions to fit their roles in recommendation. All works are supported by theoretical analysis and empirical results.
  • Item
    No Preview Available
    Straggler Mitigation in Distributed Behavioural Simulations
    Bin Khunayn, Eman Abdulaziz A ( 2021)
    Behavioural simulation (B-SIM) has been widely used to understand real-world phenomena. Running such a large-scale simulation requires high computational power, which can be provided through parallel distributed computing. To preserve the correct behaviour in the system, the implementations follow the bulk synchronous parallel (BSP) computational model. The processing in BSP is divided into iterations running on multiple machines/workers, followed by a global barrier synchronisation. Unfortunately, the BSP model is plagued by the straggler problem, where a delay in any process at any iteration leads to a slowdown in the entire simulation. Stragglers may occur due to many reasons, including imbalanced workload distribution or communication and synchronisation delays. Moreover, this problem is further exacerbated by the system scaling up. Industry and academia have been working on addressing the straggler problem in distributed systems for a long time. Though mitigating the straggler effect in distributed B-SIM is very challenging, it is fundamental to improving the system performance, cost, and utilisation. This thesis presents straggler mitigation techniques in distributed B-SIM. It focuses on reducing the stragglers’ effect caused by computation, communication, and synchronisation delays in the BSP model that run on shared-nothing architectures. This thesis proposes novel techniques to mitigate the effect of computation and communication stragglers caused by computation and communication delays at the application level. The thesis makes the following key contributions to maximise B-SIM performance: - A GridGraph load balancing and partitioning method that reduces the computation stragglers by balancing the workload distribution in B-SIM among multiple workers. - The Priority Synchronous Parallel (PSP) model, a novel parallel computational model that exploits data dependencies to reduce the communication straggler’s effect. - The Priority Asynchronous Parallel (PAP) computational model that utilises the data dependencies further to reduce computation in addition to the communication straggler’s effect. - Neighbourhood Diffusion (NDiff), a dynamic load balancing (DLB) algorithm to dynamically mitigate computation stragglers caused by imbalanced workload distribution during the runtime. - The On-demand Dynamic Resource Allocation (On-demand DRA) method that dynamically allocates and releases computing resources (workers) to mitigate computation stragglers caused by highly imbalanced workload distribution. All proposed algorithms are implemented and evaluated using a microscopic traffic simulator, SMARTS [1] as an example of a B-SIM running a traffic simulation of three big cities—Melbourne, Beijing, and New York—with different traffic and straggler scenarios.
  • Item
    Thumbnail Image
    Network Architecture for Prediction of Emergence in Complex Biological Systems
    Ghosh Roy, Gourab ( 2022)
    Emergence of properties at the system level, where these properties are not observed at the individual entity level, is an important feature of complex systems. Biological system emergent properties have critical roles in the functioning of organisms and the disruptions to normal functioning, and are relevant to the treatment of diseases like cancer. Complex biological systems can be modeled by abstractions in the form of molecular networks like gene regulatory networks (GRNs) and signaling networks with nodes representing molecules like genes and edges representing molecular interactions. The thesis aims at exploring the use of the architecture of these networks to predict emergence of system properties. First, to better infer the network architecture with aspects that can be useful in predicting emergence, we propose a novel algorithm Polynomial Lasso Bagging or PoLoBag for signed GRN inference from gene expression data. The GRN edge signs represent the nature of the regulatory relationships, activating or inhibitory. Our algorithm gives more accurate signed inference compared to state-of-the-art algorithms, and overcomes their weaknesses by also inferring edge directions and cycles. We also show how combining signed GRN architecture with dynamical information in our proposed dynamical K-core method predicts emergent states of the gene regulatory system effectively. Second, we investigate the existence of the bow-tie architectural organization in the GRNs of species of widely varying complexity. Prior work has shown the existence of this bow-tie feature in the GRNs of only some eukaryotes. Our investigation covers GRNs of prokaryotes to unicellular and multicellular eukaryotes. We find that the observed bow-tie architecture is a characteristic feature of GRNs. Based on differences that we observe in the bow-tie architectures across species, we predict a trend in the emergence of the dynamical gene regulatory system property of controllability with varying species complexity. Third, from input genotype data we predict an emergent phenotype at the organism level – the cancer-specific survival risk. We propose a novel Mutated Pathway Visible Neural Network or MPVNN, designed using prior knowledge of signaling network architecture and additional mutation data-based edge randomization. This randomization models how known signaling network architecture changes for a particular cancer type, which is not modeled by state-of-the-art visible neural networks. We suggest that MPVNN performs cancer-specific risk prediction better than other similar sized NN and non-NN survival analysis methods, while also providing reliable interpretations of the predictions. These three research contributions taken together make significant advances towards our goal of using molecular network architecture for better prediction of emergence, which can inform treatment decisions and lead to novel therapeutic approaches and is of value to computational biologists and clinicians.
  • Item
    Thumbnail Image
    Software testing with Monte Carlo tree search
    Liu, Dongge ( 2022)
    Achieving high code coverage in modern software is an essential yet challenging task. The size and complexity of programs give rise to the exploitation-exploration dilemma for testers: How to balance the consideration between investigating a) the parts of the program that appears to be interesting based on the existing information (exploitation) vs b) other parts that are less well-understood (exploration)? The Monte Carlo tree search (MCTS) is an artificial intelligence (AI) search algorithm that offers a principled solution to balancing these two concerns and has demonstrated its efficiency in large search spaces. This thesis contributes to tailoring MCTS to automate test generation for coverage-based testing. First, we propose Legion to assist coverage-based whitebox testing with our variation of MCTS. Legion’s APPFuzzing generalises concoilc testing and fuzzing to cover more program states with amortised computation cost: It selectively solves the path constraints of intermediate program states for seed inputs and mutates them to cover the descendants of the selected states. Legion’s variation of MCTS uses the best-first search strategy to learn the most productive program states to apply APPFuzzing, thus mitigating path explosion in symbolic execution. We compare Legion’s performance with KLEE, a famous coverage-based concoilc testing tool developed, maintained, and used by a large community across academia and industry. Legion outperformed KLEE in 7 out of 11 benchmark suites in a worldwide competition and achieved 90% of the best score in 5 suites. Then we extend Legion’s algorithm to seed selection in greybox fuzzing. This algorithm searches for stateful protocol servers’ most progressive seed input sequence. We analyse the impact of multiple server models and selection algorithms with repeated experiments. This study shows that although our algorithm has a limited impact on the overall coverage performance, it vastly increases the probability of covering some representative code blocks in servers. It highlights the bottleneck of overall coverage performance in input generation quality and fuzzing throughput. Finally, we design a contextual MCTS algorithm to reify and transfer the knowledge of past program state evaluations. Most software testing algorithms estimate a state’s productivity only with the state-specific statistics and neglect the knowledge from other states. To use the statistics more efficiently, we identify two contextual features of each program state and quantified their effect state’s productivity as feature coefficients. We then train a regression model to dynamically update the value of the coefficients to evaluate other program states based on the states’ feature values. By analysing our experiment results and reviewing others’ work, we point out the current weaknesses and corresponding future works regarding this topic.
  • Item
    Thumbnail Image
    Traffic Optimization in The Era of Connected Autonomous Vehicles
    Motallebi, Sadegh ( 2021)
    Traffic optimization, especially at large scale, is an important and challenging problem due to the highly dynamic and unpredictable nature of traffic. The incoming era of connected autonomous vehicles (CAVs) offers a great opportunity for traffic optimization. The routes, which are followed by CAVs, can be assigned by a centralized system with better predictability. By collecting highly detailed real-time traffic data from sensors and CAVs, a traffic management system will have the full view of the entire road network, allowing it to plan traffic in a virtual world that replicates the real road network very closely. This thesis presents the problem of route assignment for CAVs, with the aim of optimizing traffic at network-level. We formulate the research problem as a streaming route assignment problem and address the intersecting routes which is a key reason causing delays and long queues at junctions. We focus on finding road network routes with few intersections for vehicles. We propose two route assignment algorithms to solve the problem, Local Detour Algorithm (LDA) and Multiple Intersection Reduction Algorithm (MIRA). LDA uses traffic information at road links or road junctions during route allocation, while MIRA enlarges the scope of traffic information to have a big picture of traffic conditions. LDA is an efficient algorithm due to searching detours in a small area around congested junctions. On the other hand, MIRA is more complicated as it searches for longer detours from not only congested junctions but also their surrounding junctions that will be congested in the near future. The results show that the route allocator can decrease traffic congestion significantly by utilizing MIRA. By assigning new routes on roads, traffic conditions can change significantly after a certain time, which means that the real-time traffic conditions will not be an accurate estimate of the near future. What if the route allocator can predict the effect of route assignments on roads? Traffic congestion prediction enables the route allocator to realize traffic conditions accurately. For a specific type of traffic road network in which intersections are signalized, we propose a predictive congestion model by modifying a real-time queue-based congestion model. The model predicts traffic congestion for each road link based on the corresponding number of route allocations. We also propose a route assignment algorithm, Traffic Congestion Aware Route Assignment Algorithm (TCARA), which uses the predictive congestion model. TCARA is designed for a specific type of road networks, while MIRA is a general route assignment algorithm with no prior assumption on road intersections (i.e., whether they are signalized or not), making it a good candidate for a wider range of road networks. Another reason that can change traffic conditions is varying traffic demand over time. Thus, if a route is optimized based on the current traffic conditions while the traffic demand is not stable, the route may be ineffective. What happens if prior temporal traffic data for particular traffic conditions of a road network is available? Temporal traffic data demonstrates historical traffic conditions collected at regular time intervals. The route allocator can be aware of changes in traffic demand that may occur in the near future. So, it can optimize traffic furthermore by having access to such data. We propose a new route assignment algorithm, Temporal Multiple Intersection Reduction Algorithm (T-MIRA), by extending MIRA to leverage prior temporal traffic data for reducing traffic congestions. In this thesis, MIRA is the main contribution. We extend it for situations when traffic conditions evolve due to the impact of route assignments or change of traffic demand. We investigate two different approaches, using a predictive congestion model in TCARA and using temporal traffic data in T-MIRA. An advanced traffic management system may utilize different approaches and solutions in an integrated way to achieve the best outcome in the real world.
  • Item
    Thumbnail Image
    Towards dialogue system training with little labeling effort
    Huang, Xinting ( 2022)
    Dialogue systems conduct natural and meaningful conversations with humans and are applied in diverse tasks including information seeking, task completion, and social chat. They have made great progress in recent years and have been widely integrated into commercial systems used by people everyday, such as virtual assistants in smartphones like Apple Siri and Google Assistant. There are also efforts to develop domain-specific dialogue systems, such as for e-commerce or education domain. However, building these dialogue systems requires great human efforts including task schema design, sample dialogue collection, and human annotation, and such a complex process may be needed for each new application. In this thesis, we aim to alleviate the costs of building a dialogue system via better exploiting both labeled and unlabeled dialogue data. We study three problems. The first is semi-supervised policy learning that aims to reduce annotation efforts for policy learning. Dialogue policy learning heavily relies on reward signals, which promote efficient task completion and penalize inappropriate decisions. Ground-truth decision making examples (i.e., expert demonstrations) are widely utilized to provide fine-grained rewards by measuring the similarity between behaviours of policies being learned and those shown in expert demonstrations. Existing approaches require labeling expert demonstrations into the form of state-action pairs, which requires extensive annotations. To reduce annotation efforts, we study semi-supervised policy learning and unify both labeled and unlabeled dialogues to provide rewards for policy learning. We propose a novel sequence model based reward estimator for policy learning which models dialogue progress (i.e., state transitions) either with or without annotations. Extensive experiments show that the proposed approach outperforms state-of-the-art approaches by up to 35.4 and 18 percents in terms of success rate and required turns, respectively, while it reduces the amount of labeled dialogue data by 80 percents. The second problem is cross-domain dialogue generation that aims to reduce dialogue collection efforts for response generation. Response generation relies on large-scale dialogue data which might be unavailable for certain domains. Although domain adaptation may alleviate the problem of lacking training data, a straightforward adaptation is easily prone to domain gap due to complex characteristics of the response generation task. To address such limitations, we break response generation into two core steps, content planning and surface realization, which are responsible for predicting response intention and ensuring language fluency, respectively. To better disentangle the two steps, we propose to learn semantic text representations that encode dialogue responses' underlying intentions, which facilitate knowledge transferring. The proposed approach achieves up to a 10.4 percents gain in task completion (measured by entity F1 score) compared to previous approaches. The third problem is data augmentation for information-seeking question answering (QA), which a special task of a dialogue system. To answer information seeking queries, the system needs to perform reasoning over external knowledge sources, e.g., textual paragraphs. Queries that require multiple reasoning steps to derive the answers, i.e., multi-hop questions, pose more challenges than those that only require a single reasoning step. However, collecting training data for multi-hop questions is costly due to complex language expressions and underlying reasoning logic. To reduce data collection efforts, we propose to generate synthetic multi-hop QA pairs using large-scale easily obtained single-hop QA data. To unify single-hop and multi-hop QA data, we propose a generate-and-compose approach, which generates sub-questions and composes them into a target multi-hop question. Experiments show that the proposed approach outperforms competitive baselines on a benchmark multi-hop QA dataset and achieves up to a 9.3 percents performance gain in terms of F1 score for a downstream QA model.
  • Item
    Thumbnail Image
    Energy and Time Aware Scheduling of Applications in Edge and Fog Computing Environments
    Goudarzi, Mohammad ( 2022)
    The Internet of Things (IoT) paradigm is playing a principal role in the advancement of many application scenarios such as healthcare, smart city, transportation, entertainment, and agriculture, which significantly affect the daily life of humans. The smooth execution of these applications requires sufficient computing and storing resources to support the massive amount of data generated by IoT devices. However, IoT devices are resource-limited intrinsically and are not capable of efficient processing and storage of large volumes of data. Hence, IoT devices require surrogate available resources for the smooth execution of their heterogeneous applications, which can be either computation-intensive or latency-sensitive. Cloud datacenters are among the potential resource providers for IoT devices. However, as they reside at a multi-hop distance from IoT devices, they cannot efficiently execute IoT applications, especially latency-sensitive ones. Fog computing paradigm, which extends Cloud services to the edge of the network within the proximity of IoT devices, offers low latency execution of IoT applications. Hence, it can improve the response time of IoT applications, service startup time, and network congestion. Also, it can reduce the energy consumption of IoT devices by minimizing their active time. However, Fog servers are resource-limited compared to Cloud servers, preventing them from the execution of all types of IoT applications, especially extremely computation-intensive applications. Hence, Cloud servers are used to support Fog servers to create a robust computing environment with heterogeneous types of resources. Consequently, the Fog computing paradigm is highly dynamic, distributed, and heterogeneous. Thus, without efficient scheduling techniques for the management of IoT applications, it is difficult to harness the full potential of this computing paradigm for different IoT-driven application scenarios. This thesis focuses on different scheduling techniques for the management of IoT applications in Fog computing environments while considering: a) IoT devices' characteristics, b) the structure of IoT applications, c) the context of resource providers, d) the networking characteristics of the Fog servers, e) the execution cost of running IoT applications, and f) the dynamics of computing environment. This thesis advances the state-of-the-art by making the following contributions: 1. A comprehensive taxonomy and literature review on the scheduling of IoT applications from different perspectives, namely application structure, environmental architecture, optimization properties, decision engine characteristics, and performance evaluation, in Fog computing environments. 2. A distributed Fog-driven scheduling technique for network resource allocation in dense and ultra-dense Fog computing environments to optimize throughput and satisfy users' heterogeneous demands. 3. A distributed scheduling technique for the batch placement of concurrent IoT applications to optimize the execution time of IoT applications and energy consumption of IoT devices. 4. A distributed application placement and migration management technique to optimize the execution time of IoT applications, the energy consumption of IoT devices, and the migration downtime in hierarchical Fog computing environments. 5. A Distributed Deep Reinforcement Learning (DDRL) technique for scheduling IoT applications in highly dynamic Fog computing environments to optimize the execution time of IoT applications and energy consumption of IoT devices. 6. A system software for scheduling IoT applications in multi-Cloud Fog computing environments. 7. A detailed study outlining challenges and new research directions for the scheduling of IoT applications in Fog computing environments.
  • Item
    Thumbnail Image
    Avoiding Bad Traffic: Analyzing the Past, Adapting the Present, and Estimating the Future
    Aldwyish, Abdullah Saleh ( 2021)
    Mobility is essential for modern societies. However, due to the increasing demand for mobility, traffic congestion poses a significant challenge to economic growth and advancement for many cities worldwide. At the same time, the widespread availability of location-aware devices has led to a sharp increase in the amount of traffic data generated, thereby, providing an opportunity for intelligent transportation systems to emerge as one of the main cost-effective methods for traffic congestion mitigation. This boost in traffic data has led to a new generation of live navigation services that depend on traffic estimation to provide up-to-date navigation advice. Intelligent transportation systems increase the utilization of existing infrastructure and support drivers to make better navigation decisions by providing actionable traffic information. However, a fundamental shortcoming of existing intelligent navigation systems is that they do not consider the evolution of traffic and route drivers based on snapshot traffic conditions. This is especially critical in the presence of traffic incidents, where the impact of the incident introduces significant variation in the traffic conditions around the incident as things unfold. This thesis proposes three contributions focusing on traffic estimation and forecasting to help drivers avoid bad traffic, especially around traffic incidents. The first contribution is an automated traffic management service to help drivers avoid traffic events based on analyses of historical trajectory data from other drivers. Users subscribe to the service and, when a traffic event occurs, the service provides advice based on all drivers' actions during a similar traffic event in the past. We present a solution that combines a graph search with a trajectory search to find the fastest path that was taken to avoid a similar event in the past. The intuition behind our solution is that we can avoid a traffic event by following the traces of the best driver from a similar situation in the past. The second contribution is a system that uses real-time traffic information and fast traffic simulations to adapt to traffic incident impact and generate navigation advice. In this work, we use faster than real-time simulations to model the evolution of traffic events and help drivers proactively avoid congestion caused by events. The system can subscribe to real-time traffic information and estimate the traffic conditions using fast simulations without the need for historical data. We evaluate our approach through extensive experiments to test the performance and accuracy, and quality of the navigation advice of our system with real data obtained from TomTom Traffic API. For the third contribution, we propose effective deep learning models for large-scale citywide traffic forecasting. In addressing this problem, our goal is to predict traffic conditions for thousands of sites across the city. Such large-scale predictions can be used by navigation systems to help drivers avoid congestion. We propose a traffic forecasting model based on deep convolutional networks to improve the accuracy of citywide traffic forecasting. Our proposed model uses a hierarchical architecture that captures traffic dynamics at multiple spatial resolutions. We apply a multi-task learning scheme based on this architecture, which trains the model to predict traffic at different resolutions. Our model helps provide a coherent understanding of traffic dynamics by capturing short and long spatial dependencies between different regions in a city. Experimental results on real datasets show that our model can achieve competitive results while being more computationally efficient.
  • Item
    Thumbnail Image
    Computational models of visual search and attention in natural images: from neural circuits to search strategies
    Rashidi, Shima ( 2021)
    Humans use visual search constantly in their daily activities, from searching for their keys to looking out for pedestrians while driving. Cortical visual attention and human eye movements are among the mechanisms that help with an effective visual search. In this thesis, we propose a theoretical model of eye movements in a visual search task as well as a computational model of neural mechanisms underlying visual attention. Towards computationally modeling human eye movements, we propose a model of target detectability in natural scenes which can be used by a Bayesian ideal searcher. The model uses a convolutional neural network as a feature extractor for extracting the features of target and background and uses signal detection theory to estimate detectability as the discriminability of the two feature distributions. We collect the detectability of a known target on 18 different backgrounds from human observers in a two-alternative forced-choice task. We use the collected detectabilities for verification and scaling of the ones estimated by our model. We further fed the collected detectabilities to a Bayesian ideal observer to predict human eye movements in a visual search task in natural scenes. We collected human eye movements on the same 18 natural backgrounds in a search for the known target and used it as the ground truth. Our model closely followed some statistical parameters of the collected eye movements, supporting the idea that humans search near-optimal in natural images. We further generalize the detectability model to any target and background and apply the Bayesian ideal observer model on a real-world dataset of pedestrians in various contexts. To the best of our knowledge, we are the first to provide an unsupervised gaze prediction algorithm on natural scenes which uses the Bayesian ideal observer and doesn't need human eye movements for training. The presented model can have potential applications in autonomous drivers. Towards computationally modeling the neural mechanisms of visual attention, we propose a large-scale computational model of attentional enhancement of visual processing, based on the idea of neural oscillations being fed back to attended features or locations in a scene. Our proposed model supports the idea of neural oscillation for mediating spatial attention and applies the same circuit for feature-based attention as well. The presented models can be used in various human gaze assisted Artificial Intelligent systems as well as enhancing our knowledge of the human visual system.
  • Item
    Thumbnail Image
    Scalable contrast pattern mining for network traffic analysis
    Alipourchavary, Elaheh ( 2021)
    Contrast pattern mining is a data mining technique that characterises significant changes between datasets. Contrast patterns identify nontrivial differences between the classes of a dataset, interesting changes between multiple datasets, or emerging trends in a data stream. In this thesis, we focus on the pattern of characterizing changes in Internet traffic using contrast pattern mining. For example, network managers require a compact yet informative report of significant changes in network traffic for security and performance management. In this context, contrast pattern mining is a promising approach to provide a concise and meaningful summary of significant changes in the network. However, the volume and high dimensionality of network traffic records introduce a range of challenges for contrast pattern mining. In particular, these challenges include the combinatorial search space for contrast patterns, the need to mine contrast patterns over data streams, and identifying new changes as opposed to rare recurring changes. In this thesis, we introduce a range of novel contrast mining algorithms to address these challenges. We first introduce the use of contrast pattern mining in network traffic analysis. We show that contrast patterns have strong discriminative power that make them suitable for data summarization, and finding meaningful and important changes between different traffic datasets. We also propose several evaluation metrics that reflect the interpretability of patterns for security managers. In addition, we demonstrate on real-life datasets that the vast majority of extracted patterns are pure, i.e., most change patterns correspond to either attack traffic or normal traffic, but not a mixture of both. We propose a method to efficiently extract contrast patterns between two static datasets.We extract a high-quality set of contrast patterns by using only the most specific patterns to generate a compact and informative report of significant changes for network managers. By elimination of minimal patterns in our approach, we considerably reduce the overlap between generated patterns, and by reducing the redundant patterns, we substantially improve the scalability of contrast pattern mining and achieve a significant speed-up. We also propose a novel approach to discriminate between new events and rare recurring events. Some changes in network traffic occur on a regular basis and show periodic behaviour, which are already known to network analysts. Thus, network managers may want to filter out these known recurring changes, and prioritize their focus on new changes, given their limited time and resources. Our approach to this problem is based on second order contrast pattern mining. Our work demonstrates the importance of higher order contrast pattern mining in practice, and provides an effective method for finding such higher order patterns in large datasets. Finally, based on the approaches that we introduced for contrast pattern mining in static datasets, we then propose two novel methods to extract contrast patterns over high dimensional data streams. We consider two incremental scenarios: (i) when one dataset is changing over time and the other dataset is static as a reference dataset, and (ii) when both datasets are changing over a data stream. In our approach, instead of regenerating all patterns from scratch, we reuse the previously generated contrast patterns wherever possible to mine the new set of patterns. Using this technique, we avoid expensive incremental computation and increase the efficiency and scalability of mining on dense and high dimensional data streams. As a result of this scalability, our method also can find the solutions for datasets where the other algorithms cannot. In addition to the substantial improvements in performance and scalability of our algorithm, we demonstrate that the quality of the generated patterns of our algorithm is quite comparable with the other algorithms. In summary, we propose several efficient contrast pattern mining approaches to extract significant changes between two static datasets or over a data stream. We also introduce a novel approach to identify new changes from the recurring changes. Our experimental results on different real-life datasets demonstrate the improvements in performance of our proposed algorithms compared to the existing approaches.