Computing and Information Systems - Theses
Now showing items 1-12 of 405
Scalable ride-sharing through geometric algorithms and multi-hop matching
Thanks to the ubiquitous access to the Internet, on-demand ride-sharing service has emerged to provide timely and convenient rides to passengers. Ride-sharing creates a win-win situation for all participants involved and brings substantial environmental benefits, thus becoming a prevalent transportation mode. A fundamental problem of ride-sharing is determining how to dispatch vehicles to passengers. Efficiency and optimality are two core requirements for dispatching algorithms. Long response times would significantly impact the user experience, while sub-optimal matching results would substantially lower the operational efficiency. Finding optimal matches in a short time is challenging due to a large number of involved participants, the highly dynamic scenario, and the expensive road network distance computation. This thesis proposes efficient and scalable algorithms to enable fast and high-quality dispatching. We observe that vehicles and passengers can only visit limited areas to ensure their latest arrival times. The limited areas can be used to quickly filter out infeasible vehicles. Such areas can be easily indexed and updated if represented as rectangles. Inspired by this key observation, we first propose an efficient algorithm to solve a basic problem in ride-sharing -- how to quickly prune infeasible vehicles for passengers. Our algorithm ensures the finding of all possible candidates since we compute and index the confined visiting areas using ellipses, and the ellipses provide tight bound regardless of the underlying network. The effective pruning leads to only a small number of candidates remained, which substantially reduces the complexity of selecting the optimal matches and the overall matching time. We further investigate multi-hop ride-sharing algorithms that allow transfers on a user's trip. The algorithms search for possible transfers between vehicles by detecting whether the visiting areas of vehicles overlap. The proposed algorithms prune the combinations of vehicles and transfer points, thus overcoming the efficiency bottleneck and achieving real-time responses. We propose exact algorithms that guarantee the finding of optimal matches. We further improve the matching time using speed-up strategies. The enabled multi-hop trips largely enhance the flexibility of real-world ride-sharing, increase the number of successfully matched requests, and reduce the travel distance of vehicles. Lastly, we facilitate quick assigning of nearest pick-up/drop-off locations for passengers by proposing an efficient and scalable all nearest neighbor algorithm in road networks. We exploit the property of the problem such that finding a nearest neighbor only consumes constant time while only one graph traversal is required during the preprocessing phase. The proposed algorithms in this thesis achieve orders of magnitude faster running time compared to the state-of-the-art algorithms. The auxiliary indices are lightweight and eliminate the high preprocessing and update cost for the highly dynamic scenarios.
Multi-Granular Webpage Information Extraction and Analysis via Deep Joint Learning
The number of webpages is growing exponentially, which results in a great volume of unstructured information on the web. It takes time either to fully comprehend a webpage or to retrieve relevant information from a complex webpage. Analyzing unstructured webpage and extracting structured information from the webpage automatically is crucial. In this study, we aim to develop algorithms for multi-granular webpage information extraction and analysis to facilitate webpage information understanding. We investigate the problem at three levels of granularity, i.e., micro, meso and macro levels. For every level, we focus on one extraction and analysis task, although the algorithms we developed are general and can be applied to many other similar tasks. At the micro level, we aim to extract webpage entities that have diverse forms, and focus on the application of person name recognition. We propose a fine-grained annotation scheme based on anthroponymy and create the first dataset for fine-grained name recognition. We propose a joint model that learns the different name form classes with two sub-neural networks while fusing the learned signals through co-attention and gated fusion mechanisms. Experimental results show that our annotations can be utilised in different ways to improve the recognition performance. At the meso level, we study the relationships between webpage entities and blocks with a focus on the application of joint recognition of names and publications. We address the person name recognition and publication string recognition tasks in academic homepages jointly based on the insight that the two tasks are inherently correlated. We propose a joint model to capture the interdependencies between entities. We also capture global position patterns of blocks and local position patterns of entities in the model learning process. Empirical results on real datasets show that our model outperforms the state-of-the-art publication string recognition model and person name recognition model. Experimental results also show that our model outperforms baseline joint models. At the macro level, we aim to provide hierarchical analysis for webpages from diverse domains. We introduce the Webpage Briefing (WB) task, which aims to generate a summary of a webpage in a hierarchical manner, starting at the top is an abstract and general description of the topic of the webpage page, followed by high level key attributes extracted from the webpage, and then lower level key attributes, which contain concrete and specific key information. We propose to perform webpage briefing by identifying and summarizing the informative contents, which mimic human behaviour of understanding a complex webpage. We propose a novel Dual Distillation method that has a teacher-student architecture with dual distillation. We further propose a Triple Distillation method to better exploit the inherent correlation of specific key attributes and general topics of webpages. We finally propose a novel Triple Joint model that has a triple joint learning architecture with signal exchange and enhancement mechanisms. Experimental results show the superiority of Bi-Distill method and Tri-Distill over baseline methods. Experimental results also show that Tri-Join outperforms baseline single-task models and baseline jointly trained models.
Do personality traits drive online commitment to vote in social networks?
This study examines social network effects in a community election, and actor attributes effects in fostering commitment to vote behaviour. Across Western societies political participation is in decline posing major challenges for democracy. Since 2000,, political advocacy has undergone a rapid transformation led by a disruptive wave of IT-led innovation in infrastructure, predictive analytics, and online social networks. Political campaigns have harnessed these advances to target, influence, and mobilise partisan voters. Yet is political participation uniform across voters? Network interventions using online social networks and behavioural science are found to increase voter turnout, and reveal individual differences in political behaviour (Bakshy, Messing, & Adamic, 2015; Bond et al., 2012). In particular, extroverted individuals relative to other users may play an enhanced role (Messing, 2015). Our current picture of personality traits and political behaviour is largely offline. Few studies relate to online behaviour (Jordan, Pope, Wallis, & Iyer, 2015). Studies examine individual responses to general/targeted information against two-step flow communication models (Lazarsfeld, Berelson, & Gaudet, 1948). Yet opinion leaders and social networks still shape individual attitudes and behaviours (Contractor & DeChurch, 2014). How personality traits guide social interactions and social influence online during an election is unclear. In this research we examine the interaction of personality traits, internal political efficacy and human social motives to enact social influence during a community election. Our research model investigates whether personality traits drive commitment to vote behaviour by eliciting implementation intentions using an online vote plan. The attribute of gender, often ignored, is incorporated to determine influence processes operating within the network. Prior online political network analysis has relied on data collected from third-party platforms intended for other purposes. We overcome this limitation using a novel, mobile-first, social media app operating as a social network to better connect community members to political information, increase engagement, and improve transparency. The app enables unobtrusive data collection based on voluntary user interaction with online information. The app is supported by scalable graph-database architecture for behaviour tracking, real-time analytics, and increased granularity. All attribute-validated measurement instruments for personality traits, internal political efficacy, and commitment to vote are integrated into the app with only single responses recorded along with demographic information for each user. To understand how an individual’s attributes and their relationships with others affect commitment to vote, we adopt an Autologistic actor attribute model (ALAAM), a social influence model for statistical analysis of observed network data. With political disengagement endemic across Western democracies, new interventions that mitigate, or turn around current trends, are highly valued. Beyond political network analysis, social network research opportunities in national and international economic, institutional, and community settings exist. By deepening our understanding of small world interactions we hope to respond to the collective challenge of restoring the link between the meaning and purpose of voting, and help reinvigorate democracy.
Optimisation of phasing: towards improved haplotype-based genetic investigations
Haplotype or phase information significantly adds to the ability to resolve genetic problems and is important to elucidate and interpret certain genetic basis underlying diseases or traits. The common approach to derive phase information is through computational haplotype phasing or estimation methods. Current developments in phasing have paved the way to widely use haplotype information in population genetic investigations. This PhD thesis explores various ways to utilise haplotype information effectively to conduct precise haplotype-based genetic investigations. It provides evidence of the important role of haplotypes to detect significant genetic associations with phenotypes that can be missed otherwise. In particular, it provides a comprehensive evaluation of state of art phasing approaches as well as different haplotype block determination methods (sliding window and through linkage disequilibrium). A rigorous analysis is conducted to improve phasing accuracy through a consensus haplotype estimator across datasets with different characteristics. Furthermore, phasing optimisation was utilised to develop a new approach to carry out haplotype-based Expression Quantitative Trait Loci (eQTL) analysis. The approach is assessed against genotype-based eQTL methods (both single and combinations of SNPs). The main contributions of this PhD study are: 1. Novel evaluations and comparisons for haplotype phasing considering the accuracy at block scale that is the most popular way to use phase information in genetic studies. 2. An improvement of phasing accuracy reaching 10% when using the proposed consensus approach. 3. The consensus approach leads to the highest accuracy genotype imputation performed via the well-known tools Minimac3, pbwt and Beagle5. 4. An approach for haplotype-based eQTL analysis, that is demonstrated to outperform standard eQTL methods when the causal genetic architecture involves multiple variations. Finally, the work in this PhD thesis highlights the fundamental role of haplotype information in genetic problems and provides guidance for other researchers interested in performing haplotype related investigations. Two tools (consHap and eQTLHap) are also released publicly with this PhD to support other research studies.
Secure voter authentication for poll-site elections in developing countries
In the developing world, secure and privacy-preserving voter authentication remains one of the main challenges of poll-site election administration. Aside the unavailability of a robust and trustworthy national identity infrastructure, voter authentication in most developing countries has been largely undermined by different issues arising from multiple registration, polling booth capture, voter impersonation and vote buying, amongst others. This thesis will first describe how these issues play a role in compromising electoral integrity in developing countries. Next, it will present an end-to-end verifiable poll-site attendance verification protocol that has been designed to give voters, auditors and independent election observers the opportunity to detect voter impersonation and the ballot stuffing attacks mounted by colluding poll workers and dishonest election officials. For the second main contribution, the thesis will extensively discuss the cryptographic techniques that are employed in a novel three-factor voter authentication protocol which is intended to disincentivise voter impersonation, vote buying, registration fraud and identity theft. In particular, by drawing on the unique security properties of fuzzy extractors and anonymous credentials, this authentication protocol not only ensures that voter credentials are non-transferable, but also provide election officials with efficient mechanisms to revoke questionable credentials that could be used for multiple voting. Finally, we describe the results obtained from the formal modelling and verification of the designed protocols using the Tamarin prover. The symbolic analysis carried out shows that the two protocols satisfy the notion of participation privacy, if voters are honest.
Training accurate, diverse, and unbiased recommender systems by joint optimization
In today's era of information explosion, users are faced with an overwhelming number of alternative items to choose from. For example, the Netflix website has millions of movies for users to watch, whereas Amazon Kindle Store has millions of books for users to read. Typically, a small fraction of items satisfy users' information needs and it is difficult for users to find these items given the sheer number of all items. Recommender systems serve as primary means to help users easily and quickly find what they want with the aim of presenting their desired items right in front of them. Unlike search systems which require a user to explicitly formulate his or her information needs as a query and retrieve items that are good matches to the query, recommender systems do not require explicit queries from users and proactively present items that the users may enjoy according to users' preferences. Recommender systems have a wide range of applications in the real world and are extensively used in our daily life. For example, one real-world application of search systems is the Google search system: it provides a rank list of relevant documents to a user after receiving a query issued by the user. For example, one real-world application of recommender systems is the Netflix recommendation system: it suggests movies to a user in the hope that the user will enjoy watching the suggested movies. Another real-world application of recommender systems is the YouTube recommendation engine: it helps to select videos that users want to watch from a huge number of available videos online. Search and recommender systems are important in the era of information explosion for two reasons. First, search and recommender systems play an important role in helping users find results that are of interest. Second, search and recommender systems contribute billions of dollars in net revenue to industry every year. Recommender systems play an important role in helping users find results that are of interest and contribute billions of dollars in net revenue to industry every year. In this thesis, we focus on improving recommender systems in terms of three different aspects, i.e., accuracy, diversity, and unbiasedness, each of which has its own unique challenges. (1) A common issue with search systems is that most queries are ambiguous or too broad in specifying user intents. To address this issue, search result diversification aims to return a rank list of diverse results that cover as many user intents of a query as possible. To choose appropriate algorithms for search result diversification, we design widely applicable metrics to faithfully measure the diversity of search results. (1) First, we aim to improve the accuracy of recommender systems. We identify an important type of resources in many applications of recommender systems, e.g., recommending tags for users to label images uploaded to image-hosting websites. Such type of resources, which we call privileged provisions, is available when using labeled data to train models and is not available when using models to predict unseen data. We propose two general classes of algorithms, collectively referred to as adversarial distillation, that can use privileged provisions to achieve accurate recommendation results. Moreover, we provide rigorous proofs to show that the proposed algorithms have non-trivial theoretical guarantees to achieve global optimal recommendation results. (2) Then, we aim to improve the diversity of recommender systems. We observe that preferences in diverse results vary across users: a user may be interested in only action movies and another user may enjoy a variety of movie categories. We propose a novel algorithm that can explicitly consider such users' preferences over different item categories when recommending a personalized rank list of items. To effectively measure how well the recommended items satisfy individual user's preferences, we design a metric that can measure personalized diversity of recommendation results. % Extensive experiments on two real-world datasets confirm the superiority of the proposed algorithm and show the effectiveness of the proposed measure in capturing user preferences. Extensive experiments show that the proposed algorithm outperforms state-of-the-art algorithms by up to 10.0% and 10.8% in terms of an existing metric and the proposed metric, respectively. (3) Last, we aim to improve the unbiasedness of recommender systems. Datasets for training recommender systems are often biased and a widely-recognized challenge is to present accurate recommendation results given biased datasets for training. To address this challenge, we propose a doubly robust estimator to unbiasedly measure the error of recommendation results and develop a joint learning approach to obtain accurate recommender systems. The proposed approach outperforms state-of-the-art approaches by a 12% drop in the recommendation error. Although recommendation datasets are often biased, it is usually possible to gather a small unbiased dataset with a reasonable cost in practice. We propose a bi-level learning approach that can effectively leverage additional information from the small unbiased dataset to boost the accuracy of recommendation results. The proposed approach achieves up to a 7.9% drop in recommendation error compared to previous approaches.
Modelling search and session effectiveness
Search effectiveness metrics are used to quantify the quality of a ranked list of search results relative to a query. One line of argument suggests that incorporating user behaviour into the measurement of search effectiveness via a user model is useful, so that the metric scores reflect what the user has experienced during the search process. A wide range of metrics has been proposed, and many of these metrics correspond to user models. In reality users often reformulate their queries during the course of the session. Hence, it is desirable to involve both query- and session-level behaviours in the development of model-based metrics. In this thesis, we use interaction data from commercial search engines and laboratory-based user studies to model query- and session-level search behaviours, and user satisfaction; to inform the method for evaluation of search sessions; and to explore the interaction between user models, metric scores, and satisfaction. We consider two goals in session evaluation. The first goal is to develop an effectiveness model for session evaluation; and the second goal is to establish a fitted relationship between individual query scores and session-level satisfaction ratings. To achieve the first goal, we investigate factors that affect query- and session-level behaviours, and develop a new session-based user model that provides a closer fit to the observed behaviour than do previous models. This model is then used to devise a new session-based metric, sINST. In regard to the second goal, we explore variables influencing session-level satisfaction, and suggest that the combination of both query positional and quality factors provides a better correlation with session satisfaction than those based on query position alone. Based on this observation, we propose a novel query-to-session aggregation function, that is useful for scoring sessions when sequences of query reformulations are observed. We also propose a meta-evaluation framework that allows metric comparisons based on empirical evidence derived from search interaction logs, and investigate the connection between predicted behaviour and observed behaviour, and between metric scores and user satisfaction at both query and session-levels.
Knowledge base enrichment via deep neural networks
A knowledge base is a large repository that typically stores information about real-world entities. Several efforts have been made to develop knowledge bases in general and specific domains such as DBpedia, YAGO, LinkedGeoData, and Wikidata. These knowledge bases contain millions of facts about entities. However, these knowledge bases are far from complete and mandate continuous enrichment and curation. In this thesis, we study three common methods to enrich a knowledge base. The first is a Knowledge Bases Alignment method that aims to find entities in two knowledge bases that represent the same real-world entity, and then integrates these knowledge bases based on the aligned entities. Many knowledge bases have been created separately for particular purposes with overlapping entity coverage. These knowledge bases are complementary to each other in terms of completeness. We may integrate such knowledge bases to form a more extensive knowledge base for knowledge inferences. The second is a Relation Extraction method that aims to extract entities and their relationships from sentences of a corpus and map them to an existing knowledge base. With a large amount of unstructured data sources (i.e., sentences), the relation extraction is an essential method to extract facts from any data source for enriching a knowledge base. The third is a Description Generation method that aims to generate a sentence to describe a target entity from its properties in a knowledge base. The generated description can be used to enrich the presentation of the knowledge in a knowledge base, which later can be used in many downstream applications. For example, in question answering, the generated sentence can be used to describe the entity in the answer. For knowledge bases alignment, we propose an embedding-based entity alignment model. Our model exploits attribute embeddings that capture the similarity between entities in different knowledge bases. We also propose an end-to-end relation extraction model for knowledge base enrichment. The proposed model integrates the extraction and canonicalization tasks. This integration helps the model reduces the error propagation between relation extraction and named entity disambiguation that existing approaches are prone to. For description generation, we propose a content plan based attention model to generate sentences from knowledge base triples in the form of a star-shaped graph. We further propose a graph-based encoder to handle arbitrary-shaped graph for generating entity description. Extensive experiment results show that the proposed methods outperform the state-of-the-art methods in the knowledge base enrichment problems studied.
Foundations for reasoning about holistic specifications
Specifications of sufficient conditions may be enough for reasoning about complete and unchanging programs of a closed system. Nevertheless, there is no luxury of trusting external components of probably unknown provenance in an open world that may be buggy or potentially malicious. It is critical to ensure that our components are robust when cooperating with a wide variety of external components. Holistic specifications, which are concerned with sufficient and necessary conditions, could make programs more robust in an open-world setting. In this thesis, we lay the foundations for reasoning about holistic specifications. We give an Isabelle/HOL mechanization of holistic specifications focusing on object-based programs. We also pave a way to reason about holistic specifications via proving some key lemmas that we hope will be useful in the future to establish a general logic for holistic specifications.
Design of secure circuits resistant to side channel attacks for lightweight cryptography
Side channel attacks (SCA) have been found to be effective for breaching the security of lightweight cryptographic systems to extract the cryptographic secrets using signals and associated information gathered from side channels, predominantly power and timing. Since their discovery, they have been a major threat to many lightweight cryptographic devices including smart card, internet-of-things (IoT) devices and mobile phones. Many of the contemporary devices, such as smart cards and IoTs, often rely on lightweight ciphers for encrypting and decrypting data. Accordingly, various design and implementation level countermeasures which are broadly categorised as hiding and masking have been proposed to address side channel attacks. In this thesis, we develop novel countermeasures for SCAs through circuits designed to minimise data dependent variations in power and timing by way of hiding and masking and additionally using non-linear pseudo random bits for additional security. Our key contribution is the use of Binary Decision Diagram (BDD) based path balancing in conjunction with dual rail circuits with pre-charging (DRPC) to create circuits to implement ciphers that are resistant to power, timing and early propagation effect (EPE) based attacks. This path balanced DP-BDD technique has been successfully applied to masking to create circuits that have all the benefits of the underlying scheme with the additional benefit of offering resistance to EM attacks. Multibit masking has been proposed, using a separate mask bit per output function. The use of a fixed mask bit is also unsafe, this has been remedied in our masking scheme with the use of a non-linear feedback shift register (NLFSR) to generate uncorrelated mask bits dynamically. The basic path balanced DRPC-BDD leads to the creation of relatively large circuits on account of the dual rail scheme. This problem has been addressed by creating BDDs for multi-output functions with BDD node sharing to implement the cipher S-boxes leading to significant reduction in the number of BDD nodes and the associated number of transistors. Next, variations of the circuit with partial capacitive decoupling of the power supply to further obfuscate the operation of the circuit was considered. Two decoupling schemes were developed and evaluated: pass transistor configuration and push-pull configuration. Both the schemes successfully obfuscated the circuit operation, obscuring the distinction of the pre-charge phase from the evaluation phase, adding another layer of security. Finally, we explore the partial capacitive decoupling of the power supply to further obfuscate the correlation between the power supply line current and the operation of the circuit through two circuit configurations. Desired results have been obtained for this scheme as well. Overall, a novel DRPC-BDD approach towards circuit design with path balancing has been developed and further optimised and modified with partial capacitive decoupling. The developed methods have been evaluated through extensive experimentation for various implementations of the ciphers S-boxes, for 4, 8, 16, 32 and 64 bits using industry standard CAD tools and technology libraries.
Natural Language Processing for Improving Transparency in Representative Democracy
Language is the natural medium in politics, hence among several types of data, text is the central artifact to capture political behaviour. In this thesis we focus on automating several political analyses using natural language processing techniques, which can improve the transparency of policy-making and thereby the voters' trust in representative democracy. Political scientists have observed that the voters' trust in government is necessary for successful implementation of policies, and in-turn their trust is based on effective implementation of policies and services. In-order to improve the trust, the policy-making process should be more transparent and receptive. The policy-making process typically consists of several stages, and we focus on the two primary stages involving political parties and voters --- policy proposal and its implementation audit. Specifically, we target three major aspects of policy-making process: (a) analyzing policy proposal during election campaign --- what policy goals are spoken about, and specifically in which context, and what promises are made. (b) Post-election policy implementation audit --- given the pre-election promises, which sets the expectation of voters, does the government make progress towards those promises. (c) Public advocacy for policy changes --- what changes do the voters want. This can be seen as both evaluation of existing policies as well as suggestions for changes. More importantly, active participation of voters in the process reflects their level of trust in the system. We define the individual research targets based on political science literature and automate those using deep learning approaches. We use canonical sources of text for each of the tasks, for example, election manifestos released by political parties (more sources are discussed in Chapter 2). The challenges involved in this work are multi-fold, starting from defining the task, to dataset creation, to developing suitable models. We hope that this PhD thesis, dealing with political text analysis, will shed light on the available data sources, flavor of tasks at the intersection of both natural language processing and political science, and also the techniques to handle the challenges.
Cost-efficient Management of Cloud Resources for Big Data Applications
Analyzing a vast amount of business and user data on big data analytics frameworks is becoming a common practice in organizations to get a competitive advantage. These frameworks are usually deployed in a computing cluster to meet the analytics demands in every major domain, including business, government, financial markets, and health care. However, buying and maintaining a massive amount of on-premise resources is costly and difficult, especially for start-ups and small business organizations. Cloud computing provides infrastructure, platform, and software systems for storing and processing data. Thus, Cloud resources can be utilized to set up a cluster with a required big data processing framework. However, several challenges need to be addressed for Cloud-based big data processing which includes: deciding how much Cloud resources are needed for each application, how to maximize the utilization of these resources to improve applications' performance, and how to reduce the monetary cost of resource usages. In this thesis, we focus on a user-centric view, where a user can be either an individual or a small/medium business organization who want to deploy a big data processing framework on the Cloud. We explore how resource management techniques can be tailored to various user-demands such as performance improvement, and deadline guarantee for the applications; all while reducing the monetary cost of using the cluster. In particular, we propose efficient resource allocation and scheduling mechanisms for Cloud-deployed Apache Spark clusters.