Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 3 of 3
  • Item
    Thumbnail Image
    Task assignment using worker cognitive ability and context to improve data quality in crowdsourcing
    Hettiachchi Mudiyanselage, Danula Eranjith ( 2021)
    While crowd work on crowdsourcing platforms is becoming prevalent, there exists no widely accepted method to successfully match workers to different types of tasks. Previous work has considered using worker demographics, behavioural traces, and prior task completion records to optimise task assignment. However, optimum task assignment remains a challenging research problem, since proposed approaches lack an awareness of workers' cognitive abilities and context. This thesis investigates and discusses how to use these key constructs for effective task assignment: workers' cognitive ability, and an understanding of the workers' context. Specifically, the thesis presents 'CrowdCog', a dynamic online system for task assignment and task recommendations, that uses fast-paced online cognitive tests to estimate worker performance across a variety of tasks. The proposed task assignment method can achieve significant data quality improvements compared to a baseline where workers select preferred tasks. Next, the thesis investigates how worker context can influence task acceptance, and it presents 'CrowdTasker', a voice-based crowdsourcing platform that provides an alternative form factor and modality to crowd workers. Our findings inform how to better design crowdsourcing platforms to facilitate effective task assignment and recommendation, which can benefit both workers and task requesters.
  • Item
    Thumbnail Image
    Data quality and quantity in mobile experience sampling
    van Berkel, Niels ( 2019)
    The widespread availability of technologically-advanced mobile devices has brought researchers the opportunity to observe human life in day-to-day circumstances. Rather than studying human behaviour through extensive surveys or in artificial laboratory situations, this research instrument allows us to systematically capture human life in naturalistic settings. Mobile devices can capture two distinct data streams. First, the data from sensors embedded within these devices can be appropriated to construct the context of study participants. Second, participants can be asked to actively and repeatedly provide data on phenomena which cannot be reliably collected using the aforementioned sensor streams. This method is known as Experience Sampling. Researchers employing this method ask participants to provide observations multiple times per day, across a range of contexts, and to reflect on current rather than past experiences. This approach brings a number of advantages over existing methods, such as the ability to observe shifts in participant experiences over time and context, and reducing reliance on the participant’s ability to accurately recall past events. As the onus of data collection lies with participants rather researchers, there is a firm reliance on the reliability of participant contributions. While previous work has focused on increasing the number of participant contributions, the quality of these contributions has remained relatively unexplored. This thesis focuses on improving the quality and quantity of participant data collected through mobile Experience Sampling. Assessing and subsequently improving the quality of participant responses is a crucial step towards increasing the reliability of this increasingly popular data collection method. Previous recommendations for researchers are based primarily on anecdotal evidence or personal experience in running Experience Sampling studies. While such insights are valuable, it is challenging to replicate these recommendations and quantify their effect. Furthermore, we evaluate the application of this method in light of recent developments in mobile devices. The opportunities and challenges introduced by smartphone-based Experience Sampling studies remain underexplored in the current literature. Such devices can be utilised to infer participants’ context and optimise questionnaire scheduling and presentation to increase data quality and quantity. By deploying our studies on these devices, we explore the opportunities of mobile sensing and interaction in the context of mobile Experience Sampling studies. Our findings illustrate the feasibility of assessing and quantifying participant accuracy through the use of peer assessment, ground truth questions, and the assessment of cognitive skills. We empirically evaluate these approaches across a variety of study goals. Furthermore, our results provide recommendations on study design, motivation and data collection practices, and appropriate analysis techniques of participant data concerning response accuracy. Researchers can use our findings to increase the reliability of their data, to collect participant responses more evenly across different contexts in order to reduce the potential for bias, and to increase the total number of collected responses. The goal of this thesis is to improve the collection of human-labelled data in ESM studies, thereby strengthening the role of smartphones as valuable scientific instruments. Our work reveals a clear opportunity in the combination of human and sensor data sensing techniques for researchers interested in studying human behaviour in situ.
  • Item
    Thumbnail Image
    Duplication in biological databases: definitions, impacts and methods
    Chen, Qingyu ( 2017)
    Duplication is a pressing issue in biological databases. This thesis concerns duplication, in terms of its definitions (what records are duplicates), impacts (why duplicates are significant) and solutions (how to address duplication). The volume of biological databases is growing at an unprecedented rate, populated by complex records drawn from heterogeneous sources; the huge data volume and the diverse types cause concern for the underlying data quality. A specific challenge is duplication, that is, the presence of redundant or inconsistent records. While existing studies concern duplicates, the definitions of duplicates are not clear; the foundational understanding of what records are considered as duplicates by database stakeholders is lacking. The impacts of duplication are not clear either; existing studies have different or even inconsistent views on the impacts. The unclear definitions and impacts of duplication in biological databases further limit the development of the related duplicate detection methods. In this work, we refine the definitions of duplication in biological databases through a retrospective analysis of merged groups in primary nucleotide databases – the duplicates identified by record submitters and database staff (or biocurators) – to understand what types of duplicates matter to database stakeholders. This reveals two primary representations of duplication under the context of biological databases: entity duplicates, multiple records belonging to the same entities, which particularly impact record submission and curation, and near duplicates (or redundant records), records sharing high similarities, particularly impact database search. The analysis also reveals different types of duplicate records, showing that database stakeholders are concerned with diverse types of duplicates in reality, whereas previous studies mainly consider records with very high similarities as duplicates. Following this foundational analysis, we investigate both primary representations. For entity duplicate, we establish three large-scale benchmarks of labelled duplicates from different perspectives (submitter-based, expert curation and automatic curation), assess the effectiveness of an existing method, and develop a new supervised learning method that detects duplicates more precisely than previous approaches. For near duplicates, we assess the effectiveness and the efficiency of the best known clustering-based methods in terms of database search results diversity (whether retrieved results are independently informative) and completeness (whether retrieved results miss potentially important records after de-duplication), and propose suggestions and solutions for more effective biological database search.