Computing and Information Systems - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 1345
  • Item
    Thumbnail Image
    Benchmarks for measurement of duplicate detection methods in nucleotide databases
    Chen, Q ; Zobel, J ; Verspoor, K (OXFORD UNIV PRESS, 2023-12-18)
    UNLABELLED: Duplication of information in databases is a major data quality challenge. The presence of duplicates, implying either redundancy or inconsistency, can have a range of impacts on the quality of analyses that use the data. To provide a sound basis for research on this issue in databases of nucleotide sequences, we have developed new, large-scale validated collections of duplicates, which can be used to test the effectiveness of duplicate detection methods. Previous collections were either designed primarily to test efficiency, or contained only a limited number of duplicates of limited kinds. To date, duplicate detection methods have been evaluated on separate, inconsistent benchmarks, leading to results that cannot be compared and, due to limitations of the benchmarks, of questionable generality. In this study, we present three nucleotide sequence database benchmarks, based on information drawn from a range of resources, including information derived from mapping to two data sections within the UniProt Knowledgebase (UniProtKB), UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. Each benchmark has distinct characteristics. We quantify these characteristics and argue for their complementary value in evaluation. The benchmarks collectively contain a vast number of validated biological duplicates; the largest has nearly half a billion duplicate pairs (although this is probably only a tiny fraction of the total that is present). They are also the first benchmarks targeting the primary nucleotide databases. The records include the 21 most heavily studied organisms in molecular biology research. Our quantitative analysis shows that duplicates in the different benchmarks, and in different organisms, have different characteristics. It is thus unreliable to evaluate duplicate detection methods against any single benchmark. For example, the benchmark derived from UniProtKB/Swiss-Prot mappings identifies more diverse types of duplicates, showing the importance of expert curation, but is limited to coding sequences. Overall, these benchmarks form a resource that we believe will be of great value for development and evaluation of the duplicate detection or record linkage methods that are required to help maintain these essential resources. DATABASE URL: : https://bitbucket.org/biodbqual/benchmarks.
  • Item
    No Preview Available
    On the impact of initialisation strategies on Maximum Flow algorithm performance
    Alipour, H ; Munoz, MA ; Smith-Miles, K (PERGAMON-ELSEVIER SCIENCE LTD, 2024-03)
  • Item
    No Preview Available
    Directive Explanations for Actionable Explainability in Machine Learning Applications
    Singh, R ; Miller, T ; Lyons, H ; Sonenberg, L ; Velloso, E ; Vetere, F ; Howe, P ; Dourish, P (ASSOC COMPUTING MACHINERY, 2023-12)
    In this article, we show that explanations of decisions made by machine learning systems can be improved by not only explaining why a decision was made but also explaining how an individual could obtain their desired outcome. We formally define the concept of directive explanations (those that offer specific actions an individual could take to achieve their desired outcome), introduce two forms of directive explanations (directive-specific and directive-generic), and describe how these can be generated computationally. We investigate people’s preference for and perception toward directive explanations through two online studies, one quantitative and the other qualitative, each covering two domains (the credit scoring domain and the employee satisfaction domain). We find a significant preference for both forms of directive explanations compared to non-directive counterfactual explanations. However, we also find that preferences are affected by many aspects, including individual preferences and social factors. We conclude that deciding what type of explanation to provide requires information about the recipients and other contextual information. This reinforces the need for a human-centered and context-specific approach to explainable AI.
  • Item
    No Preview Available
    Scalable Approximate Butterfly and Bi-triangle Counting for Large Bipartite Networks
    Zhang, F ; Chen, D ; Wang, S ; Yang, Y ; Gan, J (Association for Computing Machinery (ACM), 2023-12-08)
    A bipartite graph is a graph that consists of two disjoint sets of vertices and only edges between vertices from different vertex sets. In this paper, we study the counting problems of two common types of em motifs in bipartite graphs: (i) butterflies (2x2 bicliques) and (ii) bi-triangles (length-6 cycles). Unlike most of the existing algorithms that aim to obtain exact counts, our goal is to obtain precise enough estimations of these counts in bipartite graphs, as such estimations are already sufficient and of great usefulness in various applications. While there exist approximate algorithms for butterfly counting, these algorithms are mainly based on the techniques designed for general graphs, and hence, they are less effective on bipartite graphs. Not to mention that there is still a lack of study on approximate bi-triangle counting. Motivated by this, we first propose a novel butterfly counting algorithm, called one-sided weighted sampling, which is tailored for bipartite graphs. The basic idea of this algorithm is to estimate the total butterfly count with the number of butterflies containing two randomly sampled vertices from the same side of the two vertex sets. We prove that our estimation is unbiased, and our technique can be further extended (non-trivially) for bi-triangle count estimation. Theoretical analyses under a power-law random bipartite graph model and extensive experiments on multiple large real datasets demonstrate that our proposed approximate counting algorithms can reach high accuracy, yet achieve up to three orders (resp. four orders) of magnitude speed-up over the state-of-the-art exact butterfly (resp. bi-triangle) counting algorithms. Additionally, we present an approximate clustering coefficient estimation framework for bipartite graphs, which shows a similar speed-up over the exact solutions with less than 1% relative error.
  • Item
    No Preview Available
    The Impact of Judgment Variability on the Consistency of Offline Effectiveness Measures
    Rashidi, L ; Zobel, J ; Moffat, A (ASSOC COMPUTING MACHINERY, 2024-01)
    Measurement of the effectiveness of search engines is often based on use of relevance judgments. It is well known that judgments can be inconsistent between judges, leading to discrepancies that potentially affect not only scores but also system relativities and confidence in the experimental outcomes. We take the perspective that the relevance judgments are an amalgam of perfect relevance assessments plus errors; making use of a model of systematic errors in binary relevance judgments that can be tuned to reflect the kind of judge that is being used, we explore the behavior of measures of effectiveness as error is introduced. Using a novel methodology in which we examine the distribution of “true” effectiveness measurements that could be underlying measurements based on sets of judgments that include error, we find that even moderate amounts of error can lead to conclusions such as orderings of systems that statistical tests report as significant but are nonetheless incorrect. Further, in these results the widely used recall-based measures AP and NDCG are notably more fragile in the presence of judgment error than is the utility-based measure RBP, but all the measures failed under even moderate error rates. We conclude that knowledge of likely error rates in judgments is critical to interpretation of experimental outcomes.
  • Item
    No Preview Available
    TransCP: A Transformer Pointer Network for Generic Entity Description Generation With Explicit Content-Planning
    Trisedya, BD ; Qi, J ; Zheng, H ; Salim, FD ; Zhang, R (IEEE COMPUTER SOC, 2023-12-01)
  • Item
    No Preview Available
    Focused Contrastive Loss for Classification With Pre-Trained Language Models
    He, J ; Li, Y ; Zhai, Z ; Fang, B ; Thorne, C ; Druckenbrodt, C ; Akhondi, S ; Verspoor, K (Institute of Electrical and Electronics Engineers (IEEE), 2023-01-01)
  • Item
    No Preview Available
    Characterizing and predicting ccRCC-causing missense mutations in Von Hippel-Lindau disease
    Serghini, A ; Portelli, S ; Troadec, G ; Song, C ; Pan, Q ; Pires, DE ; Ascher, DB (OXFORD UNIV PRESS, 2024-01-20)
    BACKGROUND: Mutations within the Von Hippel-Lindau (VHL) tumor suppressor gene are known to cause VHL disease, which is characterized by the formation of cysts and tumors in multiple organs of the body, particularly clear cell renal cell carcinoma (ccRCC). A major challenge in clinical practice is determining tumor risk from a given mutation in the VHL gene. Previous efforts have been hindered by limited available clinical data and technological constraints. METHODS: To overcome this, we initially manually curated the largest set of clinically validated VHL mutations to date, enabling a robust assessment of existing predictive tools on an independent test set. Additionally, we comprehensively characterized the effects of mutations within VHL using in silico biophysical tools describing changes in protein stability, dynamics and affinity to binding partners to provide insights into the structure-phenotype relationship. These descriptive properties were used as molecular features for the construction of a machine learning model, designed to predict the risk of ccRCC development as a result of a VHL missense mutation. RESULTS: Analysis of our model showed an accuracy of 0.81 in the identification of ccRCC-causing missense mutations, and a Matthew's Correlation Coefficient of 0.44 on a non-redundant blind test, a significant improvement in comparison to the previous available approaches. CONCLUSION: This work highlights the power of using protein 3D structure to fully explore the range of molecular and functional consequences of genomic variants. We believe this optimized model will better enable its clinical implementation and assist guiding patient risk stratification and management.
  • Item
    No Preview Available
    A Broad-Spectrum α-Glucosidase of Glycoside Hydrolase Family 13 from Marinovum sp., a Member of the Roseobacter Clade
    Li, J ; Mui, JW-Y ; da Silva, BM ; Pires, DEV ; Ascher, DB ; Soler, NM ; Goddard-Borger, ED ; Williams, SJ (SPRINGER, 2024-01-05)
    Glycoside hydrolases (GHs) are a diverse group of enzymes that catalyze the hydrolysis of glycosidic bonds. The Carbohydrate-Active enZymes (CAZy) classification organizes GHs into families based on sequence data and function, with fewer than 1% of the predicted proteins characterized biochemically. Consideration of genomic context can provide clues to infer possible enzyme activities for proteins of unknown function. We used the MultiGeneBLAST tool to discover a gene cluster in Marinovum sp., a member of the marine Roseobacter clade, that encodes homologues of enzymes belonging to the sulfoquinovose monooxygenase pathway for sulfosugar catabolism. This cluster lacks a gene encoding a classical family GH31 sulfoquinovosidase candidate, but which instead includes an uncharacterized family GH13 protein (MsGH13) that we hypothesized could be a non-classical sulfoquinovosidase. Surprisingly, recombinant MsGH13 lacks sulfoquinovosidase activity and is a broad-spectrum α-glucosidase that is active on a diverse array of α-linked disaccharides, including maltose, sucrose, nigerose, trehalose, isomaltose, and kojibiose. Using AlphaFold, a 3D model for the MsGH13 enzyme was constructed that predicted its active site shared close similarity with an α-glucosidase from Halomonas sp. H11 of the same GH13 subfamily that shows narrower substrate specificity.
  • Item
    No Preview Available
    AI-driven GPCR analysis, engineering, and targeting
    Velloso, JPL ; Kovacs, AS ; Pires, DEV ; Ascher, DB (ELSEVIER SCI LTD, 2024-02)
    This article investigates the role of recent advances in Artificial Intelligence (AI) to revolutionise the study of G protein-coupled receptors (GPCRs). AI has been applied to many areas of GPCR research, including the application of machine learning (ML) in GPCR classification, prediction of GPCR activation levels, modelling GPCR 3D structures and interactions, understanding G-protein selectivity, aiding elucidation of GPCRs structures, and drug design. Despite progress, challenges in predicting GPCR structures and addressing the complex nature of GPCRs remain, providing avenues for future research and development.