Chancellery Research - Research Publications

Permanent URI for this collection

http://hdl.handle.net/11343/377

Search Results

Now showing 1 - 10 of 20

Benchmarks for measurement of duplicate detection methods in nucleotide databases

Chen, Q ; Zobel, J ; Verspoor, K (OXFORD UNIV PRESS, 2023-12-18)

UNLABELLED: Duplication of information in databases is a major data quality challenge. The presence of duplicates, implying either redundancy or inconsistency, can have a range of impacts on the quality of analyses that use the data. To provide a sound basis for research on this issue in databases of nucleotide sequences, we have developed new, large-scale validated collections of duplicates, which can be used to test the effectiveness of duplicate detection methods. Previous collections were either designed primarily to test efficiency, or contained only a limited number of duplicates of limited kinds. To date, duplicate detection methods have been evaluated on separate, inconsistent benchmarks, leading to results that cannot be compared and, due to limitations of the benchmarks, of questionable generality. In this study, we present three nucleotide sequence database benchmarks, based on information drawn from a range of resources, including information derived from mapping to two data sections within the UniProt Knowledgebase (UniProtKB), UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. Each benchmark has distinct characteristics. We quantify these characteristics and argue for their complementary value in evaluation. The benchmarks collectively contain a vast number of validated biological duplicates; the largest has nearly half a billion duplicate pairs (although this is probably only a tiny fraction of the total that is present). They are also the first benchmarks targeting the primary nucleotide databases. The records include the 21 most heavily studied organisms in molecular biology research. Our quantitative analysis shows that duplicates in the different benchmarks, and in different organisms, have different characteristics. It is thus unreliable to evaluate duplicate detection methods against any single benchmark. For example, the benchmark derived from UniProtKB/Swiss-Prot mappings identifies more diverse types of duplicates, showing the importance of expert curation, but is limited to coding sequences. Overall, these benchmarks form a resource that we believe will be of great value for development and evaluation of the duplicate detection or record linkage methods that are required to help maintain these essential resources. DATABASE URL: : https://bitbucket.org/biodbqual/benchmarks.
Directive Explanations for Actionable Explainability in Machine Learning Applications

Singh, R ; Miller, T ; Lyons, H ; Sonenberg, L ; Velloso, E ; Vetere, F ; Howe, P ; Dourish, P (ASSOC COMPUTING MACHINERY, 2023-12)

In this article, we show that explanations of decisions made by machine learning systems can be improved by not only explaining why a decision was made but also explaining how an individual could obtain their desired outcome. We formally define the concept of directive explanations (those that offer specific actions an individual could take to achieve their desired outcome), introduce two forms of directive explanations (directive-specific and directive-generic), and describe how these can be generated computationally. We investigate people’s preference for and perception toward directive explanations through two online studies, one quantitative and the other qualitative, each covering two domains (the credit scoring domain and the employee satisfaction domain). We find a significant preference for both forms of directive explanations compared to non-directive counterfactual explanations. However, we also find that preferences are affected by many aspects, including individual preferences and social factors. We conclude that deciding what type of explanation to provide requires information about the recipients and other contextual information. This reinforces the need for a human-centered and context-specific approach to explainable AI.
Near-Wall Flow Statistics in High-Re_τ Drag-Reduced Turbulent Boundary Layers

Deshpande, R ; Zampiron, A ; Chandran, D ; Smits, AJ ; Marusic, I (SPRINGER, 2023-01-01)
Disease Delineation for Multiple Sclerosis, Friedreich Ataxia, and Healthy Controls Using Supervised Machine Learning on Speech Acoustics

Schultz, BG ; Joukhadar, Z ; Nattala, U ; Quiroga, MDM ; Noffs, G ; Rojas, S ; Reece, H ; van der Walt, A ; Vogel, AP (IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC, 2023)

Neurodegenerative disease often affects speech. Speech acoustics can be used as objective clinical markers of pathology. Previous investigations of pathological speech have primarily compared controls with one specific condition and excluded comorbidities. We broaden the utility of speech markers by examining how multiple acoustic features can delineate diseases. We used supervised machine learning with gradient boosting (CatBoost) to delineate healthy speech from speech of people with multiple sclerosis or Friedreich ataxia. Participants performed a diadochokinetic task where they repeated alternating syllables. We subjected 74 spectral and temporal prosodic features from the speech recordings to machine learning. Results showed that Friedreich ataxia, multiple sclerosis and healthy controls were all identified with high accuracy (over 82%). Twenty-one acoustic features were strong markers of neurodegenerative diseases, falling under the categories of spectral qualia, spectral power, and speech rate. We demonstrated that speech markers can delineate neurodegenerative diseases and distinguish healthy speech from pathological speech with high accuracy. Findings emphasize the importance of examining speech outcomes when assessing indicators of neurodegenerative disease. We propose large-scale initiatives to broaden the scope for differentiating other neurological diseases and affective disorders.
Instance Space Analysis of Search-Based Software Testing

Neelofar, N ; Smith-Miles, K ; Munoz, MA ; Aleti, A (IEEE COMPUTER SOC, 2023-04-01)
Thermal and reionization history within a large-volume semi-analytic galaxy formation simulation

Balu, S ; Greig, B ; Qiu, Y ; Power, C ; Qin, Y ; Mutch, S ; Wyithe, JSB (OXFORD UNIV PRESS, 2023-02-15)

ABSTRACT We predict the 21-cm global signal and power spectra during the Epoch of Reionization using the meraxes semi-analytic galaxy formation and reionization model, updated to include X-ray heating and thermal evolution of the intergalactic medium. Studying the formation and evolution of galaxies together with the reionization of cosmic hydrogen using semi-analytic models (such as M eraxes) requires N-body simulations within large volumes and high-mass resolutions. For this, we use a simulation of side-length 210 h−1 Mpc with 43203 particles resolving dark matter haloes to masses of $5\times 10^8 \rm{ }h^{-1}\, \mathrm{M_\odot }$. To reach the mass resolution of atomically cooled galaxies, thought to be the dominant population contributing to reionization, at z = 20 of $\sim 2\times 10^7 \text{ }h^{-1}\, \mathrm{M_\odot }$, we augment this simulation using the darkforest Monte Carlo merger tree algorithm (achieving an effective particle count of ∼1012). Using this augmented simulation, we explore the impact of mass resolution on the predicted reionization history as well as the impact of X-ray heating on the 21-cm global signal and the 21-cm power spectra. We also explore the cosmic variance of 21-cm statistics within 703 h−3 Mpc3 sub-volumes. We find that the midpoint of reionization varies by Δz ∼ 0.8 and that the cosmic variance on the power spectrum is underestimated by a factor of 2–4 at k ∼ 0.1–0.4 Mpc−1 due to the non-Gaussian nature of the 21-cm signal. To our knowledge, this work represents the first model of both reionization and galaxy formation which resolves low-mass atomically cooled galaxies while simultaneously sampling sufficiently large scales necessary for exploring the effects of X-rays in the early Universe.
Disease progression modelling of Alzheimer's disease using probabilistic principal components analysis

Saint-Jalmes, M ; Fedyashov, V ; Beck, D ; Baldwin, T ; Faux, NG ; Bourgeat, P ; Fripp, J ; Masters, CL ; Goudey, B (ACADEMIC PRESS INC ELSEVIER SCIENCE, 2023-09)

The recent biological redefinition of Alzheimer's Disease (AD) has spurred the development of statistical models that relate changes in biomarkers with neurodegeneration and worsening condition linked to AD. The ability to measure such changes may facilitate earlier diagnoses for affected individuals and help in monitoring the evolution of their condition. Amongst such statistical tools, disease progression models (DPMs) are quantitative, data-driven methods that specifically attempt to describe the temporal dynamics of biomarkers relevant to AD. Due to the heterogeneous nature of this disease, with patients of similar age experiencing different AD-related changes, a challenge facing longitudinal mixed-effects-based DPMs is the estimation of patient-realigning time-shifts. These time-shifts are indispensable for meaningful biomarker modelling, but may impact fitting time or vary with missing data in jointly estimated models. In this work, we estimate an individual's progression through Alzheimer's disease by combining multiple biomarkers into a single value using a probabilistic formulation of principal components analysis. Our results show that this variable, which summarises AD through observable biomarkers, is remarkably similar to jointly estimated time-shifts when we compute our scores for the baseline visit, on cross-sectional data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Reproducing the expected properties of clinical datasets, we confirm that estimated scores are robust to missing data or unavailable biomarkers. In addition to cross-sectional insights, we can model the latent variable as an individual progression score by repeating estimations at follow-up examinations and refining long-term estimates as more data is gathered, which would be ideal in a clinical setting. Finally, we verify that our score can be used as a pseudo-temporal scale instead of age to ignore some patient heterogeneity in cohort data and highlight the general trend in expected biomarker evolution in affected individuals.
Multi-objective optimization in real-time operation of rainwater harvesting systems

Zhen, Y ; Smith-Miles, K ; Fletcher, TD ; Burns, MJ ; Coleman, RA (ELSEVIER, 2023)
Bayesian coarsening: rapid tuning of polymer model parameters

Weeratunge, H ; Robe, D ; Menzel, A ; Phillips, AW ; Kirley, M ; Smith-Miles, K ; Hajizadeh, E (Springer, 2023-10)

Abstract A protocol based on Bayesian optimization is demonstrated for determining model parameters in a coarse-grained polymer simulation. This process takes as input the microscopic distribution functions and temperature-dependent density for a targeted polymer system. The process then iteratively considers coarse-grained simulations to sample the space of model parameters, aiming to minimize the discrepancy between the new simulations and the target. Successive samples are chosen using Bayesian optimization. Such a protocol can be employed to systematically coarse-grained expensive high-resolution simulations to extend accessible length and time scales to make contact with rheological experiments. The Bayesian coarsening protocol is compared to a previous machine-learned parameterization technique which required a high volume of training data. The Bayesian coarsening process is found to precisely and efficiently discover appropriate model parameters, in spite of rough and noisy fitness landscapes, due to the natural balance of exploration and exploitation in Bayesian optimization.
Identification of herbarium specimen sheet components from high-resolution images using deep learning

Thompson, KMM ; Turnbull, R ; Fitzgerald, E ; Birch, JLL (WILEY, 2023-08)

Advanced computer vision techniques hold the potential to mobilise vast quantities of biodiversity data by facilitating the rapid extraction of text- and trait-based data from herbarium specimen digital images, and to increase the efficiency and accuracy of downstream data capture during digitisation. This investigation developed an object detection model using YOLOv5 and digitised collection images from the University of Melbourne Herbarium (MELU). The MELU-trained 'sheet-component' model-trained on 3371 annotated images, validated on 1000 annotated images, run using 'large' model type, at 640 pixels, for 200 epochs-successfully identified most of the 11 component types of the digital specimen images, with an overall model precision measure of 0.983, recall of 0.969 and moving average precision (mAP0.5-0.95) of 0.847. Specifically, 'institutional' and 'annotation' labels were predicted with mAP0.5-0.95 of 0.970 and 0.878 respectively. It was found that annotating at least 2000 images was required to train an adequate model, likely due to the heterogeneity of specimen sheets. The full model was then applied to selected specimens from nine global herbaria (Biodiversity Data Journal, 7, 2019), quantifying its generalisability: for example, the 'institutional label' was identified with mAP0.5-0.95 of between 0.68 and 0.89 across the various herbaria. Further detailed study demonstrated that starting with the MELU-model weights and retraining for as few as 50 epochs on 30 additional annotated images was sufficient to enable the prediction of a previously unseen component. As many herbaria are resource-constrained, the MELU-trained 'sheet-component' model weights are made available and application encouraged.

Chancellery Research - Research Publications

Permanent URI for this collection

Filters

Date

Author

Type

Settings

Sort By

Results per page

Statistics

Citations

Search Results