Computing and Information Systems - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 100
  • Item
    Thumbnail Image
    Stratification bias in low signal microarray studies
    Parker, BJ ; Guenter, S ; Bedo, J (BMC, 2007-09-02)
    BACKGROUND: When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated. RESULTS: We show that when estimating the performance of classifiers on low signal datasets (i.e. those which are difficult to classify), which are typical of many prognostic microarray studies, commonly used performance measures can suffer from a substantial negative bias. For error rate this bias is only severe in quite restricted situations, but can be much larger and more frequent when using ranking measures such as the receiver operating characteristic (ROC) curve and area under the ROC (AUC). Substantial biases are shown in simulations and on the van 't Veer breast cancer dataset. The classification error rate can have large negative biases for balanced datasets, whereas the AUC shows substantial pessimistic biases even for imbalanced datasets. In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5. Further experiments on the van 't Veer breast cancer dataset show these biases exist in practice. CONCLUSION: Stratification bias can substantially affect several performance measures. In computing the AUC, the strategy of pooling the test samples from the various folds of cross-validation can lead to large biases; computing it as the average of per-fold estimates avoids this bias and is thus the recommended approach. As a more general solution applicable to other performance measures, we show that stratified repeated holdout and a modified version of k-fold cross-validation, balanced, stratified cross-validation and balanced leave-one-out cross-validation, avoids the bias. Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided. In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets.
  • Item
    Thumbnail Image
    Measuring Success
    CULLEN, S ; Willocks, L (Caspian Publishing, 2007)
  • Item
  • Item
    Thumbnail Image
    Graphical query for linguistic treebanks
    BIRD, STEVEN ; Lee, Haejoong ( 2007)
    Databases of hierarchically annotated text occupy a central place in linguistic research and language technology development. We describe a new approach to tree query which we call "Query by Annotation". Users express a query by annotating a tree, and the annotation is compiled into an expression in a path language. The result trees are overlaid with the original query, permitting the user to see why they match. Since queries and results are annotated trees, users can easily refine and resubmit their queries. The approach to Query by Annotation is motivated and exemplified using databases of linguistic trees, or treebanks.
  • Item
    Thumbnail Image
    Edge gradient correction of distorted images: a differential chain rule approach
    Islam, Mr Muhammad ; Kitchen, Dr Les ( 2007-11)
    Many camera lenses, particularly low-cost or wideanglelenses, can cause significant image distortion. Thismeans that measurements made on such images will be incorrect.A traditional approach to dealing with this problemis to digitally unwarp the image to correct the distortion,and then to apply computer vision processing to the corrected image. However, this is a relatively expensive operation, and can introduce additional interpolation errors. We propose instead to apply processing directly to the distorted image, modifying the algorithm to account for the distortion during processing.In this paper we focus on the particular classic problemof gradient-based extraction of straight edges. We propose amodification of the Burns [4] line extractor that works on adistorted image by correcting the gradients on the fly usingthe chain rule, and correcting the pixel positions during the line-fitting stage. Experimental results on both real and synthetic images show that our gradient-correction technique speeds up processing over traditional unwarping approach, while retaining similar accuracy.
  • Item
    Thumbnail Image
    A monomial ν-SV method for regression
    SHILTON, ALISTAIR ; Lai, Daniel ; PALANISWAMI, MARIMUTHU ( 2007)
    In the present paper we describe a new formulation for Support Vector regression (SVR), namely monomial ν-SVR. Like the standard ν-SVR, the monomial ν-SVR method automatically adjusts the radius of insensitivity (the tube width, epsilon) to suit the training data. However, by replacing Vapnik’s epsilon-insensitive cost with a more general monomial epsilon-insensitive cost (and likewise replacing the linear tube shrinking term with a monomial tube shrinking term), the performance of the monomial ν-SVR is improved for data corrupted by a wider range of noise distributions. We focus on the quadric form of monomial ν-SVR and show that the dual form of this is simpler than the standard ν-SVR. We show that, like Suykens’ Least-Squares SVR (LS-SVR) method (and unlike standard ν-SVR), the quadric ν-SVR dual has a unique global solution. Comparisons are made between the asymptotic efficiency of our method and that of standard ν-SVR and LS-SVR which demonstrate the superiority of our method for the special case of higher order polynomial noise. These theoretical predictions are validated using experimental comparisons with the alternative approaches of standard ν-SVR, LS-SVR and weighted LS-SVR.
  • Item
    Thumbnail Image
    Firebird database backup by serialized database table dump
    Ling, Maurice H. T. ( 2007)
    This paper presents a simple data dump and load utility for Firebird databases which mimics mysqldump in MySQL. This utility, fb_dump and fb_load, for dumping and loading respectively, retrieves each database table using kinterbasdb and serializes the data using marshal module. This utility has two advantages over the standard Firebird database backup utility, gbak. Firstly, it is able to backup and restore single database tables which might help to recover corrupted databases. Secondly, the output is in text-coded format (from marshal module) making it more resilient than a compressed text backup, as in the case of using gbak.
  • Item
    Thumbnail Image
    Seven tips for enhancing your research visibility and impact
    Buyya, Dr. Raujkumar ( 2007-02)
    This article presents 7 tips for enhancing your research visibility and impact.
  • Item
    Thumbnail Image
    A Framework for Evaluating Multimedia Software: Modeling Student Learning Strategies
    STERN, L ; LAM, K (Association for the Advancement of Computing in Education, 2007)
  • Item