Computing and Information Systems - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 897
  • Item
    No Preview Available
    TREE-BASED STATISTICAL MACHINE TRANSLATION: EXPERIMENTS WITH THE ENGLISH AND BRAZILIAN PORTUGUESE PAIR
    Beck, D ; Caseli, H (SBIC, 2013)
    Machine Learning paradigms have dominated recent research in Machine Translation. Current state-of-the-art approaches rely only on statistical methods that gather all necessary knowledge from parallel corpora. However, this lack on explicit linguistic knowledge makes them unable to model some linguistic phenomena. In this work, we focus on models that take into account the syntactic information from the languages involved on the translation process. We follow a novel approach that preprocess parallel corpora using syntactic parsers and uses translation models composed by Tree Transducers. We perform experiments with English and Brazilian Portuguese, providing the first known results in syntax-based Statistical Machine Translation for this language pair. These results show that this approach is able to better model phenomena like long-distance reordering and give directions to future improvements in building syntax-based translation models for this pair.
  • Item
    Thumbnail Image
    Stratification bias in low signal microarray studies
    Parker, BJ ; Guenter, S ; Bedo, J (BMC, 2007-09-02)
    BACKGROUND: When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated. RESULTS: We show that when estimating the performance of classifiers on low signal datasets (i.e. those which are difficult to classify), which are typical of many prognostic microarray studies, commonly used performance measures can suffer from a substantial negative bias. For error rate this bias is only severe in quite restricted situations, but can be much larger and more frequent when using ranking measures such as the receiver operating characteristic (ROC) curve and area under the ROC (AUC). Substantial biases are shown in simulations and on the van 't Veer breast cancer dataset. The classification error rate can have large negative biases for balanced datasets, whereas the AUC shows substantial pessimistic biases even for imbalanced datasets. In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5. Further experiments on the van 't Veer breast cancer dataset show these biases exist in practice. CONCLUSION: Stratification bias can substantially affect several performance measures. In computing the AUC, the strategy of pooling the test samples from the various folds of cross-validation can lead to large biases; computing it as the average of per-fold estimates avoids this bias and is thus the recommended approach. As a more general solution applicable to other performance measures, we show that stratified repeated holdout and a modified version of k-fold cross-validation, balanced, stratified cross-validation and balanced leave-one-out cross-validation, avoids the bias. Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided. In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets.
  • Item
    Thumbnail Image
    Plasma lipid profiling in a large population-based cohort
    Weir, JM ; Wong, G ; Barlow, CK ; Greeve, MA ; Kowalczyk, A ; Almasy, L ; Comuzzie, AG ; Mahaney, MC ; Jowett, JBM ; Shaw, J ; Curran, JE ; Blangero, J ; Meikle, PJ (ELSEVIER, 2013-10)
    We have performed plasma lipid profiling using liquid chromatography electrospray ionization tandem mass spectrometry on a population cohort of more than 1,000 individuals. From 10 μl of plasma we were able to acquire comparative measures of 312 lipids across 23 lipid classes and subclasses including sphingolipids, phospholipids, glycerolipids, and cholesterol esters (CEs) in 20 min. Using linear and logistic regression, we identified statistically significant associations of lipid classes, subclasses, and individual lipid species with anthropometric and physiological measures. In addition to the expected associations of CEs and triacylglycerol with age, sex, and body mass index (BMI), ceramide was significantly higher in males and was independently associated with age and BMI. Associations were also observed for sphingomyelin with age but this lipid subclass was lower in males. Lysophospholipids were associated with age and higher in males, but showed a strong negative association with BMI. Many of these lipids have previously been associated with chronic diseases including cardiovascular disease and may mediate the interactions of age, sex, and obesity with disease risk.
  • Item
    Thumbnail Image
    Abstract Interpretation over Non-Lattice Abstract Domains
    Gange, G ; Navas, JA ; Schachte, P ; Søndergaard, H ; Stuckey, PJ ; Logozzo, F ; Fahndrich, M (Springer, 2013)
    The classical theoretical framework for static analysis of programs is abstract interpretation. Much of the power and elegance of that framework rests on the assumption that an abstract domain is a lattice. Nonetheless, and for good reason, the literature on program analysis provides many examples of non-lattice domains, including non-convex numeric domains. The lack of domain structure, however, has negative consequences, both for the precision of program analysis and for the termination of standard Kleene iteration. In this paper we explore these consequences and present general remedies.
  • Item
    Thumbnail Image
    BioCaster: detecting public health rumors with a Web-based text mining system.
    Collier, N ; Doan, S ; Kawazoe, A ; Goodwin, RM ; Conway, M ; Tateno, Y ; Ngo, Q-H ; Dien, D ; Kawtrakul, A ; Takeuchi, K ; Shigematsu, M ; Taniguchi, K (Oxford University Press (OUP), 2008-12-15)
    SUMMARY: BioCaster is an ontology-based text mining system for detecting and tracking the distribution of infectious disease outbreaks from linguistic signals on the Web. The system continuously analyzes documents reported from over 1700 RSS feeds, classifies them for topical relevance and plots them onto a Google map using geocoded information. The background knowledge for bridging the gap between Layman's terms and formal-coding systems is contained in the freely available BioCaster ontology which includes information in eight languages focused on the epidemiological role of pathogens as well as geographical locations with their latitudes/longitudes. The system consists of four main stages: topic classification, named entity recognition (NER), disease/location detection and event recognition. Higher order event analysis is used to detect more precisely specified warning signals that can then be notified to registered users via email alerts. Evaluation of the system for topic recognition and entity identification is conducted on a gold standard corpus of annotated news articles. AVAILABILITY: The BioCaster map and ontology are freely available via a web portal at http://www.biocaster.org.
  • Item
    Thumbnail Image
    Feasibility of using Clinical Element Models (CEM) to standardize phenotype variables in the database of genotypes and phenotypes (dbGaP).
    Lin, K-W ; Tharp, M ; Conway, M ; Hsieh, A ; Ross, M ; Kim, J ; Kim, H-E ; Raghava, GPS (Public Library of Science (PLoS), 2013)
    The database of Genotypes and Phenotypes (dbGaP) contains various types of data generated from genome-wide association studies (GWAS). These data can be used to facilitate novel scientific discoveries and to reduce cost and time for exploratory research. However, idiosyncrasies and inconsistencies in phenotype variable names are a major barrier to reusing these data. We addressed these challenges in standardizing phenotype variables by formalizing their descriptions using Clinical Element Models (CEM). Designed to represent clinical data, CEMs were highly expressive and thus were able to represent a majority (77.5%) of the 215 phenotype variable descriptions. However, their high expressivity also made it difficult to directly apply them to research data such as phenotype variables in dbGaP. Our study suggested that simplification of the template models makes it more straightforward to formally represent the key semantics of phenotype variables.
  • Item
    Thumbnail Image
    A highly optimized algorithm for continuous intersection join queries over moving objects
    Zhang, R ; Qi, J ; Lin, D ; Wang, W ; Wong, RC-W (SPRINGER, 2012-08)
  • Item
    Thumbnail Image
    A proximity-aware load balancing in peer-to-peer-based volunteer computing systems
    Ghafarian, T ; Deldari, H ; Javadi, B ; Buyya, R (SPRINGER, 2013-08)
  • Item
    Thumbnail Image
    A time decoupling approach for studying forum dynamics
    Kan, A ; Chan, J ; Hayes, C ; Hogan, B ; Bailey, J ; Leckie, C (SPRINGER, 2013-11)
  • Item
    Thumbnail Image
    An enhanced XCS rule discovery module using feature ranking
    Abedini, M ; Kirley, M (SPRINGER HEIDELBERG, 2013-06)