Now showing 1 - 2 of 2
ItemGeneralized language identificationLUI, MARCO ; ( 2014)Language identification is the task of determining the natural language that a document or part thereof is written in. The central theme of this thesis is generalized language identification, and deals with eliminating the assumptions that limit the applicability of language identification techniques to specific settings that may not be representative of real-world use cases for automatic language identification techniques. Research to date has treated language identification as a supervised machine learning problem, and in this thesis I argue that such a characterization is inadequate, showing how standard document representations do not take into account the variation in a language between different sources of text, and developing a representation that is robust to such variation. I also develop a method that allows for language identification of multilingual documents, i.e. documents that contain text in more than one language. Finally, I investigate the robustness of existing off-the-shelf language identification methods on a novel and challenging domain.
ItemComputational substructure querying and topology prediction of the beta-sheetHo, Hui Kian ( 2014)Studying the three-dimensional structure of proteins is essential to understanding their function, and ultimately, their dysfunction that causes disease. The limitations of experimental protein structure determination presents a need for computational approaches to protein structure prediction and analysis. The beta-sheet is a commonly occurring protein substructure important to many biological processes and are often implicated in neurological disorders. Targeted experimental studies of beta-sheets are especially difficult due to their general insolubility in isolation. This thesis presents a series of contributions to the computational analysis and prediction of beta-sheet structure, which are useful for knowledge discovery and for directing more detailed experimental work. Approaches for predicting the simplest type of beta-sheet, the beta-hairpin, are first described. Improvements over existing methods are obtained by using the most important beta-hairpin features identified through systematic feature selection. An examination of the most important features provides a physiochemical basis of their usefulness in beta-hairpin prediction. New methods for the more general problem of beta-sheet topology prediction are described. Unlike recent methods, ours are independent of multiple sequence alignment (MSAs) and therefore do not rely on the coverage of reference sequence databases or sequence homology. Our evaluations showed that our methods do not exhibit the same reductions in performance as a state-of-the-art method for sequences with low quality MSAs. A new method for the indexing and querying of beta-sheet substructures, called BetaSearch, is described. BetaSearch exploits the inherent planar constraints of beta-sheet structure to achieve significant speedups over existing graph indexing and conventional 3D structure search methods. Case studies are presented that demonstrate the potential of this method for the discovery of biologically interesting beta-sheet substructures. Finally, a purpose-built open source toolkit for generating 2D protein maps is described, which is useful for the coarse-grained analysis and visualisation of 3D protein structures. It can also be used in existing knowledge discovery pipelines for automated structural analysis and prediction tasks, as a standalone application, or imported into existing experimental applications.