Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 6 of 6
  • Item
    Thumbnail Image
    Statistical modeling of multiword expressions
    Su, Kim Nam ( 2008)
    In natural languages, words can occur in single units called simplex words or in a group of simplex words that function as a single unit, called multiword expressions (MWEs). Although MWEs are similar to simplex words in their syntax and semantics, they pose their own sets of challenges (Sag et al. 2002). MWEs are arguably one of the biggest roadblocks in computational linguistics due to the bewildering range of syntactic, semantic, pragmatic and statistical idiomaticity they are associated with, and their high productivity. In addition, the large numbers in which they occur demand specialized handling. Moreover, dealing with MWEs has a broad range of applications, from syntactic disambiguation to semantic analysis in natural language processing (NLP) (Wacholder and Song 2003; Piao et al. 2003; Baldwin et al. 2004; Venkatapathy and Joshi 2006). Our goals in this research are: to use computational techniques to shed light on the underlying linguistic processes giving rise to MWEs across constructions and languages; to generalize existing techniques by abstracting away from individual MWE types; and finally to exemplify the utility of MWE interpretation within general NLP tasks. In this thesis, we target English MWEs due to resource availability. In particular, we focus on noun compounds (NCs) and verb-particle constructions (VPCs) due to their high productivity and frequency. Challenges in processing noun compounds are: (1) interpreting the semantic relation (SR) that represents the underlying connection between the head noun and modifier(s); (2) resolving syntactic ambiguity in NCs comprising three or more terms; and (3) analyzing the impact of word sense on noun compound interpretation. Our basic approach to interpreting NCs relies on the semantic similarity of the NC components using firstly a nearest-neighbor method (Chapter 5), then verb semantics based on the observation that it is often an underlying verb that relates the nouns in NCs (Chapter 6), and finally semantic variation within NC sense collocations, in combination with bootstrapping (Chapter 7). Challenges in dealing with verb-particle constructions are: (1) identifying VPCs in raw text data (Chapter 8); and (2) modeling the semantic compositionality of VPCs (Chapter 5). We place particular focus on identifying VPCs in context, and measuring the compositionality of unseen VPCs in order to predict their meaning. Our primary approach to the identification task is to adapt localized context information derived from linguistic features of VPCs to distinguish between VPCs and simple verb-PP combinations. To measure the compositionality of VPCs, we use semantic similarity among VPCs by testing the semantic contribution of each component. Finally, we conclude the thesis with a chapter-by-chapter summary and outline of the findings of our work, suggestions of potential NLP applications, and a presentation of further research directions (Chapter 9).
  • Item
    Thumbnail Image
    Structured classification for multilingual natural language processing
    Blunsom, Philip ( 2007-06)
    This thesis investigates the application of structured sequence classification models to multilingual natural language processing (NLP). Many tasks tackled by NLP can be framed as classification, where we seek to assign a label to a particular piece of text, be it a word, sentence or document. Yet often the labels which we’d like to assign exhibit complex internal structure, such as labelling a sentence with its parse tree, and there may be an exponential number of them to choose from. Structured classification seeks to exploit the structure of the labels in order to allow both generalisation across labels which differ by only a small amount, and tractable searches over all possible labels. In this thesis we focus on the application of conditional random field (CRF) models (Lafferty et al., 2001). These models assign an undirected graphical structure to the labels of the classification task and leverage dynamic programming algorithms to efficiently identify the optimal label for a given input. We develop a range of models for two multilingual NLP applications: word-alignment for statistical machine translation (SMT), and multilingual super tagging for highly lexicalised grammars.
  • Item
    Thumbnail Image
    Automatic identification of locative expressions from informal text
    Liu, Fei ( 2013)
    Informal place descriptions that are rich in locative expressions can be found in various contexts. The ability to extract locative expressions from such informal place descriptions is at the centre of improving the quality of services, such as interpreting geographical queries and emergency calls. While much attention has been focused on the identification of formal place references (e.g., Rathmines Road) from natu- ral language, people tend to make heavy use of informal place references (e.g., my bedroom). This research addresses the problem by developing a model that is able to automatically identify locative expressions from informal text. Moreover, we study and discover insights of what aspects are helpful in the identification task. Utilising an existing manually annotated corpus, we re-annotate locative expressions and use them as the gold standard. Having the gold standard ready, we take a machine learning approach to the identification task with well-reasoned features based on observation and intuition. Further, we study the impacts of various feature setups on the performance of the model and provide analyses of experiment results. With the best performing feature setup, the model is able to achieve significant increase in performance over the baseline systems.
  • Item
    Thumbnail Image
    Towards high-dimensional classification using Michigan style generic based machine learning
    ABEDINI, MANI ( 2013)
    High-dimensional classification problems arise frequently in many areas of science and engineering. Examples include: disease classification based on gene expression profiles, document classification, image recognition and fraud detection. Machine learning algorithms are widely used to tackle such problems. Typically, statistical techniques or artificial intelligence methods are used to build models that can learn from the data and subsequently new observations can be classified as belonging to a particular category of the data set. Genetic-based machine learning (GBML), which combines evolutionary algorithms and reinforcement learning, has been successful applied in a wide range of complex problem solving and classification tasks. The eXtended Classifier System (XCS), a well-known GMBL technique, evolves a population of classifiers in the form of condition-action rules that can be used for classification problems. A key step in XCS’s iterative learning process is the rule discovery component which creates new classifiers to be added to the bounded population. Unfortunately, population-based approaches are inherently slow when faced with large-scale problems. This may in part be attributed to the additional cost associated with maintaining relatively large populations of classifiers, often encapsulating multiple features. Consequently, few studies have examined the application of XCS in high-dimensional cases. The over-arching aim of this thesis is to develop new GBML classification models, based on an underlying XCS architecture, for high-dimensional classification problems. The objective it to improve the performance measured in terms of accuracy, population diversity and execution costs. To do this, three alternative approaches have been proposed and evaluated: In the first approach, we use “feature quality” to guide the XCS rule discovery process. Here, a pre-processing phase is used to extract feature quality information. Subsequently, the feature quality information is used to bias the evolutionary operators via a “gene mask”. Detailed numerical simulation experiments encapsulating a number of alternative feature ranking methods, parameter setting and benchmark data sets suggest that our proposed model can improve the classification performance of data sets with large numbers of features. In the second approach, a hybrid multi-population ensemble classifier inspired by co-evolutionary algorithms and ensemble learning is used to improve the baseline-XCS performance. In this model, separate sub-populations are evolved in parallel using a binary gene mask to guide the evolutionary operators. Sub-population outputs are weighted in a typical ensemble style to produce an output. As is typical in parallel evolutionary models of this form, we investigate alternative “migration” methods between sub-populations as well as specific mechanisms to adapt the gene mask within a sub-population. Numerical simulation results have shown that the multi-population model has superior performance to the single population model. Significantly, by combining feature quality information and the hybrid co-evolutionary XCS, superior performance can be achieved when measured across the benchmark data sets. The focus of the third approach is different from the previous approaches, in that emphasis is placed more on investigating techniques to reduce execution time of the proposed GMBL models for high-dimensional complex classification tasks using alternative parallel and distributed architectures. We start by comparing deployment techniques and the mapping of the multi-population GBML models using parallel or distributed programming libraries, before listing advantages and disadvantages of each approach. This section concludes with a preliminary study investigating the use of cloud computing infrastructures to run our proposed model. Large-scale classification problems pose many challenges for traditional machine learning algorithms. In this thesis, we have proposed novel extensions to XCS to meet such challenges. Our contributions are based on the development of informed rule discovering mechanisms that employ gene masks, co-evolving sub-populations and parallel and distributed architectures. The experimental results have demonstrated that the proposed models have better performance in terms of accuracy, execution costs and promoting population diversity across the data sets considered – benchmark low dimensional data sets; synthetic data sets and high-dimensional gene expression profile data sets. There are further opportunities to extend the proposed models in interesting ways, including: examining feature interactions more closely and exploring the effectiveness of alternative evolutionary and statistical models.
  • Item
    Thumbnail Image
    The effects of sampling and semantic categories on large-scale supervised relation extraction
    Willy ( 2012)
    The purpose of relation extraction is to identify novel pairs of entities which are related by a pre-specified relation such as hypernym or synonym. The traditional approach to relation extraction is to building a dedicated system for a particular relation, meaning that significant effort is required to repurpose the approach to new relations. We propose a generic approach based on supervised learning, which provides a standardised process for performing relation extraction on different relations and domains. We explore the feasibility of the approach over a range of relations and corpora, focusing particularly on the development of a realistic evaluation methodology for relation extraction. In addition to this, we investigate the impact of semantic categories on extraction effectiveness.
  • Item
    Thumbnail Image
    Scalable approaches for analysing high density single nucleotide polymorphism array data
    Wong, Gerard Kum Peng ( 2012)
    Prior to making inferences from the raw data produced by these microarrays, several challenges need to be addressed. First, it is important to limit the impact of noise on microarray measurements while maintaining data integrity. An unexplored aspect of noise is the extent of probeset sequence identity in SNP microarrays. Second, microarray-based datasets often have at least two orders of magnitude more probesets than the number of samples they describe. This poses a challenge for traditional statistical tests when used in this context. Third, the number of features in each dataset is large even when sample sizes are small, thus computationally efficient approaches are required to analyse these datasets. Finally, with improving resolution of SNP arrays, there is a need to exploit this improvement in resolution to identify finer-scaled mutations in the human genome. Most existing approaches deal with these challenges at an individual sample level and do not look for consensus change across the population to identify sites of DNA mutation. Other approaches artificially smooth or segment the data to obtain uniform segments of copy number change, and lose possible fine-scaled copy number changes in the process. Others are based on computationally expensive approaches that do not scale well to array resolution and sample size. Our first contribution is a comprehensive survey of the sequence identity of all probesets for all variants of the Affymetrix Genome-Wide Human SNP array. This survey assesses the target uniqueness of every probeset and provides a basis for the development of a set of gold standard labels of copy number change between genders. The derived sets of gold standard labels are a benchmark for assessing the performance of various algorithms in detecting recurrent copy number change. This benchmark is utilised in the evaluation of our second and third contribution. Our second contribution is a statistical approach called Detecting Recurrent Copy Number Changes Using Rank Order Statistics (DRECS), which is designed to identify regions of consensus copy number change across multiple samples in SNP array datasets. Through the use of rank-based statistics, DRECS efficiently draws on the statistical power of using multiple samples to identify fine-scaled copy number changes down to the width of a single probe in a computationally efficient way. Our third contribution is called Sum of Ranks Exact Test (SoRET), a non-parametric extension of DRECS. SoRET addresses SNP datasets with small sample sizes and makes no assumptions about the distribution from which the data was sampled. Its performance in terms of Type I and Type II errors is superior to competitive parametric and non-parametric statistical tests for small sample sizes. Our fourth contribution is a feature set reduction approach called FSR. FSR enables existing filter-based feature selection approaches to handle large dimensional microarray-type datasets by pruning irrelevant and redundant features. A novel scoring measure is developed to assess the strength of each feature in terms of sample class discrimination. FSR uses measures of entropy to efficiently gauge the contribution of higher order feature patterns to avoid combinatorial explosions in assessing the utility of features. In our tested datasets, classifiers trained on features selected from FSR-reduced feature sets have shown notably better predictive accuracy than classifiers trained on features selected from complete feature sets.