Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 7 of 7
  • Item
    Thumbnail Image
    Unsupervised all-words sense distribution learning
    Bennett, Andrew ( 2016)
    There has recently been significant interest in unsupervised methods for learning word sense distributions, or most frequent sense information, in particular for applications where sense distinctions are needed. In addition to their direct application to word sense disambiguation (WSD), particularly where domain adaptation is required, these methods have successfully been applied to diverse problems such as novel sense detection or lexical simplification. Furthermore, they could be used to supplement or replace existing sources of sense frequencies, such as SemCor, which have many significant flaws. However, a major gap in the past work on sense distribution learning is that it has never been optimised for large-scale application to the entire vocabularies of a languages, as would be required to replace sense frequency resources such as SemCor. In this thesis, we develop an unsupervised method for all-words sense distribution learning, which is suitable for language-wide application. We first optimise and extend HDP-WSI, an existing state-of-the-art sense distribution learning method based on HDP topic modelling. This is mostly achieved by replacing HDP with the more efficient HCA topic modelling algorithm in order to create HCA-WSI, which is over an order of magnitude faster than HDP-WSI and more robust. We then apply HCA-WSI across the vocabularies of several languages to create LexSemTm, which is a multilingual sense frequency resource of unprecedented size. Of note, LexSemTm contains sense frequencies for approximately 88% of polysemous lemmas in Princeton WordNet, compared to only 39% for SemCor, and the quality of data in each is shown to be roughly equivalent. Finally, we extend our sense distribution learning methodology to multiword expressions (MWEs), which to the best of our knowledge is a novel task (as is applying any kind of general-purpose WSD methods to MWEs). We demonstrate that sense distribution learning for MWEs is comparable to that for simplex lemmas in all important respects, and we expand LexSemTm with MWE sense frequency data.
  • Item
    Thumbnail Image
    Improving the utility of social media with Natural Language Processing
    HAN, BO ( 2014)
    Social media has been an attractive target for many natural language processing (NLP) tasks and applications in recent years. However, the unprecedented volume of data and the non-standard language register cause problems for off-the-shelf NLP tools. This thesis investigates the broad question of how NLP-based text processing can improve the utility (i.e., the effectiveness and efficiency) of social media data. In particular, text normalisation and geolocation prediction are closely examined in the context of Twitter text processing. Text normalisation is the task of restoring non-standard words to their standard forms. For instance, earthquick and 2morrw should be transformed into “earthquake” and “tomorrow”, respectively. Non-standard words often cause problems for existing tools trained on edited text sources such as newswire text. By applying text normalisation to reduce unknown non-standard words, the accuracy of NLP tools and downstream applications is expected to increase. In this thesis, I explore and develop lexical normalisation methods for Twitter text. I shift the focus of text normalisation from a cascaded token-based approach to a type-based approach using a combined lexicon, based on the analysis of existing and developed text normalisation methods. The type-based method achieved the state-of-the-art end-to-end normalisation accuracy at the time of publication, i.e., 0.847 precision and 0.630 recall on a benchmark dataset. Furthermore, it is simple, lightweight and easily integrable which is particularly well suited to large-scale data processing. Additionally, the effectiveness of the proposed normalisation method is shown in non-English text normalisation and other NLP tasks and applications. Geolocation prediction estimates a user’s primary location based on the text of their posts. It enables location-based data partitioning, which is crucial to a range of tasks and applications such as local event detection. The partitioned location data can improve both the efficiency and the effectiveness of NLP tools and applications. In this thesis, I identify and explore several factors that affect the accuracy of text-based geolocation prediction in a unified framework. In particular, an extensive range of feature selection methods is compared to determine the optimised feature set for the geolocation prediction model. The results suggest feature selection is an effective method for improving the prediction accuracy regardless of geolocation model and location partitioning. Additionally, I examine the influence of other factors including non-geotagged data, user metadata, tweeting language, temporal influence, user geolocatability, and geolocation prediction confidence. The proposed stacking-based prediction model achieved 40.6% city-level accuracy and 40km median error distance for English Twitter users on a recent benchmark dataset. These investigations provide practical insights into the design of a text-based normalisation system, as well as the basis for further research on this task. Overall, the exploration of these two text processing tasks enhances the utility of social media data for relevant NLP tasks and downstream applications. The developed method and experimental results have immediate impact on future social media research.
  • Item
    Thumbnail Image
    Automatic identification of locative expressions from informal text
    Liu, Fei ( 2013)
    Informal place descriptions that are rich in locative expressions can be found in various contexts. The ability to extract locative expressions from such informal place descriptions is at the centre of improving the quality of services, such as interpreting geographical queries and emergency calls. While much attention has been focused on the identification of formal place references (e.g., Rathmines Road) from natu- ral language, people tend to make heavy use of informal place references (e.g., my bedroom). This research addresses the problem by developing a model that is able to automatically identify locative expressions from informal text. Moreover, we study and discover insights of what aspects are helpful in the identification task. Utilising an existing manually annotated corpus, we re-annotate locative expressions and use them as the gold standard. Having the gold standard ready, we take a machine learning approach to the identification task with well-reasoned features based on observation and intuition. Further, we study the impacts of various feature setups on the performance of the model and provide analyses of experiment results. With the best performing feature setup, the model is able to achieve significant increase in performance over the baseline systems.
  • Item
    Thumbnail Image
    Improving the utility of topic models: an uncut gem does not sparkle
    LAU, JEY HAN ( 2013)
    This thesis concerns a type of statistical model known as topic model. Topic modelling learns abstract “topics” in a collection of documents, and by “topic” we mean an idea, theme or subject. For example we may have an article that discusses space exploration, or a book about crime. Space exploration and crime, these two subjects, are the “topics” that we are talking about. As one imagine, topic modelling has a direct application in digital libraries, as it automates the learning and categorisation of topics in books and articles. The merit of topic modelling, however, is that its machinery is not limited to processing just words but symbols in general. As such, topic modelling has seen applications in other areas outside text processing such as biomedical research for inferring protein families. Most applications, however, are small scale and experimental and much of the impact is still contained in academic research. The overarching theme of the thesis is thus to improve the utility of topic modelling. We achieve this in two ways: (1) by improving a few aspects of topic modelling to make it more accessible and usable by users; and (2) by proposing novel applications of topic modelling to real-world problems. In the first step, we look into improving the preprocessing methodology of documents that serves as the creation of input for topic models. We also experiment extensively to improve the visualisation of topics—one of the main output of topic models—to increase its usability for human users. In the second step, we apply topic modelling in a lexicography-oriented work to learn and detect new meanings that have emerged in words and in the social media space to identify popular social trends. Both were novel applications and delivered promising results, demonstrating the strength and wide applicability of topic models.
  • Item
    Thumbnail Image
    Collective document classification using explicit and implicit inter-document relationships
    Burford, Clinton ( 2013)
    Information systems are transforming the ways in which people generate, store and share information. One consequence of this change is a massive increase in the quantity of digital content the average person needs to deal with. A large part of the information systems challenge is about finding intelligent ways to help users locate and analyse this information. One tool that is available to build systems to address this challenge is automatic document classification. A document classifier is a statistical model for predicting a label for an input document that is represented as a set of features. The potential usefulness of such a generalised system for categorising documents based on their contents is very great. There are direct applications for systems that can answer complex document categorisation questions like: Is this product review generally positive or negative? Document classification systems can also become critical parts of most complex systems that need input documents to be selected based on complex criteria. This thesis addresses the question of how document classifiers can exploit information about the relationships between documents being classified. Normally, document classifiers work on a single document at a time: once the classifier has been trained from a set of labelled examples, it can then be used to label single input documents as required. Collective document classifiers learn a classifier that can be applied to a group of related documents. The inter-document relationships in the group are used to improve labelling performance beyond what is possible when considering documents in isolation. Work on collective document classifiers is based on the observation that some types of documents have features which are either ambiguous or not present in training data, but which have the special characteristic of indicating relationships between the labels of documents. Most often, an inter-document relationship indicates that two documents have the same label, but it may also indicate that they have different labels. In either case, classifiers gain an advantage if they can consider inter-document features. Inter-document features can be explicit, as when a document cites or quotes another, or implicit, as when documents exist in semantically related groups in which stylistic, structural or semantic similarities are informative, or when they are related by a spatial or temporal structure. In the first part of this thesis I survey the state-of-the-art in collective document classification and explore approaches for adding collective behaviour to standard document classifiers. I present an experimental evaluation of these techniques for use with explicit inter-document relationships. In the second part I develop techniques for extracting implicit inter-document relationships. In total, the work in this thesis assesses and extends the capabilities of collective document classifiers. Its contribution is in four main parts: (1) I introduce an approach that gives better than state of the art performance for collective classification of political debate transcripts; (2) I provide a comparative overview of collective document classification techniques to assist practitioners in choosing an algorithm for collective document classification tasks; (3) I demonstrate effective and novel approaches for generating collective classifiers from standard classifiers; and (4) I introduce a technique for inferring inter-document relationships based on matching phrases and show that these relationships can be used to improve overall document classification performance.
  • Item
    Thumbnail Image
    The effects of sampling and semantic categories on large-scale supervised relation extraction
    Willy ( 2012)
    The purpose of relation extraction is to identify novel pairs of entities which are related by a pre-specified relation such as hypernym or synonym. The traditional approach to relation extraction is to building a dedicated system for a particular relation, meaning that significant effort is required to repurpose the approach to new relations. We propose a generic approach based on supervised learning, which provides a standardised process for performing relation extraction on different relations and domains. We explore the feasibility of the approach over a range of relations and corpora, focusing particularly on the development of a realistic evaluation methodology for relation extraction. In addition to this, we investigate the impact of semantic categories on extraction effectiveness.
  • Item
    Thumbnail Image
    Computing relationships and relatedness between contextually diverse entities
    GRIESER, KARL ( 2011)
    When presented with a pair of entities such as a ball and a bat, a person may make the connection that both of these entities are involved in sport (e.g., the sports baseball or cricket, based on the individual's background), that the composition of the two entities is similar (e.g., a wooden ball and a wooden stick), or if the person is especially creative, a fancy dress ball where someone has come dressed as a bat. All of these connections are equally valid, but depending on the context the person is familiar with (e.g., sport, wooden objects, fancy dress), a particular connection may be more apparent to that person. From a computational perspective, identifying these relationships and calculating the level of relatedness of entity pairs requires consideration of all ways in which the entities are able to interact with one another. Existing approaches to identifying the relatedness of entities and the semantic relationships that exist between them fail to take into account the multiple diverse ways in which these entities may interact, and hence do not explore all potential ways in which entities may be related. In this thesis, I use the collaborative encyclopedia Wikipedia as the basis for the formulation of a measure of semantic relatedness that takes into account the contextual diversity of entities (called the Related Article Conceptual Overlap, or RACO, method), and describe several methods of relationship extraction that utilise the taxonomic structure of Wikipedia to identify pieces of text that describe relations between contextually diverse entities. I also describe the construction of a dataset of museum exhibit relatedness judgements used to evaluate the performance of RACO. I demonstrate that RACO outperforms state-of-the-art measures of semantic relatedness over a collection of contextually diverse entities (museum exhibits), and that the taxonomic structure of Wikipedia provides a basis for identifying valid relationships between contextually diverse entities. As this work is presented in regard to the domain of Cultural Heritage and using Wikipedia as a basis for representation, I additionally describe the process for adapting the principle of conceptual overlap for calculating semantic relatedness and the relationship extraction methods based on taxonomic links to alternate contextually diverse domains, and for use with other representational resources.