Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 6 of 6
  • Item
    Thumbnail Image
    Automatic identification of locative expressions from informal text
    Liu, Fei ( 2013)
    Informal place descriptions that are rich in locative expressions can be found in various contexts. The ability to extract locative expressions from such informal place descriptions is at the centre of improving the quality of services, such as interpreting geographical queries and emergency calls. While much attention has been focused on the identification of formal place references (e.g., Rathmines Road) from natu- ral language, people tend to make heavy use of informal place references (e.g., my bedroom). This research addresses the problem by developing a model that is able to automatically identify locative expressions from informal text. Moreover, we study and discover insights of what aspects are helpful in the identification task. Utilising an existing manually annotated corpus, we re-annotate locative expressions and use them as the gold standard. Having the gold standard ready, we take a machine learning approach to the identification task with well-reasoned features based on observation and intuition. Further, we study the impacts of various feature setups on the performance of the model and provide analyses of experiment results. With the best performing feature setup, the model is able to achieve significant increase in performance over the baseline systems.
  • Item
    Thumbnail Image
    Improving the utility of topic models: an uncut gem does not sparkle
    LAU, JEY HAN ( 2013)
    This thesis concerns a type of statistical model known as topic model. Topic modelling learns abstract “topics” in a collection of documents, and by “topic” we mean an idea, theme or subject. For example we may have an article that discusses space exploration, or a book about crime. Space exploration and crime, these two subjects, are the “topics” that we are talking about. As one imagine, topic modelling has a direct application in digital libraries, as it automates the learning and categorisation of topics in books and articles. The merit of topic modelling, however, is that its machinery is not limited to processing just words but symbols in general. As such, topic modelling has seen applications in other areas outside text processing such as biomedical research for inferring protein families. Most applications, however, are small scale and experimental and much of the impact is still contained in academic research. The overarching theme of the thesis is thus to improve the utility of topic modelling. We achieve this in two ways: (1) by improving a few aspects of topic modelling to make it more accessible and usable by users; and (2) by proposing novel applications of topic modelling to real-world problems. In the first step, we look into improving the preprocessing methodology of documents that serves as the creation of input for topic models. We also experiment extensively to improve the visualisation of topics—one of the main output of topic models—to increase its usability for human users. In the second step, we apply topic modelling in a lexicography-oriented work to learn and detect new meanings that have emerged in words and in the social media space to identify popular social trends. Both were novel applications and delivered promising results, demonstrating the strength and wide applicability of topic models.
  • Item
    Thumbnail Image
    Collective document classification using explicit and implicit inter-document relationships
    Burford, Clinton ( 2013)
    Information systems are transforming the ways in which people generate, store and share information. One consequence of this change is a massive increase in the quantity of digital content the average person needs to deal with. A large part of the information systems challenge is about finding intelligent ways to help users locate and analyse this information. One tool that is available to build systems to address this challenge is automatic document classification. A document classifier is a statistical model for predicting a label for an input document that is represented as a set of features. The potential usefulness of such a generalised system for categorising documents based on their contents is very great. There are direct applications for systems that can answer complex document categorisation questions like: Is this product review generally positive or negative? Document classification systems can also become critical parts of most complex systems that need input documents to be selected based on complex criteria. This thesis addresses the question of how document classifiers can exploit information about the relationships between documents being classified. Normally, document classifiers work on a single document at a time: once the classifier has been trained from a set of labelled examples, it can then be used to label single input documents as required. Collective document classifiers learn a classifier that can be applied to a group of related documents. The inter-document relationships in the group are used to improve labelling performance beyond what is possible when considering documents in isolation. Work on collective document classifiers is based on the observation that some types of documents have features which are either ambiguous or not present in training data, but which have the special characteristic of indicating relationships between the labels of documents. Most often, an inter-document relationship indicates that two documents have the same label, but it may also indicate that they have different labels. In either case, classifiers gain an advantage if they can consider inter-document features. Inter-document features can be explicit, as when a document cites or quotes another, or implicit, as when documents exist in semantically related groups in which stylistic, structural or semantic similarities are informative, or when they are related by a spatial or temporal structure. In the first part of this thesis I survey the state-of-the-art in collective document classification and explore approaches for adding collective behaviour to standard document classifiers. I present an experimental evaluation of these techniques for use with explicit inter-document relationships. In the second part I develop techniques for extracting implicit inter-document relationships. In total, the work in this thesis assesses and extends the capabilities of collective document classifiers. Its contribution is in four main parts: (1) I introduce an approach that gives better than state of the art performance for collective classification of political debate transcripts; (2) I provide a comparative overview of collective document classification techniques to assist practitioners in choosing an algorithm for collective document classification tasks; (3) I demonstrate effective and novel approaches for generating collective classifiers from standard classifiers; and (4) I introduce a technique for inferring inter-document relationships based on matching phrases and show that these relationships can be used to improve overall document classification performance.
  • Item
    Thumbnail Image
    The effects of sampling and semantic categories on large-scale supervised relation extraction
    Willy ( 2012)
    The purpose of relation extraction is to identify novel pairs of entities which are related by a pre-specified relation such as hypernym or synonym. The traditional approach to relation extraction is to building a dedicated system for a particular relation, meaning that significant effort is required to repurpose the approach to new relations. We propose a generic approach based on supervised learning, which provides a standardised process for performing relation extraction on different relations and domains. We explore the feasibility of the approach over a range of relations and corpora, focusing particularly on the development of a realistic evaluation methodology for relation extraction. In addition to this, we investigate the impact of semantic categories on extraction effectiveness.
  • Item
    Thumbnail Image
    Computing relationships and relatedness between contextually diverse entities
    GRIESER, KARL ( 2011)
    When presented with a pair of entities such as a ball and a bat, a person may make the connection that both of these entities are involved in sport (e.g., the sports baseball or cricket, based on the individual's background), that the composition of the two entities is similar (e.g., a wooden ball and a wooden stick), or if the person is especially creative, a fancy dress ball where someone has come dressed as a bat. All of these connections are equally valid, but depending on the context the person is familiar with (e.g., sport, wooden objects, fancy dress), a particular connection may be more apparent to that person. From a computational perspective, identifying these relationships and calculating the level of relatedness of entity pairs requires consideration of all ways in which the entities are able to interact with one another. Existing approaches to identifying the relatedness of entities and the semantic relationships that exist between them fail to take into account the multiple diverse ways in which these entities may interact, and hence do not explore all potential ways in which entities may be related. In this thesis, I use the collaborative encyclopedia Wikipedia as the basis for the formulation of a measure of semantic relatedness that takes into account the contextual diversity of entities (called the Related Article Conceptual Overlap, or RACO, method), and describe several methods of relationship extraction that utilise the taxonomic structure of Wikipedia to identify pieces of text that describe relations between contextually diverse entities. I also describe the construction of a dataset of museum exhibit relatedness judgements used to evaluate the performance of RACO. I demonstrate that RACO outperforms state-of-the-art measures of semantic relatedness over a collection of contextually diverse entities (museum exhibits), and that the taxonomic structure of Wikipedia provides a basis for identifying valid relationships between contextually diverse entities. As this work is presented in regard to the domain of Cultural Heritage and using Wikipedia as a basis for representation, I additionally describe the process for adapting the principle of conceptual overlap for calculating semantic relatedness and the relationship extraction methods based on taxonomic links to alternate contextually diverse domains, and for use with other representational resources.
  • Item
    Thumbnail Image
    Orthographic support for passing the reading hurdle in Japanese
    YENCKEN, LARS ( 2010)
    Learning a second language is, for the most part, a day-in day-out struggle against the mountain of new vocabulary a learner must acquire. Furthermore, since the number of new words to learn is so great, learners must acquire them autonomously. Evidence suggests that for languages with writing systems, native-like vocabulary sizes are only developed through reading widely, and that reading is only fruitful once learners have acquired the core vocabulary required for it to become smooth. Learners of Japanese have an especially high barrier in the form of the Japanese writing system, in particular its use of kanji characters. Recent work on dictionary accessibility has focused on compensating for learner errors in pronouncing unknown words, however much difficulty remains. This thesis uses the rich visual nature of the Japanese orthography to support the study of vocabulary in several ways. Firstly, it proposes a range of kanji similarity measures and evaluates them over several new data sets, finding that the stroke edit distance and tree edit distance metrics best approximate human judgements. Secondly, it uses stroke edit distance construct a model of kanji misrecognition, which we use as the basis for a new form of kanji search by similarity. Analysing query logs, we find that this new form of search was rapidly adopted by users, indicating its utility. We finally combine kanji confusion and pronunciation models into a new adaptive testing platform, Kanji Tester, modelled after aspects of the Japanese Language Proficiency Test. As the user tests themselves, the system adapts to their error patterns and uses this information to make future tests more difficult. Investigating logs of use, we find a weak positive correlation between ability estimates and time the system has been used. Furthermore, our adaptive models generated questions which were significantly more difficult than their control counterparts. Overall, these contributions make a concerted effort to improve tools for learner self-study, so that learners can successfully overcome the reading hurdle and propel themselves towards greater proficiency. The data collected from these tools also forms a useful basis for further study of learner error and vocabulary development.