Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 2 of 2
  • Item
    Thumbnail Image
    Analysing the interplay of location, language and links utilising geotagged Twitter content
    Afshin, Rahimi ( 2018)
    Language use and interactions on social media are geographically biased. In this work we utilise this bias in predictive models of user geolocation and lexical dialectology. User geolocation is an important component of applications such as personalised search and recommendation systems. We propose text-based and network-based geolocation models, and compare them over benchmark datasets yielding state-of-the- art performance. We also propose hybrid and joint text and network geolocation models that improve upon text or network only models and show that the joint models are able to achieve reasonable performance in minimal supervision scenarios, as often happens in real world datasets. Finally, we also propose the use of continuous representations of location, which enables regression modelling of geolocation and lexical dialectology. We show that our proposed data-driven lexical dialectology model provides qualitative insights in studying geographical lexical variation.
  • Item
    Thumbnail Image
    Improving the utility of social media with Natural Language Processing
    HAN, BO ( 2014)
    Social media has been an attractive target for many natural language processing (NLP) tasks and applications in recent years. However, the unprecedented volume of data and the non-standard language register cause problems for off-the-shelf NLP tools. This thesis investigates the broad question of how NLP-based text processing can improve the utility (i.e., the effectiveness and efficiency) of social media data. In particular, text normalisation and geolocation prediction are closely examined in the context of Twitter text processing. Text normalisation is the task of restoring non-standard words to their standard forms. For instance, earthquick and 2morrw should be transformed into “earthquake” and “tomorrow”, respectively. Non-standard words often cause problems for existing tools trained on edited text sources such as newswire text. By applying text normalisation to reduce unknown non-standard words, the accuracy of NLP tools and downstream applications is expected to increase. In this thesis, I explore and develop lexical normalisation methods for Twitter text. I shift the focus of text normalisation from a cascaded token-based approach to a type-based approach using a combined lexicon, based on the analysis of existing and developed text normalisation methods. The type-based method achieved the state-of-the-art end-to-end normalisation accuracy at the time of publication, i.e., 0.847 precision and 0.630 recall on a benchmark dataset. Furthermore, it is simple, lightweight and easily integrable which is particularly well suited to large-scale data processing. Additionally, the effectiveness of the proposed normalisation method is shown in non-English text normalisation and other NLP tasks and applications. Geolocation prediction estimates a user’s primary location based on the text of their posts. It enables location-based data partitioning, which is crucial to a range of tasks and applications such as local event detection. The partitioned location data can improve both the efficiency and the effectiveness of NLP tools and applications. In this thesis, I identify and explore several factors that affect the accuracy of text-based geolocation prediction in a unified framework. In particular, an extensive range of feature selection methods is compared to determine the optimised feature set for the geolocation prediction model. The results suggest feature selection is an effective method for improving the prediction accuracy regardless of geolocation model and location partitioning. Additionally, I examine the influence of other factors including non-geotagged data, user metadata, tweeting language, temporal influence, user geolocatability, and geolocation prediction confidence. The proposed stacking-based prediction model achieved 40.6% city-level accuracy and 40km median error distance for English Twitter users on a recent benchmark dataset. These investigations provide practical insights into the design of a text-based normalisation system, as well as the basis for further research on this task. Overall, the exploration of these two text processing tasks enhances the utility of social media data for relevant NLP tasks and downstream applications. The developed method and experimental results have immediate impact on future social media research.