Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 2 of 2
  • Item
    Thumbnail Image
    Generalized language identification
    LUI, MARCO ; ( 2014)
    Language identification is the task of determining the natural language that a document or part thereof is written in. The central theme of this thesis is generalized language identification, and deals with eliminating the assumptions that limit the applicability of language identification techniques to specific settings that may not be representative of real-world use cases for automatic language identification techniques. Research to date has treated language identification as a supervised machine learning problem, and in this thesis I argue that such a characterization is inadequate, showing how standard document representations do not take into account the variation in a language between different sources of text, and developing a representation that is robust to such variation. I also develop a method that allows for language identification of multilingual documents, i.e. documents that contain text in more than one language. Finally, I investigate the robustness of existing off-the-shelf language identification methods on a novel and challenging domain.
  • Item
    Thumbnail Image
    Impact of user characteristics on online forum classification tasks
    LUI, MARCO ( 2009)
    We develop methods for describing users based on their posts to an online discussion forum. These methods build on existing techniques to describe other aspects of online discussions communities, but the application of these techniques to describing users is novel. We demonstrate the utility of our proposed methods by showing that they are superior to existing methods over distinct thread-level, post-level and user-level classification tasks, utilizing real world datasets. In all cases, we attain statistically significant improvements over baseline results. In post-level classification, we also see statistically significant improvements over state-of-the-art benchmark methods. Our major contributions in this work are • creation of a corpus with user-level annotations • detailed description and analysis of three relevant corpora • implementation of a data model for accessing forum data • implementation of feature extraction techniques • evaluation and analysis of user-level features over classification tasks Our work on preparing corpora and providing extensible implementations of feature extraction will be of particular value to researchers looking to work in this field.