Knowledge discovery and extraction of domain-specific web data
AffiliationComputing and Information Systems
Document TypePhD thesis
Access StatusOpen Access
© 2014 Dr. Li Wang
Web user forums (or simply “forums”) are a valuable means for users to resolve specific information needs, both interactively for the participants and statically for users who search and browse over historical thread data. However, the complex structure of forum threads can make it difficult for users to extract relevant information. Addressing this problem, we propose to parse thread discourse structure of forum threads for the purpose of enhancing information access and solution sharing over web user forums. The discourse structure of a forum thread is modelled as a rooted directed acyclic graph (DAG), and each post in the thread is represented as a node in this DAG. The reply-to relations between posts are then denoted as directed edges (LINKs) between nodes in the DAG, and the type of a reply-to relation is defined as a dialogue act (DA). To parse the discourse structure of threads, both LINKs and DAs need to be identified. The first method we propose uses conditional random fields to either classify the LINK and DA separately and compose them afterwards, or classify the combined LINK and DA directly. Another technique we adopt is to treat this discourse structure parsing as a dependency parsing problem, because the joint classification of LINK and DA is a natural fit for dependency parsing. Our parsing methods not only perform significantly better than a strong heuristic baseline, but also can robustly handle growing threads, and achieve similar results over partial threads compared to complete threads. Additionally, we also explore unsupervised approaches for LINK classification by using lexical chaining. Then, we explore ways of using thread discourse structure information to improve information access and solution sharing over web user forums. Specifically, we first demonstrate that the proposed discourse structure can help thread solvedness identification (i.e. automatically identify whether the question asked in a forum thread is resolved or not). The basic idea is using features derived from thread discourse structure to help solvedness classification. For example, the last reply-to LINK and its DA type can be indicative of whether the asked question is resolved or not. Experimental results show that simple features derived from thread discourse structure can greatly boost the accuracy of solvedness classification, which has been shown to be very difficult in previous research. We also investigate the utility of discourse structure in forum thread IR. The proposed method first parses the discourse structure of targeted threads, then uses information from the parsed discourse structure to augment existing IR systems. For instance, if a post is linked to a question post with a DA type of an answer, more weight should be given to this post during retrieval. Experimental results demonstrate that exploiting the characteristics of discourse structure of forum threads can benefit IR, when compared to previously-published state-of-the-art IR methods.
- Click on "Export Reference in RIS Format" and choose "open with... Endnote".
- Click on "Export Reference in RIS Format". Login to Refworks, go to References => Import References