Show simple item record

dc.contributor.authorShen, Aili
dc.date.accessioned2021-01-11T05:15:16Z
dc.date.available2021-01-11T05:15:16Z
dc.date.issued2020
dc.identifier.urihttp://hdl.handle.net/11343/258655
dc.description© 2020 Aili Shen
dc.description.abstractDocument quality assessment, due to its complexity and subjectivity, requires considering information from multiple sources and aspects, to capture quality indicators. Grammaticality, readability, stylistics, structure, correctness, and expertise depth reflect the quality of documents from different aspects, with varying importance across different domains. Automatic quality assessment has obvious benefits in terms of time saving and tractability in contexts where the volume of documents is large. In the case of dynamic documents (possibly with multiple authors), such as in the case of Wikipedia, it is particularly pertinent, as any edit potentially has implications for the quality label of that document. In this thesis, we focusing on improving the performance of document quality assessment systems and measure the uncertainty of document quality assessment systems. This thesis addresses four research questions: (1) How can we capture visual features not present in the document text, such as images and visual layout, to enhance representations learned from text content? (2) How can we make use of hand-crafted features widely adopted in traditional machine learning approaches in the context of neural networks, to generate a more accurate document quality assessment system? (3) How can we model the inherent subjectivity of quality assessment in evaluating the performance of quality assessment systems? and (4) Can a quality assessment system detect whether there are intruder sentences in documents and identify the span of any such intruder sentences, given that they interrupt the coherence of documents, thereby lowering their quality? To address the first research question, we propose to use Inception V3 (Szegedy et al., 2016), a widely used visual model in computer vision, to capture visual features from visual renderings of documents, based on the observation that visual renderings of documents can capture these visual features. Inception V3 compares favourably to textual-based models over the Wikipedia and academic paper reviewing datasets. We further propose a joint model to predict document quality by combining visual and textual features. We observe further improvements over both Wikipedia and academic paper reviewing datasets, indicating complementary between visual and textual features, and the general applicability of our proposed method. Next, we propose two methods to enhance the capacity of neural models in predicting the quality of documents by utilising hand-crafted features. In the first method, we propose to concatenate hand-crafted features with neural learned high-level representations, assuming that neural model-learned features may not have captured all the information carried by these hand-crafted features. The second method, on the other hand, utilises hand-crafted features to guide neural model learning by explicitly attending to feature indicators when learning the relationship between the input and target variables, rather than simply concatenating hand-crafted features. Experimental results demonstrate the superiority of our proposed methods over baselines. To imitate people’s disagreement over the inherently subjective task of document quality assessment, we propose to measure the uncertainty in document quality predictions. We investigate two methods: Gaussian processes (GPs) (Rasmussen and Williams, 2006) and random forests (RFs) (Breiman, 2001), which provide not only a prediction of the document quality but also the uncertainty over their predictions. We also propose an asymmetric cost, considering the prediction uncertainty, which is used to measure the performance of two methods in the scenario, where decision-making processes based on model predictions can lead to different costs. Lastly, we propose a new task of detecting whether there is an intruder sentence in a document, generated by replacing an original sentence with a similar sentence from a second document. Existing datasets in coherence detection are not suitable for our task as they are either too small for training current data-hungry models on or do not specify the span of incoherent text. To benchmark model performance over this task, we construct a large-scale dataset consisting of documents from English Wikipedia and CNN news articles. Experimental results show that pre-trained language models which incorporate larger document contexts in pretraining perform remarkably well in-domain, but experience a substantial drop cross-domain. In follow-up analysis based on human annotations, substantial divergences from human intuitions were observed, pointing to limitations in their ability to model document coherence. Further results over a linguistic probe dataset show that pre-trained models fail to identify some linguistic characteristics that affect document coherence, suggesting room to improve for them to truly capture document coherence, and motivating the construction of a dataset with intruder text at the intra-sentential level.
dc.rightsTerms and Conditions: Copyright in works deposited in Minerva Access is retained by the copyright owner. The work may not be altered without permission from the copyright owner. Readers may only download, print and save electronic copies of whole works for their own personal non-commercial use. Any use that exceeds these limits requires permission from the copyright owner. Attribution is essential when quoting or paraphrasing from these works.
dc.subjectdocument quality assessment
dc.subjectdocument coherence measurement
dc.subjectnatural language processing
dc.subjectdeep learning
dc.subjecthand-crafted features
dc.subjectvisual features
dc.titleA multi-faceted approach to document quality assessment
dc.typePhD thesis
melbourne.affiliation.departmentComputing and Information Systems
melbourne.affiliation.facultyEngineering
melbourne.thesis.supervisornameJianzhong Qi
melbourne.contributor.authorShen, Aili
melbourne.thesis.supervisorothernameTimothy Baldwin
melbourne.thesis.supervisorothernameBahar Salehi
melbourne.tes.fieldofresearch1460208 Natural language processing
melbourne.tes.fieldofresearch2460308 Pattern recognition
melbourne.tes.fieldofresearch3461099 Library and information studies not elsewhere classified
melbourne.accessrightsOpen Access


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record