Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 15
  • Item
    Thumbnail Image
    Semi-supervised community detection and clustering
    Ganji, Mohadeseh ( 2017)
    Data clustering and community detection in networks are two important tasks in machine learning which aim to group the data into similar objects or densely connected sub-graphs. However, applying an appropriate similarity measure to obtain the highest accuracy is always a challenge. Furthermore, in some real- world applications, some background knowledge and information exists about the true or desired assignments or properties of clusters and communities. The side-information could be obtained by experiments, domain knowledge or user preferences in different applications. Some constraints may also be imposed to the system due to natural complexity of the problem or resource limitations. Community detection (clustering) in the presence of side-information represented as supervision constraints is called semi-supervised community detection (clustering). However, finding efficient approaches to take the most advantage of this pre-existing information to improve quality of the solutions is still a challenge. In this thesis, we study community detection and clustering problems with and without incorporating domain knowledge for which we propose a similarity measure and exact and approximate optimization techniques to improve the accuracy of the results. In this thesis, we address limitations of a popular community detection measure called modularity and propose an improved measure called generalized modularity which quantifies similarity of network vertices more realistically and comprehensively by incorporating vertex similarity concepts. The pro- posed generalized modularity outperforms the state of the art modularity optimization approach in community detection. In addition, to incorporate background knowledge and user preferences into community detection process, two semi-supervised approaches are proposed in this thesis: First we address the modelling flexibility issue in the literature of semi- supervised community detection to simultaneously model instance level and community level constraints. We propose a generic framework using constraint programming technology which enables incorporating a variety of instance level, community level and complex supervision constraints at the same time. The framework also enables modelling local community definitions as constraints to a global community detection scheme and is able to incorporate a variety of similarity measures and community detection objective functions. Using a high level modelling language enables the proposed semi-supervised community detection framework to have the flexibility of applying both complete and incomplete techniques and has the advantage of proving optimality of the solutions found when using complete techniques. Second, a new algorithm for semi-supervised community detection is pro- posed based on discrete Lagrange multipliers which incorporates pairwise constraints. Unlike most of the existing semi-supervised community detection schemes that modify the graph adjacency based on the supervision constraints, the pro- posed algorithm works with quality measures such as modularity or generalized modularity and guides the community detection process by systematically modifying the similarity matrix only for hard-to satisfy constraints. The pro- posed algorithm commits to satisfy (almost) all of the constraints to take the most advantage of the existing supervision. It outperforms the existing semi- supervised community detection algorithms in terms of satisfying the supervision constraints and noise resistance. Another contribution of this thesis is to incorporate instance level supervision constraints into clustering problem. In this regard, a k-means type semi- supervised clustering algorithm is proposed which can take the most advantage of the pre-existing information to achieve high quality solutions satisfying the constraints. The proposed algorithm is based on continuous Lagrange multipliers and penalizes the constraint violations in a systematic manner which guides the cluster centroids and cluster assignments towards satisfying all of the constraints. The achievements of this thesis are supported by several experiments and publications.
  • Item
    Thumbnail Image
    Automatic optical coherence tomography imaging analysis for retinal disease screening
    Hussain, Md Akter ( 2017)
    The retina and the choroid are two important structures of the eye and on which the quality of eye sight depends. They have many tissue layers which are very important for monitoring the health and the progression of the eye disease from an early stage. These layers can be visualised using Optical Coherence Tomography (OCT) imaging. The abnormalities in these layers are indications of several eye diseases that can lead to blindness, such as Diabetic Macular Edema (DME), Age-related Macular Degeneration (AMD) and Glaucoma. If the retina and the choroid are damaged there is little chance to recover normal sight. Moreover, any damage in them will lead to blindness if no or late treatment is administered. With eye diseases, early detection and treatment are more effective and cheaper. Biomarkers extracted from these tissue layers, such as changes in thickness of the layers, will note the presence of abnormalities called pathologies such as drusen and hyper-reflective intra-retinal spots, and are very effective in the early detection and monitoring the progression of eye disease. Large scale and reliable biomarker extraction by manual grading for early detection is infeasible and prone to error due to subjective bias and are also cost ineffective. Automatic biomarker extraction is the best solution. However, OCT image analysis for extracting biomarkers is very challenging because of noisy images, low contrast, extremely thin retinal layers, the presence of pathologies and complex anatomical structures such as the optic disc and macula. In this thesis, a robust, efficient and accurate automated 3D segmentation algorithm for OCT images is proposed for the retinal tissue layers and the choroid, thus overcoming those challenges. By mapping OCT image segmentation problem as a graph problem, we converted the detection of layer boundaries to the problem of finding the shortest paths in the mapped graph. The proposed method exploits layer-oriented small regions of interest, edge pixels from canny edge detections as nodes of the graph, and incorporates prior knowledge of the structures into edge weight computation for finding the shortest path using Dijkstra’s shortest path algorithm as a boundary of the layers. Using this segmentation scheme, we were able to segment all the retinal and choroid tissue layers very accurately and extract eight novel biomarkers such as attenuation of the retinal nerve fibre layer, relative intensity of the ellipsoid zone, thickness of the retinal layers, and volume of pathologies i.e. drusen, etc. In addition, we demonstrated that using these biomarkers provides a very accurate (98%) classification model for classifying eye patients into those with normal, DME and AMD diseases which can be built using a Random Forest classifier. The proposed segmentation method and classification method have been evaluated on several datasets collected locally at the Center for Eye Research Australia and from the public domain. In total, the dataset contains 56 patients for the evaluation of the segmentation algorithms and 72 patients for the classification model. The method developed from this study has shown high accuracy for all layers of the retina and the choroid over eight state-of-the-art methods. The root means square error between manually delineated and automatically segmented boundaries is as low as 0.01 pixels. The quantification of biomarkers has also shown a low margin of error from the manually quantified values. Furthermore, the classification model has shown more than 98% accuracy, which outperformed four state-of-the-art methods with an area under the receiver operating characteristic curve (AUC) of 0.99. The classification model can also be used in the early detection of diseases which allows significant prevention of blindness as well as providing a score/index for the condition or prediction of the eye diseases. In this thesis, we have also developed a fully automated prototype system, OCTInspector, for OCT image analysis using these proposed algorithms and methods.
  • Item
    Thumbnail Image
    Machine learning for feedback in massive open online courses
    HE, JIZHENG ( 2016)
    Massive Open Online Courses (MOOCs) have received widespread attention for their potential to scale higher education, with multiple platforms such as Coursera, edX and Udacity recently appearing. Online courses from elite universities around the world are offered for free, so that anyone with internet access can learn anywhere. Enormous enrolments and diversity of students have been widely observed in MOOCs. Despite their popularity, MOOCs are limited in reaching their full potential by a number of issues. One of the major problems is the notoriously low completion rates. A number of studies have focused on identifying the factors leading to this problem. One of the factors is the lack of interactivity and support. There is broad agreement in the literature that interaction and communication play an important role in improving student learning. It has been indicated that interaction in MOOCs helps students ease their feelings of isolation and frustration, develop their own knowledge, and improve learning experience. A natural way of improving interactivity is providing feedback to students on their progress and problems. MOOCs give rise to vast amounts of student engagement data, bringing opportunities to gain insights into student learning and provide feedback. This thesis focuses on applying and designing new machine learning algorithms to assist instructors in providing student feedback. In particular, we investigate three main themes: i) identifying at-risk students not completing courses as a step towards timely intervention; ii) exploring the suitability of using automatically discovered forum topics as instruments for modelling students' ability; iii) similarity search in heterogeneous information networks. The first theme can be helpful for assisting instructors to design interventions for at-risk students to improve retention. The second theme is inspired by recent research on measurement of student learning in education research communities. Educators explore the suitability of using latent complex patterns of engagement instead of traditional visible assessment tools (e.g. quizzes and assignments), to measure a hypothesised distinctive and complex learning skill of promoting learning in MOOCs. This process is often human-intensive and time-consuming. Inspired by this research, together with the importance of MOOC discussion forums for understanding student learning and providing feedback, we investigate whether students' participation across forum discussion topics can indicate their academic ability. The third theme is a generic study of utilising the rich semantic information in heterogeneous information networks to help find similar objects. MOOCs contain diverse and complex student engagement data, which is a typical example of heterogeneous information networks, and so could benefit from this study. We make the following contributions for solving the above problems. Firstly, we propose transfer learning algorithms based on regularised logistic regression, to identify students who are at risk of not completing courses weekly. Predicted probabilities with well-calibrated and smoothed properties can not only be used for the identification of at-risk students but also for subsequent interventions. We envision an intervention that presents probability of success/failure to borderline students with the hypothesis that they can be motivated by being classified as "nearly there". Secondly, we combine topic models with measurement models to discover topics from students' online forum postings. The topics are enforced to fit measurement models as statistical evidence of instruments for measuring student ability. In particular, we focus on two measurement models, the Guttman scale and the Rasch model. To the best our knowledge, this is the first study to explore the suitability of using discovered topics from MOOC forum content as instruments for measuring student ability, by combining topic models with psychometric measurement models in this way. Furthermore, these scaled topics imply a range of difficulty levels, which can be useful for monitoring the health of a course and refining curricula, student assessment, and providing personalised feedback based on student ability levels and topic difficulty levels. Thirdly, we extend an existing meta path-based similarity measure by incorporating transitive similarity and temporal dynamics in heterogeneous information networks, evaluated using the DBLP bibliographic network. The proposed similarity measure might apply to MOOC settings to find similar students or threads, or thread recommendation in MOOC forums, by modelling student interactions in MOOC forums as a heterogeneous information network.
  • Item
    Thumbnail Image
    Cluster validation and discovery of multiple clusterings
    Lei, Yang ( 2016)
    Cluster analysis is an important unsupervised learning process in data analysis. It aims to group data objects into clusters, so that the data objects in the same group are more similar and the data objects in different groups are more dissimilar. There are many open challenges in this area. In this thesis, we focus on two: discovery of multiple clusterings and cluster validation. Many clustering methods focus on discovering one single ‘best’ solution from the data. However, data can be multi-faceted in nature. Particularly when datasets are large and complex, there may be several useful clusterings existing in the data. In addition, users may be seeking different perspectives on the same dataset, requiring multiple clustering solutions. Multiple clustering analysis has attracted considerable attention in recent years and aims to discover multiple reasonable and distinctive clustering solutions from the data. Many methods have been proposed on this topic and one popular technique is meta-clustering. Meta-clustering explores multiple reasonable and distinctive clusterings by analyzing a large set of base clusterings. However, there may exist poor quality and redundant base clustering which will affect the generation of high quality and diverse clustering views. In addition, the generated clustering views may not all be relevant. It will be time and energy consuming for users to check all the returned solutions. To tackle these problems, we propose a filtering method and a ranking method to achieve higher quality and more distinctive clustering solutions. Cluster validation refers to the procedure of evaluating the quality of clusterings, which is critical for clustering applications. Cluster validity indices (CVIs) are often used to quantify the quality of clusterings. They can be generally classified into two categories: external measures and internal measures, which are distinguished in terms of whether or not external information is used during the validation procedure. In this thesis, we focus on external cluster validity indices. There are many open challenges in this area. We focus two of them: (a) CVIs for fuzzy clusterings and, (b) Bias issues for CVIs. External CVIs are often used to quantify the quality of a clustering by comparing it against the ground truth. Most external CVIs are designed for crisp clusterings (one data object only belongs to one single cluster). How to evaluate the quality of soft clusterings (one data object can belong to more than one cluster) is a challenging problem. One common way to achieve this is by hardening a soft clustering to a crisp clustering and then evaluating it using a crisp CVI. However, hardening may cause information loss. To address this problem, we generalize a class of popular information-theoretic based crisp external CVIs to directly evaluate the quality of soft clusterings, without the need for a hardening step. There is an implicit assumption when using external CVIs for evaluating the quality of a clustering, that is, they work correctly. However, if this assumption does not hold, then misleading results might occur. Thus, identifying and understanding the bias behaviors of external CVIs is crucial. Along these lines, we identify novel bias behaviors of external CVIs and analyze the type of bias both theoretically and empirically.
  • Item
    Thumbnail Image
    Volatility homogenisation and machine learning for time series forecasting
    Kowalewski, Adam Waldemar ( 2016)
    Volatility homogenisation is a technique of looking at a process at regular points in space. In other words, we are only interested when the process moves by a certain quantum. The intuition and empirical evidence behind this is that we ignore smaller movements which are just noise, while only concerning ourselves with larger movements which represent the information from the underlying process. In this vein, we have derived theoretical results showing volatility homogenisation as a means of estimating the drift and volatility of theoretical processes and verify these results by simulation. This demonstrates the ability of a “homogenised” process to retain salient information regarding the underlying process. Volatility homogenisation is then coupled, as a preprocessing step, with various machine learning techniques which yields greater forecasting accuracy than when the machine learning techniques are used without volatility homogenisation preprocessing. In addition to this, we develop volatility homogenisation kernels for machine learning kernel-based techniques such as support vector machines, relevance vector machines and Gaussian processes for machine learning. The volatility homogenisation kernel causes a kernel-based machine learning technique to utilise volatility homogenisation internally and, with it, obtain better predictions on forecasting the direction of a financial time series. In order to create and use the volatility homogenisation kernel, we have developed a solution to the problem of a kernel taking inputs which have dimensions of differing size while still maintaining a convex solution to the model for techniques such as support vector machines, for a given set of parameters. Furthermore, we have demonstrated the efficacy of volatility homogenisation as a way of successfully investing using a Kelly criterion strategy. The strategy makes use of the information inherent in a support vector machine model which uses a volatility homogenisation kernel in order to calculate the necessary parameters for the Kelly betting strategy. We also develop strategies which select additional features for the support vector machine through the use of a nearest neighbour strategy using various measures of association. Overall, volatility homogenisation is a robust strategy for the decomposition of a process which allows various machine learning techniques to discern the main driving process inherent in a financial time series, which leads to better forecasts and investment strategies.
  • Item
    Thumbnail Image
    Design and adjustment of dependency measures
    Romano, Simone ( 2015)
    Dependency measures are fundamental for a number of important applications in data mining and machine learning. They are ubiquitously used: for feature selection, for clustering comparisons and validation, as splitting criteria in random forest, and to infer biological networks, to list a few. More generally, there are three important applications of dependency measures: detection, quantification, and ranking of dependencies. Dependency measures are estimated on finite data sets and because of this the tasks above become challenging. This thesis proposes a series of contributions to improve performances on each of these three goals. When differentiating between strong and weak relationships using information theoretic measures, the variance plays an important role: the higher the variance, the lower the chance to correctly rank the relationships. In this thesis, we discuss the design of a dependency measure based on the normalized mutual information whose estimation is based on many random discretization grids. This approach allows us to reduce its estimation variance. We show that a small estimation variance for the grid estimator of mutual information if beneficial to achieve higher power when the task is detection of dependencies between variables and when ranking different noisy dependencies. Dependency measure estimates can be high because of chance when the sample size is small, e.g. because of missing values, or when the dependency is estimated between categorical variables with many categories. These biases cause problems when the dependency must have an interpretable quantification and when ranking dependencies for feature selection. In this thesis, we formalize a framework to adjust dependency measures in order to correct for these biases. We apply our adjustments to existing dependency measures between variables and show how to achieve better interpretability in quantification. For example, when a dependency measure is used to quantify the amount of noise on functional dependencies between variables, we experimentally demonstrate that adjusted measures have more interpretable range of variation. Moreover, we demonstrate that our approach is also effective to rank attributes during the splitting procedure in random forests where a dependency measure between categorical variables is employed. Finally, we apply our framework of adjustments to dependency measures between clusterings. In this scenario, we are able to analytically compute our adjustments. We propose a number of adjusted clustering comparison measures which reduce to well known adjusted measures as special cases. This allows us to propose guidelines for the best applications of our measures as well as for existing ones for which guidelines are missing in literature, e.g. for the Adjusted Rand Index (ARI).
  • Item
    Thumbnail Image
    Similarity analysis with advanced relationships on big data
    Huang, Jin ( 2015)
    Similarity analytic techniques such as distance based joins and regularized learning models are critical tools employed in numerous data mining and machine learning tasks. We focus on two typical such techniques in the context of large scale data and distributed clusters. Advanced distance metrics such as the Earth Mover's Distance (EMD) are usually employed to capture the similarity between data dimensions. The high computational cost of EMD calls for a distributed solution, yet it is difficult to achieve a balanced workloads given the skewed distribution of the EMDs. We propose efficient bounding techniques and effective workload scheduling strategies on the Hadoop platform to design a scalable solution, named HEADS-Join. We investigate both the range joins and the top-k joins, and explore different computation paradigms including MapReduce, BSP, and Spark. We conduct comprehensive experiments and confirm that the proposed techniques achieve an order of magnitude speedup over the state-of-the-art MapReduce join algorithms. The hypergraph model is demonstrated to achieve excellent effectiveness in a wide range of applications where high-order relationships are of interest. When processing a large scale hypergraph, the straightforward approach is to convert it to a graph and reuse the distributed graph frameworks. However, such an approach significantly increases the problem size, incurs excessive replicas due to partitioning, and renders it extremely difficult to achieve a balanced workloads. We propose a novel scalable framework, named HyperX, to directly operate on a distributed hypergraph representation and minimize the numbers of replicas while still maintain a great workload balance among the distributed machines. We closely investigate an optimization problem of partitioning a hypergraph in the context of distributed computation. With extensive experiments, we confirm that HyperX achieve an order of magnitude improvement over the graph conversion approach in terms of the execution time, network communication, and memory consumption.
  • Item
    Thumbnail Image
    Recommendation systems for travel destination and departure time
    Xue, Yuan ( 2015)
    People travel on a daily basis to various local destinations such as office, home, restaurant, appointment venue, and sightseeing spot. It is vital to most people that we have a positive experience and high efficiency of daily travel. With this observation, my research strives to provide daily-travel related recommendations by solving two optimisation problems, driving destination prediction and departure time recommendation for appointments. Our “SubSyn” destination prediction algorithm, by definition, predicts potential destinations at real-time for drivers on the road. Its applications include recommending sightseeing places, pushing targeted advertisement, and providing early warnings for road congestion. It employs the Bayesian inference framework and second-order Markov model to compute a list of high-probability destinations. The key contributions include real-time processing and the ability to predict destinations with very limited amount of training data. We also look into the problem of privacy protection against such prediction. The “iTIME” departure time recommendation system is a smart calendar that reminds users to depart in order to arrive at appointment venues on time. It also suggests the best transport mode based on users’ travel history and preferences. Currently, it is very inefficient for people to manually and repeatedly check the departure time and compare all transport modes using, for instance, Google Maps. The functionalities of iTIME were realised by machine learning algorithms that learn users’ habits, analyse the importance of appointments and optimal mode of transport, and estimate the start location and travel time. Our field study showed that we can save up to 40% of time by using iTIME. The system can also be extended easily to provide additional functionalities such as clashing appointments detection and appointment scheduling, both taking into account the predicted start location and travel time of future appointments. Both problems can be categorised as recommender systems (or recommendation systems) that provide insightful suggestions in order to improve daily-travel experiences and efficiency.
  • Item
    Thumbnail Image
    Generalized language identification
    LUI, MARCO ; ( 2014)
    Language identification is the task of determining the natural language that a document or part thereof is written in. The central theme of this thesis is generalized language identification, and deals with eliminating the assumptions that limit the applicability of language identification techniques to specific settings that may not be representative of real-world use cases for automatic language identification techniques. Research to date has treated language identification as a supervised machine learning problem, and in this thesis I argue that such a characterization is inadequate, showing how standard document representations do not take into account the variation in a language between different sources of text, and developing a representation that is robust to such variation. I also develop a method that allows for language identification of multilingual documents, i.e. documents that contain text in more than one language. Finally, I investigate the robustness of existing off-the-shelf language identification methods on a novel and challenging domain.
  • Item
    Thumbnail Image
    Computational substructure querying and topology prediction of the beta-sheet
    Ho, Hui Kian ( 2014)
    Studying the three-dimensional structure of proteins is essential to understanding their function, and ultimately, their dysfunction that causes disease. The limitations of experimental protein structure determination presents a need for computational approaches to protein structure prediction and analysis. The beta-sheet is a commonly occurring protein substructure important to many biological processes and are often implicated in neurological disorders. Targeted experimental studies of beta-sheets are especially difficult due to their general insolubility in isolation. This thesis presents a series of contributions to the computational analysis and prediction of beta-sheet structure, which are useful for knowledge discovery and for directing more detailed experimental work. Approaches for predicting the simplest type of beta-sheet, the beta-hairpin, are first described. Improvements over existing methods are obtained by using the most important beta-hairpin features identified through systematic feature selection. An examination of the most important features provides a physiochemical basis of their usefulness in beta-hairpin prediction. New methods for the more general problem of beta-sheet topology prediction are described. Unlike recent methods, ours are independent of multiple sequence alignment (MSAs) and therefore do not rely on the coverage of reference sequence databases or sequence homology. Our evaluations showed that our methods do not exhibit the same reductions in performance as a state-of-the-art method for sequences with low quality MSAs. A new method for the indexing and querying of beta-sheet substructures, called BetaSearch, is described. BetaSearch exploits the inherent planar constraints of beta-sheet structure to achieve significant speedups over existing graph indexing and conventional 3D structure search methods. Case studies are presented that demonstrate the potential of this method for the discovery of biologically interesting beta-sheet substructures. Finally, a purpose-built open source toolkit for generating 2D protein maps is described, which is useful for the coarse-grained analysis and visualisation of 3D protein structures. It can also be used in existing knowledge discovery pipelines for automated structural analysis and prediction tasks, as a standalone application, or imported into existing experimental applications.