Semi-supervised community detection and clustering
AffiliationComputing and Information Systems
Document TypePhD thesis
Access StatusOpen Access
© 2017 Dr Mohadeseh Ganji
Data clustering and community detection in networks are two important tasks in machine learning which aim to group the data into similar objects or densely connected sub-graphs. However, applying an appropriate similarity measure to obtain the highest accuracy is always a challenge. Furthermore, in some real- world applications, some background knowledge and information exists about the true or desired assignments or properties of clusters and communities. The side-information could be obtained by experiments, domain knowledge or user preferences in different applications. Some constraints may also be imposed to the system due to natural complexity of the problem or resource limitations. Community detection (clustering) in the presence of side-information represented as supervision constraints is called semi-supervised community detection (clustering). However, finding efficient approaches to take the most advantage of this pre-existing information to improve quality of the solutions is still a challenge. In this thesis, we study community detection and clustering problems with and without incorporating domain knowledge for which we propose a similarity measure and exact and approximate optimization techniques to improve the accuracy of the results. In this thesis, we address limitations of a popular community detection measure called modularity and propose an improved measure called generalized modularity which quantifies similarity of network vertices more realistically and comprehensively by incorporating vertex similarity concepts. The pro- posed generalized modularity outperforms the state of the art modularity optimization approach in community detection. In addition, to incorporate background knowledge and user preferences into community detection process, two semi-supervised approaches are proposed in this thesis: First we address the modelling flexibility issue in the literature of semi- supervised community detection to simultaneously model instance level and community level constraints. We propose a generic framework using constraint programming technology which enables incorporating a variety of instance level, community level and complex supervision constraints at the same time. The framework also enables modelling local community definitions as constraints to a global community detection scheme and is able to incorporate a variety of similarity measures and community detection objective functions. Using a high level modelling language enables the proposed semi-supervised community detection framework to have the flexibility of applying both complete and incomplete techniques and has the advantage of proving optimality of the solutions found when using complete techniques. Second, a new algorithm for semi-supervised community detection is pro- posed based on discrete Lagrange multipliers which incorporates pairwise constraints. Unlike most of the existing semi-supervised community detection schemes that modify the graph adjacency based on the supervision constraints, the pro- posed algorithm works with quality measures such as modularity or generalized modularity and guides the community detection process by systematically modifying the similarity matrix only for hard-to satisfy constraints. The pro- posed algorithm commits to satisfy (almost) all of the constraints to take the most advantage of the existing supervision. It outperforms the existing semi- supervised community detection algorithms in terms of satisfying the supervision constraints and noise resistance. Another contribution of this thesis is to incorporate instance level supervision constraints into clustering problem. In this regard, a k-means type semi- supervised clustering algorithm is proposed which can take the most advantage of the pre-existing information to achieve high quality solutions satisfying the constraints. The proposed algorithm is based on continuous Lagrange multipliers and penalizes the constraint violations in a systematic manner which guides the cluster centroids and cluster assignments towards satisfying all of the constraints. The achievements of this thesis are supported by several experiments and publications.
Keywordsmachine learning; optimization; semi-supervised learning; clustering; community detection
- Click on "Export Reference in RIS Format" and choose "open with... Endnote".
- Click on "Export Reference in RIS Format". Login to Refworks, go to References => Import References