Electrical and Electronic Engineering - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 2 of 2
  • Item
    Thumbnail Image
    Big data cluster analysis and its applications
    Rathore, Punit ( 2018)
    The increasing prevalence of Internet of things (IoT) technologies, smartphones, and social media services generates a huge amount of data, popularly known as ’big data’. Extracting useful information from big data is essential for many businesses and applications for providing better services and increasing their profits. For example, smart city solutions aim to use this wealth ofdata for formulating effective policies to solve the problems faced by citizens. These voluminous data are usually unlabeled, therefore, scalable and efficient unsupervised algorithms are required to manage and extract actionable information from big data. Cluster analysis is a useful unsupervised approach to discover the underlying groups and useful patterns in the data. Cluster Analysis for any data consists of three problems, (P1) cluster assessment, which asks “Do the data have clusters? If yes, how many?"; (P2) Clustering i.e., partitioning the data into clusters, and (P3) cluster validity, which asks “Are the clusters found useful? Is there a better one we did not find?" Traditional cluster analysis algorithms are not suitable for big data owing to its volume, variety, and velocity property. This thesis developed a suite of novel scalable algorithms to solve each of the three problems of cluster analysis, namely, cluster assessment, clustering, and cluster validity, for big data, that may be high-dimensional, anomalous and streaming. For demonstration, a novel scalable framework for predicting large-scale taxi trajectories is presented as a real application of big data clustering. Our first contribution addresses the high-dimensionality and scalability issues for soft clustering methods. Specifically, we developed a simple and computationally efficient framework for high-dimensional data clustering: CAFCM, which employs fuzzy c-means clustering on an ensemble of random projections to obtain multiple fuzzy clustering partitions, and then cumulatively aggregates them based on their quality to get a final output partition. The CAFCM framework scales linearly in the number of samples in the data and does not require any prior knowledge of the number of clusters, which makes it an attractive clustering approach for big datasets. Our second contribution solves the cluster tendency assessment and clustering problem for voluminous, high-dimensional datasets. We developed a fast cluster tendency assessment and subsequent clustering algorithm: FensiVAT, which integrates an intelligent sampling scheme, called Maximin Random Sampling (MMRS), and a new random projection (RP)-based ensemble method with a visual assessment of cluster tendency (VAT) method, in an efficient manner. The reordered dissimilarity image (RDI) (aka cluster heat map) obtained in FensiVAT suggests the number of clusters in data. The FensiVAT is more effective than the existing big data clustering techniques, both in terms of CPU-time and cluster quality. Our third contribution deals with the cluster validity problem for big data. Notably, we presented six novel approximation algorithms including two incremental methods to compute Dunn’s cluster validity index for big data. Four methods used variations of the MMRS sampling and two are based on unsupervised training of one class support vector machines. All six methods for estimation of Dunn’s index (DI) are linear in the number of samples. Computing approximations to DI with MMRS methods is both tractable and accurate. After dealing with big static data, our next contribution focused on detecting evolving structure in high-velocity, streaming data. Existing VAT-based algorithms for streaming data, inc-VAT/ inciVAT and dec-VAT/dec-iVAT, are impractical for high-velocity data streams. We developed a novel algorithm, inc-siVAT, for incremental and time efficient visualization of evolving cluster structures in high-velocity, data streams. The inc-siVAT extracts an initial smart (MMRS) sample and its RDI image, then it incrementally updates them on the fly to track changes in cluster structure after each chunk. The new algorithm is demonstrated for visualizing evolving cluster structures and detecting anomalies in dynamic streams of four big datasets, including a real IoT data. Finally, we demonstrate our big data clustering framework for a real-life smart city application. Based on a big data clustering method and Markov models, we developed a scalable framework for vehicle trajectory prediction which is suitable for a large number of overlapping trajectories in a dense road network, typically for major cities around the world. The short-term and long-term prediction performance of our framework on two real-life, large-scale taxi trajectory data from the Beijing and Singapore Road networks is found to be better than two current methods, in terms of prediction accuracy and distance error.
  • Item
    Thumbnail Image
    Big data clustering for smart city applications
    Kumar, Dheeraj ( 2016)
    The Internet of Things (IoT) infrastructure for the creation of smart cities consists of internet connected sensors, devices and citizens. This IoT infrastructure generates an enormous amount of data in the form of city-scale physical measurements and public opinions, constituting big data. Smart cities aim to efficiently use this wealth of data to manage and solve the problems faced by modern cities for better decision making. However, interpretation of the massive amount of smart city generated big data to create actionable knowledge is a challenging task. Aggregation and Summarization (data clustering) is a useful tool to create knowledge from raw data from different sources. However, traditional data clustering algorithms are not suitable for unlabelled smart city data owing to its high volume and generation velocity and limited experience about generating phenomenon. This thesis presents a novel framework for clustering tendency assessment for big data: clusiVAT, which provides an aggregated view of the big data to create actionable knowledge. clusiVAT intelligently selects a small number of samples from the data such that the samples retain the approximate geometry of the big dataset. The reordered dissimilarity image of the samples generated using single linkage minimum spanning tree (MST) suggests the number of clusters in the data, which is required as an input for most popular clustering algorithms. The cluster labels are then extended to the non-sampled points using the nearest prototype rule. The clusiVAT framework was applied to two real life smart city applications to understand the underlying patterns hidden in the huge volumes of data to generate knowledge. The first application used clusiVAT for clustering and anomaly detection from the pedestrian and vehicle trajectories obtained from a video surveillance system. Experiments were performed on a real-life MIT trajectories dataset of vehicles and pedestrians from a parking lot scene. The trajectory clusters and anomalies thus obtained were helpful in the high-level interpretation of a scene (crowd behavior modeling), as feedback for a low-level (individual) tracking and activity prediction system and as an alarm for human supervisor. For the second application, clusiVAT was used to cluster large scale (of the order of millions) vehicular trajectories obtained from the GPS traces of taxis in the city of Beijing and Singapore using a novel Dijkstra-based dynamic time warping distance measure. The results facilitated the understanding of spatial and temporal patterns in trajectories and were of great significance for decision-makers to understand road traffic conditions and to propose metro bus corridors and light rail systems for better public transport. Another prominent data generated by smart city IoT infrastructure are high-velocity data streams. Automatic interpretation of these evolving big data is required for timely detection of unusual events. This thesis presents a computationally efficient 'hot' update approach for incremental visualization of evolving cluster structure in streaming data. The new algorithms were demonstrated for two applications: online anomaly detection and sliding window based clustering of time series data. Numerical experiments on weather monitoring data from great barrier reef and the city of Melbourne provided visual clues to the onset of the new structure in streaming data.