Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 1 of 1
  • Item
    Thumbnail Image
    Design and adjustment of dependency measures
    Romano, Simone ( 2015)
    Dependency measures are fundamental for a number of important applications in data mining and machine learning. They are ubiquitously used: for feature selection, for clustering comparisons and validation, as splitting criteria in random forest, and to infer biological networks, to list a few. More generally, there are three important applications of dependency measures: detection, quantification, and ranking of dependencies. Dependency measures are estimated on finite data sets and because of this the tasks above become challenging. This thesis proposes a series of contributions to improve performances on each of these three goals. When differentiating between strong and weak relationships using information theoretic measures, the variance plays an important role: the higher the variance, the lower the chance to correctly rank the relationships. In this thesis, we discuss the design of a dependency measure based on the normalized mutual information whose estimation is based on many random discretization grids. This approach allows us to reduce its estimation variance. We show that a small estimation variance for the grid estimator of mutual information if beneficial to achieve higher power when the task is detection of dependencies between variables and when ranking different noisy dependencies. Dependency measure estimates can be high because of chance when the sample size is small, e.g. because of missing values, or when the dependency is estimated between categorical variables with many categories. These biases cause problems when the dependency must have an interpretable quantification and when ranking dependencies for feature selection. In this thesis, we formalize a framework to adjust dependency measures in order to correct for these biases. We apply our adjustments to existing dependency measures between variables and show how to achieve better interpretability in quantification. For example, when a dependency measure is used to quantify the amount of noise on functional dependencies between variables, we experimentally demonstrate that adjusted measures have more interpretable range of variation. Moreover, we demonstrate that our approach is also effective to rank attributes during the splitting procedure in random forests where a dependency measure between categorical variables is employed. Finally, we apply our framework of adjustments to dependency measures between clusterings. In this scenario, we are able to analytically compute our adjustments. We propose a number of adjusted clustering comparison measures which reduce to well known adjusted measures as special cases. This allows us to propose guidelines for the best applications of our measures as well as for existing ones for which guidelines are missing in literature, e.g. for the Adjusted Rand Index (ARI).