Learning Resolution Parameters for Graph Clustering

Finding clusters of well-connected nodes in a graph is an extensively studied problem in graph-based data analysis. Because of its many applications, a large number of distinct graph clustering objective functions and algorithms have already been proposed and analyzed. To aid practitioners in determining the best clustering approach to use in different applications, we present new techniques for automatically learning how to set clustering resolution parameters. These parameters control the size and structure of communities that are formed by optimizing a generalized objective function. We begin by formalizing the notion of a parameter fitness function, which measures how well a fixed input clustering approximately solves a generalized clustering objective for a specific resolution parameter value. Under reasonable assumptions, which suit two key graph clustering applications, such a parameter fitness function can be efficiently minimized using a bisection-like method, yielding a resolution parameter that fits well with the example clustering. We view our framework as a type of single-shot hyperparameter tuning, as we are able to learn a good resolution parameter with just a single example. Our general approach can be applied to learn resolution parameters for both local and global graph clustering objectives. We demonstrate its utility in several experiments on real-world data where it is helpful to learn resolution parameters from a given example clustering.


INTRODUCTION
Partitioning a collection of items into groups of similar items -that is, clustering -is a fundamental computational task. So commonly applied, there is a large and still-growing suite of objective functions, algorithms, and techniques for identifying good clusters. One powerful mathematical model for clustering is the graph, comprising nodes and (undirected) edges. For a broad overview of graph clustering, refer to any one of a number of surveys [11,13,28,31]. Nearly all clustering approaches favor clusters with high internal edge density and a low external edge density. A related, but not identical, notion is that a good cluster is a set of nodes with a small cut (i.e., few edges leaving the set), and a nontrivial size (e.g., a large number of nodes, or many internal edges).
Although most clustering approaches follow these general principles, there are many different ways to formalize such goals mathematically. In practice, this array of objective functions yields a large variety of different output clusterings. Indeed, many existing (theoretical) approaches to graph clustering assume that the user knows a priori which objective function is appropriate for their context or job. The main design task, leading to a practical solution, is then to develop good algorithms that exactly, or approximately, optimize the objective. However, we propose that it is more natural to assume that the user starts with some a priori knowledge about the desired structure of clusters in a given application domain. More specifically, they can provide at least one example of what a good clustering should look like. The revised goal is to find an objective function whose optimization yields the desired type of output.
Our approach. In this article, we show how to bootstrap from this one quality clustering to learn the appropriate objective function, chosen from a parameterized family of objective functions. The efficiency of our technique relies on the clustering objective function being linear, with linear constraints, tuned by a single parameter β. Given this type of objective, we formalize the notion of a parameter fitness function, which relies on a fixed example clustering C x of a network, and takes a parameter β as input. The fitness function computes the ratio between the objective score of C x , when β is chosen as the input parameter, and a lower bound on the optimal clustering objective for that β. Specifically, we use a concave piecewise linear lower bound over a wide family of what we refer to as relaxed clusterings. Minimizing the fitness function produces a parameter (and a corresponding objective function) that C x exactly or at least nearly optimizes. Thus the aim, which can be realized via a guided binary search, is to identify the parameter setting in which C x most stands out. In the remainder of this introduction, we flesh out the context for our technique.
Parameters. There are multiple families of objective functions whose members are specified by a tunable resolution parameter. This parameter controls the size and structure of clusters that are formed by optimizing the objective. Key examples of such generalized clustering objective functions include the Hamiltonian objective studied by Reichardt and Bornholdt [29], the stability objective of Delvenne et al. [10], and a multi-resolution variant of the map equation [32].
In this manuscript we focus on a related clustering framework that we developed in previous work [37], based on correlation clustering [5]. This framework is named LambdaCC, after its resolution parameter λ, which implicitly controls both the internal edge density as well as the cut sparsity of nodes formed by minimizing the objective. Furthermore, LambdaCC generalizes several well-studied objectives such as modularity clustering, sparsest cut, normalized cut, and cluster deletion. All of these objectives can be viewed as special cases of the objective for appropriate settings of λ.
Global and local. The above objectives are specifically designed for global clustering, in which the goal is to find a multi-cluster partitioning of an input graph. Local clustering objectives relying on resolution parameters also exist; these focus on finding a single cluster in a localized region of a graph. Flow-based methods such as FlowImprove [4], LocalImprove [26], and SimpleLocal [35] fit into this category. These methods repeatedly solve minimum s-t cut problems for different values of a parameter α, in order to minimize a ratio-style objective related to a cluster quality measure called conductance. This α can be viewed as a resolution parameter that balances a trade-off between forming a cluster with a small cut, and forming a cluster with a large overlap with a seed set in the graph.
Given the unifying power and versatility of generalized clustering objective functions, the challenge of finding the right clustering technique for a specific application can often be reduced to finding an appropriate resolution parameter. However, very little work has addressed how to set these parameters in practice, in particular to capture the specific clustering structure present in a certain application domain. In the past, solving generalized objective functions for a range of resolution parameters has been used to detect hierarchical clusterings in a network [29], or as a way to identify stable clusterings, which consistently optimize the objective over a range of parameter values [10,17,32]. While both are important applications of resolution-based clustering, the ability to detect a specified type of clustering structure is important regardless of a clustering's stability or the hierarchical structure of a network. Finally, while tuning hyperparameters is a standard procedure in the broader machine learning literature, most existing approaches are not specifically designed for tuning graph clustering resolution parameters. Furthermore hyperparameter tuning techniques typically rely on performing cross validation over a large number of training examples. We are concerned with learning good resolution parameters from a single example clustering that represents a meaningful partitioning in a certain application domain.
Our Contributions. In this paper we develop an approach for learning how to set resolution parameters for both local and global graph clustering problems. Our results for global graph clustering rely heavily on the LambdaCC framework we developed in past work [37]. We begin by formally defining a parameter fitness function for a given clustering. We then prove that under reasonable assumptions on the input clustering and clustering objective function used, we can find the minimizer of such a fitness function to within arbitrary precision using a simple bisection-like method. Our approach can be viewed as a type of single-shot hyperparameter tuning, as we are able to learn an appropriate setting of a resolution parameter when given a single example clustering. We display the utility of our approach in several local and global graph clustering experiments. Our approach allows us to obtain improved community detection results on synthetic and real-world networks. We also show how our method can be used to measure the correlation between metadata attributes and community structure in social networks.

GRAPH CLUSTERING BACKGROUND
This section reviews the global LambdaCC [37] clustering objective and a local clustering objective that is based on regionally biased minimum cut computations [4,26]. While there do exist many other objectives for local and global clustering, we focus on these two as they both rely crucially on resolution parameters.
Basic Notation. In this paper we consider unweighted and undirected graphs G = (V , E), though many of the ideas can be extended to weighted graphs. Global graph clustering separates G into disjoint sets of nodes so that every node belongs to exactly one cluster. For local clustering, one is additionally given a set of reference or seed nodes R ⊂ V and the objective is to find a good cluster that shares a nontrivial overlap with R. The degree of a node i ∈ V is the number of edges incident to it; we denote this by d i . The volume of a set S ⊆ V is given by vol(S ) = i ∈S d i and cut(S ) measures how many edges cross from S to its complement setS = V \S. Further notation will be presented as needed in the paper.

Global Clustering with LambdaCC
The LambdaCC objective is a special case of correlation clustering (CC) [5], a framework for partitioning signed graphs. In correlation clustering, each pair of nodes (u, v) in a signed graph is associated with either a positive edge or a negative edge, as well as a nonnegative edge weight e uv indicating the strength of the relationship between u and v. Given this input, the goal is to produce a clustering which minimizes the weight of disagreements or mistakes, which occur when positive edges are placed between clusters or negative edges are placed inside clusters.
The LambdaCC framework takes an unsigned graph G = (V , E), a resolution parameter λ ∈ (0, 1), and node weights w u for each u ∈ V . It converts this input into a signed graph over which the correlation clustering objective can then be minimized. The signed graphG = (V , E + , E − ) is constructed as follows: for every (u, v) ∈ E, if 1 − λw u w v ≥ 0, form a positive edge (u, v) inG, otherwise form a negative edge. In either case, the weight of this edge is e uv = |1 − λw u w v |. For every non-edge in the original graph ((u, v) E), form a negative edge (u, v) ∈ E − inG with weight e uv = λw u w v . The LambdaCC objective function then corresponds to the correlation clustering objective applied toG: where δ uv is a zero-one indicator function which encodes whether a clustering has placed nodes u, v together (δ uv = 1), or apart (δ uv = 0). There are two main choices for node weights: w u = 1 for all u ∈ V is the standard LambdaCC objective. For this simple case, we note that a node pair (u, v) which defines an edge in G will always correspond to a positive edge in G. In some applications it is useful to consider a degree-weighted version in which w u = d u . In this case, if λ ≤ 1/(d 2 max ) then we can still guarantee that E = E + .
However, for larger values of λ it may be possible that an edge in G gets mapped to a negative edge inG. As a generalization of standard unweighted correlation clustering, LambdaCC is NP-hard, though many approximation algorithms and heuristics for correlation clustering have been developed in practice [2,5,8,9]. In our previous work [37], we showed that a 3-approximation for standard LambdaCC can be obtained for any λ ≥ 1/2 by rounding the following LP relaxation of objective (1): Furthermore, even when a priori approximations are not guaranteed, solving the LP relaxation can be a very useful way to obtain empirical lower bounds for the objective in polynomial time. In follow up work [15], we provided improved approximations for λ < 1/2 based on rounding the LP, but noted an Ω(log n) integrality gap for some small value of λ.
Equivalence Results. LambdaCC generalizes and unifies a large number of other clustering approaches. When λ = 1/(2|E|), the degree-weighted version is equivalent to the popular maximum modularity clustering objective [23,24]. Standard LambdaCC interpolates between the sparsest cut objective for a graph-dependent small value of λ, and the cluster deletion problem when λ > |E|/(1+ |E|). Given its relationship to modularity, LambdaCC is known to also be related to the stochastic block model [25] and a multi-cluster normalized cut objective [42].

Local Clustering Objectives
We next consider a class of clustering objectives that share some similarities with (1), but are designed for finding a single local cluster in a specific region of a large graph. With the input graph G = (V , E), we additionally identify a set of seed or reference nodes R around which we wish to form a good community. One common measure for the "goodness" of a cluster S is the conductance objective: which is small when S is connected very well internally but shares few edges withS. A number of graph clustering algorithms have been designed to minimize local variants of (3). These substitute the denominator of (3) with a measure of the overlap between an output cluster S and the reference set R. One such objective is the following local conductance measure: which is minimized over all sets S such that the denominator of ϕ R (S ) is positive. This objective includes a locality parameter ε that controls how much overlap there should be between the seed set and output cluster. For a general overview of this clustering paradigm and its relationship to spectral and random-walk based techniques, we refer the reader to the work of Fountoulakis et al. [14]. Specific algorithms which minimize variants of (4) include FlowImprove [4], which always uses parameter ε = vol(R)/vol(R), and LocalImprove [26] and SimpleLocal [35], both of which choose larger values of ε in order to keep computations more local. In the extreme case where we consider ε = ∞, the problem reduces to finding the minimum conductance subset of a reference set R, which can be accomplished by the Minimum Quotient Improvement (MQI) algorithm of Lang and Rao [20]. Objective (4) can be efficiently minimized by repeatedly solving a minimum s-t cut problem on an auxiliary graph constructed from G, which introduces a sink node s attached to nodes in R, and a source node t attached to nodes inR = V \R. Edges are weighted with respect to the locality parameter ε and another parameter α. In order to detect whether there exists some set S with ϕ R (S ) ≤ α, one can solve a local clustering objective corresponding to the minimum s-t cut objective on the auxiliary graph. We refer to this simply as the local flow clustering objective: If the set S minimizing f α satisfies f α (S ) < αvol(R), then rearranging terms one can show that ϕ R (S ) < α. Thus, by performing binary search over α or repeatedly solving (5) for smaller and smaller α, one can minimize the local conductance measure (4). Previous research has largely treated α as a temporary parameter used in one step of a larger algorithm seeking to minimize (4). Algorithms which minimize (4) do so by finding the smallest α such that the minimum of (5) is αvol(R). We depart from this approach by instead treating α as a tunable resolution parameter for balancing two conflicting goals: finding clusters with a small cut, and finding clusters that have a large overlap with the seed set R. In the case where ε is treated as infinitely large and we are simply looking for subsets of a seed set R satisfying vol(R) ≤ vol(R), then in effect we are trying to solve the optimization problem: This goal is related to, but ultimately should be contrasted with, the goal of minimizing the ratio cut(S )/vol(S ). The objectives are similar in that they both tend to prefer sets with small cut and large volume. We argue that treating α as a tunable parameter is in fact more versatile than simply minimizing the ratio score. In multiple applications it may be useful to find clusters with small cut and large volume, but different applications may put a different weight on each aspect of the objective. We observe that ε also plays an important role in the size and structure of the output community when it is less than ∞. For simplicity, in this paper we can treat this as a fixed constant, and in our experimental section we simply focus on objective (6).

Parametric Linear Programs
Before moving on we provide key background on parametric linear programming which will be important in our theoretical results. A standard linear program is a problem of the form where c, b are vectors and A is a constraint matrix. A parametric linear program is a related problem of the form where ∆c is another vector of the same length as c and β is a parameter controlling the difference between (7) and (8). We state a well-known result about the solutions of (8) for different β. This result is not new; it follows directly from Proposition 2.3b from [1]. Theorem 1. Let L(β ) be the minimum of (8) for a fixed β. If we are given bounds a and b such that L(β ) ∈ R for all β ∈ [a, b], then L is a piecewise linear and concave function in β over this interval.
Parametric LPs in Graph Clustering Applications. In our work it is significant to note that the linear programming relaxation of LambdaCC is a parametric linear program in λ. Furthermore, the local flow clustering objective can be cast as a parametric linear program in α, since this objective corresponds simply to a special case of the minimum s-t cut problem, which can be cast as an LP.

Related Work
Our work builds on previous results that introduced generalized objective functions with resolution parameters, including the Hamiltonian objective [29], clustering stability [10], a multiscale variant of the map equation [32], and the LambdaCC framework [37]. Recently Jeub et al. [17] introduced a technique for sampling values of a resolution parameter and applying hierarchical consensus clustering techniques. Our work on learning clustering resolution parameters differs from theirs in that we do not aim to provide hierarchical clusterings of a network. Instead we assume that there is a known fixed clustering, for which we wish to learn a single specific resolution parameter.
There exist many techniques for localized community detection based on seed set expansion. Among numerous others, these include spectral and random-walk based methods [3,33], flow-based methods [4,20,26,35], and other approaches which perform diffusions from a set of seed nodes and round embeddings via a sweep cut procedure [18,39]. We build on these by interpreting hyperparameters associated with such methods as resolution parameters which can be learned to produce clusters of a certain type.

THEORETICAL RESULTS
The major theoretical contribution of our work is a new framework for learning clustering resolution parameters based on minimizing a parameter fitness function for a given example clustering. We present results for a generic clustering objective and fitness function, and later show how to apply our results to LambdaCC and local flow clustering.

Problem Formulation
Let C denote a set of valid clusterings for a graph G = (V , E). We consider a generic clustering objective function f β : C → R ≥0 that depends on a resolution parameter β. The function takes as input a clustering C ∈ C, and outputs a nonnegative clustering quality score for C. We assume that smaller values of f β are better.
We intentionally allow f β to be very general in order to develop broadly applicable theory. For intuition, one can think of f β as being the LambdaCC function (1) with β = λ. Alternatively, one can picture f β to be the local flow objective (5) with β = α and with C representing the set of bipartitions, i.e. for any C ∈ C, Given some objective function f β , a standard clustering paradigm is to assume that an appropriate value of β has already been chosen, and then the goal is to produce some clustering C that exactly or approximately minimizes f β . In our work, we address an inverse question: given an example clustering C x , how do we determine a parameter β such that C x approximately minimizes f β ? Ideally we would like to solve the following problem: In practice, however, C x may not exactly minimize a generic clustering objective for any choice of resolution parameter. Thus we relax this to a more general and useful goal: This second goal is motivated by the study of approximation algorithms for clustering. In effect this asks: if we are given a certain clustering C x , is C x a good approximation to f β for any choice of β? Note that this generalizes (9): if β can be chosen to satisfy Goal 1, then the same β will satisfy Goal 2 with ∆ = 1. Furthermore, it has the added advantage that, if solved, Goal 2 will produce a value ∆ which communicates how well clusterings like C x can be detected using variants of the objective function f β . If ∆ is near 1, it means that f β is able to produce similar clusterings for a correct choice of β, whereas if ∆ is very large this indicates that C x will be difficult to find even for an optimal β, and thus a different approach will be necessary for detecting clusterings of this type.
Clustering Relaxations. While Goal 2 is a more reasonable target than Goal 1, it may still be a very challenging problem to solve when objective f β is hard to optimize, e.g., if it is NP-hard. We thus consider one final relaxation that is slightly weaker than (10), but will be more feasible to work with. LetĈ denote a superset of C which includes not only clusterings for G, but also some notion of a relaxed clustering, and let д β :Ĉ → R ≥0 be an objective that assigns a score for every C ∈Ĉ. Furthermore, assume д β represents a lower bound function for f β : д β (C) ≤ f β (C) for all β and all C ∈ C. Our consideration of д β is motivated by the fact that many NP-hard clustering objectives permit convex relaxations, which can be optimized in polynomial time over a larger set of relaxed clusterings that contain all valid clusterings of G as a subset. For example, the LambdaCC objective is NP-hard to optimize for every λ ∈ (0, 1), but the linear programming relaxation for every λ can be solved in polynomial time, and is defined over relaxed clusterings in which pairs of nodes are assigned distances between 0 and 1. These relaxations can be rounded to produce good approximations to the original NP-hard objective [8,9]. Since д β is indeed easier to optimize than f β , the following goal will be easier to approach but still provide strong guarantees for learning a good value of β: If we can solve (11), this still guarantees that C x is a ∆-approximation to f β for an appropriately chosen β. For problems where f β is very challenging to optimize, but д β is not, this will be a much more feasible approach. In the next section we will focus on developing theory for addressing Goal 3, though we note that in applying this theory we can still choose д β = f β and therefore instead address the stronger Goal 2 whenever this is feasible. We will take this approach when applying our theory to the local flow objective.

Parameter Fitness Function
We now present a parameter fitness function whose minimization is equivalent to solving (11). Functions f β and д β take a clustering or relaxed clustering as input and output an objective score. However, we wish to view β as an input parameter and we treat an example clustering C x as a fixed input. Thus for convenience we introduce new related functions: The ratio of these two functions defines the parameter fitness function that we seek to minimize: Observe that this function is always greater than or equal to 1 since G (β ) ≤ F (β ) for any β. The minimizer of P is a resolution parameter β that minimizes the ratio between the clustering score of a fixed C x and a lower bound on f β . Thus, by minimizing (14) we achieve Goal 3 in (11) with ∆ = min β P (β ). In Section 2.3, we noted that the local flow clustering objective can be characterized as a parametric linear program, as can the LP relaxation of LambdaCC. Furthermore, for a fixed clustering, both objective functions can be viewed as a linear function in terms of their resolution parameter. Motivated by these facts, we present a theorem which characterizes the behavior of the parameter fitness function P under certain reasonable conditions on the functions F and G. In the subsequent section we will use this result to show that P can be minimized to within arbitrary precision using an efficient bisection-like method. Theorem 2. Assume F (β ) = a + bβ for nonzero real numbers a and b. Let G be concave and piecewise linear in β, and assume F (β ) ≥ G (β ) ≥ 0 for all β ∈ [ℓ, r ] where ℓ and r are nonnegative lower and upper (i.e. left and right) bounds for β. Then P satisfies the following two properties: (a) If β − < β < β + , then P (β ) cannot be strictly greater than both P (β − ) and P (β + ).
Using property 1, we know as β increases from its lower to upper limit, P cannot increase and then decrease. Thus, either P attains its minimum on [β − , β + ], else P is a constant for all β ∈ [β − , β + ]. If the latter is true, then for some β ∈ [β − , β + ] and some sufficiently small ϵ > 0, G must be linear in the range (β − ϵ, β + ϵ ), since we know that G is piecewise linear. Therefore, G (β ) = c + dβ and for β ∈ (β − ϵ, β + ϵ ) and for some c, d ∈ R. This ratio of linear functions can only be a constant if a = c = 0, or b = d = 0, or if a = c and b = d. Since we assumed a and b were nonzero, the last case must hold, and thus P (β ) = 1 for every β ∈ [β − , β + ], so the minimizer is obtained in this case, since P (β ) ≥ 1 for all β. □ In the next section we present a method for finding the minimizer of a function satisfying properties (a) and (b) in Theorem 2 to within arbitrary precision. Before doing so, we highlight the importance of ensuring that both properties hold. In Figure 1 we plot two toy functions, P and Q. Although both satisfy property (a), only P additionally satisfies (b). Assume we do not have explicit representations of either function, but we can query them at specific points to help find their minimizers. Consider Figure 1. If we query P at points β 1 and β 2 to find that P (β 1 ) = P (β 2 ), then choosing any third point β 3 ∈ (β 1 , β 2 ) will get us closer to the minimizer. However, if Q (β 1 ) = Q (β 2 ) for some β 1 , β 2 , we cannot be sure these points are not part of a flat region of Q somewhere far from the minimizer. It thus becomes unclear how to choose a third point β 3 at which to query Q. If we choose some β 3 ∈ (β 1 , β 2 ) and find that Q (β 3 ) = Q (β 2 ) = Q (β 1 ), the minimizer may be within [β 1 , β 2 ], within [β 2 , β 3 ], or in a completely different region. Thus it is important for the denominator of a parameter fitness function to be piecewise linear in addition to being concave, since this piecewise linear assumption guarantees property (b) will hold.

Minimizing P
We now outline an approach for finding a minimizer of P to within arbitrary precision when Theorem 2 holds. Our approach is closely related to the standard bisection method for finding zeros of a continuous function f . Recall that standard bisection starts with a and b such that siдn( f (a)) siдn( f (b)), and then computes f (c) where c = (a + b)/2. Checking the sign of f (c) allows one to determine whether the zero of f is located within the interval [a, c] or [b, c]. Thus each new query of the function f halves the interval in which a zero must be located.
Assume P satisfies properties (a) and (b) in Theorem 2 over an interval [ℓ, r ]. To satisfy Goal 3, given in (11) in Section 3.1, it suffices to find any minimizer of P, which we do by repeatedly halving the interval in which the minimizers of P must lie. Our approach differs from standard bisection in that we are trying to find a minimizer instead of the zero of some function. The key algorithmic difference is that querying P at a single point between two bounds will not always be sufficient to cut the search space in half. Consider Figure 2. Our method starts in a one-branch phase in which we know a minimizer lies between ℓ and r . If we compute m = (ℓ + r )/2 and find that P (m) is between P (ℓ) and P (r ), this does in fact automatically cut our search space in half, as this implies that P is monotonic on either [ℓ, m] or [m, r ]. However, if P (m) < min{P (ℓ), P (r )}, then it is possible for the minimizer to reside within either the left branch [ℓ, m] or the right branch [m, r ]. In this case, the method enters a two-branch phase in which it takes the midpoint of each branch (ℓ mid = (ℓ +m)/2 and r mid = (m +r )/2) and evaluates P (ℓ mid ) and P (r mid ). If P returns the same value for two of the inputs (e.g., P (ℓ) = P (m)), then by property (b) we have found a new interval containing the minimizer(s) of P that is at most half the length of [ℓ, r ]. Otherwise, we can use property (a) to deduce that the minimizer will be located within [ℓ, m], [m, r ], or [ℓ mid , r mid ], and we recurse on the two-branch phase.
Algorithms 1 and 2 handle the one-and two-branch phases of the method respectively. The guarantees of our method are summarized in Theorem 3. We omit the full proof, since it follows directly from considering different simple cases and applying properties of P to halve the search space as outline above. Theorem 3. Consider a fixed clustering C x and a corresponding parameter fitness function P C x satisfying the assumptions of Theorem 2. Running Algorithm 1 with input ℓ, r and a tolerance ϵ will produce a resolution parameterβ that is within ϵ of the minimizer of P X over the interval [ℓ, r ], in at most log 2 ((r − ℓ)/ϵ ) recursive calls.

APPLICATION TO SPECIFIC OBJECTIVES
Theorem 2 and our approach for minimizing P can be immediately applied to learn resolution parameters for the LambdaCC global clustering objective and the local flow clustering objective.

Local Clustering
For local clustering we consider the objective function f α given in (5) and note that the set of valid clusterings C is the set of bipartitions. The example clustering we are given at the outset of the problem is C x = {X ,X } where X ⊂ V is some nontrivial set of nodes representing a "good" cluster for a given application. We assume we Algorithm 1 CheckOneBranch(ℓ, r , ϵ ) Base case: if r − ℓ < ϵ then return ℓ Recursive call: 5: Midpoint: m = (ℓ + r )/2 switch ℓ, m, r do case P (ℓ) = P (m) = P (r ) return m case P (ℓ) ≤ P (m) < P (r ) 10: return CheckOneBranch(ℓ, m, ϵ ) case P (ℓ) > P (m) ≥ P (r ) return CheckOneBranch(m, r, ϵ ) case P (ℓ) > P (m) < P (r ) return CheckTwoBranches(ℓ, m, r, ϵ )
If we focus on finding clusters that are subsets of R, using objective (6), we have a simplified fitness function: .
When we apply Algorithm 1 to minimize (16) or (17), we can query P X in the time it takes to evaluate a linear function and the time it takes to solve the s-t cut problem (5). This can be done extremely quickly using localized min-cut computations [20,26,35,38]. Functions (16) and (17) should be minimized over α ∈ [α * , cut(R)], where α * is either the minimum of (4) if we are minimizing (16), or is the minimum conductance for a subset of R if we are minimizing (17). One can show that for any α outside this range, objectives (5) and (6) will be trivially minimized by S = R, so it is not meaningful to optimize these objectives for these α. In practice one can additionally set stricter upper and lower bounds if desired.

Global Clustering Approach
We separately consider the standard and degree-weighted versions of LambdaCC when applying Theorem 2 to global graph clustering.
Standard LambdaCC. For the standard objective, it is useful to consider the scaled version of LambdaCC obtained by dividing (1) by 1 − λ and substituting for a new resolution parameter γ = λ/(1 − λ). Then the objective is The denominator of the parameter fitness function for this scaled LambdaCC problem would be where X represents the set of linear constraints for the linear program (2). Note that G (γ ) will be finite for every γ ≥ 0, so Theorem 1 holds. Thus G is concave and piecewise linear as required by Theorem 2. Next, for a fixed clustering C x , let P x be the number of positive mistakes (pairs of nodes that are separated despite sharing an edge) and N x be the number of negative mistakes (pairs of nodes that are clustered together but share no edge). Then objective (18) for this clustering is P x + γ N x , and we see that this fits the linear form given in Theorem 2 as long as the example clustering satisfies P x > 0 and N x > 0, which will be the case for nearly any nontrivial clustering one might consider. Finally, note that the parameter fitness function for (18) would be exactly the same as the parameter fitness function for the standard LambdaCC objective, since scaling by (1 − λ) makes no difference if we are going to minimize the ratio between the clustering objective and its LP relaxation. The parameter fitness function for standard LambdaCC is therefore and it satisfies the assumptions of Theorem 2 as long as P x > 0, N x > 0, and we optimize over λ ∈ (0, 1).
Degree-weighted LambdaCC. Showing how Theorem 2 applies to degree-weighted LambdaCC requires slightly more work, though the same basic principles hold. The LP-relaxation of the objective is still a parametric linear program, thus is still concave and piecewise linear in λ over the interval (0, 1). The denominator of the parameter fitness function in this case would be: where e uv is defined in the degree-weighted fashion (see Section 2.1). For a fixed example clustering C x encoded by a function δ x = (δ uv ), we can rearrange this into the form a + λb where a = (u,v ) Right: one of the 5 LFR test graphs for µ = 0.3. Modularity (λ = 1/(2|E|)) makes mistakes by putting distinct ground truth clusters together (highlighted). For this example our approach perfectly detects the ground truth.
δ uv ). These values are simple to compute, and as long as they are both nonzero, the results of Theorem 2 apply. In some extreme cases it is possible that a = 0 or b = 0, but we expect this to be rare. Furthermore, our general approach may still work even when a = 0 or b = 0, Theorem 2 simply does not analyze this case. We leave it as future work to develop more refined sufficient and necessary conditions such that Algorithm 1 is guaranteed to minimize P.

EXPERIMENTS
We consider several local and global clustering experiments in which significant benefit can be gained from learning resolution parameters rather than using previous off-the-shelf algorithms and objective functions. We implement Algorithms 1 and 2 in the Julia programming language for both local and global parameter fitness functions. Computing the LambdaCC linear programming relaxation can be challenging due to the size of the constraint set. For our smaller graphs we apply Gurobi optimization software, and for larger problems we use recently developed memory-efficient projection methods [30,36]. For the local-flow objective we use a fast Julia implementation we developed in recent work [38]. Our experiments were run on a machine with two Intel Xeon E5-2690 v4 processors. Code for our experiments and algorithms are available at https://github/nveldt/LearnResParams.

Learning Parameters for Synthetic Datasets
Although modularity is a widely-applied objective function for community detection, Fortunato and Barthélemy [12] demonstrated that it is unable to accurately detect communities below a certain size threshold in a graph. In our first experiment we demonstrate that learning resolution parameters for LambdaCC allows us to overcome the resolution limit of modularity, and better detect community structure in synthetic networks. We generate a large number of synthetic LFR benchmark graphs [19], in a parameter regime that is chosen to be difficult for modularity. All graphs contain 200 nodes, average degree 10, max degree 20, and community sizes between 5 and 20 nodes. We test a range of mixing parameters µ, which controls the fraction of edges that connect nodes in different communities (µ = 0 means all edges are inside the communities).
For each µ from 0.2 to 0.5, in increments of 0.05, we generate six LFR networks, one for training and five for testing. On the training graph, we minimize the degree-weighted LambdaCC parameter fitness function to learn a resolution parameter λ best . This takes between roughly half an hour (for µ = 0.2) to just over three hours (for µ = 0.5), solving the underlying LambdaCC LP with Gurobi software. We then cluster the five test LFR examples using a generalized version of the Louvain method [7], as implemented by Jeub et al. [16]. We separately run the method with two resolution parameters: λ = 1/(2|E|), the standard setting for modularity, and λ = λ best . Learning λ best significantly improves adjusted Rand index (ARI) scores for detecting the ground truth (see Figure 3).

Local Community Detection
Next we demonstrate that a small amount of semi-supervised information about target communities in real-world networks can allow us to learn good resolution parameters, leading to more robust community identification. Additionally, minimizing the parameter fitness function provides a way to measure the extent to which functional communities in a network correspond to topological notions of community structure in networks.
Data. We consider four undirected networks, DBLP, Amazon, Orkut, LiveJournal, which are all available on the SNAP repository [22], and come with sets of nodes that can be identified as "functional communities" (see Yang and Leskovec [40]). For example, members of the social network Orkut may explicitly identify as being part of a user-formed group. Such user groups can be viewed simply as metadata about the network, though these still correspond to some notion of community organization that may be desirable to detect. Following an approach taken in previous work [38], we specifically consider the ten largest communities from the 5000 best functional communities as identified by Yang and Leskovec [40].
Experimental Setup and Results. We treat each functional community as an example cluster X . We build a superset of nodes R by growing X from a breadth first search until we have a superset of size 5|X |, breaking ties arbitrarily. The size of R is chosen so that it comprises a localized region of a large graph, but is still significantly larger than the target cluster X hidden inside of it. We compare two approaches for detecting X within R. As a baseline approach we extract the best conductance subset of R. Then as our new approach we assume we are given cut(X ) and vol(X ) as additional semi-supervised information. This allows us to minimize the parameter fitness function (17), without knowing what X is. This outputs a resolution parameter α X , and we then minimize cut(S ) − α X vol(S ) + αvol(R) over S ⊆ R to output a set S X . Table 1 reports conductance, set size, runtimes, and F1 scores for both approaches, averaged over the ten communities in each network. Learning resolution parameters leads to significantly better F1 scores on every dataset. Additionally, learning resolution parameters for local clustering can be done much more quickly than learning λ for LambdaCC.
New Insights. In addition to improving semi-supervised community detection, minimizing P X allows us to measure how well a functional community matches the topological notion of a cluster.  Figure 4 shows a scatter plot of F1 community recovery scores against the minimum of P X for each experiment from Table 1. We note a downward sloping trend: small values of P X near 1 tend to indicate that a cluster is highly "detectable, " whereas a higher value of P X gives some indication that the functional community may not in fact correspond to a good structural community. We also plot the F1 recovery scores for finding the minimum conductance subset of R against the conductance of functional communities. In this case we do not see any clear pattern, and we learn very little about the relationship between structural and functional communities.

Meta-Data and Global Clustering
Next we use our techniques to measure how strongly metadata attributes in a network are associated with actual community structure. In general, sets of nodes sharing metadata attributes should not be viewed as "ground truth" clusters [27], although they may still shed light on the underlying clustering structure of a network.
Email Network. We first consider the largest connected component of an email network [21,41]. Each of the 986 nodes in the graph represents a faculty member at a European university, and edges represent email correspondence between members. We remove edge weights and directions, and consider an example clustering C x formed by assigning faculty in the same academic department to the same cluster. We use our bisection method to approximately minimize the global parameter fitness function for the degree-weighted LambdaCC objective. We run our method until we find the best resolution parameter to within a tolerance of 10 −8 , yielding a resolution parameter λ x = 6.5 × 10 −5 and a fitness score of P C x (λ x ) = 1.34.
To assess how good or bad a score of 1.34 is for this particular application, we construct a new fake metadata attribute by performing a random permutation of the department labels, which gives a clustering C fake . Approximately minimizing P C fake yields a resolution parameter λ fake = 3.25 × 10 −5 and a score P C fake (λ fake ) = 2.16. The gap between the minima of P C fake and P C x indicates that although the true metadata partitioning does not perfectly map to clustering structure in the network, it nevertheless shares some meaningful correlation with the network's connectivity patterns. To further demonstrate this, we run the generalized Louvain algorithm [7,16], using the resolution parameters λ x and λ fake . Running the clustering heuristic with λ x outputs a clustering that has a normalized mutual information score (NMI) of 0.71 and an adjusted Rand index (ARI) score of 0.55 with C x . Using λ fake , we get NMI and ARI scores of only 0.05 and 0.003 respectively when comparing with C fake . Table 1: Output statistics on SNAP datasets for finding the minimum conductance (mc) subset, and learning a resolution parameter (lr ). Runtime is given in seconds. We also provide average size (|T |) and conductance (ϕ (T )) for target communities. Social Networks. We repeat the above experiment on the smallest social network in the Facebook 100 datasets [34], Caltech36. This network is a subset of Facebook with n = 769 nodes, defined by users at the California Institute of Technology at a certain point in September 2005. Every node in the network comes with anonymized metadata attributes reporting student/faculty status, gender, major, second major, residence, graduation year, and high school. We treat each metadata attribute as an example clustering C x . Any node with a value of 0 for an attribute we treat as its own cluster, as this indicates the node has no label for the given attribute. We do not run Algorithm 1 for each individual C x , since this would involve redundant computations of the LambdaCC LP relaxations for many of the same values of λ. Instead, we evaluate the denominator of P, which is the same for all example clusterings, at 20 equally spaced λ values between 1/(8|E|) and 2/(|E|). We set values of λ to be inversely proportional to the number of edges, since we expect the effect of a resolution parameter to depend on a network's size. We note for example that the resolution parameter corresponding to modularity is λ = 1/(2|E|), which is also inversely proportional to |E|. Computing all of the LP bounds is the bottleneck in our computations, and takes just under 2.5 hours using a recently developed parallel solver for the correlation clustering relaxation [30].
Having evaluated the denominator of P at these values, we can quickly find the minimizer of P for each metadata attribute and a permuted fake metadata attribute to within an error of less than 10 −5 . The smallest values of the parameter fitness function P for both real and permuted (fake) metadata attributes are given below: We note that the smallest values of P, as well as the largest gap between P for true and fake metadata clusterings, are obtained for the student/faculty status, residence, and graduation year attributes. This indicates that these attributes share the strongest correlation with the community structure at this university, which is consistent with independent results on the Facebook 100 datasets [34,37].

Local Clustering in Social Networks
In our final experiment we continue exploring the relationship between metadata and community structure in Facebook 100 datasets. We find that minimizing a local parameter fitness function P can be a much better way to measure the community structure of a set of nodes than simply considering the set's conductance.

Data.
We perform experiments on all Facebook 100 networks, focusing on the student/faculty status, gender, residence, and graduation year metadata attributes. For the Caltech dataset in the last experiment, these attained the lowest scores for a global parameter fitness function, and furthermore these are the only attributes with a significant number of sets with nontrivial conductance. For the graduation year attribute, we focus on classes between 2006 to 2009, since these correspond to the four primary classes on each campus when the networks were crawled in September of 2005 [34].
Experimental Setup. We return to an approach similar to our first experiment. For each network and metadata attribute, we consider sets of nodes identified by the same metadata label, e.g., X may represent all students in the class of 2008 at the University of Chicago. We will refer to these simply as metadata sets. A label of zero indicates no attribute is known, so we ignore these sets. We also discard sets that are larger than half the graph, or smaller than 20 nodes. We restrict to considering metadata sets with conductance at most 0.7, since conductance scores too close to 1 indicate that a set has little to no meaningful connectivity pattern. For each remaining metadata set X , we grow a superset R around X using a breadth first search, and stop growing when R contains half the nodes in the graph or is three times the size of X . We then minimize P X as given by (17) to learn a resolution parameter α X . This allows us to find S X = argmin S ⊆R cut(S ) −α X vol(S ), and we then compute the F1 score between S X and X . Our goal here is not to develop a new method for community detection. Rather, computing the F1 score and the minimum of P X provide ways to measure how well a metadata set conforms to a topological notion of community structure, and how detectable the set is from an algorithmic perspective.
Results. While computing conductance scores provides a good first order measure of a node's community structure, we find that minimizing P provides more refined information for the detectability of clusters. In Figure 5 we show scatter plots of F1 detection scores against both min P X as well as ϕ (X ) for each metadata set X . We see that especially for the gender and residence metadata sets across all networks, there is a much clearer relationship between F1 scores and min P X . Values of P X very close to 1 map to F1 scores near 1, and as P X increases we see a downward sloping trend in F1 scores. In the conductance plot we do not see the same trend. Figures 5c and 5d show results for metadata sets associated with the 2006-2009 graduation years. For this attribute there appears to be a relationship between both conductance and the min P scores. Furthermore, in both plots we see a separation of the points roughly into two clusters. A deeper exploration of these trends reveals that the 2009 graduation class accounts for the majority of one of these two clusters, and there appears to be an especially clear trend between F1 detection scores and both ϕ (X ) and P X for this class.
New Insights. Figure 6 shows violin plots for ϕ (X ), cut(X ), and vol(X ) for metadata sets associated with graduation years from 2006 to 2009. Overall, conductance decreases as graduation year increases. We notice that class sizes for the 2009 graduation year are much smaller on average. When these datasets were generated, Facebook users needed a .edu email address to register an account. Thus, in September 2005, the graduation class of 2009 was made up primarily of new freshman who just started college, many of whom had not registered a Facebook account yet. Interestingly, we see a slight decrease in the median cut score from 2007 to 2008, and a significant decrease from 2008 to 2009 (Figure 6c). This suggests that although there were fewer freshman on Facebook at the time, on average they had a greater tendency to establish connections on Facebook among peers in their same graduation year. Figure 6 suggests that in the early years of Facebook, with each new year, students in the same graduating class tended to form tighter Facebook circles with members in their own class. To further explore this hypothesis, for each of the 100 Facebook datasets we consider each node from a graduating class between 2006 and 2009. In each network we compute the average in-class connection ratio, i.e., the number of Facebook friends each person has inside the same graduating class, divided by the total number of Facebook connections that the person has across the entire university. In 97 out of 100 datasets (all networks except Caltech36, Hamilton46, and Santa74), this ratio strictly increases as graduation year increases. For Hamilton46 and Santa74, the ratio is still significantly higher for the 2009 graduation class than any other class. If we average this ratio across all networks, as the graduation year increases from 2006 to 2009, the ratios strictly increase: 0.39 for 2006, 0.45 for 2007, 0.57 for 2008, and 0.75 for the class of 2009. Previous work on has already highlighted the influence of the graduation year attribute on the connectivity structure of Facebook 100 datasets on the whole [34]. Our exploration of these datasets has further allowed us to uncover interesting new connectivity patterns that exist between different individual graduation years.

DISCUSSION AND FUTURE WORK
We have introduced a new framework and theory for learning resolution parameters based on minimizing a fitness function associated with a single example clustering of interest. There are several open questions for improving our specific approach. Our bisection-like algorithm is designed to be general enough to minimize a large class of functions to within arbitrary precision. However, by making additional assumptions on either specific clustering objectives or the fixed example clustering, one may be able to develop improved algorithms for minimizing the parameter fitness function in practice. Another open question is to study which other graph clustering objectives can fit into out framework, beyond just the LambdaCC global objective and the local flow clustering objective, and whether, for example, the approach can be applied to clustering in directed graphs.
Our work can be viewed as one approach to the more general goal of learning objective functions for graph clustering applications. This general goal could involve more techniques than simply learning resolution parameters. For example, in future work we wish to explore how to learn small motif subgraph patterns [6] in an example clustering that may be indicative of a desirable type of clustering structure in an application of interest.