Supplementary material concerning repeatability of
 
Statistical Selection of Congruent Subspaces for
Mining Attributed Graphs


by Patricia Iglesias Sánchez, Emmanuel Müller, Fabian Laforet, Fabian Keller, and Klemens Böhm

published in Proc. IEEE International Conference on Data Mining
(ICDM 2013), Dallas, TX, USA (2013)

This page offers additional information about our experiments and assist in reproducing the results with our ConSub algorithm. We provide this in addition to our publication [download full text PDF] published at IEEE ICDM 2013 conference.



Synthetic Data


For our synthetic data we generate graphs as follows. In general, node degrees follow a power law distribution in our synthetic graphs. Two of our generator parameters control the generation of congruent subspaces. The ratio of relevant attributes (ra) selects a percentage of attributes that are congruent with the graph structure in the one-dimensional space, and the parameter probability for a subspace (ps) determines the probability that one-dimensional subspaces take part in a higher-dimensional subspace that combines relevant attributes.

For irrelevant attributes, node values are assigned out of an uniform distribution without considering the graph structure. For relevant attributes, node values are assigned out of a Gaussian distribution. We implement the relevant attribute values using Gaussian distributions in order to be able to evaluate CODA, which is based on this assumption. Please note, that ConSub is not restricted to such Gaussian distributions. In order to ensure non deviating values in the relevant attributes, we cut the tails of each Gaussian distribution by a hyper ellipsoid . For each community and subspace, such a Gaussian distribution is defined.

The outlier ratio determines the number of outlier nodes where their attribute values in the congruent subspaces are manipulated. Outliers get random values outside the range defined by the hyper ellipsoids of the communities they belong to. 25% of all outliers are hub outliers that are only detected by considering combinations of at least two relevant attributes. Like the example given in our publication (cf. Fig. 1 (a)), hub outliers belong to different communities in lower-dimensional projections, but show deviating values in combinations of attributes.

Each synthetic dataset contains three files with the following graph information:
The following figures are visualizations of one of our synthetic graphs (red nodes represent outliers):
example graph showing a relevant subspace
(a) example graph showing a relevant subspace

example graph showing an irrelevant subspace
(b) example graph showing an irrelevant subspace













Parameter Configurations

In order to measure the quality of the competitors, we performed several algorithmic configurations on the synthetic and real world datasets

The algorithm configurations were:
Please note, that we use the same parameter setting of CODA for each of the pre-processing steps (fullspace, LUFS, and ConSub) in order to ensure comparability of results.

Varying the dimensionality of the datasets

For the evaluation of the quality and the runtime w.r.t. increasing dimensionality, we use the synthetic datasets
We generated 3 different datasets for each dimensionality {2, 10, 20, 40, 60, 80}.

Parameter Value
Number of Nodes 1000
Outlier Ratio 10%
Relevant Attributes (ra) 50%
Probability for a subspace(ps) 20%

Varying the number of nodes and edges

For the runtime evaluation w.r.t. the number of nodes and edges, we use these synthetic datasets

Parameter Value
Dimensions 10
Outlier Ratio 10%
Relevant Attributes (ra) 50%
Probability for a subspace(ps) 20%

Parameter Variation

For the parameter evaluation of ConSub, we consider the three synthetic graphs with 20 dimensions from the quality experiments.



Real Data


We analyze our approach on four graphs from two real world networks: Amazon (co-purchased network) [4] and Enron (communications network) , where we have a given ground truth for quality assessment.

Each benchmark contains three files:
Here are the details about the aggregated attributes of Amazon and Enron.

The used real world datasets can be found here:

Disney Graph

Details about the Disney Benchmark can be found on this website.

Amazon Fail Graph

Both Tag information and attribute values were extracted on March 2012.  We tagged those products that have the tag amazonfail.
This tag was used at least by 20 users to tag the product as an amazonfail  (average number of persons that have tagged one product is 23)

Nodes 1418
Edges 3695
Attributes 28
Outlier Ratio 0.02

Enron Graph

We generated a graph where e-mail addresses represent nodes from the Enron dataset. Those addresses that have sent spam messages (spam dataset) were tagged as outliers.

Nodes 13533
Edges 176967
Attributes 20
Outlier Ratio 0.0004

Large Amazon Graph

To discuss novel knowledge derived from subspace analysis on attributed graphs, we use the biggest compornent of the Amazon network from [4]:
 
Nodes 314824
Edges 882930
Attributes 28



Supplementary Discussion on Instantiation of Edge Count Estimator



Let us give some additional information on the proposed instantiation of the edge count estimator (cf. Sec. 4.2). Here we discuss several possible instantiations of edge count estimators and compare them empirically. Preliminary experiments have shown that all three of these instantiations can be used. The final instatitation (used in our paper) is the best choice out of these preliminary evaluations. 

In general, we focus on graphs that are undirected and do not have any self-loops. We aim at both accurate and efficient computation of edge count estimates. However, this is an open challenge, since all valid topologies have to be considered for an optimal solution. The following example illustrates this problem:
Example of valid topologies
Given the specific degree assignment to the node set, in this example only two valid topologies can occur. Thus, an accurate expected edge count for the node set {v1, v2} is 0, since there does not exist an edge between these nodes in any valid topology. In order to evaluate the quality of an estimator, the result has to be compared to the average edge count for the given node set over all valid topologies.

First, we compare our estimator against the null model presented by Erdös and Rényi [5]. Erdös and Rényi assume, that the edge density remains constant for all node sets. Thus, they achieve the following estimator:

EErdös−Rényi(GC, S) = |EC\{Ij}, S\{Aj}| · |VC, S| · (|VC, S| − 1)

|VC\{Ij}, S\{Aj}| · (|VC\{Ij}, S\{Aj}| − 1)

Another estimator is proposed by Newman [6]. It considers the case, that the edge density may change for different node sets. Thus, it preserves the degree distribution:

EModularity(GC, S) =
o, p ∈ VC, S 
degS\{Aj}(o) · degS\{Aj}(p)

2 · |EC\{Ij}, S\{Aj}|

In advance, our estimator ignores self loops and can be computed in linear time.

EConSub(GC, S) = 1

2

o ∈ VC,S 
degS\{Aj}(o) ·

p ∈ VC,S \{o} 
degS\{Aj}(p)


p ∈ V′\{o}  
degS\{Aj}(p)

Since the number of valid topologies depends binomial on the number of vertices and the number of edges, we are only able to evaluate these estimators on very small graphs. For each relaxed subgraph size |VC\{Ij}, S\{Aj}|, we create 20 node sets having the degrees drawn out of power law distribution. We determine the accurate expected number of edges between each node pair by considering each valid topology. We take random node sets with a size [2; |VC\{Ij}, S\{Aj}| - 1] out of each relax subgraph 1000 times and compare the exact expected edge counts with the estimated ones using the absolute deviation. The following figure shows our prelimary results w.r.t. this comparison:

quality evaluation of expected edge count estimators

EConSub(GC, S) shows the most accurate results. Thus, we use the instantiation of the estimator EConSub(GC, S) for our framework.



[1] J. Gao, F. Liang, W. Fan, C. Wang, Y. Sun, and J. Han. On community outliers and their efficient detection in information networks. In ACM SIGKDD, pages 813-822, 2010.

[2] J. Tang and H. Liu. Unsupervised feature selection for linked social media data. In ACM SIGKDD, pages 904-912, 2012.

[3] Rotta, R. and Noack, A. Multilevel local search algorithms for modularity clustering. Journal of Experimental Algorithmics (JEA), 2011.

[4] J. Leskovec, L. Adamic and B. Adamic. The Dynamics of Viral Marketing. ACM Transactions on the Web (ACM TWEB), 1(1), 2007.

[5] P. Erdös and A. Rényi. On the evolution of random graphs. Magyar Tud. Akad. Mat. Kutató Int. Közl, 5:17-61, 1960.

[6] M. Newman. Modularity and community structure in networks. Proceedings of the National Academy of Sciences, 103(23):8577-8582, 2006.