ConSub

This page offers additional information about our experiments and assist in reproducing the results with our ConSub algorithm. We provide this in addition to our publication [download full text PDF] published at IEEE ICDM 2013 conference.

Synthetic Data

For our synthetic data we generate graphs as follows. In general, node degrees follow a power law distribution in our synthetic graphs. Two of our generator parameters control the generation of congruent subspaces. The ratio of relevant attributes (ra) selects a percentage of attributes that are congruent with the graph structure in the one-dimensional space, and the parameter probability for a subspace (ps) determines the probability that one-dimensional subspaces take part in a higher-dimensional subspace that combines relevant attributes.

For irrelevant attributes, node values are assigned out of an uniform distribution without considering the graph structure. For relevant attributes, node values are assigned out of a Gaussian distribution. We implement the relevant attribute values using Gaussian distributions in order to be able to evaluate CODA, which is based on this assumption. Please note, that ConSub is not restricted to such Gaussian distributions. In order to ensure non deviating values in the relevant attributes, we cut the tails of each Gaussian distribution by a hyper ellipsoid . For each community and subspace, such a Gaussian distribution is defined.

The outlier ratio determines the number of outlier nodes where their attribute values in the congruent subspaces are manipulated. Outliers get random values outside the range defined by the hyper ellipsoids of the communities they belong to. 25% of all outliers are hub outliers that are only detected by considering combinations of at least two relevant attributes. Like the example given in our publication (cf. Fig. 1 (a)), hub outliers belong to different communities in lower-dimensional projections, but show deviating values in combinations of attributes.

Each synthetic dataset contains three files with the following graph information:

.graphml file contains the graph with its node attributes.
.true file contains the following tuple for each node: (node identifier;outlier) for the quality evaluation of results.

1: outlier
0: regular node

.stat file contains following graph information

Number of nodes, edges and dimensions
Number of communities
Number of inserted community and hub outliers
A list with the irrelevant and the relevant subspaces
A list of community and hub outliers

The following figures are visualizations of one of our synthetic graphs (red nodes represent outliers):

(a) example graph showing a relevant subspace

(b) example graph showing an irrelevant subspace

Parameter Configurations

In order to measure the quality of the competitors, we performed several algorithmic configurations on the synthetic and real world datasets

The algorithm configurations were:

CODA [1]

k in {5, 8, 10, 15, 20}
percentage of outliers in {0.0004, 0.001, 0.01, 0.05, 0.1, 0.15}
λ = 0.01
10 different initializations

LUFS [2]

c (pseudo-class labels) in {3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25}
α (controlling social dimension) in {0.1, 0.2, 0.3}
β (controlling norm regularization) in {0.1, 0.2, 0.3}
λ in {1, 10, 100, 10000}
K (social dimensions):

We use with the recent proposed modularity graph clustering approach [3] in order to get the social dimensions for the framework.

k (number of features) in {10%, 25%, 50% from the total number of features}

ConSub

Significance level in {0.1%, 0.5%, 1%, 5%, 10%, 15%}
Number of Intervals in {1, 5, 10, 20, 30, 40}
Monte Carlo Iterations in {1, 25, 50, 100, 150, 200}

Please note, that we use the same parameter setting of CODA for each of the pre-processing steps (fullspace, LUFS, and ConSub) in order to ensure comparability of results.

Varying the dimensionality of the datasets

For the evaluation of the quality and the runtime w.r.t. increasing dimensionality, we use the synthetic datasets
We generated 3 different datasets for each dimensionality {2, 10, 20, 40, 60, 80}.

Parameter	Value
Number of Nodes	1000
Outlier Ratio	10%
Relevant Attributes (ra)	50%
Probability for a subspace(ps)	20%

Varying the number of nodes and edges

For the runtime evaluation w.r.t. the number of nodes and edges, we use these synthetic datasets

Parameter	Value
Dimensions	10
Outlier Ratio	10%
Relevant Attributes (ra)	50%
Probability for a subspace(ps)	20%

Parameter Variation

For the parameter evaluation of ConSub, we consider the three synthetic graphs with 20 dimensions from the quality experiments.

Real Data

We analyze our approach on four graphs from two real world networks: Amazon (co-purchased network) [4] and Enron (communications network) , where we have a given ground truth for quality assessment.

Each benchmark contains three files:

.graphml file with the network and its node attributes
.csv file contains the mapping of each real world identifier (ASIN for Amazon products and e-mail address for Enron) to our numerical identifier
.true the files with the ground truth as in the synthetic datasets

Here are the details about the aggregated attributes of Amazon and Enron.

The used real world datasets can be found here:

Disney Graph

Details about the Disney Benchmark can be found on this website.

Amazon Fail Graph

Both Tag information and attribute values were extracted on March 2012. We tagged those products that have the tag amazonfail.

This tag was used at least by 20 users to tag the product as an amazonfail (average number of persons that have tagged one product is 23)

Nodes	1418
Edges	3695
Attributes	28
Outlier Ratio	0.02

Enron Graph

We generated a graph where e-mail addresses represent nodes from the Enron dataset. Those addresses that have sent spam messages (spam dataset) were tagged as outliers.

Nodes	13533
Edges	176967
Attributes	20
Outlier Ratio	0.0004

Large Amazon Graph

To discuss novel knowledge derived from subspace analysis on attributed graphs, we use the biggest compornent of the Amazon network from [4]:

Nodes	314824
Edges	882930
Attributes	28

Supplementary Discussion on Instantiation of Edge Count Estimator

Let us give some additional information on the proposed instantiation of the edge count estimator (cf. Sec. 4.2). Here we discuss several possible instantiations of edge count estimators and compare them empirically. Preliminary experiments have shown that all three of these instantiations can be used. The final instatitation (used in our paper) is the best choice out of these preliminary evaluations.

In general, we focus on graphs that are undirected and do not have any self-loops. We aim at both accurate and efficient computation of edge count estimates. However, this is an open challenge, since all valid topologies have to be considered for an optimal solution. The following example illustrates this problem:

Given the specific degree assignment to the node set, in this example only two valid topologies can occur. Thus, an accurate expected edge count for the node set {v₁, v₂} is 0, since there does not exist an edge between these nodes in any valid topology. In order to evaluate the quality of an estimator, the result has to be compared to the average edge count for the given node set over all valid topologies.

First, we compare our estimator against the null model presented by Erdös and Rényi [5]. Erdös and Rényi assume, that the edge density remains constant for all node sets. Thus, they achieve the following estimator:

E_{Erdös−Rényi}(G_C,
S) = |E_{C\{I_j},
S\{A_j}}| ·

|V_C,
S| · (|V_C,
S| − 1)

|V_{C\{I_j},
S\{A_j}}| · (|V_{C\{I_j},
S\{A_j}}| − 1)

Another estimator is proposed by Newman [6]. It considers the case, that the edge density may change for different node sets. Thus, it preserves the degree distribution:

E_Modularity(G_C,
S) =

∑
o, p ∈ V_{C, S}

deg_{S\{A_j}}(o) · deg_{S\{A_j}}(p)

2 · |E_{C\{I_j},
S\{A_j}}|

In advance, our estimator ignores self loops and can be computed in linear time.

E_ConSub(G_C,
S) =

∑
o ∈ V_C,S

deg_{S\{A_j}}(o) ·

∑
p ∈ V_C,S \{o}

deg_{S\{A_j}}(p)

∑
p ∈ V′\{o}

deg_{S\{A_j}}(p)

Since the number of valid topologies depends binomial on the number of vertices and the number of edges, we are only able to evaluate these estimators on very small graphs. For each relaxed subgraph size |V_{C\{I_j},
S\{A_j}}|, we create 20 node sets having the degrees drawn out of power law distribution. We determine the accurate expected number of edges between each node pair by considering each valid topology. We take random node sets with a size [2; |V_{C\{I_j},
S\{A_j}}| - 1] out of each relax subgraph 1000 times and compare the exact expected edge counts with the estimated ones using the absolute deviation. The following figure shows our prelimary results w.r.t. this comparison:

quality evaluation of expected edge count estimators

E_ConSub(G_C,
S) shows the most accurate results. Thus, we use the instantiation of the estimator E_ConSub(G_C,
S) for our framework.

[1] J. Gao, F. Liang, W. Fan, C. Wang, Y. Sun, and J. Han. On community outliers and their efficient detection in information networks. In ACM SIGKDD, pages 813-822, 2010.

[2] J. Tang and H. Liu. Unsupervised feature selection for linked social media data. In ACM SIGKDD, pages 904-912, 2012.

[3] Rotta, R. and Noack, A. Multilevel local search algorithms for modularity clustering. Journal of Experimental Algorithmics (JEA), 2011.

[4] J. Leskovec, L. Adamic and B. Adamic. The Dynamics of Viral Marketing. ACM Transactions on the Web (ACM TWEB), 1(1), 2007.

[5] P. Erdös and A. Rényi. On the evolution of random graphs. Magyar Tud. Akad. Mat. Kutató Int. Közl, 5:17-61, 1960.

[6] M. Newman. Modularity and community structure in networks. Proceedings of the National Academy of Sciences, 103(23):8577-8582, 2006.