This
page offers additional information about our
experiments and assist in reproducing the results with
our ConSub algorithm. We provide this in addition to our publication [
download full
text
PDF]
published at
IEEE
ICDM 2013
conference.
Synthetic
Data
For our synthetic data we generate graphs as follows. In
general, node degrees follow a power law distribution in our
synthetic
graphs. Two of our generator parameters control the generation of
congruent subspaces. The
ratio of
relevant attributes (ra)
selects a percentage of attributes that
are congruent with the graph structure in the one-dimensional space,
and the parameter
probability for a
subspace (ps) determines the
probability that one-dimensional
subspaces take part in a higher-dimensional subspace that combines
relevant attributes.
For
irrelevant
attributes, node values are
assigned out of an uniform
distribution without considering the graph
structure. For
relevant
attributes, node
values are assigned out of a Gaussian distribution. We implement the
relevant attribute values using
Gaussian distributions in order to be able to evaluate
CODA,
which is based on this
assumption. Please note, that
ConSub
is not restricted to such Gaussian distributions. In order to ensure
non deviating values in the relevant attributes, we cut the tails of
each Gaussian distribution by a hyper ellipsoid . For
each
community and subspace, such a
Gaussian
distribution is defined.
The
outlier
ratio
determines the number of outlier nodes where their attribute values in
the congruent subspaces are manipulated.
Outliers
get random values
outside the range defined by the
hyper ellipsoids of the communities they belong to. 25% of all outliers
are
hub
outliers that are only
detected by considering combinations of at least two relevant
attributes. Like the example given in our publication (cf. Fig. 1
(a)), hub outliers belong to different communities in lower-dimensional
projections, but show deviating values in combinations of attributes.
Each synthetic dataset contains three files with the
following graph information:
- .graphml file
contains the graph with its node attributes.
- .true file
contains the following tuple for each node: (node identifier;outlier)
for the quality evaluation of results.
- 1: outlier
- 0: regular node
- .stat
file contains following graph information
- Number of nodes, edges and
dimensions
- Number of communities
- Number of inserted
community and hub outliers
- A list with the
irrelevant and the relevant subspaces
- A list of community and
hub outliers
The following figures are visualizations of one of our
synthetic graphs
(red nodes represent outliers):
(a) example
graph showing
a relevant subspace
(b)
example graph showing
an irrelevant subspace
Parameter
Configurations
In order to measure the quality of the competitors, we performed
several algorithmic configurations on the synthetic and real world
datasets
The algorithm configurations were:
- CODA
[1]
- k
in {5, 8, 10, 15, 20}
- percentage
of outliers in {0.0004, 0.001, 0.01, 0.05, 0.1, 0.15}
- λ
= 0.01
- 10
different initializations
- LUFS
[2]
- c
(pseudo-class labels) in {3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25}
- α (controlling
social dimension) in {0.1, 0.2, 0.3}
- β (controlling
norm regularization) in {0.1, 0.2, 0.3}
- λ
in {1, 10, 100, 10000}
- K (social dimensions):
- We use with the
recent proposed modularity graph clustering approach [3]
in
order to get the social dimensions for the framework.
- k (number of
features)
in {10%, 25%, 50% from the total number of features}
- ConSub
- Significance
level in {0.1%, 0.5%, 1%,
5%,
10%, 15%}
- Number
of Intervals in {1, 5, 10, 20, 30, 40}
- Monte
Carlo Iterations in {1, 25, 50, 100, 150, 200}
Please note, that we use the same parameter setting of CODA for each of
the pre-processing steps (fullspace, LUFS, and ConSub) in order
to ensure
comparability of results.
Varying the dimensionality of
the datasets
For the evaluation of the quality and the runtime w.r.t. increasing
dimensionality, we use
the synthetic datasets
We generated 3 different datasets for each dimensionality {2, 10, 20,
40, 60, 80}.
Parameter |
Value |
Number
of Nodes |
1000 |
Outlier
Ratio |
10% |
Relevant
Attributes (ra) |
50% |
Probability
for a subspace(ps) |
20% |
Varying the number of nodes
and edges
For the runtime evaluation w.r.t. the number of nodes and edges, we use
these
synthetic datasets
Parameter |
Value |
Dimensions |
10 |
Outlier
Ratio |
10% |
Relevant
Attributes (ra) |
50% |
Probability
for a subspace(ps) |
20% |
Parameter Variation
For the parameter evaluation of ConSub, we consider the three synthetic
graphs with 20 dimensions from the quality experiments.
Real
Data
We analyze our approach on four graphs from two real
world networks: Amazon (co-purchased network) [4] and Enron
(communications
network) , where we have a given ground truth for quality assessment.
Each benchmark contains three files:
- .graphml
file with the
network and its node attributes
- .csv file
contains the mapping of each real world identifier (ASIN for Amazon
products and
e-mail address for Enron) to our numerical identifier
- .true
the files
with the ground truth as in the synthetic datasets
Here are the details about the aggregated attributes of
Amazon
and
Enron.
The used real world datasets can be found here:
Disney Graph
Details about the Disney Benchmark can be found on this
website.
Amazon Fail Graph
Both
Tag
information and attribute
values were extracted on March 2012. We tagged those products
that have the tag
amazonfail.
This
tag was used at least by 20 users
to tag the product as an
amazonfail
(average
number of persons that have tagged one product is 23)
Nodes |
1418 |
Edges |
3695 |
Attributes |
28 |
Outlier
Ratio |
0.02 |
Enron Graph
We
generated a graph where e-mail addresses represent nodes from the
Enron dataset.
Those
addresses that have sent spam messages (
spam
dataset)
were tagged as outliers.
Nodes |
13533 |
Edges |
176967 |
Attributes |
20 |
Outlier
Ratio |
0.0004 |
Large Amazon Graph
To discuss novel knowledge derived from subspace analysis on attributed
graphs, we use the biggest compornent of the Amazon network from [4]:
Nodes |
314824 |
Edges |
882930 |
Attributes |
28 |
Supplementary
Discussion on Instantiation of Edge
Count Estimator
Let us give some additional information on the
proposed instantiation
of the edge count estimator (cf. Sec. 4.2). Here we discuss several
possible instantiations of edge count estimators and compare
them
empirically. Preliminary experiments have shown that all three of these
instantiations can be used. The final instatitation (used in our
paper) is the best choice out of these preliminary
evaluations.
In general, we focus on graphs that are undirected and do not have any
self-loops. We aim at both accurate and efficient computation of edge
count estimates. However, this is an open challenge, since all
valid topologies have to be considered for an optimal solution. The
following example
illustrates
this problem:
Given
the specific degree assignment
to the node set, in this example
only two valid topologies can occur. Thus, an accurate expected edge
count for the node set {v1,
v2}
is 0, since there does not exist an edge between these
nodes in any valid topology. In order to evaluate the quality of an
estimator, the result has to be compared to the average edge count for
the given node set over all valid topologies.
First, we compare our estimator against the null model presented by
Erdös
and Rényi [5]. Erdös
and Rényi assume, that the edge density remains
constant
for all node sets. Thus, they achieve the following estimator:
EErdös−Rényi(GC,
S) = |EC\{Ij},
S\{Aj}|
· |
|VC,
S| · (|VC,
S| − 1)
|VC\{Ij},
S\{Aj}|
· (|VC\{Ij},
S\{Aj}|
− 1)
|
|
|
Another
estimator is proposed by
Newman [6]. It considers the case, that
the edge density may change for different node sets. Thus, it preserves
the degree distribution:
EModularity(GC,
S) = |
∑
o,
p ∈ VC, S
|
|
degS\{Aj}(o)
· degS\{Aj}(p)
2 · |EC\{Ij},
S\{Aj}|
|
|
|
In
advance, our estimator ignores
self loops and can be computed in
linear time.
EConSub(GC,
S) = |
1
2
|
|
∑
o
∈ VC,S
|
degS\{Aj}(o)
· |
∑
p
∈ VC,S
\{o}
|
degS\{Aj}(p) |
∑
p
∈ V′\{o}
|
degS\{Aj}(p) |
|
|
|
Since
the number of valid topologies
depends binomial on the number of
vertices and the number of edges, we are only able to evaluate these
estimators on very small graphs. For each relaxed subgraph size |VC\{Ij},
S\{Aj}|,
we create 20 node sets having
the degrees
drawn out of power law distribution. We determine the accurate expected
number of edges between each node pair by considering each valid
topology. We take random node sets with a size [2; |VC\{Ij},
S\{Aj}|
- 1]
out of each relax subgraph 1000 times and
compare the exact expected edge counts with the estimated ones using
the absolute deviation. The following figure shows our prelimary
results w.r.t. this comparison:
E
ConSub(G
C,
S) shows the most
accurate results. Thus,
we use the instantiation of the
estimator E
ConSub(G
C,
S) for our framework.
[1] J.
Gao, F. Liang, W. Fan, C. Wang, Y. Sun, and J. Han. On
community outliers and their efficient detection in information
networks.
In ACM SIGKDD, pages 813-822, 2010.
[2] J. Tang and H. Liu. Unsupervised feature selection for
linked
social media data. In ACM SIGKDD, pages 904-912, 2012.
[3] Rotta, R. and Noack, A. Multilevel local search algorithms
for
modularity clustering. Journal of Experimental Algorithmics (JEA), 2011.
[4] J. Leskovec, L. Adamic and B. Adamic. The Dynamics of
Viral Marketing. ACM Transactions on the Web (ACM TWEB), 1(1), 2007.
[5] P. Erdös and A. Rényi. On the
evolution of random
graphs. Magyar Tud. Akad. Mat. Kutató Int. Közl,
5:17-61, 1960.
[6] M. Newman. Modularity and
community structure in
networks.
Proceedings of the National Academy of Sciences, 103(23):8577-8582,
2006.