Ranking Outlier Nodes in Subspaces of Attributed Graphs

Emmanuel Müller, Patricia Iglesias Sánchez, Yvonne Mülle, and Klemens Böhm

in 4th International Workshop on Graph Data Management: Techniques and Applications (GDM 2013)
in Conjunction with IEEE International Conference on Data Engineering (ICDE 2013) 

Amazon Benchmark


This  dataset is a subgraph of the Amazon co-purchase Amazon Network [1] . We filtered products from the Disney category and we extend the product information with  product prices extracted from the Amazon website on March 2012. A list of the attributes and their description can be found here.

Number of nodes 124
Number of edges 334
Average clustering coefficient 0.437
Dimensions 30

User Experiment

Pre-Configuration:

  1. First, we clustered the Amazon subgraph with a modularity based technique  [2]. Thus, we have ensured that students do not label global outliers (e.g., product with the highest price of the database). but they label contextual outliers inside graph clusters. (The file with the obtained graph clusters can be found here)
  2. We choose those attributes which are available on the Amazon website (not aggregated attributes) , and we translate them into German in order to provide a product description to the students.

Experiment Sequence:

  1. For each graph cluster (in total 8 clusters) : 
            a. All products in a graph cluster were shown to the students in the browser.

                                                         Visualization of the product group

            b. Each group of two students had to choose 1 or 2 products that they considered unexpected, rare or suspicious in the  group.
            c. For each labeled product, the two students had to fill out a form with the reasons and the attributes they considered  rare.

                                                        Form

Results:  All results of the experiment with the German comments of the students can be found here.

Experimental Evaluation

For the evaluation, we selected those products, where at least 50% of the students considered the products as outliers inside the group (in total 6 outliers were selected).  

Evaluation with traditional techniques on relational data only

First, we evaluate the results with traditional outlier detection techniques provided by  the SOREX toolkit. We used the relational dataset, without the information of the graph structure. For the evaluation of the detection quality we use the true file as outlier labels. The configurations of  the  algorithms used can be found in the following table:


Algorithm Parameters
LOF weka.outlier.LocalOutlierFactor -L 20 -U 20
SOF weka.outlier.AggraWal -S 20 -P 10 
                  RPLOF                weka.outlier.RPLOF -L 20 -U 20 -X 1


Evaluation with other competitors considering the graph structure

Finally, we also considered the dataset with the graph structure  in order to evalute our approach and other competitors. The settings are shown in the following table:



Algorithm Parameter
SCAN µ= 2,ε=0.5
CODA k=8, r = 6/124 , λ=0.1


Evaluation considering different clusterings

We evaluate our scoring functions using the results provided by graph subspace clustering techniques. The implementation of all these algorithms can be found here. The properties fles for the execution of each of the algorithms are shown in the following table:

Algorithm Configuration File
GAMer Config  File
extension of Cocain Config File
                CoPaM                                   Config File





[1]  J. Leskovec, L. Adamic and B. Adamic. The Dynamics of Viral Marketing. ACM Transactions on the Web (ACM TWEB), 1(1), 2007.
[2]  Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre, Fast unfolding of communities in large networks, in Journal of Statistical Mechanics: Theory and Experiment 2008 (10), P1000




Access to this website

After publication of this work in April 2013, we encourage researchers in this area to use the proposed algorithm for their own publications as competitor. Our implementation will then be available for anyone to use. Thus, all algorithms, data sets and parameter setting will be available for the community.