Supplementary material concerning repeatability

 

Paper ID

432

Title

HiCS: High Contrast Subspaces for Density-Based Outlier Ranking Data

Authors

Fabian Keller, Emmanuel Müller, Klemens Böhm


Datasets

All datasets used in the following experiments can be downloaded here:

Experiments on Synthetic Data

General remarks:
Experiment configurations:
In summary, this leads to a total number of 21 * 42 = 882 experiments (corresponding to a processing time of about 5 days, mainly due to RIS).

Experiments on Real World Data

The following experiments were performed with each best algorithm configuration from the experiment on synthetic data. We applied a standardized preprocessing procedure to all datasets (rescaling all attributes, removing categorical attributes or attributes that  show strong discretization effects). The arff-files included in the downloadable zip-archive include all results from these preprocessing steps. Furthermore we also stored our outlier definition in these files (always in the last attribute, 0 = no outlier, 1 = outlier). The following figures show the ROC plots of all experiments.

Ann-Thyroid:

roc_ann_thyroid.png

Arrythmia:

roc_arrhythmia.png

Breast:

roc_breast.png

Breast (diagnostic):

roc_breast_diagnostic.png

Diabetes:

roc_breast_diagnostic.png

Glass:

roc_glass.png

Ionosphere:

roc_ionosphere.png

Pendigits:

roc_pendigits.png


Resources

The reviewer is expected to agree to confidentiality requirements with respect to non-disclosure of data on this website, as the reviewer does for any paper under review. Usage is limited to repeating and exploring the experimental results of this paper. Until this work has not been published, no other use is allowed, especially not for other publications. This website conveniently documents the experimental setup used in the evaluation described in our manuscript. We will provide additional experimental data, setup, and software, which will be made available when the manuscript is published.

Public access to this website

After publication of this work, we encourage researchers in this area to use the proposed algorithm for their own publications as competitor. Our implementation will then be available for anyone to use. Thus, all algorithms, data sets and parameter setting will be available for the community.