Published June 7, 2023 | Version 1.0.0
Software Open

SDOclust Evaluation Tests

  • 1. ROR icon TU Wien

Description

SDOclust Evaluation Tests

conducted for the paper: SDOclust: Clustering with Sparse Data Observers

Context and methodology

SDOclust is a clustering extension of the Sparse Data Observers (SDO) algorithm. SDOclust uses data observers as graph nodes and cluster them considering connected components and local thresholding. Observers' labels are subsequently propagated to data points. 

In this repository, SDOclust is evaluated with 15 two-dimensional synthetic datasets, 138 multi-dimensional synthetic datasets, and 2 real-application datasets, and compared with HDBSCAN and k-means-- algorithms.

This repository is framed within the research on the following domains: algorithm evaluation, clustering, unsupervised learning, machine learning, data mining, data analysis. Datasets and algorithms can be used for experiment replication and for further clustering evaluation and comparison.   

Technical details

Experiments are conducted in Python 3. The file and folder structure is as follows:

  • [data2d] contains 15 two-dimensional datasets as CSV files (last column is the label).
  • [dataMd] contains 138 multi-dimensional datasets as CSV files (last column is the label).
  • [dataReal] contains 2 real/application datases as CSV files (last column is the label).
  • [plots] contains plots (.png, .pdf) with results generated by test scripts.
  • [tables] contains tables (.csv, .tex) with results generated by test scripts.
  • [cddiag] contain scripts for generating critical difference diagrams with Wilcoxon signed rank tests and plots from conducted tests.
  • "dependencies.py" installs required python packages.
  • "tests_2d.py" runs 2d experiments.
  • "tests_Md.py" runs multi-dimensional experiments.
  • "test_mawi.py" runs experiments with real network traffic data from MAWI captures.
  • "test_sirena.py" runs experiments with real electricity consumption data from the Sirena project.
  • "sdo.py" implements sdoclust functions.
  • "pamse2d.py" runs sensitivity analysis on SDOclust parameters.
  • "update_test.py" shows an example of SDOclust working in update modus,
  • "gbc.py" contains functions for the graph-based clustering implementation (based on https://github.com/dayyass/graph-based-clustering).
  • "kmeansmm.py" is the k-means-- implementation (based on https://github.com/Strizzo/kmeans--).
  • "LICENSE" file.
  • "README.md" for further details, link to sources and instructions for reproducibility.

License

The CC-BY license applies to all data generated with MDCgen. All distributed code is under the GNU GPL license.

 

Files

pysdoclust-main.zip

Files (30.3 MiB)

NameSize
md5:5a89f4c5d609d1bcaaf7543d836875b5
30.3 MiBPreview Download

Additional details

Related works

Is derived from
Conference Paper: 10.1109/ICDMW.2018.00140 (DOI)
Is version of
Software: https://github.com/CN-TU/pysdoclust (URL)