# On Heterogeneous Ensembles for Anomaly Detection (artifacts)

Codes, datasets, and experimental results for reuse and replication of the paper:

> Félix Iglesias, Tanja Zseby, Conrado Martínez, Arthur Zimek. *On Heterogeneous Ensembles for Anomaly Detection: Empirical Insights and Guidelines for the Design*. Expert Systems with Applications, 2026, 133468. [DOI: 10.1016/j.eswa.2026.133468](https://doi.org/10.1016/j.eswa.2026.133468)

**Jun, 2026**

---

## Requirements

To replicate the experiments, make sure the following tools are installed:

- [Docker](https://docs.docker.com/get-docker/)
- [Docker Compose](https://docs.docker.com/compose/)

If you want to try with your own data, you must either:

- Modify the `run.sh` script accordingly, **or**
- Use the Python functions directly as needed.

We recommend to locate your own dataset under the data folder, e.g.:

```
data/mycollection/
```

Each dataset must be a **tabular CSV or NPZ file**. See the `load_dataset` function in `modules/utils.py` for details on the expected format.

---

## Files and Folders

```
compiled_image/          # Precompiled Docker image
docker/                  # Files to build the image and container
README.md                # This file

HetEns_Apr2026.zip       # Source code and datasets

results_Apr2026.zip      # All performances, tables and images generated 
                         # and shown in the paper
```

## Instructions

### 0. Unzip folder with codes and data in place

> First of all, unzip **HetEns_Apr2026.zip** in place to create the **HetEns_Apr2026/** folder.



---

> You need **sudo/root privileges** to run the following commands.

### 1. Load the precompiled image

```bash
$ docker load -i AD_ensembles_Apr2026.tar
```

### 2. Or build it yourself

```bash
/docker$ make build
```

### 3. Run all experiments

```bash
/docker$ make run     # Runs run.sh
```

> This complete process may take **a few days**, depending on your machine.

### 4. Run a quick test

```bash
/docker$ make test    # Runs run_test.sh
```

### 5. Access the container shell

```bash
/docker$ make shell
```

### 6. Stop the container

```bash
/docker$ make clean
```

### 7. Remove the image completely

```bash
/docker$ make nuke    # Force deletion
```

---

## Generated Data and Results

When running `run.sh`, the following folders will be created or rewritten`:

### `data/globloc/`

- Collection with 100 generated CSV datasets.

### `results/`

- Subfolders organized by dataset collection:
  - `results/globloc/`
  - `results/adbench/`
  - `results/cic2017bin/`  

Each subfolder contains:

| Folder             | Description                                                                                  |
| ------------------ | -------------------------------------------------------------------------------------------- |
| `ad_performances/` | Tables with anomaly detection performance per dataset, algorithm, and metric.                |
| `ad_scores/`       | Raw anomaly detection scores per algorithm and dataset.                                      |
| `summaries/`       | Tables, figures, measurements, statistical tests, CDD, etc., extracted from all experiments. |
| `metadata/`        | *(Only in globloc)* Meta-data from the data generation process.                              |
| `results/`         | Extracted statistics and performance related to all the different ensemble options tested    |

---

## Details on the main scripts

- **run_ad.py** runs all anomaly detection experiments with individual algorithms in a dataset collection.
- **run_ensemble.py** uses the scores obtained with *run_ad.py* to run all studied ensemble combinations for anomaly detection in a dataset collection.
- **run_selected_en.py** uses the scores obtained with *run_ad.py* to extract the performance of the recommended ensemble [IForest, PCA, SDO] for anomaly detection in a dataset collection.
- **summary.py** extracts tables, figures and statistical tests from a complete collectin of experiments.
- **mcc_sim.py** runs correlation tests based on the MCC metric.
- **formalHE.py** simulates the anomaly detection problem solved with individual algorithms and ensembles.
