Outliers Detection

When performing clustering, it is generally recommended to detect outliers. Not doing so could generate very skewed clusters, or many small clusters and one cluster containing almost the whole dataset.

DSS detects outliers by performing a pre-clustering with a large number of clusters and considering the smallest "mini-clusters" as outliers, if:

Once outliers are detected, you can either:

Note that this may increase training time significantly.

Warning: this may slow down training significantly on an MLLib backend.
Should usually not be higher than 10%