• {{desc.script.steps.length}} preparation steps are applied
  • {{desc.modeling.algorithm}} algorithm is used
  • {{desc.core.time.timeVariable}} is set as time variable
  • {{desc.core.weight.sampleWeightVariable}} is set as sample weight variable

Partitioned source version

You can choose which saved model version you wish to use to retrain partitions

Sampling

If your dataset does not fit in your RAM, you may want to subsample the set on which splitting will be performed

Test data is sub-sampled if necessary at max. 1M records.
Using a fixed random seed allows for reproducible result

Training operation mode

You can choose how the training recipe works

  • Split: Perform the train/test splitting normally. {{desc.splitParams.ssdTrainingRatio * 100|number:0}} % of the input data is used. Performance results are available.
  • Train on 100% and split for performance: First the actual model is trained on 100% of the input data. Then a second model is trained on the train/test split to compute performance results. It greatly increases training time as it needs to train two different models.
  • K-fold: Split the dataset in {{desc.splitParams.nFolds}} folds and each fold is independently used as a separate testing set, with the remaining {{desc.splitParams.nFolds - 1 }} folds used as a training set. Gives error margins on metrics, but greatly increases training time.

Note that since probability calibration is enabled, a fraction of the input dataset will be used to learn the calibration parameters so the model is not trained on 100% of the data. For normal train/test splitting the test set is also used for the calibration.

Only regular train/test splitting is available for Deep learning models. {{desc.splitParams.ssdTrainingRatio * 100|number:0}} % of the input data is used. Performance results are available.

Splitting

The variable {{desc.core.time.timeVariable}} will be used to sort the data before splitting
Proportion of the sample that goes to the train set; The rest goes to the test set.
Number of folds to divide the dataset into
Using a fixed random seed allows for reproducible result
Preserve target variable distribution within every split. See .
Rows with same group column value are assigned to same fold. See .
Column containing the k-fold groups

Train set: {{recipe.inputs.main.items[0].ref}}

Distinct
{{ desc.splitParams.eftdTrain.filter.distinct ? "Duplicate rows will be removed." : "Duplicate rows are allowed."}}

Filter

Test set: {{recipe.inputs.test.items[0].ref}}

Distinct
{{ desc.splitParams.eftdTest.filter.distinct ? "Duplicate rows will be removed." : "Duplicate rows are allowed."}}

Filter

Train set

Distinct
{{ desc.splitParams.efsdTrain.filter.distinct ? "Duplicate rows will be removed." : "Duplicate rows are allowed."}}

Filter

Test set

Distinct
{{ desc.splitParams.efsdTest.filter.distinct ? "Duplicate rows will be removed." : "Duplicate rows are allowed."}}

Filter

GPU options

Container configuration

Hyperparameters search

Spark configuration

Metadata

Optional. Informative labels for the model. The model:algorithm, model:date, model:name, trainDataset:dataset-name, testDataset:dataset-name labels, evaluation:date and evaluationDataset:dataset-name are automatically added.
Optional. Set the base name of model versions built by this recipe