## Regression model 
This solution uses an  **Extra Random Tree**  regression model to forecast the next horizons. Extra trees, just like Random Forests, are an ensemble model. In addition to sampling features at each stage of splitting the tree, it also samples a random threshold at which to make the splits. 

The additional randomness may improve the ability of the model to generalize (compared to a random forest) and may yield better results. 

## Features 
Four predictive variables are always part of the model regardless of the chosen parameters:  **date** ,  **category** ,  **forecast**  and  **horizon** . 

Time series forecasts and horizon values are previously computed using the simple forecast methodology and are included in the regression model as predictive variables. In the Dataiku application, the user has the possibility to set a number of target variables’ **lag values**  and to select  **drivers**  to include in the model. 
 
**Features handling**: 
- Date values are encoded using a cyclical DateTime yearly encoding technique.
- Horizon and category variables are dummy encoded.
- Other variables are treated as numerical.

## Model evaluation
To evaluate the performance of each forecast, we compare their results over the test set. The test set is defined as the historical data's last data points and its length is similar to the horizon to forecast.

**Example**: 

As illustrated below, we can imagine that we are dealing with a time series of 9 datapoints (historical data) and want to forecast the next three horizons (horizons 1, 2, and 3). The test set would include the last three datapoints of the historical data. 

![Screenshot 2023-01-12 at 12.04.20.png](9uVQsDE3PzpS)

In the Train / test set section of the visual analysis tool, the policy is set as "Explicit extracts from two datasets" which allows using two extracts from two different datasets, one for train, and one for test.

##  Hyperparameters

The hyperparameters of the **Extra Random Tree**  regression model are: 
- Number of trees:  **20** 
- Max trees depth:  **8** (Maximum depth of each tree in the forest. Higher values generally increase the quality of the prediction, but can lead to overfitting. High values also increase the training and prediction time.)
- Min samples per leaf:  **5** (Minimum number of samples required in a single tree node to split this node. Lower values increase the quality of the prediction (by splitting the tree mode), but can lead to overfitting and increased training and prediction time.)

The hyperparameters can be changed by updating the third step named  **Custom Python Regression model**  of the following scenario:  [11. Regression model](scenario:11REGRESSIONMODEL)

![Screenshot 2023-01-12 at 14.08.47.png](l6Zee4fUPoa1)

The following lines of codes (line 130 to 135) need to be updated by changing the value assigned to each parameter: 
 Edit number of trees: 
```settings_raw['modeling'][algorithm_name]['n_estimators']['values'][0] = 20```
Edit max trees depth: 
```settings_raw['modeling'][algorithm_name]["max_tree_depth"]['values'][0] = 8 ```
Edit min sample leaf: 
```settings_raw['modeling'][algorithm_name]['min_samples_leaf']['values'][0] = 5 ```



