# Overview 
In this example project we leverage Dataiku visual capabilities to build forecasting models using two different methods:
- the first one based on statistical and Deep Learning time series models that leverage the sequential nature of our dataset
- the second one using a more traditional Machine Learning approach

In practice we will implement the following steps:
 1. explore our dataset using Dataiku charts and statistics tools. In this context, we will also leverage the Time Series Preparation plugin to resample and extract insights on our time series. 
 2. use some visual recipes (Prepare, Join, Window) to clean and enrich our dataset. 
 3. perform the two different analysis. 
 4. compare our models and display a forecast. 


# Data 
The two input datasets are located in the [(1. Input data)](flow_zone:default) flow zone. 

The [sales_data](dataset:sales_data) contains aggregated sales data for 10 Walmart stores. It was built starting from [this famous Kaggle dataset](https://www.kaggle.com/competitions/walmart-recruiting-store-sales-forecasting/data) and contains Weekly Sales values between 2010-02-05 and 2012-10-26. The goal of the analysis is to predict a three-month horizon of future sales (12 data points corresponding to 12 weeks) for each of the 10 stores. Our data is in a [long format](https://doc.dataiku.com/dss/latest/time-series/data-formatting.html): a time series identifier (_Store_) identifies each series. Also note that the date column has already been [parsed](https://doc.dataiku.com/dss/latest/preparation/dates.html#parsing-dates). 
Finally some columns are filled for both historical and future dates (_Date_, _Store_, _IsHoliday_). In particular the _IsHoliday_ column is an external regressor that is known for future dates and can help the model make better predictions. The _Temperature_ and _Fuel_price_ columns are external regressors that are not known for future dates.  
 
The [stores_data](dataset:store_data) dataset contains information about the stores themselves (_Type_ and _Size_). 

  
# Walkthrough 

## Visual Time Series Forecasting

### Explore and prepare our data 
We first implement a a traditional approach where we perform statistical analysis on the time series before applying statistical and Deep Learning models. This analysis is done in the [(2. Visual Time Series Forecasting)](flow_zone:Rv3R8gG) and [(3. Time Series Statistics and decomposition)](flow_zone:tgN0Pof) flow zones. 

We start by extracting the historical sales data on which we will train our models using a [Filter recipe](recipe:compute_sales_date_training).  

On the [historical dataset](dataset:historical_sales), we then leverage the Charts tab to plot different insights that we publish onto to the default project [dashboard](dashboard:8D0Iylg). 

At this stage we notice that some data points are missing, which is a problem for statistical models. We thus implement a [Resampling recipe](recipe:compute_resampled_historical_data). We choose a linear interpolation to compute values for missing dates located between two existing data points and we do not extrapolate. Note that this recipe is part of the Time Series Preparation plugin mentioned in the technical requirements of this example project (details in section: Next: Implement your own times series forecasting). 

After resampling we perform a more exhaustive exploration leveraging the Statistics tab of the [resampled dataset](dataset:resampled_historical_data). In particular we add to the [dashboard](dashboard:8D0Iylg) some partial and regular autocorrelation charts. We notice that the time series displays a positive autocorrelation and this for several lags. This means that the Weekly Sales value for a given timestamp is strongly correlated with its close past values.

We also leverage the [Time Series Decomposition recipe](recipe:compute_decomposed_time_series) (also part of the Time Series Preparation plugin) to decompose our time series between a trend, a seasonal and a residual component starting from the [resampled_historical_dataset](dataset:resampled_historical_data). This allows us to add an insightful representation of the series in our [dashboard](dashboard:8D0Iylg), highlighting the high seasonality of our data. 

### Train and deploy statistical and Deep Learning Models

From the [historical dataset](dataset:historical_sales), we go to the lab and select Time Series Forecasting. Here we want to create a forecasting model on _Weekly Sales_ through time _Date_. We also need to specify the identifier column (_Store_). 

In the Design tab we set up the training. In particular: 
- In the general settings, we change the day of week to Friday. We do not add a gap because we want to predict right after the historical data without skipping any time step. We also stay with the default quantiles. For statistical and Deep Learning models, these forecasts quantiles will give us lower and upper bounds for prediction intervals. 
- In the Train/Test set section, we resample directly in the lab (keeping the default parameters) and we use a K-Fold cross test (three folds) in order to obtain error margins on metrics. 
- For the external features, we enable only the external regressor for which future dates are known: _IsHoliday_. We could find ways to use the _Temperature_ and _Fuel_Price_ features but it would require some additional preprocessing and is not the focus of this example project. 
- In the algorithms section we add Seasonal Naive, Auto-ARIMA, Transformer and MQ-CNN. 
- Finally we change the runtime environment to timeseries_36 (details in section: Next: Implement your own times series forecasting). 

We launch the training of the models and analyse the results. The best-performing model is an Auto-ARIMA which confirms the seasonal nature of the series we are studying. By clicking on the model name, we have access to interesting insights such as detailed metrics and training summary that we can add to the [dashboard](dashboard:8D0Iylg).

We can finally deploy our [model](saved_model:eXiylBaA) to the Flow.

### Forecast with statistical and Deep Learning Models

After deploying our model to the Flow we apply a [Scoring recipe](recipe:score_sales_data) to the full [dataset](dataset:sales_data) (including past and future dates). We use this dataset because Statistical and Deep Learning models require historical data in order to predict future values. 

In the output dataset, some columns are added : the forecast and the 9 quantiles on the prediction. By going to the Charts tab we can draw nice insights such as the next forecasted values and the 4th and 6th quantile that we add to the [dashboard](dashboard:8D0Iylg). 

We are now done with the statistical and Deep Learning approach! 

## Visual Machine Learning approach
The classical approach we have implemented has some limitations. In particular, the use of other covariates is limited and there is no real cross-learning between time series. In certain situations a regular ML approach might do a better job. That's what we implement in the [(4. Visual Machine Learning)](flow_zone:lKkDXMK) flow zone.

### Enriching the dataset: Join, Prepare, Window recipes

In this new context, the store number is considered as a regular feature and no more as a time series identifier. It thus makes sense to enrich the [sales_data](dataset:sales_data) with the [store_data](dataset:store_data) using a [Join recipe](recipe:compute_history_join_store_features). 

On the [output dataset](dataset:history_join_store_features), we then extract static attributes from the parsed date (year, month and week number) in order to enrich our data with additional features. We do this using a [Prepare recipe](recipe:compute_history_join_store_features_prepared). 

Finally we augment the dataset with lagged values using a [Window recipe](recipe:compute_training_with_lagged_features). Lagged values correspond to past values of certain features in a similar set-up (for instance last year sales value in the same store, for the same week). 

We define two windows: one partitioned on the columns _"Store"_ and _ "Date_week_of_year"_ to retrieve information for the same store, same week and one partitioned on the columns _"Store"_ and _"Date_month"_ to retrieve information for the same store and month. 

To avoid data leakage (using a lagged value that will not be available at prediction time) we add the _"Date"_ column as the order column, and use a window upper bound of minus three months. This will ensure that all aggregations do not take into account the previous 12 weeks. 

Finally in the aggregations section, we retrieve all column values and add _"avg"_ and _"last"_ for the _Weekly_Sales_ column, as well as _"avg"_ for both _"Temperature"_ and _"Fuel price"_. 

### Train and deploy a Machine Learning Model
Before training our model we split our dataset between the labelled and unlabelled rows using a [Split recipe](recipe:split_training_with_lagged_features_prepared). 

We then create an [AutoML analysis](analysis:VsLR8E6C) on the [train_set](dataset:train_set) on _"Weekly Sales"_. In the Design Tab we adjust some training parameters. In particular: 
- In the Train/Test Set tab, we enable time-ordering to prevent the model from being trained on timestamps that are posterior in time to the dates it is evaluated on. 
- In the metrics section, we chose Mean Absolute Percentage Error. 
- In the features handling tab we remove redundant columns (_Date_), we make sure categorical columns are considered as so (in particular the year, month and week of year as well as _Holidays_). For the _same_week_store_Weekly Sales_avg_ we chose to drop rows when value is missing. This will ensure we remove rows for which we do not have lag information. 
- for algorithms we stick to the default ones and add an XGBoost model

After training, the Random Forest is the best-performing model and reaches a MAPE error that is comparable to the one obtained with statistical models. The lagged features have a very high weight in the final prediction. This confirms that this time series is very seasonal and stable year-on-year. 

### Forecast
Once deployed to the Flow, we select the [model](saved_model:AOsAYuQ0) and the [future dataset](dataset:to_score_set) and use a [Scoring recipe](recipe:score_to_score_set) to obtain a prediction for future dates. Using the Charts tab on the [output dataset](dataset:to_score_set), we plot a line chart with predictions for the upcoming 12 weeks split by store and add it to the last slide of our [dashboard](dashboard:8D0Iylg). 

We are now done with our two different approaches. Feel free to go deeper by trying out other parameters, features and pre-processing! 

# Next: Implement your own times series forecasting

## Technical requirements
This project:
- leverages features available starting from Dataiku 11 . 
- uses the [Time Series Preparation Plugin](https://www.dataiku.com/product/plugins/timeseries-preparation)
- requires to create a dedicated code environment. When creating the code environment call it _timeseries_36_ and choose Python 3.6 or a higher version. In the _packages to install_, you will find the necessary set of packages _Visual Time Series Forecasting_. It is possible to use another name for the code environment. If you do so, you will simply need to remap the old name to the new name when importing the project.

## How to reuse this project

The project can be downloaded [here](https://downloads.dataiku.com/public/dss-samples/EX_TIMESERIES/).

Once you have imported the project, you will simply have to build the whole Flow (Flow actions > build all > build required dependencies) to be able to explore the project in details. 
All the datasets are stored in filesystem so no remapping will be needed. However you have the option to[ change the connection type](https://knowledge.dataiku.com/latest/courses/flow-views-and-actions/connection-changes-concept-summary.html#connection-changes) if you want to rely on a specific data storage type. 

If you want to leverage certain parts of the Flow directly, keep in mind that Dataiku allows you to reuse elements at different levels. In particular it is possible to: 
- [duplicate a whole project](https://knowledge.dataiku.com/latest/kb/governance/How-to-duplicate-a-DSS-project.html)
- [copy and paste entire subflows](https://knowledge.dataiku.com/latest/courses/flow-views-and-actions/connection-changes-concept-summary.html#copy-subflow) 
- [copy and paste recipes](https://knowledge.dataiku.com/latest/kb/collaboration/How-to-copy-a-recipe-in-your-Flow.html) and remap input/output datasets 
- [copy and paste preparation steps](https://doc.dataiku.com/dss/latest/preparation/copy-steps.html) within a Prepare recipe 

It is also possible to directly integrate your datasets within the Flow by uploading them to the project and then changing the input/output of existing recipes. However, in this case, you will need to make sure that you propagate the schema (your own column names and storage types) properly (you will find an example [here](https://knowledge.dataiku.com/latest/courses/use-cases/classification-oil-and-gas/schema-propagation.html)) and that you respect some constraints (in particular your dates should be in a parsed format). 


# Related Resources
- [Time Series Dataiku official Documentation](https://doc.dataiku.com/dss/latest/time-series/index.html)
- [Dataiku Academy: Time Series Basics](https://academy.dataiku.com/time-series-basics-1)
- [Dataiku Academy: Time Series Preparation](https://academy.dataiku.com/path/ml-practitioner/time-series-preparation-1)


