# Collaborative filtering scores

With the [collaborative filtering flowzone](flow_zone:default), we use the components of the plugin to calculate scores between users and items, and generate negative samples if needed. Once our features ready, we train the prediction model that will be able to distinguich which items are the most relevant for users. In output of this flowzone, we have a trained model ready to be used for scoring and optionnaly, a dataset representing the similarity scores between the users.

## Input samples

The recommendation workflow starts with a dataset of dated user-item interactions fetched by the user in the selected SQL connection.

## Time-based split

First we split the [input_prepared](dataset:input_prepared) dataset of user-item interactions based on the timestamp to get 2 datasets of old and recent interactions:

[samples_for_cf_scores](dataset:samples_for_cf_scores) : old interactions used to compute scores between users and items
[samples_to_train_ml](dataset:samples_to_train_ml): recent interactions used to get positive samples to train a ML model on the affinity scores

It's important to train the ML model with more recent interactions than the ones used to compute the affinity scores to prevent data leakage. In production, all interactions are used to compute affinity scores and new samples are scored by the model.

## Collaborative filtering

Then we compute multiple affinity scores using the [samples_for_cf_scores](dataset:samples_for_cf_scores) dataset of interactions and the collaborative filtering recipes.
In this flow, both [user-based](recipe:compute_cf_scores_1) and [item-based](recipe:compute_cf_scores_2) collaborative filtering  scores are computed.

We could also provide our own users (or items) similarity datasets as input of the Custom collaborative filtering recipe to compute more affinity scores.

The multiple scores are joined together into the [all_scores_samples](dataset:all_scores_samples) dataset. First a  [stack recipe](recipe:compute_user_based_scores_samples_l1_stacked_unique) with distinct rows to retrieve all user-items pairs then a [full join recipe](recipe:compute_all_scores_samples) to get the multiple scores.

## Sampling

We have computed affinity scores for user-item pairs. Some of these pairs are interactions present in the [samples_to_train_ml](dataset:samples_to_train_ml), they are positive samples, others did not interact together and are negative samples (not present in the [samples_to_train_ml](dataset:samples_to_train_ml) or [samples_for_cf_scores](dataset:samples_for_cf_scores) datasets).

The [Sampling](recipe:compute_samples_with_target) recipe takes as inputs the [samples_for_cf_scores](dataset:samples_for_cf_scores), [samples_to_train_ml](dataset:samples_to_train_ml) and [all_scores_samples](dataset:all_scores_samples) datasets and outputs the scored pairs with a target column indicating whether they are positive or negative samples (the ratio of positive and negative samples per users can be fixed with a recipe parameter).

## Model training

The [samples_with_target](dataset:samples_with_target) output dataset can finally be used to train a Machine Learning model to predict the target column using the score columns as features.


# Duplicate flow for scoring

Once the ML model is trained, it can be used in production to predict samples with affinity scores obtained from all past interactions (before the time-based split used to train the model). The duplicated flow can be found in [Duplicate flow for scoring](flow_zone:mA1AoA0)

To compute the affinity scores used to train a model, only a subset of the interactions was used. Some interactions were left aside to have positive samples in the training.

In production, all past interactions are used to compute the samples affinity scores. The trained model then predicts these scored samples and the predictions are used to make recommendations.

To compute affinity scores using all past interactions, we need to duplicate the collaborative filtering recipes (with the same parameters), make them use the [input_prepared](dataset:input_prepared) dataset as input and again join all the computed scores to get the [all_scores_samples_duplicate](dataset:all_scores_samples_duplicate) dataset.

Finally, we can predict all samples that have affinity scores by scoring the [all_scores_samples_duplicate](dataset:all_scores_samples_duplicate) dataset with the trained model.

# Scoring

With the model trained and the dataset prepared containing all the samples, we can launch a scoring task to predict which items should be recommended to users. In output of this recipe, we get, for each user/item pair, whether it matches or not. See [Scoring](flow_zone:YJ3cyEF) flowzone for more details.

# Recommend to users list (Optional)

This final flowzone will only appear if you filled the last section of the Dataiku App, meaning if you tried to score new users with the model you previously trained. By providing a new oist of users, the flowzone will output the probabilities of each item being recommended to a used. You can them use a visual recipe to select the top N items you'd like to recommend to each user.