# Overview

This example project illustrates how to perform **text classification** in Dataiku. We aim at classifying a text passage according to several predefined categories (or classes). This can be useful in many use cases: email or support ticket routing, sentiment analysis, spam filtering, toxicity detection...

This project covers the following aspects:
- **Exploratory data analysis**: visualization of examples and identification of the most common words per class
- **Standard text classification** with several approaches:
  - Visual mode with a sparse representation (TF-IDF);
  - Visual mode with a dense representation (sentence embeddings);
  - Fine-tuning of a pre-trained language model;
- **Few-shot text classification** (just a few training examples at inference):
  - SetFit (training of a shallow classifier over a fine-tuned dense representation);
  - In-context learning with ChatGPT and a few examples;
- **Zero-shot text classification** (no training example at inference):
  - Natural language inference;
  - In-context learning with ChatGPT;
- Visualization and interpretation of predictions through **interactive scoring**, and **visualization of misclassified test examples**.

This project can be [downloaded](https://downloads.dataiku.com/public/dss-samples/EX_TEXT_CLASSIFICATION/) and instructions to reuse it with your own datasets are provided in the [last section](#next-classify-your-own-documents-1).

# Data

We use a small sample of the [Amazon Reviews dataset](https://nijianmo.github.io/amazon/index.html) with **1400 reviews evenly distributed in 7 product categories** ("Cell Phones and Accessories", "Digital Music", "Electronics", "Industrial and Scientific", "Office Products", "Software", "Toys and Games"). We only selected reviews whose length is between 100 and 400 characters to avoid excessively short texts (like "great product!") which would not be relevant for text classification.

For example, a review categorized as "Cell Phones and Accessories" is:
> Can't say enough good things.  Fits snugly and provides great all round protection.  Good grip but slides easily into pocket.  Really adds to the quality feel of the Moto G 2014.  I had the red and it looks sharp.  Highly recommended.

# Walkthrough

## Data preparation

In the [1. Data preparation and exploration](flow_zone:iFd8lg0) Flow zone, we [split](recipe:compute_train) the data in a [training set](dataset:train) and a [test set](dataset:test) which both include 100 examples of each class. We also prepare a [small training set](dataset:train_small) (8 examples of each class) which will be used later in the few-shot setting.

## Data exploration

In the same Flow zone, we [create](recipe:compute_nau8CvPH) a [word cloud](managed_folder:nau8CvPH) for each class to identify their most common words.
![wordcloud_label_text_digital music.png](i8HrUA6AaYCr)

We also use a [web app](web_app:gp5xsjB) to see some training examples of each class. The examples can be filtered by their class and/or by keywords (exact search only or semantic search). This requires to [extract](recipe:compute_GuB6FLVF) an embedding of each training example with a semantic similarity model.

![interactive_search.png](acVQyj8QIyUW)

## Standard text classification

### Visual mode with a sparse representation (TF-IDF)

A simple approach for text classification is to **convert text passages in vectors and then use standard machine learning algorithms** such as logistic regression, random forest or gradient boosting. The key question then becomes: how to transform a text passage in a vector?

[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (or term frequency - inverse document frequency) is one way to achieve this vectorization. It returns a vector with one dimension for each word in a given vocabulary (i.e., a set of words). Each component of this vector reflects the frequency of the corresponding word in the input text compared to the entire corpus.

In Dataiku, we can **use a TF-IDF vectorization and train machine learning models without code**. In the "Features handling" tab of an "AutoML Prediction" [visual analysis](analysis:gDQvZnQj), we can just specify that our text variable should be handled with "TF/IDF vectorization". All the other steps of [training](recipe:train_Predict_label_text__multiclass__-_TF-IDF) and [evaluating](recipe:evaluate_on_test_1) our machine learning models are then identical to those for tabular data, as illustrated in the [2. TF-IDF and classical ML model](flow_zone:default) Flow zone. With TF-IDF and the default parameters of the "Quick modeling" feature of Dataiku, we get a **F1 score** of **0.62** and an **ROC AUC** of **0.88**.

![TF-IDF](tWH1JDyQuMhP)

### Visual mode with a dense representation (sentence embeddings)

TF-IDF has several drawbacks. It does not take into consideration the order of the words in the text and it ignores the semantic similarity between words. It also makes no distinction between the various meanings of a polysemous word (e.g. "sound" as in "a loud sound", "they sound correct" or "a sound proposal").

A more effective approach, in particular if the training dataset is relatively small, is to take advantage of the vector representations (or *sentence embeddings*) obtained with a large pre-trained deep learning model such as BERT. 

In Dataiku, **sentence embeddings** can be used as simply as TF-IDF. We just need to specify "sentence embeddings" in the "Features handling" tab after [having downloaded the targeted embeddings model in a code environment](https://doc.dataiku.com/dss/latest/machine-learning/features-handling/text.html#sentence-embedding) (cf. the [3. Sentence embeddings and classical ML model](flow_zone:nCODhDO) Flow zone). This allows a jump to a **F1 score** of **0.76** and a **ROC AUC** of **0.95**.

![Sentence embeddings](kFY6wYEQX1Is)

### Fine-tuning of a pre-trained deep learning model

The previous approach leverages a frozen deep learning model pre-trained on a very large corpus of texts. Further performance gains may be achieved by **fine-tuning** this model on the specific task at hand. This is what we do in this [Python recipe](recipe:compute_UHpAVvLA) in the [4. Fine-tuning a DL model](flow_zone:pQLgfvY). 

Even if we create our model with code, we want to:
1. keep track of the different hyperparameter combinations that we tried and their associated performance metric; 
2. benefit from the visual tools associated with no-code models in Dataiku such as performance vizualization and easy versioning, scoring and evaluation. 
 
In order to achieve this goal, we leverage [Experiment Tracking](https://doc.dataiku.com/dss/latest/mlops/experiment-tracking/index.html) and the Dataiku [integration with ML-Flow](https://doc.dataiku.com/dss/10.0/mlops/mlflow-models/using.html). 

The [recipe](recipe:compute_UHpAVvLA) is commented for more clarity but here is a summary of the key steps taken:
 1.  Set up the ML-Flow extension then define the experiment name and the associated [managed folder](managed_folder:UHpAVvLA) where artifacts will be stored.
 2. Define a hyperparameters grid and then for each combination of hyperparameters: load the pre-trained model, fine-tune it on our training dataset and log the model and the metrics.
 3. Find the run in the experiment that resulted in the best evaluation accuracy and automatically deploy the corresponding model as a [saved model ](saved_model:qx3rUX8I) in the Flow. When clicking on the last active version, we have access to performance visualizations such as the confusion matrix, lift charts etc.
 
 We can then [evaluate](recipe:evaluate_on_test) our fine-tuned model on fresh data where we obtain a **F1 score** of **0.75** and a **ROC AUC** of **0.95**.
 
Please note that the models saved for each experiment were deleted from the [experiments_finetuning managed folder](managed_folder:UHpAVvLA) to make this example project easier to download. **As a result, models cannot directly deployed from the Experiment Tracking page**. This would however be possible if the experiments are run again.

## Zero-shot text classification

### Natural language inference

In the zero-shot setting, no training example is available at inference time but we can still take advantage of large models trained on massive and diverse training datasets. For example, one zero-shot approach leverages **natural language inference** models.

In a natural language inference task, we are given a **premise** and a **hypothesis** and we need to determine whether the premise entails, contradicts or is neutral with regard to the hypothesis. Let's assume that we have a model trained on natural language inference. How can we use it for our text classification task? We can simply take the text passage to classify as the premise and define one hypothesis per class: "This text is about Cell Phones and Accessories", "This text is about Digital Music"...

For all combinations of the premise and a hypothesis, we then just need to use a model fine-tuned on a natural language inference task to assess to what extent the premise entails the hypothesis and select the class with the highest score.

This is [implemented](recipe:compute_test_scored_zero-shot_nli) in the [5. Zero-shot: natural language inference](flow_zone:Fdib2zv) Flow zone with decent results given the lack of training examples: **F1 score** of **0.63** and **ROC AUC** of **0.90**.

### In-context learning with ChatGPT

A completely different approach is to leverage a large autoregressive **language model** like GPT-3, GPT-4 or ChatGPT. Autoregressive language models are models estimating the probability of the next token (a word or part of a word) given all the preceding tokens. This allows them to generate credible texts and, if they have been trained at a sufficiently large scale, they can effectively perform natural language processing tasks even without training examples.

Here, we can create a "**prompt**" like below, replace the placeholder with the text to classify and ask an autoregressive language model to generate the next words. Hopefully, the next words will be exactly the name of one of the targeted classes.

> Classify product reviews in one of the following product categories: "Cell Phones and Accessories", "Digital Music", "Electronics", "Industrial and Scientific", "Office Products", "Software", "Toys and Games"

> Product review: {placeholder for the text to classify}
> Product category: 

In the [6. Zero-shot: ChatGPT](flow_zone:6bCqEu1) Flow zone, we [do](recipe:compute_test_scored_few-shot_chatgpt) exactly this with ChatGPT and the [OpenAI GPT Text Completion plugin](https://doc.dataiku.com/dss/latest/nlp/openai-gpt-text-completion.html). In the resulting [dataset](dataset:test_scored_zero-shot_chatgpt_raw), we see that the generated text is indeed one of the valid classes 98% of the time. The generated texts in the remaining 2% cases are actually quite interesting:
- in 9 cases out of 14, ChatGPT was overzealous and added details in parentheses, after a valid label. For example: "Industrial and Scientific (filament for 3D printing)" instead of just "Industrial and Scientific"
- in 2 cases, the labels were made up: "Home decor (not listed)" or "Automotive Accessories"
- in 3 cases, ChatGPT declined to provide an answer. For example, it replied "Unknown (not enough information to classify)" for the following review: "Thanks for the product. It was exactly as described and exactly what I was looking for. Great quality and I look forward to ordering more in the future."

In practice, it is important to avoid such invalid labels. It can be done easily in a [Prepare recipe](recipe:compute_test_scored_zero-shot_chatgpt_prepared) with simple "if... then... else..." rules. The **F1 score** is **0.81** after this post-processing (the **ROC AUC** cannot be computed since the OpenAI API for ChatGPT does not provide enough details about the proabilities of potential tokens).

## Few-shot text classification

### SetFit

In the **few-shot setting**, we assume that we only have a few training examples of each class. In that situation, there is a high-risk of overfitting so we need to be careful, in particular with high-capacity models. [SetFit](https://huggingface.co/blog/setfit) is an approach proposed in 2022 for such a context. It is quite similar to the method described above which consisted in extracting dense representations with a pre-trained NLP model and training a shallow classifier on these representations. The only difference is that the pre-trained NLP model is fine-tuned with a contrastive semantic similarity task.

In the [7. Few-shot: SetFit](flow_zone:O5MwKr7) Flow zone, we [use ](recipe:compute_Ey4Ge7PK) the `setfit` Python library to train a [SetFit model](managed_folder:Ey4Ge7PK). We can [score](recipe:compute_test_scored_few-shot_setfit) the test set with it and we get a **F1 score** of **0.70** and a **ROC AUC** of **0.91**.

### In-context learning with ChatGPT

Autoregressive language models can also take advantage of a few training examples (the academic paper introducing GPT-3 was actually titled *Language Models are Few-Shot Learners*). For this we just need to include some of examples directly in the prompt:

> Classify product reviews in one of the following product categories: "Cell Phones and Accessories", "Digital Music", "Electronics", "Industrial and Scientific", "Office Products", "Software", "Toys and Games"

> Product review: I've received so many compliment about this phone case.  I didn't realized it was see thru which made me love it even more.  Thank you!!!
> Product category: Cell Phones and Accessories
> [...]
> Product review: For whatever reason, I love this song at Christmas.  It's whimsical and fun.  Madonna's version not so much.
> Product category: Digital Music
> Product review: {placeholder for the text to classify}
> Product category: 

In Dataiku, the few-shot examples can be directly specified in the [OpenAI API Text Completion recipe](recipe:compute_test_scored_few-shot_chatgpt). With just 2 examples per class in the [8. Few-shot: ChatGPT (in-context learning)](flow_zone:DkDZuKr) Flow zone, we see the **F1 score** improving from 0.81 to **0.84** and the rate of invalid labels decreasing from 2% to 0.3%.

## Comparing the models' performance

We can visualize the predictive performance metrics through a [model comparison](model_comparison:V9mguAGT). In particular, the F1 scores on the [test set](dataset:test) are:

| **Approach**                         | **F1 score** |
|--------------------------------------|--------------|
| Few-shot ChatGPT                     | 0.84         |
| Zero-shot ChatGPT                    | 0.81         |
| Sentence embeddings + classic ML     | 0.76         |
| Fine-tuning                          | 0.75         |
| Few-shot SetFit                      | 0.70         |
| Zero-shot natural language inference | 0.63         |
| TF-IDF + classic ML                  | 0.62         |

We did not make particular efforts to optimize the hyperparameters and it is just one small dataset. We should then not derive strong general conclusions from these results. We can still note that:

- **ChatGPT and more generally few-shot and zero-shot approaches perform remarkably well** given the limited number of training examples. They certainly benefited from the fact that this text classification task does not require domain-specific knowledge.
- Sentence embeddings obtained from a frozen text encoder approximately yields the same predictive performance as a fine-tuned model. This is probably due to the limited hyperparameter optimization.
- Beyond predictive performance, **other criteria and constraints should be taken into account** to assess these approaches. For example, leveraging ChatGPT requires an internet connection and the use of a third party API whereas more expertise and computational resources are needed to fine-tune a self-hosted model.

## Visualizing predictions

Once a model is trained and quantitatively evaluated, it is also important to **qualitatively analyze the predictions** to identify problems, understand the limitations of the model and detect errors in our training or test datasets. The example project includes 2 web apps for this purpose.

The [Interactive Scoring web app](web_app:IUmobzw) enables users to classify the text passages they provide through an input field. The users can see the distribution of the predicted classes, as well as the Shapley values for each class. In the example below for example, we can see that the sentence "I love my chess board" is categorized in "Toys and Games" and that it's mainly the word "chess" that led to this prediction.

![Interactive_scoring](ik0tFStoItEv)

The [Error Analysis web app](web_app:aQEieka) shows misclassified test examples along with the incorrect predictions for all models.

![Error analysis](MMOQMzcHH2ST)

# Next: Classify your own documents

The project can be downloaded [here](https://downloads.dataiku.com/public/dss-samples/EX_TEXT_CLASSIFICATION/).

## Technical requirements
This project:
- leverages features available starting from **Dataiku 11**;
- requires a Python 3.6 code environment named `py_36_sample_text_classification` and a Python 3.8 code environment named `py_38_sample_text_classification` with the packages and [resources](https://doc.dataiku.com/dss/latest/code-envs/operations-python.html#managed-code-environment-resources-directory) specified in [Appendix: code environments](article:3);
- requires the [Text Visualization plugin](https://www.dataiku.com/product/plugins/nlp-visualization/) and the [OpenAI GPT Text Completion plugin](https://doc.dataiku.com/dss/latest/nlp/openai-gpt-text-completion.html).

## How to reuse this project

The project can be downloaded [here](https://downloads.dataiku.com/public/dss-samples/EX_TEXT_CLASSIFICATION/).

Once you have imported the project, you can directly navigate the Flow.

If you want to use your own data, you can just replace the [data](dataset:data) dataset with your own, with the texts to classify in a `text` column and the corresponding labels in a `label_text` column. The Natural Language Inference and ChatGPT approaches require you to respectively adjust the "hypothesis template" (cf. the `HYPOTHESIS_TEMPLATE` variable in the [Python recipe](recipe:compute_test_scored_zero-shot_nli)) and the "prompt" (cf. the "Task", "Input Description", "Output Description" and "Examples" fields in the ChatGPT recipes).

Please note that the models saved for each experiment were deleted from the [experiments_finetuning managed folder](managed_folder:UHpAVvLA) to make this example project easier to download. **As a result, models cannot directly deployed from the Experiment Tracking page**. This would however be possible if the experiments are run again.

All the datasets are stored in filesystem so no remapping will be needed. However you have the option to [change the connection type](https://knowledge.dataiku.com/latest/courses/flow-views-and-actions/connection-changes-concept-summary.html#connection-changes) if you want to rely on a specific data storage type. 

If you want to leverage certain parts of the Flow directly, keep in mind that Dataiku allows you to reuse elements at different levels. In particular it is possible to:
- [duplicate a whole project](https://knowledge.dataiku.com/latest/kb/governance/How-to-duplicate-a-DSS-project.html)
- [copy and paste entire subflows](https://knowledge.dataiku.com/latest/courses/flow-views-and-actions/connection-changes-concept-summary.html#copy-subflow)
- [copy and paste recipes](https://knowledge.dataiku.com/latest/kb/collaboration/How-to-copy-a-recipe-in-your-Flow.html) and remap input/output datasets
- [copy and paste preparation steps](https://doc.dataiku.com/dss/latest/preparation/copy-steps.html) within a Prepare recipe  

# Related Resources
- [MLFlow](https://doc.dataiku.com/dss/latest/mlops/mlflow-models/index.html) in Dataiku
- [Natural language inference pipeline](https://huggingface.co/tasks/zero-shot-classification) in Transformers
- SetFit: [paper](https://arxiv.org/abs/2209.11055), [GitHub repository](https://github.com/huggingface/setfit)
- [OpenAI GPT Text Completion plugin](https://doc.dataiku.com/dss/latest/nlp/openai-gpt-text-completion.html)
