# Overview

This example project illustrates how to perform ***zero-shot* / *few-shot*  object detection and semantic segmentation.** It is designed to be **easily reusable**. You can [download](https://downloads.dataiku.com/public/dss-samples/EX_FEW_SHOT/) it and find instructions to reuse it in the [last section](#next-use-your-own-images-1) of this wiki.

![tasks.png](59GSkfO1wzsE)
<div align=center style="margin-bottom: 20px; margin-top: -5px">Object detection (left) and segmentation (right) for 3 classes: "chair", "screen", "dog"</div>

For a given image, **object detection** (respectively **semantic segmentation**) consists in determining the bounding boxes (respectively the pixels) of objects of certain classes. These tasks usually leverage deep learning models trained on a significant number of annotated images. However, recent vision-language models such as [CLIPSeg](https://arxiv.org/abs/2112.10003) and [OWLViT](https://arxiv.org/abs/2205.06230) offer the possibility to perform them in a:
- ... **zero-shot** setting: no example of the target classes is provided at test time;
- ... **few-shot** setting: just a few examples of the target classes are provided at test time.

Nota bene: zero-shot or few-shot image classification is covered in another [example project](https://gallery.dataiku.com/projects/EX_CLIP/) on Dataiku's public gallery.

# Data

For **zero-shot learning** and both object detection and segmentation, we use the same [five images](managed_folder:Tmf77vDr) from Unsplash, with everyday life scenes and common objects (image credits: [1](https://unsplash.com/fr/photos/DJ7bWa-Gwks), [2](https://unsplash.com/fr/photos/J9cBJjlpYKU), [3](https://unsplash.com/fr/photos/q9q6XOe4Sy0), [4](https://unsplash.com/fr/photos/JmmXKlJ8MKQ), [5](https://unsplash.com/fr/photos/8_oFcxtXUSU)).

For **few-shot learning**, we use the [Microcontroller Object Detection dataset](https://www.kaggle.com/datasets/tannergi/microcontroller-detection), which includes [images](managed_folder:aKlMxTsk) with 4 models of microcontrollers ("8266 ESP", "Arduino Nano", "Heltec ESP32 Lora", "Raspberry Pi 3").

![dataset.png](DJMbujVkJlAU)
<div align=center style="margin-bottom: 20px; margin-top: -5px">5 images (top) and 4 examples of images (bottom) used respectively for zero-shot and few-shot learning</div>

# Walkthrough

## Zero-shot object detection

We leverage **[OWLViT](https://arxiv.org/abs/2205.06230)**, and more precisely [google/owlvit-base-patch32](https://huggingface.co/google/owlvit-base-patch32), for the object detection task. As a vision-language model, OWLViT can process both texts and images. It is composed of a text encoder, an image encoder and two heads attached to the image encoder. These two heads' purpose is to describe candidate bounding boxes with their coordinates and an embedding representing their semantic content. The image encoder and the text encoder were first contrastively pretrained on a huge dataset of captioned image (3.6 billion image-text pairs) in the same fashion as [CLIP](https://medium.com/data-from-the-trenches/leveraging-joint-text-image-models-to-search-and-classify-images-36c87091ff02). Afterwards, the whole model was fine-tuned on several large object detection datasets (2 million annotated images in total).

![owlvit.png](S6jpjPJGIKrN)
<div align=center style="margin-bottom: 20px; margin-top: -5px">Architecture of the OWLViT model. Illustration from the <a href="https://arxiv.org/abs/2205.06230">original paper</a></div>

Given an image and a prompt describing the targeted class, OWLViT provides a large list of bounding boxes and a score for each of them. If we define a threshold on the score or a maximum number of boxes, we can remove low-confidence results. Optionally, we can also filter out boxes with a high intersection-over-union ratio with a better-scored box.

This is implemented in a [web app](web_app:xKA7MHs) which allows to:
- select one of the [5 images](managed_folder:Tmf77vDr);
- define the target classes with comma-separated prompts (a label distinct from the prompt can optionally be specified with the following format: `{prompt}:{label}`);
- adjust the maximum number of results, the score threshold or both;
- visualize the predicted bounding boxes;
- manually add a new bounding box or manually edit or remove an existing bounding box;
- save the bounding boxes in the format used in Dataiku for object detection tasks (the bounding boxes are saved in this [folder](managed_folder:sr6n0lYE) then aggregated in this [dataset](dataset:annotated_images)).

![0shot_detection.png](mSa52OnHBxWh)
<div align=center style="margin-bottom: 20px; margin-top: -5px">Screenshot of the zero-shot object detection web app</div>

## Zero-shot segmentation

For the segmentation task, we take advantage of [clipseg-rd64-refined](https://huggingface.co/CIDAS/clipseg-rd64-refined), one of the versions of **[CLIPSeg](https://arxiv.org/abs/2112.10003)**. Similarly as OWLViT, CLIPSeg is composed of an image encoder and a text encoder pretrained contrastively and augmented with a lightweight task-specific module. In the case of CLIPSeg, the image encoder and the text encoder are simply those of a frozen CLIP model. The task-specific module is a transformer-based decoder trained on a large image segmentation dataset (340,000 images).

![clipseg.png](uVxL48zbx2Ex)
<div align=center style="margin-bottom: 20px; margin-top: -5px">Architecture of the CLIPSeg model. Illustration from the <a href="https://arxiv.org/abs/2112.10003">original paper</a></div>

Given an image and a prompt, CLIPSeg returns per-pixel scores that can be converted into a segmentation mask. A [web app](web_app:3Ndvlqt), similar to the one for zero-shot object detection, allows you to test this interactively. With this web app, the user can:
- select one of the [5 images](managed_folder:Tmf77vDr);
- define the target classes with comma-separated prompts (a label distinct from the prompt can optionally be specified with the following format: `{prompt}:{label}`);
- adjust the score threshold;
- visualize the predicted segmentation masks;
- save the segmentation masks in this [folder](managed_folder:1HRv3kle) (the segmentation masks can then be added to the original images with a [recipe](recipe:compute_btc9F5cm)).

![0shot_segmentation.png](izFcwBeXSCQY)
<div align=center style="margin-bottom: 20px; margin-top: -5px">Screenshot of the zero-shot segmentation web app</div>

## Few-shot object detection

In the few-shot setting for object detection, we do not know how to describe the targeted class through a natural language prompt (or the vocabulary used in the prompt is unknown to the text encoder). However we have **a few training images with the ground truth bounding boxes for this class**.

Let's start with the simplest case: just one training image with a single ground truth bounding box. In this case, as suggested in the original paper, we can obtain the list of predicted bounding boxes and predicted embeddings and select the **predicted bounding box with a high overlap with the ground truth bounding box**. We can then substitute the query embedding that would have been obtained through the text encoder with the **predicted embedding of this bounding box**. If there are several predicted bounding boxes largely overlapping the ground truth bounding box, the original paper suggests a heuristic to select the one to use.

The approach is the same with several images or several ground truth bounding boxes. We just need to average the embeddings obtained for each of these bounding boxes.

We implement this approach on the [Microcontroller Object Detection dataset](https://www.kaggle.com/datasets/tannergi/microcontroller-detection) in the [3. Few-Shot Object Detection](flow_zone:zVuxytY) Flow zone. We [retrieve](recipe:compute_train) 35 training images with 10 bounding boxes for each class in total. We [compute](recipe:compute_U68shXET) the embeddings of each class and [score](recipe:compute_test_scored) the test images on this basis. We then generate [versions](managed_folder:LkxFClYj) of the test images with both the predicted and ground truth bounding boxes and compute the corresponding [Average Precision values](dataset:evaluation).

These results are broadly comparable to those after fine-tuning a Faster R-CNN in our [example project for object detection](https://gallery.dataiku.com/projects/EX_OBJECT_DETECTION/), even though much fewer training images were used (35 instead of 120).

<table>
<thead>
  <tr>
    <th style="border: 0px"></th>
    <th colspan="3">Average Precision for all classes</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td style="border: 0px"></td>
    <td><b>IoU = 0.5</b><br></td>
    <td><b>IoU = 0.75</b><br></td>
    <td><b>All IoUs</b><br></td>
  </tr>
  <tr>
    <td><b>Fine-tuning (120 train images)</b></td>
    <td>0.87</td>
    <td>0.39</td>
    <td>0.43</td>
  </tr>
  <tr>
    <td><b>Few-shot learning (35 train images)</b></td>
    <td>0.71</td>
    <td>0.71</td>
    <td>0.53</td>
  </tr>
</tbody>
</table>

## Few-shot segmentation

In the few-shot setting for segmentation, we just assume that we have **a few training images** for each targeted class, **without their ground truth segmentation masks**.

Remember that in the zero-shot setting, the frozen CLIP text encoder computes an embedding of a natural language prompt, which is then fed to a decoder. If we now need to replace this embedding in the few-shot setting, we can just take the **average embedding of the images of the corresponding class**, using the frozen CLIP image encoder. This is appropriate because, by design of the CLIP pre-training procedure, the embedding of an image is aligned with the embedding of its text description.

In our project, we take, for each class, [3 training images](managed_folder:PIBaZG9p) of only the target objects over a neutral background. We augment these images through 90°, 180° and 270° rotations and [compute](recipe:compute_3gKc7IXi) the average embeddings per class. We then get the corresponding [segmentation masks](managed_folder:umYWZZ9z) for the test images.

The results are less convincing than those for few-shot object detection. The predicted location of the objects is somewhat correct but the discrimination among the 4 classes is poor. This can be visualized through a [web app](web_app:i38Z5kf). When the proper class is selected and the decision threshold is adjusted, we obtain a reasonable mask. However, when all classes are selected, the real class does not generally get the highest scores. This may be because we use a single decision threshold instead of a class-specific decision threshold.

![fewshot_segmentation.png](xRKYJnwpMrbn)
<div align=center style="margin-bottom: 20px; margin-top: -5px">Example of good localization but improper categorization with few-shot segmentation and a single decision threshold</div>

# Next: use your own images

## Technical requirements
This project:
- leverages features available starting from **Dataiku 11**;
- requires a Python 3.8 code environment named `py_38_sample_fewshot` with the following packages: 
``` 
Pillow==9.4.0
transformers==4.26.1
torch==1.13.1
torchvision==0.14.1
dash==2.8.1
dash-bootstrap-components==1.4.0
scikit-image==0.20.0
```

`py_38_sample_fewshot` should also include the following [initialization script](https://doc.dataiku.com/dss/latest/code-envs/operations-python.html#managed-code-environment-resources-directory) (cf. the "Resources" tab of the code environment):
```
## Base imports
import os

from dataiku.code_env_resources import clear_all_env_vars
from dataiku.code_env_resources import grant_permissions
from dataiku.code_env_resources import set_env_path
from dataiku.code_env_resources import set_env_var

# Clears all environment variables defined by previously run script
clear_all_env_vars()

## Hugging Face
# Set HuggingFace cache directory
set_env_path("HF_HOME", "huggingface")
hf_home_dir = os.getenv("HF_HOME")

# Import Hugging Face's transformers
import torch
from transformers import OwlViTProcessor, OwlViTForObjectDetection
processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
model = OwlViTForObjectDetection.from_pretrained(
    "google/owlvit-base-patch32",
    torch_dtype=torch.float16
)

from transformers import CLIPSegForImageSegmentation, CLIPSegProcessor
processor = CLIPSegProcessor.from_pretrained("CIDAS/clipseg-rd64-refined")
model = CLIPSegForImageSegmentation.from_pretrained("CIDAS/clipseg-rd64-refined")

# Grant everyone read access to pretrained models in the HF_HOME folder
# (by default, only readable by the owner)
grant_permissions(hf_home_dir)
```

## How to reuse this project
The project can be downloaded [here](https://downloads.dataiku.com/public/dss-samples/EX_FEW_SHOT/).

Once you have imported the project, you can directly navigate the Flow.

If you want to use your own data, put your images:
- in this [folder](managed_folder:Tmf77vDr) for **zero-shot** object detection or segmentation;
- in this [folder](managed_folder:aKlMxTsk) for **few-shot** object detection or segmentation.

For **few-shot object detection**, add your annotations in this [dataset](dataset:object_detection_data) and adjust the [recipe](recipe:compute_train) to create your training set and your test set. **Make sure that the annotations are in the right format** or use a Prepare recipe or a Python recipe to convert them. You may also include unlabeled images and annotate them with the Dataiku [managed labeling](https://doc.dataiku.com/dss/latest/machine-learning/labeling.html) feature.

For **few-shot segmentation**, add your training images in this [folder](managed_folder:PIBaZG9p).

All the datasets are stored in filesystem so no remapping will be needed. However you have the option to [change the connection type](https://knowledge.dataiku.com/latest/courses/flow-views-and-actions/connection-changes-concept-summary.html#connection-changes) if you want to rely on a specific data storage type. 

If you want to leverage certain parts of the Flow directly, keep in mind that Dataiku allows you to reuse elements at different levels. In particular it is possible to:
- [duplicate a whole project](https://knowledge.dataiku.com/latest/kb/governance/How-to-duplicate-a-DSS-project.html)
- [copy and paste entire subflows](https://knowledge.dataiku.com/latest/courses/flow-views-and-actions/connection-changes-concept-summary.html#copy-subflow)
- [copy and paste recipes](https://knowledge.dataiku.com/latest/kb/collaboration/How-to-copy-a-recipe-in-your-Flow.html) and remap input/output datasets
- [copy and paste preparation steps](https://doc.dataiku.com/dss/latest/preparation/copy-steps.html) within a Prepare recipe  

# Related Resources
- [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) (OWLViT paper)
- [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) (CLIPSeg paper)
- [Object detection example project](https://gallery.dataiku.com/projects/EX_OBJECT_DETECTION/)
- [CLIP example project](https://gallery.dataiku.com/projects/EX_CLIP/flow/) (for zero-shot and few-shot image classification)