# Overview

This example project shows how to implement a **Multimodal Retrieval-Augmented Generation** (RAG) pipeline. It illustrates some of the notions present in a Dataiku [blog post](https://medium.com/data-from-the-trenches/beyond-text-taking-advantage-of-rich-information-sources-with-multimodal-rag-0f98ff077308). You are invited to read this blog post before exploring this example project.

The project covers the following aspects:
- **Extracting individual elements** (text boxes, images and tables) from PDF documents;
- Identifying text boxes corresponding to captions of images or tables;
- **Embedding the individual elements** with either a multimodal embedding model or a text embedding model;
- **Retrieving** relevant elements for a given user question;
- **Generating the answer** of a given user question thanks to the elements retrieved and a **multimodal LLM** (GPT-4o or IDEFICS2);
- Incorporate the question answering pipeline in a **simple Dash user interface**.

The project can be [downloaded](https://downloads.dataiku.com/public/dss-samples/EX_MULTIMODAL_RAG/) and the last section provides instructions to reuse it with your own datasets.

# Data

For this example project, we use [three academic articles](managed_folder:VyiwDJse) (in PDF format) which are part of the [PubMed Central Open Access Subset](https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/). These articles include text, tables, diagrams, charts and other images. The aim of the following data processing steps will be to answer [test questions](dataset:questions) based on these articles.

![documents.png](ADXn83fhAIIH)
<div align="center"><i>Figure 1: examples of pages of the three documents used in this project</i></div>

# Walkthrough

As explained in the upcoming blog post mentioned above and summarized in the diagram below, there are **various ways to implement a multimodal RAG pipeline**. This section will first present a pipeline with the steps 1, 2b, 3 and 4 before presenting some alternatives.

![variants.png](r5JpSDH0etgs)
<div align="center"><i>Figure 2: variants of a multimodal RAG pipeline</i></div>

## Content extraction

In the [1. Content extraction](flow_zone:D74ZXfn) Flow zone, we [extract](recipe:compute_texts) individual elements – text chunks or images – from the three PDFs. For this, we take advantage of the open source `unstructured` library and get:
- for each text chunk:
  - an id;
  - a category (e.g. `NarrativeText`, `Title`, `Header`...);
  - the coordinates of the corresponding bounding box;
  - the page number;
  - the text content;
- for each image or table:
  - an id;
  - a category (`Image` or `Table`);
  - the coordinates of the corresponding bounding box;
  - an image file stored in a [managed folder](managed_folder:vOjkXoGz);
  - the page number;
  - the text content obtained through Optical Character Recognition (OCR).

Unfortunately, the way `unstructured` splits a document page into various elements may separate a table or an image from its caption (cf. Figure 3a). In this case, we risk losing important contextual information. It is then important to try to **establish this missing link between a table or image and its caption**. One way to achieve this is to try to leverage the coordinates of the text boxes, images and tables extracted. If for example, the x-coordinates of a text box are between the x-coordinates of an image and if the y-coordinates of a text box are close to the top or bottom border of this image, then we can reasonably assume that the text box is a caption for this image (cf. Figure 3b). We implement this simple heuristic rule in the [same Python recipe](recipe:compute_texts) to assign a caption to images and tables whenever possible.

Alternatively, as demonstrated in the [1.1. Content extraction with GPT4-V captioning](flow_zone:OH9P5oT) Flow zone, we can also [create](recipe:compute_figures2) a picture with the target image outlined in red and its surroundings, send this picture to a multimodal LLM and ask this multimodal LLM to generate a description of this image (cf. Figure 3c). Hopefully, the multimodal LLM will take advantage of the caption if there is one.

![captioning.png](ruMfQmL1xxyL)
<div align="center"><i>Figure 3, left: the link between an image and its caption may be implicit. Middle: a way to identify the caption of an image is to look for all text boxes whose x-coordinates are between the x-coordinates of the image and whose y-coordinates are close to the bottom or top of the image. Right: an alternative is to send a picture with the target image and its surroundings to a multimodal LLM and to ask for a description for this image.</i></div>

We perform additional image processing steps after having extracted the content from the documents:
- we [discard](recipe:compute_tables) images that are too small;
- since `unstructured` does not detect text orientation before extracting the images, we [check](recipe:compute_tables) whether the text has been rotated. If it is the case, we accordingly rotate the image and perform OCR again on this image;
- if necessary, we [resize](recipe:compute_tables) the images so that their dimensions are less than the [maximal dimensions allowed with GPT-4o](https://platform.openai.com/docs/guides/vision/managing-images);
- we [perform](recipe:compute_figures_annotated) [zero-shot image classification](https://medium.com/data-from-the-trenches/leveraging-joint-text-image-models-to-search-and-classify-images-36c87091ff02) to determine whether the extracted images are diagrams, charts, maps, photographs or illustrations. We then remove images classified as photographs or illustrations (because they are assumed to be included in the documents only for aesthetic purposes).

## Embedding

Through all the previous steps, we have created a [list of text chunks](dataset:texts), a [list of tables](dataset:tables) and a [list of images](dataset:figures_annotated). We derive from these lists a [dataset of texts to embed](dataset:to_embed) and a [dataset of metadata](dataset:metadata) for each text chunk, table or image, with a common identifier to relate the records of both datasets. The text to embed for each text chunk is simply its content. When it comes to the tables and images, the texts to embed are the text extracted through OCR and, if applicable, their caption.

We then [vectorize](recipe:compute_kb) all the texts to embed and store the corresponding vectors in a [knowledge bank](retrievable_knowledge:zdqno9RF). Please note here that we do not use a multimodal embedding model because we assume that the captions and texts obtained through OCR are sufficiently explicit for an effective retrieval at the next step. We present below an [alternative](#variant-1-use-of-a-multimodal-embedding-model-1)  which leverages a multimodal embedding model.

## Answer generation

Thanks to the preprocessing steps above, we can now [answer](recipe:compute_answers) a user question. For this:
- we retrieve the texts included in the knowledge bank that are most similar to the question;
- we get the top `K` elements – text chunk, image or table – corresponding to these texts. Please remember that an image or table can be assigned a `caption` text and a `content` text and any of these texts can be split during the embedding step;
- we include a general question answering prompt, the question and all these elements in a series of messages sent to the multimodal LLM;
- we obtain the answer generated by the multimodal LLM.

A simple [web app](web_app:SXFFvuL) illustrates this mechanism.

![webapp.png](3E105nySuLRe)
<div align="center"><i>Figure 4: simple question answering web app</i></div>

## Variant 1: use of a multimodal embedding model

The pipeline described above uses a simple text embedding model. This model is used to vectorize all text chunks as well as the captions and text content of images and tables. Another approach is to directly use a multimodal embedding model (here, [SIGLIP](https://huggingface.co/google/siglip-so400m-patch14-384)) on both the text chunks and images. This is illustrated in the [2.1. Embedding with a multimodal model](flow_zone:LWWbzAD) and [3.1. Retrieval + Answer generation with a multimodal LLM](flow_zone:vSfiFpf) Flow zones.

A point of caution is that the text encoder of SIGLIP only handles strings shorter than 64 tokens. We then use a *parent-children* approach: we [split](recipe:compute_6LZzt7AZ_1) text chunks in segments shorter than 64 tokens, embed these segments, but [include](recipe:compute_answers_siglip) the whole text chunks when any of their segments are retrieved during the semantic similarity search.

Additionally, multimodal semantic similarity search with a text query tends to return text chunks more often than other modalities because of the *[multimodality gap](https://arxiv.org/abs/2203.02053)*. To simply address this, we build two distinct indices, one for text chunks and one for images and tables. When performing the semantic similarity, we then obtain `K/2` results from each index instead of `K` results from a single shared index.

## Variant 2: use of an open weight mulimodal LLM

In the [3.2. Retrieval + Answer generation with a IDEFICS 2](flow_zone:dVKkIlW) Flow zone, we reproduce the retrieval of the relevant chunks and the generation of the answer performed in the [3. Retrieval + Answer generation with a multimodal LLM](flow_zone:default) Flow zone but we replace the GPT-4o LLM Mesh connection with [IDEFICS2](https://huggingface.co/HuggingFaceM4/idefics2-8b), an open-weight multimodal LLM. In this Flow zone, we distinguish the [retrieval step](recipe:compute_answers_idefics2) and the [answer generation step](recipe:compute_idefics2_answers)  because knowledge banks can only be leveraged with local execution at the moment while performing inference with IDEFICS2 most likely requires a GPU container.

## Variant 3: use of an image-only retrieval approach with ColPali

In the [1.2. Full multimodal RAG pipeline with ColPali](flow_zone:UBRGrQ6) Flow zone, we leverage a recent approach named [ColPali](https://arxiv.org/abs/2407.01449).

During the pre-processing step, ColPali completely skips the content extraction step. A fine-tuned vision language model based on PaliGemma directly [transforms](recipe:compute_8jv9ZTHM) the pages of the documents (seen as images) into a collection of embeddings.

[At retrieval time](recipe:compute_answers_colpali) and for a given user question, a similarity score is computed thanks to these embeddings and a late interaction matching mechanism (similar to [ColBERT](https://arxiv.org/abs/2004.12832)). This allows to select the most relevant images, which correspond to whole pages, and send them to a multimodal LLM.

Even if it is much more straightforward, ColPali yields similar or better [results](dataset:answers_colpali). Please note that the late interaction matching mechanism in ColPali is more computationally expensive than traditional semantic search. Therefore ColPali may be harder to operationalize for large collections of documents.

The project includes a [variant](web_app:3Qd71oF) of the previous web app, based on ColPali.

In addition, another [web app](web_app:ntDAJRD) focuses on ColPali interpretability. It allows users to visualize which part of an image is relevant for a textual query.

![interpretability.png](BZt9ZldY8opN)
<div align="center"><i>Figure 5: interpretability web app using ColPali</i></div>

# Next: Implement your own multimodal RAG pipeline!

The project can be downloaded [here](https://downloads.dataiku.com/public/dss-samples/EX_MULTIMODAL_RAG/).

## Technical requirements

This project requires:

- features available starting from **Dataiku 12.6**;
- a Python 3.10 code environment named `py_310_sample_multimodal_rag` with the packages and resources specified in  [Appendix: code environment](article:2);
- a Python 3.10 code environment named `py_310_sample_colpali` with the packages and resources specified in  [Appendix: code environment](article:2) and a **GPU** for the [1.2. Full multimodal RAG pipeline with ColPali](flow_zone:UBRGrQ6) Flow zone and the ColPali [web app](web_app:3Qd71oF);
- a Python 3.9 code environment named `py39_rag` as defined in the Initial setup section of this [page](https://doc.dataiku.com/dss/latest/generative-ai/rag.html#initial-setup);
- a multimodal LLM connection specified as a project variable and an embedding model connection. You can get a list of all available LLM connections and embedding model connections with the `list_llms` [method](https://developer.dataiku.com/latest/api-reference/python/projects.html#dataikuapi.dss.project.DSSProject.list_llms);
- `poppler-utils` and `tesseract-ocr` added as system dependencies.

## How to reuse this project

Once you have imported the project, you can directly navigate the Flow. You need to specify a GPT-4o (or equivalent) LLM connection in the `"LLM_id"` global variable. You also need to replace the documents in this [folder](managed_folder:VyiwDJse) and the questions in this [dataset](dataset:questions). If your documents are not PDF documents, you need to modify the [content extraction recipe](recipe:compute_texts) to replace `partition_pdf` with the `unstructured` function relevant for the types of your documents.

All the datasets are stored in filesystem so no remapping will be needed. However you have the option to [change the connection type](https://knowledge.dataiku.com/latest/data-sourcing/connections/concept-connection-changes.html#connection-changes) if you want to rely on a specific data storage type.

If you want to leverage certain parts of the Flow directly, keep in mind that Dataiku allows you to reuse elements at different levels. In particular it is possible to:

- [duplicate a whole project](https://knowledge.dataiku.com/latest/getting-started/dataiku-ui/how-to-duplicate-project.html)
- [copy and paste entire subflows](https://knowledge.dataiku.com/latest/collaboration/sharing-projects-assets/how-to-copy-flow-items.html)
- copy and paste recipes and remap input/output datasets
- [copy and paste preparation steps](https://doc.dataiku.com/dss/latest/preparation/copy-steps.html) within a Prepare recipe

# Related resources

- [LLM Mesh](https://doc.dataiku.com/dss/latest/generative-ai/index.html) in Dataiku
- Multimodal RAG [blog post](https://medium.com/data-from-the-trenches/beyond-text-taking-advantage-of-rich-information-sources-with-multimodal-rag-0f98ff077308)
- Advanced RAG ([blog post](https://medium.com/data-from-the-trenches/from-sketch-to-success-strategies-for-building-and-evaluating-an-advanced-rag-system-edd7bc46375d), [example project](https://gallery.dataiku.com/projects/EX_ADVANCED_RAG/))
- [Introduction to Large Language Models with Dataiku](https://content.dataiku.com/llms-dataiku/dataiku-llm-starter-kit)
- [Dataiku LLM Starter Kit](https://gallery.dataiku.com/projects/EX_LLM_STARTER_KIT/)