# Overview

This example project illustrates how to create a  **question answering system** with **GPT-3** for a specific collection of (potentially private) documents, here Dataiku's technical documentation. You can download this [project](https://downloads.dataiku.com/public/dss-samples/EX_QUESTION_ANSWERING/) and reuse it for your own use case.

After explaining how GPT-3 can be leveraged to build a question answering system, we'll present the Flow of the project and provide instructions on reusing this example project.

**Please note that the question answering web app is not live on Dataiku's public project gallery but you can test it by downloading the project and providing an OpenAI API key.** Examples of actual answers can be visualized through this [web app](web_app:ehyJdp6).

Note: it is also possible to take advantage of Dataiku's [NLG Tasks plugin](https://www.dataiku.com/product/plugins/nlp-nlg-tasks/) to access the OpenAI API in a Dataiku Flow. We do not use this plugin in this project because the API calls happen within a web app.

![screenshot_question_answering.png](xSHNvRk9si43)

# GPT-3 as the engine of a question answering system

GPT-3 can effectively perform a variety of natural language processing tasks without further training: question answering, summarization, translation, classification, etc. The key to leverage these capabilities is to provide the right prompt (i.e., the initial text that the model will then complete), one token at a time. The following diagram illustrates what the prompt and the completion by GPT-3 can look like for a question answering task.

![naive_pipeline.png](gVG540ty4qja)

However, this approach does not work for recent or non-public documents because the latest available GPT-3 model was trained on internet documents until June 2021 (as of February 2023). Fortunately, GPT-3 can still be a useful assistant with a **retrieve-then-read pipeline**.

Let's assume that we want to get answers based on a specific collection of documents. As illustrated in the diagram below, the retrieve-then-read pipeline entails the following steps:
1. Receive the question from the user;
2. Retrieve from the collection of documents the passages most semantically similar to the question;
3. Create a prompt including the question and the retrieved passages;
4. Query GPT-3 with the prompt.

![retrieve_then_read_pipeline.png](wq7mp72IJyRN)

In this way, we hopefully provide relevant facts for the language model to prepare an accurate answer. Of course, this relies on our ability to extract the proper passages from the collection of documents. In a nutshell, this semantic search can be done in the following way.

As pre-processing steps, once and for all until the documents are updated:
- Extract the raw text from the documents;
- Divide the raw text in small chunks (typically the size of one or several paragraphs; this is needed to not include irrelevant information and stay within the prompt size limit);
- Compute the embeddings of each chunk either with a semantic similarity pre-trained model or a bag of words retrieval function like BM25;
- Optionally, index these embeddings to enable fast similarity search.

Then, for each question:
- Compute the embedding of the question;
- Find the K chunks whose embeddings are most similar to this embedding.

Please read our [blog post](https://medium.com/data-from-the-trenches/semantic-search-an-overlooked-nlp-superpower-b67c4b1b119a) on semantic search for further details.

# Data

The documentation available to Dataiku's users and administrators includes more than 2,000 public web pages. This is a testament to the richness and versatility of Dataiku. More prosaically, this also means that finding a specific answer may be challenging.

The content we use in this project includes all pages of Dataiku's [technical documentation](https://doc.dataiku.com/dss/latest/), [knowledge base](https://knowledge.dataiku.com/latest/), and [developer's guide](https://developer.dataiku.com/latest/), as well as the descriptions of the [Dataiku plugins](https://www.dataiku.com/product/plugins/). We directly scrap this content from the internet.

# Walkthrough

## Webscraping

In the [1. Web scraping](flow_zone:pvtveHM) Flow zone, we automatically download and parse all [pages](managed_folder:vElSoRUz) of Dataiku's documentation through two Python recipes ([one](recipe:compute_list_plugins) with `selenium` to get the list of Dataiku plugins and [another](recipe:compute_vElSoRUz) to get the HTML content of all the web pages). We convert the content into the Markdown format to retain some structure of the HTML pages (e.g., the section and subsection headings or the fact that some parts correspond to code samples or tables).

**Please note that this web app is not live on Dataiku's public project gallery but you can test it by downloading the project and providing an OpenAI API key.**

## Chunking

With a [Python recipe](recipe:compute_chunks) in [2. Chunking](flow_zone:default) Flow zone, we first divide the text using the existing sections and subsections of the pages and then we cut each part in chunks of approximately 800 characters maximum. We use some pre-processing and post-processing transformations to make sure that this did not break the Markdown format (in particular for code sections and tables). We systematically add the heading of the current section or subsection in each chunk so that it is more informative and self-contained. We also remove exact duplicates. This is important because the number of extracts to add later to the prompt should be limited to avoid diluting or excluding relevant facts. We end up with about 17,000 [chunks](dataset:chunks).

## Computing embeddings

In the [3. Embedding computation](flow_zone:DV2wGEN) Flow zone, we compute the [embeddings](managed_folder:bwli327B) of the chunks with a pre-trained semantic similarity DistilBERT model, as in the [Semantic Search example project](https://gallery.dataiku.com/projects/EX_SEMANTICSEARCH/). We do not use an indexing technique like FAISS given the limited number of vectors.

## Question answering interface

Once the embeddings have been computed, it is possible to get answers through a Dash [web app](web_app:BClj8Fy). It includes a simple input bar to submit a question. In return, the user gets:
- An **answer**;
- The **possibility to flag useful or useless answers** with two icons;
- The **"sources"**. The sources are not provided by GPT-3. They are simply the paragraphs of the retrieved passages (cf. step 2 of the retrieve-then-read pipeline) most semantically similar to the answer.

Under the hood, the following happens for each question:
1. The embedding of the question is computed;
2. The top-K chunks whose embeddings are most similar to the query's embedding are retrieved;
3. The question and the chunks are added to the prompt template (cf. template below);
4. The prompt is sent to the OpenAI API
5. An answer is received;
6. The paragraphs of the retrieved chunks most semantically similar to the answer are selected as "sources";
7. The answer and the sources are displayed to the user.

```python
prompt_template = """Use the following extracts of the Dataiku DSS documentation to answer the question at the end. If you don't know the answer, just say that you don't know.
 - - - - -
Dataiku DSS documentation:
{context}
 - - - - -
Question: I am a Dataiku DSS user. {question}
Answer: """
```

## Past answers and feedback

The interactions of the users with the question answering [web app](web_app:BClj8Fy) are logged in a [managed folder](managed_folder:jI8G2N4I). These logs are periodically pushed to a [dataset](dataset:past_answers) and removed from the folder.

## Visualization of past answers

A Dash [web app](web_app:ehyJdp6) lets you visualize past answers. You can choose to display all past answer or only those with positive feedback or only those with negative feedback.

![screenshot_past_answers.png](W20pfjiTq05f)

# Next: Use GPT-3 with your own documents

The project can be downloaded [here](https://downloads.dataiku.com/public/dss-samples/EX_QUESTION_ANSWERING/).

## Technical requirements
This project:
- leverages features available starting from **Dataiku 10**;
- requires an OpenAI API key stored as a [user secret](https://doc.dataiku.com/dss/latest/security/user-secrets.html). The key for this secret should be `openai_key`;
- requires to create a Python 3.8 code environment named `py_38_sample_question_answering` with the following packages: 
```
langchain==0.0.75
openai==0.26.4
numpy==1.23.5
transformers==4.18.0
torch==1.13.1
dash==2.8.1
dash-bootstrap-components==1.3.1
beautifulsoup4==4.11.2
selenium==3.141.0
markdownify==0.11.6
```

(Depending on whether you need to scrap internet content or manipulate Markdown documents, `selenium`, `beautifulsoup4` and `markdownify` may not be necessary)

`py_38_sample_question_answering` should also include the following [initialization script](https://doc.dataiku.com/dss/latest/code-envs/operations-python.html#managed-code-environment-resources-directory) (cf. the "Resources" tab of the code environment):

```
## Base imports
from dataiku.code_env_resources import clear_all_env_vars
from dataiku.code_env_resources import set_env_path
from dataiku.code_env_resources import set_env_var

# Clears all environment variables defined by previously run script
clear_all_env_vars()

## Hugging Face
# Set HuggingFace cache directory
set_env_path("HF_HOME", "huggingface")

from transformers import AutoModel, AutoTokenizer, GPT2TokenizerFast, pipeline
model = AutoModel.from_pretrained("sentence-transformers/msmarco-distilbert-cos-v5")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-distilbert-cos-v5")
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
```

## Using your own collection of documents

The processing steps in the [1. Web scraping](flow_zone:pvtveHM) and [2. Chunking](flow_zone:default) Flow zones are specific to the documents used. They depend on the format (files, content scraped from the internet...) and structure of the documents. For example, with Dataiku's documentation, we take advantage of the structure of the web pages to identify sections and subsections. We also keep tables and code samples as such.

You need to replace the recipes of these 2 Flow zones so that the [chunks](dataset:chunks) dataset includes your content, divided in relatively small chunks, with a URL and a title for each chunk. If so, the rest of the Flow and the two web apps can directly be used. Please note that some parameters in the [question answering web app](web_app:BClj8Fy) can be adjusted. For example, you can decide to log only answers with user feedback or all answers by changing the value of `LOG_ALL_ANSWERS` to `True` or `False`.

## Reusing parts of this project

You first need to [download](https://downloads.dataiku.com/public/dss-samples/EX_QUESTION_ANSWERING/) the project and import it on your Dataiku instance.

All the datasets and files are stored in filesystem so no remapping will be needed. However you have the option to [change the connection type](https://knowledge.dataiku.com/latest/courses/flow-views-and-actions/connection-changes-concept-summary.html#connection-changes) if you want to rely on a specific data storage type. 

If you want to leverage certain parts of the Flow directly, keep in mind that Dataiku allows you to reuse elements at different levels. In particular it is possible to:
- [duplicate a whole project](https://knowledge.dataiku.com/latest/kb/governance/How-to-duplicate-a-DSS-project.html)
- [copy and paste entire subflows](https://knowledge.dataiku.com/latest/courses/flow-views-and-actions/connection-changes-concept-summary.html#copy-subflow)
- [copy and paste recipes](https://knowledge.dataiku.com/latest/kb/collaboration/How-to-copy-a-recipe-in-your-Flow.html) and remap input/output datasets
- [copy and paste preparation steps](https://doc.dataiku.com/dss/latest/preparation/copy-steps.html) within a Prepare recipe  

# Related Resources
- [OpenAI's tutorial on creating a question answering system](https://platform.openai.com/docs/tutorials/web-qa-embeddings)