# Overview

This example project shows how to answer open-ended questions with a Retrieval-Augmented Generation (RAG) pipeline based on web content. It accompanies a **[blog post](https://medium.com/data-from-the-trenches/standing-on-the-shoulders-of-a-giant-cefe2a50881a) that you are invited to read before exploring this project**.

More specifically, this example project illustrates 3 web-based RAG pipelines:

|  | **Context added in the prompt** | **Web search query** | **Static chain or agent?** |
|---|---|---|---|
|[Basic RAG](web_app:dbf7BKR) | Search results | User question | Static chain |
| [Cascade RAG](recipe:compute_answers_cascade_RAG) | Search results and, if needed, web pages listed in the search results | LLM-generated | Static chain |
| [RAG Agent](recipe:compute_answers_cascade_RAG) | Search results | LLM-generated | Agent |

For these 3 pipelines, we use either the [Brave Search API](https://brave.com/search/api/) or the [YOU API](https://about.you.com/introducing-the-you-api-web-scale-search-for-llms/) as the web search API.

The project can be downloaded [here](https://downloads.dataiku.com/public/dss-samples/EX_WEB_RAG/) and can be reused. Cf. instructions in the [last section](#next-create-your-own-web-based-rag-pipeline-1) of this wiki.

# Data

We use a small [ sample of questions/answers](dataset:questions) from [FreshQA](https://github.com/freshllms/freshqa), a "novel dynamic QA benchmark encompassing a diverse range of question and answer types, including questions that require fast-changing world knowledge". For example, FreshQA includes questions on recent events like "How many tornadoes have been confirmed so far in the United States this year?" or "Who is the latest winner of the Formula 1 world championship?".

Please note that the answers date from 13 November 2023 and, by design of this dataset, will not be valid anymore soon after this date.

# Walkthrough

## Basic RAG

The first web-based RAG pipeline consists in simply performing a web search based on the user question and adding the results in a question-answering prompt. We incorporate this web-based RAG pipeline in a simple [web app](web_app:dbf7BKR). The user can simply ask a question and get an answer with the corresponding sources and citations. Optionally, the user can specify a site or domain (e.g. `wikipedia.org`) to restrict the web search to it.

![webapp.png](GrVOL06qZkTH)

## Cascade RAG

The previous pipeline fails if the search results alone are insufficient to provide the answer. In this case, we can try to leverage the actual content of the web pages listed in the search results. For this and as suggested in this [LangChain blog post](https://blog.langchain.dev/automating-web-research/), we can process this content as in a standard RAG pipeline: we extract the text from the web pages, we split it in chunks, we vectorize and index these chunks and we retrieve the most relevant ones with a semantic search.

For the second pipeline, we use a cascade of chains. First, we try to either directly answer the question or generate relevant web search queries. If the answer is still unknown after this first step, we use the web search queries to get web search results and we try again to answer this time by adding the search results directly in the prompt. If this also fails, we try a final time by leveraging the content scraped from the web pages.

This approach is implemented in a [Python recipe](recipe:compute_answers_cascade_RAG).

## RAG Agent

In another [Python recipe](recipe:compute_answers_agent), we implement an LLM agent that can decide whether and when to use the web search engine. The agent is given one or several retrieval tools. It can call these tools by providing a search query and and get relevant search results.

This can be necessary in the case of multi-hop questions, i.e., questions that require several sequential retrieval steps (e.g., “What is the date of birth of the main actor in the latest Martin Scorsese movie?”).

## Results

The results of the two Python recipes on 21 November 2023 can be found in this [dataset](dataset:answers_stacked). The correct answers were found for two thirds of the questions. On this limited sample of questions, the "Cascade agent" pipeline works better than a "RAG agent" pipeline but this is probably not a surprise because no multi-hop questions are included in this sample.

# Next: create your own web-based RAG pipeline

You can [download](https://downloads.dataiku.com/public/dss-samples/EX_WEB_RAG/) this example project and import it in your own Dataiku instance.

## Technical requirements

This example project requires:

- An OpenAI API key stored as a [user secret](https://doc.dataiku.com/dss/latest/security/user-secrets.html) if you want to use OpenAI models. The key for this secret should be: `openai_key`;
- A Brave Search API key and/or a You.com API key, stored as [user secrets](https://doc.dataiku.com/dss/latest/security/user-secrets.html) with `BRAVE_API_KEY` and `YDC_API_KEY` as secrets;
- A Python environment named  `py_39_sample_web_rag ` with this [initialization script](article:2) and the following  packages:
```
langchain==0.0.335
openai==1.2.3
faiss-cpu==1.7.4
tiktoken==0.5.1
nest-asyncio==1.5.8
html2text==2020.1.16
beautifulsoup4==4.12.2
unstructured==0.10.30
sentence-transformers==2.2.2
dash==2.14.1
dash-bootstrap-components==1.5.0
Markdown==3.5.1
wikipedia==1.4.0
mlflow==2.8.0
textstat==0.7.3
spacy==3.7.0
https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.0/en_core_web_sm-3.7.0.tar.gz```

## Reusing parts of this project

You first need to [download](https://downloads.dataiku.com/public/dss-samples/EX_WEB_RAG/) the project and import it on your Dataiku instance.

All the datasets and files are stored in filesystem so no remapping will be needed. However you have the option to [change the connection type](https://knowledge.dataiku.com/latest/courses/flow-views-and-actions/connection-changes-concept-summary.html#connection-changes) if you want to rely on a specific data storage type. 

If you want to leverage certain parts of the Flow directly, keep in mind that Dataiku allows you to reuse elements at different levels. In particular it is possible to:
- [Duplicate a whole project](https://knowledge.dataiku.com/latest/kb/governance/How-to-duplicate-a-DSS-project.html)
- [Copy and paste entire subflows](https://knowledge.dataiku.com/latest/courses/flow-views-and-actions/connection-changes-concept-summary.html#copy-subflow)
- [Copy and paste recipes](https://knowledge.dataiku.com/latest/kb/collaboration/How-to-copy-a-recipe-in-your-Flow.html) and remap input/output datasets
- [Copy and paste preparation steps](https://doc.dataiku.com/dss/latest/preparation/copy-steps.html) within a Prepare recipe

# Related resources

- [Blog post](https://medium.com/data-from-the-trenches/standing-on-the-shoulders-of-a-giant-cefe2a50881a)
- [Introduction to Large Language Models With Dataiku](https://content.dataiku.com/llms-dataiku/dataiku-llm-starter-kit)
- [Dataiku's LLM Starter Kit](https://gallery.dataiku.com/projects/EX_LLM_STARTER_KIT/)
- [FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation](https://arxiv.org/abs/2310.03214) (Google and University of Massachusetts Amherst paper)
- [Automating Web Research](https://blog.langchain.dev/automating-web-research/) (LangChain blog post)
- [WebResearchRetriever](https://python.langchain.com/docs/modules/data_connection/retrievers/web_research) (LangChain retriever)
- [YOU API](https://about.you.com/introducing-the-you-api-web-scale-search-for-llms/)
- [Brave Search API](https://brave.com/search/api/)