# Overview

This example project shows how to create a  **semantic search** [user interface](web_app:WJhKnfG) in Dataiku.

 Semantic search  consists in **retrieving text passages whose meaning matches a search query** . For example, if your search query is "car", your results could include the following words: "car", "automobile", "vehicle"... In contrast, keyword search only returns text passages with words of the search query. In the previous example, you would only get text passages including the word "car".

Semantic search is relevant for the **following use cases**:
- **E-Commerce**: helping customers explore the product catalogue or suggesting similar products;
- **Customer care**: searching for customer feedbacks corresponding to certain themes;
- **Corporate IT**: identifying IT support tickets similar to a given ticket;
- **Competition monitoring**: analyzing patent applications, newsletters, press releases...;
- ...

More generally, semantic search is appealing whenever many documents need to be explored. It is **broadly applicable** because it only requires **raw documents, without additional labels** (as opposed to text classification).

The user interface presented in this example project includes the following **features**:

1. **Find documents based on a text query**, combining semantic search and keyword search;
2. **Find documents similar to a given document**;
3. **Filter** the retrieved documents on these documents' metadata (e.g. year);
4. **Display charts** summarizing some metadata (e.g. year) of the retrieved documents;
5. **Download results**.

The user interface and the associated Flow have been designed to be **easily reusable**. You can find instructions to reuse this example project in the [last section](#next-search-through-your-own-documents-1) of this wiki.

# Data

We use a [dataset](dataset:data) of **patent applications** sampled from [open data](https://bulkdata.uspto.gov/) made available by the [United States Patent and Trademark Office](https://www.uspto.gov/). Each patent application includes:
- a unique identifier;
- a title;
- an abstract;
- the name of the organization applying for the patent;
- the application date;
- thematic categories.

In particular, the **abstract** is a short paragraph outlining the content of the patent application. In this example project, we assume that we want to **search patent applications on the basis of these abstracts' content**.

As an example, here is the abstract for an "Oscillating Weeder":
> An agricultural tool including a powered motor that supplies vibration to a tool implement. The vibration allows a user to dig with decreased effort, and also reduces the likelihood of severing a root on a plant by replacing the linear force required to dig with transverse vibration. The tool includes a one-handed grip, allowing the user a free hand to assist with gardening activities.

In the following, since the example project is generic and can easily be transposed to other domains, we use a general terminology: we refer to the abstracts as **"documents"** and all the features of the patent applications as **"documents' metadata"**.

Beside the main dataset, a [table](dataset:categories) provides the mapping between the identifiers of the thematic categories for the patent applications and the labels of these categories.

# Walkthrough

## Computing an embedding for each document

The appoach in this example project consists in using a **pre-trained semantic similarity model** to compute **an embedding** (or vector representation) **for each document**. Such a model is trained so that sentences with similar meanings are mapped to similar vectors.

More precisely, we use [msmarco-distilbert-cos-v5](https://huggingface.co/sentence-transformers/msmarco-distilbert-cos-v5), which is adequate for asymmetric semantic search (ie. when the text query is much shorter than typical documents) with documents in English.

The embeddings are computed through a [Python recipe](recipe:compute_P4SttKJS) in the [2. Embedding computation](flow_zone:default) Flow zone. This recipe writes two files in the [output Managed Folder](managed_folder:P4SttKJS): `embeddings.npy` and `ids.npy`. `embeddings.npy` contains the embeddings for the whole dataset and `ids.npy` corresponds to the mapping between the order of the embeddings in `embeddings.npy` and the ids of the documents.
 
If the Python recipe is re-run on a modified dataset, embeddings will be computed only for the previously unseen rows (i.e. rows whose ids are not in `ids.npy`) and the new `embeddings.npy` and `ids.npy` will cover the documents in the new version of the dataset and only them.

## Indexing the documents to enable keyword search

The pre-trained semantic similarity models have been trained on large datasets and their performance should be good with common words. Conversely, they should **perform poorly with words largely absent from these training sets**. This is why, if the documents processed include uncommon jargon, combining the semantic search with a keyword search can be useful.

Therefore we use a [Python recipe](recipe:compute_evJsZfu6) to create an index for this keyword search. It is simply a dictionary mapping a given word to all the ids of the documents including this word.

## Indexing the embeddings to facilitate vector similarity search

The naive approach to find the best matches for a semantic query is to:
1. compute the embedding of this query;
2. compute the dot-product of this (normalized) embedding with all the pre-computed (normalized) embeddings of the documents in the corpus;
3. select the documents corresponding to the highest dot-products.

This approach is however not scalable in terms of memory and latency. If needed, an alternative is to use **efficient vector indexing techniques such as [Faiss](https://github.com/facebookresearch/faiss)**.

We index the embeddings of the documents in a Faiss index through a [Python recipe](recipe:compute_1G2AREMC). With this index, it will be possible to find approximate nearest neighbors with a much lower memory footprint.

## Using the search interface

Once the embeddings and the indices have been built, the search interface is ready to use. It is a Dash [web app](web_app:WJhKnfG) with four main sections:

1. The users can input a query in the **search bar** and submit it by clicking on the button or hitting "Enter";
2. The **search results** are then displayed below the search bar. Their presentation combines the documents and their metadata. For each result, a "similar results" link allows to get similar documents;
3. At the bottom of the sidebar, **charts** summarize some aspects of the results;
4. At the top of the sidebar, various controls can be used to **filter** the results.

Additionally, the displayed results can be downloaded as a CSV file.

![output.gif](EYGyPA7b1axh)

Under the hood, when a user submits a query:
1. The embedding of the query is computed;
2. Three scores are computed for each document:
    1. A **semantic search score**, as the cosine similarity of the query embedding and the document embedding;
    2. A **keyword search score**, as the share of the words of the query present in the document;
    3. An aggregate score, as a weighted average of the semantic search score and the keyword search score;
3. The top-N results with the highest aggregate score are returned.

## Bonus: using Elasticsearch as a backend

The web app described above relies only on Dataiku. Alternatively, for more advanced search features or for better scalability, it is possible to use a search engine solution as a backend. The [4. Bonus: Indexing with Elasticsearch](flow_zone:rAT8tLI) Flow zone illustrates how to index the documents (including their embeddings)  in Elasticsearch and a version of the [web app](web_app:2ozpAfW) leveraging Elasticsearch as a backend is also provided.

Please note that this web app cannot be used live on the public Dataiku Gallery but you can try it if you download the project.

# Next: Search through your own documents

The project can be downloaded [here](https://downloads.dataiku.com/public/dss-samples/EX_SEMANTICSEARCH/).

## Technical requirements
This project:
- leverages features available starting from **Dataiku 10**;
- requires to create a Python 3.8 code environment named `py_38_sample_semanticsearch` with the following packages: 
``` 
transformers==4.18.0
torch
dash
dash-bootstrap-components
faiss-cpu
elasticsearch
```

`py_38_sample_semanticsearch` should also include the following [initialization script](https://doc.dataiku.com/dss/latest/code-envs/operations-python.html#managed-code-environment-resources-directory) (cf. the "Resources" tab of the code environment):

```
## Base imports
from dataiku.code_env_resources import clear_all_env_vars
from dataiku.code_env_resources import set_env_path
from dataiku.code_env_resources import set_env_var

# Clears all environment variables defined by previously run script
clear_all_env_vars()

## Hugging Face
# Set HuggingFace cache directory
set_env_path("HF_HOME", "huggingface")

from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("sentence-transformers/msmarco-distilbert-cos-v5")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-distilbert-cos-v5")
```

## Importing your own documents

You can replace the input [dataset](dataset:data) with your **own documents**. This new dataset should have an `id` column and a `text` column. The values of the `id`column are used as unique identifiers of your documents and the embeddings are computed on the texts in the `text`column. Alternatively, you can update the project's global variables `id_label` and `text_label`.

The pretrained model used in this example project works with **relatively short paragraphs** (a few sentences). For longer documents, you should either keep only an extract (this is relevant if your documents includes an abstract or a conclusion which well summarizes its content) or slice your documents in short paragraphs.

## Choosing a pretrained semantic similarity model

You can keep the same model, ie. [msmarco-distilbert-cos-v5](https://huggingface.co/sentence-transformers/msmarco-distilbert-cos-v5) but you may want to change it for various reasons:

- **Languages**: `msmarco-distilbert-cos-v5` is a monolingual model for English. If your documents are in another language (resp. several languages), you need to use another monolingual model (resp. a multilingual model). You can find other models on [Hugging Face](https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=downloads);
- **Domain**: ideally, the vocabulary of your documents is well covered in the training sets of the pretrained model. This may be a reason to switch to another model. Another option is to [fine-tune a pretrained model on your own corpus](https://www.sbert.net/examples/training/data_augmentation/README.html). In practice, we have found that pretrained models are effective in most settings, even without fine-tuning, in particular when semantic search is combined with keyword search;
- **Performance**: there is a trade-off between the size of a model and its accuracy. You may switch to another model to find a balance more appropriate for your use case. If accuracy is paramount, it is also possible to use a [cross-encoder](https://www.sbert.net/examples/applications/retrieve_rerank/README.html) on top of the semantic similarity model, at the cost of a higher latency;

To use another model, change the `model` global variable of the project and adjust the initialization script mentioned [above](#technical-requirements-1).

## Adapting the web apps

The two Dash web apps are designed to be **easily adapted to your specific use case**. Your corpus of documents will have different metadata and you then need to adjust the filters, the charts, or the way results are displayed. For this, you can edit the portions of the web apps' code delineated by the following comments:
```
#### Edit below...
```
and:
```
#### No need to edit the code below
```
Please follow the instructions provided as comments in the code.

You can also modify variables at the the beginning of the code to decide:
- whether you want to use the FAISS indexing (only for the first web app);
- whether you want to combine exact search and semantic search;
- the maximum number of results displayed;
- the maximum number of rows in the bar charts.

## Using Elasticsearch

You can try the version of the [web app](web_app:2ozpAfW) leveraging Elasticsearch if you download the project  and use it with your own Dataiku instance. If you use Elastic Cloud, you will need your Cloud ID, your user name and your passwords as [user secrets](https://doc.dataiku.com/dss/latest/security/user-secrets.html) with the following names: `cloud_id`, `elastic_user`, `elastic_password`. If you use a self-managed Elasticsearch cluster, you will need to slightly adapt the code of the web app for the [connection to the cluster](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html) (`es = elasticsearch.Elasticsearch(...)`).

## Reusing parts of this project

You first need to download the [project](https://downloads.dataiku.com/public/dss-samples/EX_SEMANTICSEARCH/) and import it in your Dataiku instance. All the datasets and files are stored in filesystem so no remapping will be needed. However you have the option to [change the connection type](https://knowledge.dataiku.com/latest/courses/flow-views-and-actions/connection-changes-concept-summary.html#connection-changes) if you want to rely on a specific data storage type. 

If you want to leverage certain parts of the Flow directly, keep in mind that Dataiku allows you to reuse elements at different levels. In particular it is possible to:
- [duplicate a whole project](https://knowledge.dataiku.com/latest/kb/governance/How-to-duplicate-a-DSS-project.html)
- [copy and paste entire subflows](https://knowledge.dataiku.com/latest/courses/flow-views-and-actions/connection-changes-concept-summary.html#copy-subflow)
- [copy and paste recipes](https://knowledge.dataiku.com/latest/kb/collaboration/How-to-copy-a-recipe-in-your-Flow.html) and remap input/output datasets
- [copy and paste preparation steps](https://doc.dataiku.com/dss/latest/preparation/copy-steps.html) within a Prepare recipe  

# Related Resources
- [SentenceTransformers documentation](https://www.sbert.net/)