Load and re-use a SentenceTransformers word embedding model #

Prerequisites #

Introduction #

Natural Language Processing (NLP) use cases typically involve converting text to word embeddings. Training your word embeddings on large corpora of texts is costly. As a result, downloading pre-trained word embeddings models and re-training them as needed is a popular option. SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. The framework is based on Pytorch and Transformers and offers a large collection of pre-trained models. In this tutorial, you will use Dataiku’s Code Environment resources feature to download and save pre-trained word embedding models from SentenceTransformers. You will then use one of those models to map a few sentences to embeddings.

Downloading the pre-trained word embedding model #

The first step is to download the required assets for your pre-trained models. To do so, in the Resources screen of your Code Environment, input the following initialization script then click on Update :

######################## Base imports #################################
import logging
import os
import shutil

from dataiku.code_env_resources import clear_all_env_vars
from dataiku.code_env_resources import grant_permissions
from dataiku.code_env_resources import set_env_path
from dataiku.code_env_resources import set_env_var
from dataiku.code_env_resources import update_models_meta

# Set-up logging
logging.basicConfig()
logger = logging.getLogger("code_env_resources")
logger.setLevel(logging.INFO)

# Clear all environment variables defined by a previously run script
clear_all_env_vars()

# Optionally restrict the GPUs this code environment can use (it can use all by default)
# set_env_var("CUDA_VISIBLE_DEVICES", "") # Hide all GPUs
# set_env_var("CUDA_VISIBLE_DEVICES", "0") # Allow only cuda:0
# set_env_var("CUDA_VISIBLE_DEVICES", "0,1") # Allow only cuda:0 & cuda:1

######################## Sentence Transformers #################################
# Set sentence_transformers cache directory
set_env_path("SENTENCE_TRANSFORMERS_HOME", "sentence_transformers")

import sentence_transformers

# Download pretrained models
MODELS_REPO_AND_REVISION = [
    ("DataikuNLP/average_word_embeddings_glove.6B.300d", "52d892b217016f53b6c717839bf62c746a658933"), 
    # Add other models you wish to download and make available as shown below (removing the # to uncomment):
    # ("bert-base-uncased", "b96743c503420c0858ad23fca994e670844c6c05"),
]

sentence_transformers_cache_dir = os.getenv("SENTENCE_TRANSFORMERS_HOME")
for (model_repo, revision) in MODELS_REPO_AND_REVISION:
    logger.info("Loading pretrained SentenceTransformer model: {}".format(model_repo))
    model_path = os.path.join(sentence_transformers_cache_dir, model_repo.replace("/", "_"))


    # This also skips same models with a different revision
    if not os.path.exists(model_path):
        model_path_tmp = sentence_transformers.util.snapshot_download(
            repo_id=model_repo,
            revision=revision,
            cache_dir=sentence_transformers_cache_dir,
            library_name="sentence-transformers",
            library_version=sentence_transformers.__version__,
            ignore_files=["flax_model.msgpack", "rust_model.ot", "tf_model.h5",],
        )
        os.rename(model_path_tmp, model_path)
    else:
        logger.info("Model already downloaded, skipping")
# Add sentence embedding models to the code-envs models meta-data
# (ensure that they are properly displayed in the feature handling)
update_models_meta()
# Grant everyone read access to pretrained models in sentence_transformers/ folder
# (by default, sentence transformers makes them only readable by the owner)
grant_permissions(sentence_transformers_cache_dir)

This script retrieves a pre-trained model from SentenceTransformers and stores them in the Dataiku Instance. To download more of them, you’ll need to add them to the list and includes their revision , which is the model repository’s way of versioning these models .

Note that the script will only need to run once. After that, all users allowed to use the Code Environment will be able to leverage the pre-trained models without having to re-download them.

Converting sentences to embeddings using your pre-trained model #

You can now use those pre-trained models in your Dataiku Project’s Python Recipe or notebook. Here is an example using the word average_word_embeddings_glove.6B.300d model to map each sentence in a list to a 300-dimensional dense vector space.

import os
from sentence_transformers import SentenceTransformer

# Load pre-trained model
sentence_transformer_home = os.getenv('SENTENCE_TRANSFORMERS_HOME')
model_path = os.path.join(sentence_transformer_home, 'DataikuNLP_average_word_embeddings_glove.6B.300d')
model = SentenceTransformer(model_path)

sentences = ["I really like Ice cream", "Brussels sprouts are okay too"]

# get sentences embeddings
embeddings = model.encode(sentences)
embeddings.shape

Running this code should output a numpy array of shape (2,300) containing numerical values.