# Load and re-use a SentenceTransformers word embedding model[¶](https://developer.dataiku.com/latest/tutorials/machine-learning/code-env-resources/sentence-transformers-resources/index.html#load-and-re-use-a-sentencetransformers-word-embedding-model "Permalink to this heading")

Pre-requisites

* Dataiku DSS version >= 10.0.0.

* A Python>=3.6 Code Environment with the following package:

§ sentence-transformers==2.2.2

Natural Language Processing (NLP) use cases typically involve converting text to word embeddings. Training your own word embeddings on large corpora of texts is costly. As a result, downloading *pre-trained word embeddings models* and re-training them as needed is a popular option. SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. The framework is based on Pytorch and Transformers and offers a large collection of pre-trained models. In this tutorial, you will use Dataiku’s Code Environment resources feature to download and save pre-trained word embedding models from SentenceTransformers. You will then use one of those models to map a few sentences to embeddings.

## Downloading the pre-trained word embedding model[¶](https://developer.dataiku.com/latest/tutorials/machine-learning/code-env-resources/sentence-transformers-resources/index.html#downloading-the-pre-trained-word-embedding-model "Permalink to this heading")

The first step is to download the required assets for your pre-trained models. To do so, in the *Resources* screen of your Code Environment, input the following **initialization script** then click on *Update*:

§ ######################## Base imports #################################

§ import logging

§ import os

§ import shutil

§ from dataiku.code\_env\_resources import clear\_all\_env\_vars

§ from dataiku.code\_env\_resources import grant\_permissions

§ from dataiku.code\_env\_resources import set\_env\_path

§ from dataiku.code\_env\_resources import set\_env\_var

§ from dataiku.code\_env\_resources import update\_models\_meta

§ # Set-up logging

§ logging.basicConfig()

§ logger = logging.getLogger("code\_env\_resources")

§ logger.setLevel(logging.INFO)

§ # Clear all environment variables defined by a previously run script

§ clear\_all\_env\_vars()

§ # Optionally restrict the GPUs this code environment can use (it can use all by default)

§ # set\_env\_var("CUDA\_VISIBLE\_DEVICES", "") # Hide all GPUs

§ # set\_env\_var("CUDA\_VISIBLE\_DEVICES", "0") # Allow only cuda:0

§ # set\_env\_var("CUDA\_VISIBLE\_DEVICES", "0,1") # Allow only cuda:0 & cuda:1

§ ######################## Sentence Transformers #################################

§ # Set sentence\_transformers cache directory

§ set\_env\_path("SENTENCE\_TRANSFORMERS\_HOME", "sentence\_transformers")

§ import sentence\_transformers

§ # Download pretrained models

§ MODELS\_REPO\_AND\_REVISION = [

§ ("DataikuNLP/average\_word\_embeddings\_glove.6B.300d", "52d892b217016f53b6c717839bf62c746a658933"),

§ ("DataikuNLP/TinyBERT\_General\_4L\_312D", "33ec5b27fcd40369ff402c779baffe219f5360fe"),

§ ("DataikuNLP/paraphrase-multilingual-MiniLM-L12-v2", "4f806dbc260d6ce3d6aed0cbf875f668cc1b5480"),

§ # Add other models you wish to download and make available as shown below (removing the # to uncomment):

§ # ("bert-base-uncased", "b96743c503420c0858ad23fca994e670844c6c05"),

§ ]

§ sentence\_transformers\_cache\_dir = os.getenv("SENTENCE\_TRANSFORMERS\_HOME")

§ for (model\_repo, revision) in MODELS\_REPO\_AND\_REVISION:

§ logger.info("Loading pretrained SentenceTransformer model: {}".format(model\_repo))

§ model\_path = os.path.join(sentence\_transformers\_cache\_dir, model\_repo.replace("/", "\_"))

§ # Uncomment below to overwrite (force re-download of) all existing models

§ # if os.path.exists(model\_path):

§ # logger.warning("Removing model: {}".format(model\_path))

§ # shutil.rmtree(model\_path)

§ # This also skips same models with a different revision

§ if not os.path.exists(model\_path):

§ model\_path\_tmp = sentence\_transformers.util.snapshot\_download(

§ repo\_id=model\_repo,

§ revision=revision,

§ cache\_dir=sentence\_transformers\_cache\_dir,

§ library\_name="sentence-transformers",

§ library\_version=sentence\_transformers.\_\_version\_\_,

§ ignore\_files=["flax\_model.msgpack", "rust\_model.ot", "tf\_model.h5",],

§ )

§ os.rename(model\_path\_tmp, model\_path)

§ else:

§ logger.info("Model already downloaded, skipping")

§ # Add sentence embedding models to the code-envs models meta-data

§ # (ensure that they are properly displayed in the feature handling)

§ update\_models\_meta()

§ # Grant everyone read access to pretrained models in sentence\_transformers/ folder

§ # (by default, sentence transformers makes them only readable by the owner)

§ grant\_permissions(sentence\_transformers\_cache\_dir)

This script retrieves 3 pre-trained models from SentenceTransformers and stores them in the Dataiku Instance.

Note that the script will only need to run once. After that, all users allowed to use the Code Environment will be able to leverage the pre-trained models without having to re-download them.

## Converting sentences to embeddings using your pre-trained model[¶](https://developer.dataiku.com/latest/tutorials/machine-learning/code-env-resources/sentence-transformers-resources/index.html#converting-sentences-to-embeddings-using-your-pre-trained-model "Permalink to this heading")

You can now use those pre-trained models in your Dataiku Project’s Python Recipe or notebook. Here is an example using the word `average\_word\_embeddings\_glove.6B.300d` model to map each sentence in a list to a 300 dimensional dense vector space.

§ import os

§ from sentence\_transformers import SentenceTransformer

§ # Load pre-trained model

§ sentence\_transformer\_home = os.getenv('SENTENCE\_TRANSFORMERS\_HOME')

§ model\_path = os.path.join(sentence\_transformer\_home, 'DataikuNLP\_average\_word\_embeddings\_glove.6B.300d')

§ model = SentenceTransformer(model\_path)

§ sentences = ["I really like Ice cream", "Brussels sprouts are okay too"]

§ # get sentences embeddings

§ embeddings = model.encode(sentences)

§ embeddings

Running this code should output a numpy array of shape (2,300) containing numerical values.
