# Text variables[¶](https://doc.dataiku.com/dss/latest/machine_learning/features-handling/text.html#text-variables "Permalink to this headline")

The **Text handling** and **Missing values** methods, and their related controls, specify how a text variable is handled.

## Text handling[¶](https://doc.dataiku.com/dss/latest/machine_learning/features-handling/text.html#text-handling "Permalink to this headline")

* **Count vectorization**

* **TF/IDF vectorization**

* **Hashing trick** (producing sparse matrices)

* **Hashing trick + Truncated SVD** (producing smaller dense matrices for algorithms that do not support sparse matrices)

* Sentence embedding (Python training backend only)

For the specific case of deep learning, see text features in deep-learning models

### Sentence embedding[¶](https://doc.dataiku.com/dss/latest/machine_learning/features-handling/text.html#sentence-embedding "Permalink to this headline")

Sentence embedding creates semantically meaningful dense matrix representations of text. In DSS, this text handling method makes use of transformer models using the transformers and sentence-transformers libraries. Each text sample is passed through a selected transformer model. The outputs are then pooled to an embedding with a model-specific fixed size. The computations will automatically use a GPU if available.

Using sentence embedding in Visual ML requires the `sentence-transformers` python package. You can install all necessary packages by adding the “Visual Machine Learning with sentence embedding” package set, in the code-environment “Packages to install” tab.

Sentence embedding also requires models to be downloaded. This can be done via the managed code environment resources directory. See below for an example code environment resources initialization script.

§ ######################## Base imports #################################

§ import logging

§ import os

§ import shutil

§ from dataiku.code\_env\_resources import clear\_all\_env\_vars

§ from dataiku.code\_env\_resources import grant\_permissions

§ from dataiku.code\_env\_resources import set\_env\_path

§ from dataiku.code\_env\_resources import set\_env\_var

§ from dataiku.code\_env\_resources import update\_models\_meta

§ # Set-up logging

§ logging.basicConfig()

§ logger = logging.getLogger("code\_env\_resources")

§ logger.setLevel(logging.INFO)

§ # Clear all environment variables defined by a previously run script

§ clear\_all\_env\_vars()

§ ######################## Sentence Transformers #################################

§ # Set sentence\_transformers cache directory

§ set\_env\_path("SENTENCE\_TRANSFORMERS\_HOME", "sentence\_transformers")

§ import sentence\_transformers

§ # Download pretrained models

§ MODELS\_REPO\_AND\_REVISION = [

§ ("DataikuNLP/average\_word\_embeddings\_glove.6B.300d", "52d892b217016f53b6c717839bf62c746a658933"),

§ ("DataikuNLP/TinyBERT\_General\_4L\_312D", "33ec5b27fcd40369ff402c779baffe219f5360fe"),

§ ("DataikuNLP/paraphrase-multilingual-MiniLM-L12-v2", "4f806dbc260d6ce3d6aed0cbf875f668cc1b5480"),

§ # Add other models you wish to download and make available as shown below (removing the # to uncomment):

§ # ("bert-base-uncased", "b96743c503420c0858ad23fca994e670844c6c05"),

§ ]

§ sentence\_transformers\_cache\_dir = os.getenv("SENTENCE\_TRANSFORMERS\_HOME")

§ for (model\_repo, revision) in MODELS\_REPO\_AND\_REVISION:

§ logger.info("Loading pretrained SentenceTransformer model: {}".format(model\_repo))

§ model\_path = os.path.join(sentence\_transformers\_cache\_dir, model\_repo.replace("/", "\_"))

§ # Uncomment below to overwrite (force re-download of) all existing models

§ # if os.path.exists(model\_path):

§ # logger.warning("Removing model: {}".format(model\_path))

§ # shutil.rmtree(model\_path)

§ # This also skips same models with a different revision

§ if not os.path.exists(model\_path):

§ model\_path\_tmp = sentence\_transformers.util.snapshot\_download(

§ repo\_id=model\_repo,

§ revision=revision,

§ cache\_dir=sentence\_transformers\_cache\_dir,

§ library\_name="sentence-transformers",

§ library\_version=sentence\_transformers.\_\_version\_\_,

§ ignore\_files=["flax\_model.msgpack", "rust\_model.ot", "tf\_model.h5",],

§ )

§ os.rename(model\_path\_tmp, model\_path)

§ else:

§ logger.info("Model already downloaded, skipping")

§ # Add sentence embedding models to the code-envs models meta-data

§ # (ensure that they are properly displayed in the feature handling)

§ update\_models\_meta()

§ # Grant everyone read access to pretrained models in sentence\_transformers/ folder

§ # (by default, sentence transformers makes them only readable by the owner)

§ grant\_permissions(sentence\_transformers\_cache\_dir)

## Missing values[¶](https://doc.dataiku.com/dss/latest/machine_learning/features-handling/text.html#missing-values "Permalink to this headline")

For text features, DSS only supports treating missing values as empty strings.
