# Experiment Tracking for NLP with Keras/Tensorflow[¶](https://developer.dataiku.com/latest/tutorials/machine-learning/experiment-tracking/keras-nlp/index.html#experiment-tracking-for-nlp-with-keras-tensorflow "Permalink to this heading")

Pre-requisites

* Dataiku >= 11.0

* A Python code environment containing the following libraries (see supported versions here):

+ mlflow,

+ tensorflow

* Possibility of dowloading the Large Movie Review Dataset.

* Basic knowledge of Tensorflow/Keras.

## Introduction[¶](https://developer.dataiku.com/latest/tutorials/machine-learning/experiment-tracking/keras-nlp/index.html#introduction "Permalink to this heading")

In this tutorial, you will:

Train multiple Keras text classifiers to predict whether a movie review is either positive or negative.

Log those models in the MLflow format so that they can be compared using the DSS Experiment Tracking interface.

The present tutorial is an adaptation of this basic text classification tutorial. We recommend that you take a look at that tutorial prior to starting ours, especially if you’re not familiar with Tensorflow and Keras.

Although MLflow provides the `mlflow.keras.log\_model` function to log models, you will rely on the more general `mlflow.pyfunc.PythonModel` module to enjoy greater flexibility and to circumvent a current limitation in the deployment of custom Keras pre-processing layers (more on this later). If needed, please consult our `pyfunc` tutorial to get familiar with that module.

## Downloading the data[¶](https://developer.dataiku.com/latest/tutorials/machine-learning/experiment-tracking/keras-nlp/index.html#downloading-the-data "Permalink to this heading")

The first step to training text classifiers is to obtain text data.

You will programmatically download the Large Movie Review Dataset and decompress it into a **local** managed folder in DSS. A local managed folder is a folder that is hosted on the filesystem on the DSS machine, where your code runs.

To do so, create a python recipe:

Leave the input field empty.

Set its output to a new **local** managed folder (name that folder `aclImdb`).

Edit the recipe with the following code (do not forget to change the folder id to that of your output folder).

§ import dataiku

§ from io import BytesIO

§ from urllib.request import urlopen

§ import tarfile

§ folder = dataiku.Folder("YOUR\_FOLDER\_ID") # change to output folder id

§ folder\_path = folder.get\_path()

§ r = urlopen("https://ai.stanford.edu/~amaas/data/sentiment/aclImdb\_v1.tar.gz")

§ with tarfile.open(name=None, fileobj=BytesIO(r.read())) as t:

§ t.extractall(folder\_path)

Run the recipe.

## Preparing the experiment[¶](https://developer.dataiku.com/latest/tutorials/machine-learning/experiment-tracking/keras-nlp/index.html#preparing-the-experiment "Permalink to this heading")

After downloading and decompressing the movie review archive, prepare the ground for the experiment tracking:

Create a second Python recipe.

Set its input to the managed `aclImdb` folder that contains the data.

Set its output to a new output folder which can either be local or non-local. Name the output folder `experiments`.

Create the recipe and change its code environment to one that satisfies the pre-requisites laid out at the beginning of this tutorial.

The following code imports all libraries and defines constant variables, handles and function necessary to the training and tracking of Keras models. Copy and paste it while making sure to change the input folder id to your own input folder id.

For more information regarding experiment tracking in code, refer to our documentation.

§ import dataiku

§ import numpy as np

§ from datetime import datetime

§ import os

§ import shutil

§ import tensorflow as tf

§ import re

§ import string

§ from tensorflow.keras import layers, losses

§ from sklearn.model\_selection import ParameterGrid

§ # Replace these constants with your own values

§ PREDICTION\_TYPE = "BINARY\_CLASSIFICATION"

§ EXPERIMENT\_FOLDER\_ID = ""         # Replace with your output Managed Folder id (experiments)

§ EXPERIMENT\_NAME = ""              # Replace with your chosen experiment name

§ MLFLOW\_CODE\_ENV\_NAME = ""         # Replace with your code environment name

§ SAVED\_MODEL\_NAME = ""             # Replace with your chosen model name

§ # Some utils

§ def now\_str() -> str:

§ return datetime.now().strftime("%Y%m%d%H%M%S")

§ client = dataiku.api\_client()

§ project = client.get\_default\_project()

§ input\_folder = dataiku.Folder('YOUR\_FOLDER\_ID') # change to input folder id (aclImdb)

§ # Retrieve the path to the aclImbd folder.

§ input\_folder\_path = input\_folder.get\_path()

§ # Create a mlflow\_extension object to easily collect information for the promotion step

§ mlflow\_extension = project.get\_mlflow\_extension()

§ # Get a handle on a Managed Folder to store the experiments.

§ mf = project.get\_managed\_folder(EXPERIMENT\_FOLDER\_ID)

§ # dictionary with path to save intermediary model

§ artifacts = {

§ SAVED\_MODEL\_NAME: "./keras\_model\_cnn.pth"

§ }

In the rest of this tutorial, you will append more code snippets to that second recipe, starting with the creation of a train, an evaluation, and a test dataset. Only run the recipe at the end of the tutorial, after all snippets have been added.

At this stage, your DSS flow should look like this:

## Converting raw text data to Tensorflow Datasets[¶](https://developer.dataiku.com/latest/tutorials/machine-learning/experiment-tracking/keras-nlp/index.html#converting-raw-text-data-to-tensorflow-datasets "Permalink to this heading")

Before you can start training and evaluating your Keras models, you will have to convert your data to 3 differentTensorflow Datasets (train, evaluation and test).

The `tf.keras.utils.text\_dataset\_from\_directory()` function will allow to create such datasets from the different subfolders in your newly created `aclImdb` input folder.

Use the Dataiku Folder API to retrieve the `aclImdb` folder path and pass it to the `tf.keras.utils.text\_dataset\_from\_directory()` functions.

§ dataset\_dir = os.path.join(input\_folder\_path, 'aclImdb')

§ train\_dir = os.path.join(dataset\_dir, 'train')

§ remove\_dir = os.path.join(train\_dir, 'unsup')

§ if os.path.exists(remove\_dir):

§ shutil.rmtree(remove\_dir)

§ batch\_size = 32

§ seed = 42

§ raw\_train\_ds = tf.keras.utils.text\_dataset\_from\_directory(

§ os.path.join(input\_folder\_path,'aclImdb/train'),

§ batch\_size=batch\_size,

§ validation\_split=0.2,

§ subset='training',

§ seed=seed)

§ raw\_val\_ds = tf.keras.utils.text\_dataset\_from\_directory(

§ os.path.join(input\_folder\_path, 'aclImdb/train'),

§ batch\_size=batch\_size,

§ validation\_split=0.2,

§ subset='validation',

§ seed=seed)

§ raw\_test\_ds = tf.keras.utils.text\_dataset\_from\_directory(

§ os.path.join(input\_folder\_path, 'aclImdb/test'),

§ batch\_size=batch\_size)

## Preprocessing[¶](https://developer.dataiku.com/latest/tutorials/machine-learning/experiment-tracking/keras-nlp/index.html#preprocessing "Permalink to this heading")

The reviews were pulled from a website and contain html carriage-return tags (`<br />`). The following `custom\_standardization()` function strips those tags, lower-case the reviews and remove any punctuation from them.

That function is then used as a preprocessing step in a vectorization layer.

Append the following code.

§ @tf.keras.utils.register\_keras\_serializable()

§ def custom\_standardization(input\_data):

§ lowercase = tf.strings.lower(input\_data)

§ stripped\_html = tf.strings.regex\_replace(lowercase, '<br />', ' ')

§ return tf.strings.regex\_replace(stripped\_html,

§ '[%s]' % re.escape(string.punctuation),

§ '')

§ max\_features = 10000

§ sequence\_length = 250

§ vectorize\_layer = layers.TextVectorization(

§ standardize=custom\_standardization,

§ max\_tokens=max\_features,

§ output\_mode='int',

§ output\_sequence\_length=sequence\_length)

§ # Make a text-only dataset (without labels), then call adapt

§ train\_text = raw\_train\_ds.map(lambda x, y: x)

§ vectorize\_layer.adapt(train\_text)

§ def vectorize\_text(text, label):

§ text = tf.expand\_dims(text, -1)

§ return vectorize\_layer(text), label

§ # vectorize

§ train\_ds = raw\_train\_ds.map(vectorize\_text)

§ val\_ds = raw\_val\_ds.map(vectorize\_text)

§ test\_ds = raw\_test\_ds.map(vectorize\_text)

§ AUTOTUNE = tf.data.AUTOTUNE

§ train\_ds = train\_ds.cache().prefetch(buffer\_size=AUTOTUNE)

§ val\_ds = val\_ds.cache().prefetch(buffer\_size=AUTOTUNE)

§ test\_ds = test\_ds.cache().prefetch(buffer\_size=AUTOTUNE)

The `@tf.keras.utils.register\_keras\_serializable()` decorator makes that custom function serializable which is a needed property to later be able to save that preprocessing layer as part of an MLflow model.

### Model Training and hyperparameter grid[¶](https://developer.dataiku.com/latest/tutorials/machine-learning/experiment-tracking/keras-nlp/index.html#model-training-and-hyperparameter-grid "Permalink to this heading")

Now define a function–`create\_model()`–that will be used to create a Sequential model from two different hyperparameters.

The `embedding\_dim` hyperparameter determines the output dimension of the Embedding layer while the `dropout` hyperparameter determines the frequency (rate) at which the Dropout layers randomly set the input units to 0 as a way of mitigating overfitting.

The function makes it easier to test different model architecture and find the best hyperparameter combinations among a scikit-learn hyperparameter grid. While simple, the function could be improved to allow for more flexibility in the architecture design.

Add the following code to the end of your python recipe.

§ def create\_model(embedding\_dim,

§ dropout):

§ model = tf.keras.Sequential([

§ layers.Embedding(max\_features + 1, embedding\_dim),

§ layers.Dropout(dropout),

§ layers.GlobalAveragePooling1D(),

§ layers.Dropout(dropout),

§ layers.Dense(1)

§ ])

§ model.compile(loss=losses.BinaryCrossentropy(from\_logits=True),

§ optimizer='adam',

§ metrics=tf.metrics.BinaryAccuracy(threshold=0.0))

§ return model

§ param\_grid = {

§ 'embedding\_dim':[16],

§ 'dropout':[0.1,0.2]

§ }

§ grid = ParameterGrid(param\_grid)

This was the last step needed before you can run the experiment.

### Experiment runs[¶](https://developer.dataiku.com/latest/tutorials/machine-learning/experiment-tracking/keras-nlp/index.html#experiment-runs "Permalink to this heading")

At last, add the following piece of code to the recipe and run the recipe. After a successful run, you will be able to deploy the model either visually or programmatically.

The following code can be split into the different steps:

Create an `mlflow` context using the `setup\_mlflow()` method from the Dataiku API.

Create and experiment and add tags to it

Loop through the list of hyperparameter combinations so that for each combination, you start a run in which you:

* Create and train a model.

* Collect trained model metrics on test set.

* Create a `full\_model` by prepending the preprocessing layer to the model.

* Serialize + save that `full\_model`. This is a necessary intermediary step.

* Wrap that `full\_model` in a `KerasWrapper` so that the model can be logged as an MLflow python function model.

* Log the wrapper along with the collected metrics and other model metadata (hyperparemeters, epochs, code environment…).

§ # 1 create the mlflow context

§ with project.setup\_mlflow(mf) as mlflow:

§ # 2 create experiment and add tags

§ experiment\_id = mlflow.create\_experiment(

§ f'{EXPERIMENT\_NAME}\_{now\_str()}')

§ mlflow.tracking.MlflowClient().set\_experiment\_tag(

§ experiment\_id, "library", "Keras")

§ mlflow.tracking.MlflowClient().set\_experiment\_tag(

§ experiment\_id, "predictionType", "BINARY\_CLASSIFICATION")

§ # 3 Loop through combination of hyperparameter in grid

§ for hparams in grid:

§ with mlflow.start\_run(experiment\_id=experiment\_id) as run:

§ # create model

§ print(f'Starting run {run.info.run\_id} ...\n{hparams}')

§ model = create\_model(\*\*hparams)

§ print(model.summary())

§ # train model

§ history = model.fit(

§ train\_ds,

§ validation\_data=val\_ds,

§ epochs=10)

§ # collect metrics

§ run\_metrics = {}

§ for k,v in history.history.items():

§ run\_metrics[f'mean\_{k}'] = np.mean(v)

§ # Bundle the model with the preprocessing layer

§ full\_model = tf.keras.Sequential([

§ vectorize\_layer,

§ model,

§ layers.Activation('sigmoid')])

§ full\_model.compile(

§ loss=losses.BinaryCrossentropy(

§ from\_logits=False), optimizer="adam", metrics=['accuracy'])

§ # Serialize and save full model

§ full\_model.save(artifacts.get(SAVED\_MODEL\_NAME))

§ # Wrap the full model using the pyfunc module

§ class KerasWrapper(mlflow.pyfunc.PythonModel):

§ def load\_context(self, context):

§ import tensorflow as tf

§ @tf.keras.utils.register\_keras\_serializable()

§ def custom\_standardization(input\_data):

§ lowercase = tf.strings.lower(input\_data)

§ stripped\_html = tf.strings.regex\_replace(lowercase, '<br />', ' ')

§ return tf.strings.regex\_replace(

§ stripped\_html,

§ '[%s]' % re.escape(string.punctuation),

§ '')

§ self.model = tf.keras.models.load\_model(

§ context.artifacts.get(SAVED\_MODEL\_NAME))

§ def predict(self, context, model\_input):

§ model\_input = model\_input[['Review']]

§ return self.model.predict(model\_input)

§ mlflow\_pyfunc\_model\_path = f"{type(full\_model).\_\_name\_\_}-{run.info.run\_id}"

§ # log the wrapper

§ mlflow.pyfunc.log\_model(

§ artifact\_path=mlflow\_pyfunc\_model\_path, python\_model=KerasWrapper(),

§ artifacts=artifacts

§ )

§ # log the metrics + model metadata

§ mlflow.log\_metrics(metrics=run\_metrics)

§ mlflow.log\_params(hparams)

§ mlflow.log\_param("epochs", 10)

§ mlflow\_extension.set\_run\_inference\_info(run\_id=run.\_info.run\_id,

§ prediction\_type=PREDICTION\_TYPE,

§ classes=['0', '1'],

§ code\_env\_name=MLFLOW\_CODE\_ENV\_NAME)

§ print(f'Run {run.info.run\_id} done\n{"-"\*40}')

You’ll notice that we’re reloading the decorated `custom\_standardization()` function in the `load\_context()` method of our `KerasWrapper`. The reason is that the `TextVectorization` layer contains a custom step which, despite having been serialized and saved, cannot automatically be restored at load time in a different pyhon program. This limitation prevented us from using the `mlflow.keras.log\_model` function to log the model.

### Deploying the model for batch scoring[¶](https://developer.dataiku.com/latest/tutorials/machine-learning/experiment-tracking/keras-nlp/index.html#deploying-the-model-for-batch-scoring "Permalink to this heading")

You can now deploy your model either via the Experiment Tracking interface or through our Python API. In either case, you will need an evaluation dataset that contains a `Review`column with the reviews (free text) and a `Label` column for the associated binary sentiment target (1 being positive, 0 being negative).

You can generate this dataset from one batch of the `test` subdirectory located in your **aclImdb** folder:

* Create a Python recipe that takes that folder as input and a new dataset as output.

* Create the recipe and change its code environment to the one you used to log the experiment (so you have the `tensorflow` package available).

* Copy and paste the following code into your recipe. Run the recipe

§ import dataiku

§ import pandas as pd

§ import os

§ import tensorflow as tf

§ # Read recipe inputs

§ aclImdb = dataiku.Folder("YOUR\_FOLDER\_ID") # change to aclImdb folder id

§ folder\_path = aclImdb.get\_path()

§ batch\_size = 300

§ raw\_test\_ds = tf.keras.utils.text\_dataset\_from\_directory(

§ os.path.join(folder\_path, 'aclImdb/test'),

§ batch\_size=batch\_size)

§ np\_it = raw\_test\_ds.as\_numpy\_iterator()

§ records = np\_it.next()

§ records = [[review, label] for review, label in zip(records[0], records[1])]

§ df = pd.DataFrame(records, columns=['Review', 'Label'])

§ aclImdb = dataiku.Dataset("YOUR\_OUTPUT\_DATASET") # change to output dataset

§ aclImdb.write\_with\_schema(df)

### Deploying the model as an API endpoint[¶](https://developer.dataiku.com/latest/tutorials/machine-learning/experiment-tracking/keras-nlp/index.html#deploying-the-model-as-an-api-endpoint "Permalink to this heading")

Once your model is deployed in the flow, you can follow the steps laid out our reference documentation to deploy as an API endpoint

### Conclusion[¶](https://developer.dataiku.com/latest/tutorials/machine-learning/experiment-tracking/keras-nlp/index.html#conclusion "Permalink to this heading")

In this tutorial, you saw how to train multiple Keras models using a custom text vectorization layer and log them in the MLFlow format. You also saw that the `mlflow.pyfunc.PythonModel` allows for more deployment flexibility.
