# Building a Model using Pre-Trained Word Embeddings[¶](https://knowledge.dataiku.com/latest/courses/advanced-analytics/nlp-code/use-pretrained-embeddings.html#building-a-model-using-pre-trained-word-embeddings "Permalink to this headline")

The fastText repository includes a list of links to pre-trained word vectors (or embeddings) (P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information). In order to use the fastText library with our model, there are a few preliminary steps:

* Download the English bin+text word vector and unzip the archive

* Create a folder in the project called *fastText\_embeddings* and add the *wiki.en.bin* file to it

* Add the fastText library to your deep learning code environment (or create a new deep learning code environment that includes the fastText library). You can add it with `git+https://github.com/facebookresearch/fastText.git` in the Requested Packages list, as shown in the following screenshot.

## Features Handling[¶](https://knowledge.dataiku.com/latest/courses/advanced-analytics/nlp-code/use-pretrained-embeddings.html#features-handling "Permalink to this headline")

In the Features Handling panel of the Design for our deep learning ML task, add the following lines to the custom processing of the text input.

§ from dataiku.doctor.deep\_learning.shared\_variables import set\_variable

§ set\_variable("tokenizer\_processor", processor)

## Model Architecture[¶](https://knowledge.dataiku.com/latest/courses/advanced-analytics/nlp-code/use-pretrained-embeddings.html#model-architecture "Permalink to this headline")

We need to make few changes. Add the following imports to the top of the code.

§ import dataiku

§ from dataiku.doctor.deep\_learning.shared\_variables import get\_variable

§ import os

§ import fasttext

§ import numpy as np

Within the `build\_model()` specification, add the code for loading the embeddings and making the embedding matrix. This needs to occur before the line that defines `emb`.

§ folder = dataiku.Folder('fastText\_embeddings')

§ folder\_path = folder.get\_path()

§ embedding\_size = 300

§ embedding\_model\_path = os.path.join(folder\_path, 'wiki.en.bin')

§ embedding\_model = fasttext.load\_model(embedding\_model\_path)

§ processor = get\_variable("tokenizer\_processor")

§ sorted\_word\_index = sorted(processor.tokenizer.word\_index.items(),

§ key=lambda item: item[1])[:vocabulary\_size-1]

§ embedding\_matrix = np.zeros((vocabulary\_size, embedding\_size))

§ for word, i in sorted\_word\_index:

§ embedding\_matrix[i] = embedding\_model.get\_word\_vector(word)

Change the definition of the embedding layer as follows, in order to use the fastText pre-trained word embeddings.

§ emb = Embedding(vocabulary\_size,

§ embedding\_size,

§ input\_length=text\_length,

§ weights=[embedding\_matrix],

§ trainable=False)(text\_input)

Change the second `MaxPooling` layer for a `GlobalMaxPooling` layer.

§ x = GlobalMaxPooling1D()(x)

Finally, remove the `x = Flatten()(x)` line.

## Model results[¶](https://knowledge.dataiku.com/latest/courses/advanced-analytics/nlp-code/use-pretrained-embeddings.html#model-results "Permalink to this headline")

Click **Train** and, when complete, redeploy the model to the flow, and reevaluate on the test data. In the resulting dataset, you can see that the model has an accuracy of about 87% and an AUC of about 0.94.
