Tutorial | Active learning for tabular data classification problems using Dataiku apps #
Prerequisites #
-
You should be familiar with the basics of machine learning in Dataiku.
Technical requirements #
-
Access to a Dataiku instance of version higher than 8.0 where the ML-assisted Labeling plugin is installed.
-
A code Python 3.6 code environment called ml-assisted-labeling-visual-ml-python-36 should be created. It should have these packages installed:
scikit-learn>=0.20,<0.21
scipy>=1.1,<1.2
xgboost==0.81
statsmodels>=0.9,<0.10
jinja2>=2.10,<2.11
flask>=1.0,<1.1
Setting up #
Suppose you need to classify article titles depending on whether they look like clickbait or not. We’re going to download a table containing unlabeled article titles and label them manually using active learning in a specific webapp.
Supporting data #
We will use a dataset of article titles containing 1 column. It contains both clickbait and legit titles.
Create the project #
In this tutorial, we will use a Dataiku app to fasten the creation of the Flow.
-
Go to the application menu and select Image classification - ML Assisted Labeling .
-
Click on Start using the application .
-
Give a name to your project, for example Clickbait .
Labeling setup #
You are now presented with a user-friendly user interface of the tabular data classification application. There are two steps required to kickstart the application:
-
Tabular input. Simply drag and drop your unlabeled csv file to this area to add the data.
-
You need to provide the labeling categories, enter two of them: clickbait and legit into the key-value table.

Label the data #
-
Start the labeling webapp by clicking on Run now next to the Start / Restart the labeling webapp and wait while the app is starting.
Now that you have 17260 unlabeled rows before you can start training a classifier to distinguish clickbait from legit titles, you first have to label them.
-
Click on the Labeling app link next to Label tabular data .
-
To start labeling, click on one of the category buttons on the right.
Note
-
If you’re not sure, it’s possible to skip a sample.
-
You may also leave a comment related to a sample.
-
To change a category of an already labeled sample, you may navigate back using the arrow buttons.
-
It may also be convenient to use hot keys assigned to labels to go even faster.
-
-
Label a few samples. Make sure that you have several labels per category (grey progress bar under a category button).
Once you have enough labeled samples you can start training your model.

Generate queries #
Now that you have some labeled samples, you can train the first model to enable active learning.
-
Navigate back to the homepage of the Clickbait Dataiku App.
-
Click on Run now next to Re-generate queries .
After the queries are generated, the labeling app will restart and the active learning will be enabled.

What’s next? #
For more on active learning, see the following posts on Data From the Trenches :
References