# Tesseract – OCR[¶](https://www.dataiku.com/product/plugins/tesseract-ocr/)

## This plugin provides recipes to perform Optical Character Recognition (OCR) using the Tesseract engine[¶](https://www.dataiku.com/product/plugins/tesseract-ocr/)

## Plugin information[¶](https://www.dataiku.com/product/plugins/tesseract-ocr/)

|  |  |

| --- | --- |

| Version | 1.0.2 |

| Author | Dataiku |

| Released | 2020-06 |

| Last updated | 2021-11 |

| License | Apache Software License |

| Source code | Github |

| Reporting issues | Github |

## How to set up[¶](https://www.dataiku.com/product/plugins/tesseract-ocr/)

If you are a Dataiku admin user, you need to follow the instructions on the README.md file of the plugin GitHub page in order to install the required packages on the DSS instance machine.

If you are not an admin, you can forward this to your admin and/or scroll down to the **How to use**section.

**Warning: You must first install Tesseract on your machine !**

## How to use[¶](https://www.dataiku.com/product/plugins/tesseract-ocr/)

This plugin has multiple components: Image conversion recipe, Image processing recipe, Text extraction recipe and a notebook template.

Let’s assume that you have a Dataiku DSS project with a folder containing both images and PDFs.

In order to extract text from images and PDFs, you must go through the following steps:

### Image conversion[¶](https://www.dataiku.com/product/plugins/tesseract-ocr/)

Because the Text extraction recipe only works on greyscale JPG images, you may have to use the Image conversion recipe first.

The Image conversion recipe takes as input a folder of images (JPG/JPEG/PNG/TIFF) and PDFs. It converts them into greyscale JPG images. If a PDF has multiple page, it creates a subfolder with one image per page.

You can also set some advanced parameters in the image conversion:

* DPI (Dot Per Inch): set the DPI of images extracted from PDFs only.

* Quality: set the quality of images according to the PIL package parameter.

### Notebook template[¶](https://www.dataiku.com/product/plugins/tesseract-ocr/)

You may want to process images before extracting text from images in order to get better results.

There is a notebook template where you can explore the effect of different image processing techniques.

### Image processing[¶](https://www.dataiku.com/product/plugins/tesseract-ocr/)

This recipe will process each greyscale JPG images of the input folder using the functions defined by the user in the recipe parameter’s form. Both input and output of these functions are numpy array image.

### Text extraction[¶](https://www.dataiku.com/product/plugins/tesseract-ocr/)

Finally, this last Text extraction recipe takes as input a folder of greyscale JPG images and outputs a dataset with two columns: filename and extracted text from tesseract.

If some images of the input folder were extracted from the same multiple-page PDF in the Image conversion recipe (meaning that there are in the same subfolder with a specific name pattern: <PDF\_NAME>\_pdf\_page\_XXXXX.jpg), you can choose to concatenate their extracted text.

You can also specify the language to be used by tesseract by entering its code (languages must be installed beforehand, ask your admin).

#### Install In DSS[¶](https://www.dataiku.com/product/plugins/tesseract-ocr/)

To install the plugin, open the  Apps menu, click Plugins and search for Tesseract - OCR.

Alternatively, you can download a zipped version here.

### Get the Dataiku Data Sheet[¶](https://www.dataiku.com/product/plugins/tesseract-ocr/)

Learn everything you ever wanted to know about Dataiku (but were afraid to ask), including detailed specifications on features and integrations.
