# Training Tesseract on a new font

First install Tesseract

brew install tesseract


Let’s create a new language “newfra” :

Tesseract will use a TIFF image file (with characters to learn) and a Box file (indicating the bounding box of the characters in the image) to do its training to a new language.

First begin by creating the character table as a TIFF image. Here is an example of TIFF file :

The format of the box file is one ligne per character in the image and each line of the form char bl_x bl_y rt_x rt_y where char is the character, bl_x the abcyss of bottom-left corner in a coordinate system where (0,0) is at the bottom-left corner of the TIFF image.

To create the box file, it’s possible to use Tesseract recognition engine and manually add/complete the lines that were not recognized automatically, correct lines that were recognized improperly.

To perform a better recognition, you can download additional languages such as “fra” for French language. Put the file fra.traineddata in /usr/local/share/tessdata/ for Tesseract to use it. You can also use an online tool.

Then, with the 2 files (.tiff and .box), here is the list of commands to create the new language newfra for Tesseract from this TIFF image :

To recognize characters in a new image simply type

tesseract image.tif output -l newfra


