Training optical character recognition technology Tesseract on a new character font on MacOS
Training Tesseract on a new font
First install Tesseract
brew install tesseract
Let’s create a new language “newfra” :
Tesseract will use a TIFF image file (with characters to learn) and a Box file (indicating the bounding box of the characters in the image) to do its training to a new language.
First begin by creating the character table as a TIFF image. Here is an example of TIFF file :
The format of the box file is one ligne per character in the image and each line of the form char bl_x bl_y rt_x rt_y
where char
is the character, bl_x
the abcyss of bottom-left corner in a coordinate system where (0,0) is at the bottom-left corner of the TIFF image.
To create the box file, it’s possible to use Tesseract recognition engine and manually add/complete the lines that were not recognized automatically, correct lines that were recognized improperly.
To perform a better recognition, you can download additional languages such as “fra” for French language. Put the file fra.traineddata
in /usr/local/share/tessdata/
for Tesseract to use it. You can also use an online tool.
Then, with the 2 files (.tiff and .box), here is the list of commands to create the new language newfra
for Tesseract from this TIFF image :
To recognize characters in a new image simply type
tesseract image.tif output -l newfra