Tesseract is tough … so tough indeed, even Chuck Norris would have to check the manual twice. Not kidding you. Okay, so this article aimes at structuring what I needed to learn about tesseract to OCR-convert PDFs to text and how to train tesseract for application to new fonts. Let me dampen your expectations – you *will* have to read further texts (esp. the official documentation) to actually perform successful training! This text is describing usage of tesseract 3.03 RC on Ubuntu 14.04. Tesseract is also available for other Linuxes and Windows – the work flow will be mostly the same across OSes – of course some commands I use are though specific to Ubuntu. Also mind that tesseract 3.03 is considerably different to 3.02, which again differs from 3.01 as well – the changes are partially more fundamental than what you might expect from the version numbers.
Installation of tesseract
Installation of tesseract, so you can use the training tools, will require a number of potentially difficult steps on Ubuntu 14.04 (in my case though it worked like a charm):
- Compilation of Leptonica 1.7+
- Install Dependencies and Download and Compile tesseract 3.03 RC1
- Building of training tools
Figure out where the configuration and traineddata-files are located. Best place is:
/usr/local/share/tessdata. If not then set
$TESSDATA_PREFIXto that tessdata-folder. Custom configuration files are supposed to be placed in
configs-subfolder.
If you don’t intend to train tesseract but only to use it for OCR directly, installation on Ubuntu is no more and no less than
sudo apt-get install tesseract-ocr.
Conversion of a PDF to an Image
# conversion of PDF to PNGs (one per page) convert -density 500 test.pdf -quality 100 test.png # conversion of page 1 of PDF to PNG convert -density 500 test.pdf[0] -quality 100 test.png # conversion of PDF to multi-page TIFF gs -dNOPAUSE -q -r500 -sDEVICE=tiffg4 -dBATCH -sOutputFile=test.tif test.pdf # conversion of PDF pages 1 to 5 to multi-page TIFF gs -dNOPAUSE -q -r500 -sDEVICE=tiffg4 -dBATCH -sOutputFile=test.tif -dFirstPage=1 -dLastPage=5 test.pdf
For a regular sized font of about 11pt a good resolution is about 300 to 500 DPI.
Application tesseract to an Image
# OCR test.png using pretrained fonts for German (deu-tsch) # language and write results to test.txt tesseract test.png test -l deu # OCR test.tif using pretrained fonts for English language # using configuration specified in configfile and write # results to test.txt tesseract test.tif test -l eng configfile
The initial OCR result for …
might be …
Tdon't thinh about ycur errors orfailures; odterwise, you'll never do a thing. TEill rnAlurray
… Meh!
Let me tell you though that for standard (sane) fonts like Arial or Times New Roman the out of the box performance yields an error rate of maybe 1% if your document is of good optical quality. That’s the good part about tesseract – most of the time you won’t have to worry about training tesseract.
Create box file
# OCR new.test.exp0.tif using pretrained information for # English language fonts and create a box file for it tesseract new.test.exp0.tif new.test.exp0 -l eng batch.nochop makebox
Let’s assume the following training image …
The inital resulting box file might be …
A 235 3220 263 3260 0 e 278 3225 307 3260 0 E 323 3225 348 3260 0 b 358 3224 389 3260 0 E 405 3225 427 3260 0 T 437 3225 462 3260 0 g 477 3215 501 3260 0 ... E 438 2741 460 2776 0 i 471 2741 484 2774 0 n 494 2741 518 2765 0 s 528 2741 547 2765 0 t 555 2741 570 2772 0 e 581 2741 600 2765 0 i 609 2741 622 2774 0
Some letters are identified correctly – others not. By the way the first four numbers is the coordinates of the box (left-x, bottom-y, right-x, top-y) with origin at bottom left. The fourth number is the page index in case you use a multi-page TIFF. Whether to split two characters or to keep them in one box and allocate it the correct value is a source of mystery and speculation. Commen sense and putting yourself mentally into a machine learning algorithm’s shoes will help.
Correcting the box file
I think in some The Intercept article I read that CIA was torturing potential terrorists in those black sites by having them correct tesseract box files for texts of handwritten Sanskrit in case water boarding didn’t work. If you endulge in correcting box files for longer than one hour – make sure you have tissues next to you as your brain might melt and drip from your nostrils. Don’t blame me if you ruin your shirt!
Anyway – my adivce is to segment training into multiple steps. The first training will be tedious b/c tesseract will make many mistakes and you will have to correct a lot of little boxes. But you can use what you learned for the next training step and its initial creation of the box files. So with every training step you increase the complexity of your training data.
To make correction, adjustment, insertion, deletion, merging and splitting of boxes a bit easier I recommend to use a box file editor. jTessBoxEditor is doing a good job. Download, extract and then start it:
java -Xms4096m -Xmx4096m -jar jTessBoxEditor.jar
So above box file might initially look like this:
In above case you would have to correct the value for the marked character from “T” to “F”, you would have to split “N O P” into three different cases etc. When you’re done don’t forget to save the box file edits.
Training tesseract
Tesseract expects involved files to adhere to naming scheme:
[language].[font name].exp[num]
The language might be eng2 (as “eng” already exists). The font name is Lobster Two. So the name of the training picture and its box file might be:
eng2.LobsterTwo.exp0.png
eng2.LobsterTwo.exp0.box
Now let’s get some training done – I recommend for now to just “accept” the steps taken – don’t question, follow slavishly – as if it was a religion – or some new Apple product.
tesseract eng2.LobsterTwo.exp0.png eng2.LobsterTwo.exp0 nobatch box.train unicharset_extractor eng2.LobsterTwo.exp0.box # font name <italic> <bold> <fixed> <serif> <fraktur> echo "LobsterTwo 0 0 0 0 0" > font_properties shapeclustering -F font_properties -U unicharset eng2.LobsterTwo.exp0.tr mftraining -F font_properties -U unicharset -O eng2.unicharset eng2.LobsterTwo.exp0.tr cntraining eng2.LobsterTwo.exp0.tr # prefix "relevant" files with our language code mv inttemp eng2.inttemp mv normproto eng2.normproto mv pffmtable eng2.pffmtable mv shapetable eng2.shapetable combine_tessdata eng2. # copy the created eng2.traineddata to the tessdata folder # so tesseract is able to find it sudo cp eng2.traineddata /usr/local/share/tessdata/
Did it work?
Okay – chances are that it didn’t work yet – you’ll have to reread this text and draw inspiration from further blogs and even the official documentation. But let’s assume everything did work – so, if I now re-OCR the test image …
tesseract test.png test -l eng2
… what I will get is …
Don't Mk about your errors orfaiwres; otherwise, you'u never do a thing Bw Murray
… well – it’s a bit better :) Not much – but given the oddness of the font I fear we just have to put more effort into the training and provide much more data. It’s been suggested that there should be at least 10 samples per character and also our training data set assumes a larger font spacing. This would have to be addressed as well.
Helpful Blog Posts with Further Details
- How to train Tesseract 3.01
- Adding New Fonts to Tesseract 3 OCR Engine
- Training with Tesseract
- Training Tesseract
At the End of the Day
There is a lot more stuff to learn about tesseract. And chances are that many things will change if 3.04 sees the light of the day. But if you need to get OCR done I think delving into tesseract is well worth it. It’s terribly documented and the community is not very active but its a very powerful tool nonetheless. Good luck!