Quantcast
Channel: joy of data » text mining
Viewing all articles
Browse latest Browse all 5

A Guide on OCR with tesseract 3.03

$
0
0

Tesseract is tough … so tough indeed, even Chuck Norris would have to check the manual twice. Not kidding you. Okay, so this article aimes at structuring what I needed to learn about tesseract to OCR-convert PDFs to text and how to train tesseract for application to new fonts. Let me dampen your expectations – you *will* have to read further texts (esp. the official documentation) to actually perform successful training! This text is describing usage of tesseract 3.03 RC on Ubuntu 14.04. Tesseract is also available for other Linuxes and Windows – the work flow will be mostly the same across OSes – of course some commands I use are though specific to Ubuntu. Also mind that tesseract 3.03 is considerably different to 3.02, which again differs from  3.01 as well – the changes are partially more fundamental than what you might expect from the version numbers.

Installation of tesseract

Installation of tesseract, so you can use the training tools, will require a number of potentially difficult steps on Ubuntu 14.04 (in my case though it worked like a charm):

  1. Compilation of Leptonica 1.7+
  2. Install Dependencies and Download and Compile tesseract 3.03 RC1
  3. Building of training tools

Figure out where the configuration and traineddata-files are located. Best place is:

/usr/local/share/tessdata
 . If not then set
$TESSDATA_PREFIX
  to that tessdata-folder. Custom configuration files are supposed to be placed in
configs
 -subfolder.

If you don’t intend to train tesseract but only to use it for OCR directly, installation on Ubuntu is no more and no less than

sudo apt-get install tesseract-ocr
 .

Conversion of a PDF to an Image

# conversion of PDF to PNGs (one per page)

convert -density 500 test.pdf -quality 100 test.png

# conversion of page 1 of PDF to PNG

convert -density 500 test.pdf[0] -quality 100 test.png

# conversion of PDF to multi-page TIFF

gs -dNOPAUSE -q -r500 -sDEVICE=tiffg4 -dBATCH 
    -sOutputFile=test.tif test.pdf

# conversion of PDF pages 1 to 5 to multi-page TIFF

gs -dNOPAUSE -q -r500 -sDEVICE=tiffg4 -dBATCH 
    -sOutputFile=test.tif -dFirstPage=1 -dLastPage=5 test.pdf

For a regular sized font of about 11pt a good resolution is about 300 to 500 DPI.

Application tesseract to an Image

# OCR test.png using pretrained fonts for German (deu-tsch) 
# language and write results to test.txt

tesseract test.png test -l deu

# OCR test.tif using pretrained fonts for English language
# using configuration specified in configfile and write
# results to test.txt

tesseract test.tif test -l eng configfile

The initial OCR result for …

test

might be …

Tdon't thinh about ycur errors orfailures; odterwise, you'll never do a thing.
TEill rnAlurray

… Meh!

Let me tell you though that for standard (sane) fonts like Arial or Times New Roman the out of the box performance yields an error rate of maybe 1% if your document is of good optical quality. That’s the good part about tesseract – most of the time you won’t have to worry about training tesseract.

Create box file

# OCR new.test.exp0.tif using pretrained information for
# English language fonts and create a box file for it

tesseract new.test.exp0.tif new.test.exp0 
    -l eng batch.nochop makebox

Let’s assume the following training image …

training

The inital resulting box file might be …

A 235 3220 263 3260 0
e 278 3225 307 3260 0
E 323 3225 348 3260 0
b 358 3224 389 3260 0
E 405 3225 427 3260 0
T 437 3225 462 3260 0
g 477 3215 501 3260 0
...
E 438 2741 460 2776 0
i 471 2741 484 2774 0
n 494 2741 518 2765 0
s 528 2741 547 2765 0
t 555 2741 570 2772 0
e 581 2741 600 2765 0
i 609 2741 622 2774 0

Some letters are identified correctly – others not. By the way the first four numbers is the coordinates of the box (left-x, bottom-y, right-x, top-y) with origin at bottom left. The fourth number is the page index in case you use a multi-page TIFF. Whether to split two characters or to keep them in one box and allocate it the correct value is a source of mystery and speculation. Commen sense and putting yourself mentally into a machine learning algorithm’s shoes will help.

Correcting the box file

I think in some The Intercept article I read that CIA was torturing potential terrorists in those black sites by having them correct tesseract box files for texts of handwritten Sanskrit in case water boarding didn’t work. If you endulge in correcting box files for longer than one hour – make sure you have tissues next to you as your brain might melt and drip from your nostrils. Don’t blame me if you ruin your shirt!

Anyway – my adivce is to segment training into multiple steps. The first training will be tedious b/c tesseract will make many mistakes and you will have to correct a lot of little boxes. But you can use what you learned for the next training step and its initial creation of the box files. So with every training step you increase the complexity of your training data.

To make correction, adjustment, insertion, deletion, merging and splitting of boxes a bit easier I recommend to use a box file editor. jTessBoxEditor is doing a good job. Download, extract and then start it:

java -Xms4096m -Xmx4096m -jar jTessBoxEditor.jar

So above box file might initially look like this:

Screenshot from 2015-03-15 17:45:22

In above case you would have to correct the value for the marked character from “T” to “F”, you would have to split “N O P” into three different cases etc. When you’re done don’t forget to save the box file edits.

Training tesseract

Tesseract expects involved files to adhere to naming scheme:

[language].[font name].exp[num]

The language might be eng2 (as “eng” already exists). The font name is Lobster Two. So the name of the training picture and its box file might be:

  • eng2.LobsterTwo.exp0.png
  • eng2.LobsterTwo.exp0.box

Now let’s get some training done – I recommend for now to just “accept” the steps taken – don’t question, follow slavishly – as if it was a religion – or some new Apple product.

tesseract eng2.LobsterTwo.exp0.png eng2.LobsterTwo.exp0 nobatch box.train

unicharset_extractor eng2.LobsterTwo.exp0.box

# font name <italic> <bold> <fixed> <serif> <fraktur>
echo "LobsterTwo 0 0 0 0 0" > font_properties

shapeclustering -F font_properties -U unicharset eng2.LobsterTwo.exp0.tr

mftraining -F font_properties -U unicharset -O eng2.unicharset 
    eng2.LobsterTwo.exp0.tr

cntraining eng2.LobsterTwo.exp0.tr


# prefix "relevant" files with our language code
mv inttemp eng2.inttemp
mv normproto eng2.normproto
mv pffmtable eng2.pffmtable
mv shapetable eng2.shapetable
combine_tessdata eng2.

# copy the created eng2.traineddata to the tessdata folder
# so tesseract is able to find it
sudo cp eng2.traineddata /usr/local/share/tessdata/

Did it work?

Okay – chances are that it didn’t work yet – you’ll have to reread this text and draw inspiration from further blogs and even the official documentation. But let’s assume everything did work – so, if I now re-OCR the test image …

tesseract test.png test -l eng2

… what I will get is …

Don't Mk about your errors orfaiwres; otherwise, you'u never do a thing
Bw Murray

… well – it’s a bit better :) Not much – but given the oddness of the font I fear we just have to put more effort into the training and provide much more data. It’s been suggested that there should be at least 10 samples per character and also our training data set assumes a larger font spacing. This would have to be addressed as well.

Helpful Blog Posts with Further Details

At the End of the Day

There is a lot more stuff to learn about tesseract. And chances are that many things will change if 3.04 sees the light of the day. But if you need to get OCR done I think delving into tesseract is well worth it. It’s terribly documented and the community is not very active but its a very powerful tool nonetheless. Good luck!


Viewing all articles
Browse latest Browse all 5

Trending Articles