OCR image preprocessing with ImageMagic

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Post Reply
milosbre
Posts: 5
Joined: 2019-01-09T10:20:15-07:00
Authentication code: 1152

OCR image preprocessing with ImageMagic

Post by milosbre » 2019-01-09T10:26:10-07:00

I am trying to find the best way to clean the image with imageMagic before I send it to tesseract.

So far the best result was given by this combination

Code: Select all

convert test.tif -fill black -fuzz 30% +opaque "#FFFFFF" result.tif
But the results from tesseract aren't so good.

How would you guys do it?

Example image:
Image


Here are the images
https://www.dropbox.com/sh/jyrd58nbrava ... RRQRa?dl=0

User avatar
fmw42
Posts: 25108
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: OCR image preprocessing with ImageMagic

Post by fmw42 » 2019-01-09T11:15:44-07:00

What is different about your various tests? Does not test6.tif work to do OCR? If not, have you tried making the background white? You have not posted your original tif file. The file above has been changed to a JPG.

milosbre
Posts: 5
Joined: 2019-01-09T10:20:15-07:00
Authentication code: 1152

Re: OCR image preprocessing with ImageMagic

Post by milosbre » 2019-01-09T11:26:43-07:00

Yes, test6.tif works.
Original image(s) is in that dropbox link (test4.tif).

With the code I provided I get some results but for example, often '5' is missplaced for 'S' and ':' is missplaced for "i" '?' etc.
I'm very new to image preprocessing so I was thinking I'm doing something wrong since I am able to clear the image just fine but tesseract still misses the obvious characters.

milosbre
Posts: 5
Joined: 2019-01-09T10:20:15-07:00
Authentication code: 1152

Re: OCR image preprocessing with ImageMagic

Post by milosbre » 2019-01-09T11:33:16-07:00

I have done a lot of searching and testing and as far as the numbers are concerned, I got perfect results but I completely loose the . and :

Code: Select all

convert test.tif -brightness-contrast -40x10 -units pixelsperinch -density 300 -negate -noise 10 -threshold 70% result.tif
converts this
Image
into this:
Image

And I get perfect number detection, but as you can see "," and ":" are lost.

User avatar
fmw42
Posts: 25108
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: OCR image preprocessing with ImageMagic

Post by fmw42 » 2019-01-09T12:25:42-07:00

Looks like the results at https://stackoverflow.com/questions/541 ... magemagick are better than you show here.

milosbre
Posts: 5
Joined: 2019-01-09T10:20:15-07:00
Authentication code: 1152

Re: OCR image preprocessing with ImageMagic

Post by milosbre » 2019-01-09T16:30:02-07:00

fmw42 wrote:
2019-01-09T12:25:42-07:00
Looks like the results at https://stackoverflow.com/questions/541 ... magemagick are better than you show here.
To the eye it looks better but its not actually.
With that first code the numbers are often bad.

Sorry for double posting but I'm just trying to find someone experienced in this to point me in the right direction.
Why isn't tesseract read it right when it looks almost perfect to the eye.

The second code produces bad looking results but it does its magic with the numbers.

milosbre
Posts: 5
Joined: 2019-01-09T10:20:15-07:00
Authentication code: 1152

Re: OCR image preprocessing with ImageMagic

Post by milosbre » 2019-01-09T16:51:08-07:00

fmw42 wrote:
2019-01-09T12:25:42-07:00
Looks like the results at https://stackoverflow.com/questions/541 ... magemagick are better than you show here.
If you have any advice on what switches I should use, let me know.

For now I solved the issue by using the tesseract 4.0 alpha. It uses deep learning and so far it works perfectly with my first code

Code: Select all

convert test.tif -fill black -fuzz 30% +opaque "#FFFFFF" result.tif

User avatar
anthony
Posts: 8883
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: OCR image preprocessing with ImageMagic

Post by anthony » 2019-01-14T22:27:40-07:00

If the image is from a screen dump... Make the image BIGGER....

tesseract is designed for scanned documents at about 600dpi, but displays are typically only 90 to 100dpi so scaling by 600% often works wonders.


I also find some fonts make go bad. For example a 'serif f' will often be thought of as a P
Then there is confusion about Il1 or QO0 which can be solve by limiting the character set tesseract is using.

PS: the last paragraph when screen captured from my web browser display produced...
I also find some fonts make go bad. For example a 'serif f' will often be thought of as a P
Then there is confusion about III or Q00 which can be solve by limiting the character set tesseract is using.
PS: not all the bars came out as letter 'I' and the letter O's as digit zero 0. In another run the 'I's came out a "1
The 'f' had no problem as my web browser is not using a serif font.
Basically tesseract results can need some luck to work well. I am certainly no expert in its use.

Some links I have
http://community.aiim.org/blogs/richard ... d-indexing
https://mathieularose.com/decoding-captchas/
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/

Post Reply