OCR of typewriter copy

Discuss digital image processing techniques and algorithms. We encourage its application to ImageMagick but you can discuss any software solutions here.
Post Reply
rleir
Posts: 8
Joined: 2014-09-02T07:22:10-07:00
Authentication code: 6789

OCR of typewriter copy

Post by rleir »

This is my first post here, so hi everyone.

I have been using Imagemagick to pre-process images in preparation for OCR. But the input is typewriter copy which is a problem because some characters are typed with less force and so are considerably lighter than others. The other problem is that the white background varies in brightness across the image (I have no control of the scanning process). The OCR output was crummy. Image

Then I explored a bit, and found this useful line:
$ convert 1345.jpg -colorspace gray \( +clone -blur 0x20 \) -compose Divide_Src -composite 1345photocopy1.jpg
# ref http://www.imagemagick.org/Usage/compose/#divide

After this, the OCR output is quite good, thanks!. But not perfect. Are there other things I could do in ImageMagick to improve this?
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: OCR of typewriter copy

Post by fmw42 »

If on Linux or MacOSX or (Windows with Cygwin), you can try my script textcleaner at the link below.

Code: Select all

textcleaner -g -e normalize -f 15 -o 10 -t 35 snippetFrom1345.png snippetFrom1345_g_norm_f15_o10_35.png
Image



otherwise, use the IM function -lat

Code: Select all

convert snippetFrom1345.png -negate -lat 15x15+10% -negate snippetFrom1345_lat_15x15_10.png
Image


adjust the 15x15 as desired.
rleir
Posts: 8
Joined: 2014-09-02T07:22:10-07:00
Authentication code: 6789

Re: OCR of typewriter copy

Post by rleir »

Thanks, the -lat command line gives better results than the -divide line. Both work better than the old way, where I just brightened the input by 50%. I have not tried the textcleanup line yet but would like to.

But now I am using more CPU time for 'convert' than for 'tesseract. Yikes!
rleir
Posts: 8
Joined: 2014-09-02T07:22:10-07:00
Authentication code: 6789

Re: OCR of typewriter copy

Post by rleir »

Oops, sorry, hold that last post. The -lat method gives more words in the OCR output, but there are incorrect and broken up words. The -divide method gives almost correct text. I should try other values for 15x15.
Post Reply