Page 1 of 1

OCR of typewriter copy

Posted: 2014-09-02T08:51:29-07:00
by rleir
This is my first post here, so hi everyone.

I have been using Imagemagick to pre-process images in preparation for OCR. But the input is typewriter copy which is a problem because some characters are typed with less force and so are considerably lighter than others. The other problem is that the white background varies in brightness across the image (I have no control of the scanning process). The OCR output was crummy. Image

Then I explored a bit, and found this useful line:
$ convert 1345.jpg -colorspace gray \( +clone -blur 0x20 \) -compose Divide_Src -composite 1345photocopy1.jpg
# ref http://www.imagemagick.org/Usage/compose/#divide

After this, the OCR output is quite good, thanks!. But not perfect. Are there other things I could do in ImageMagick to improve this?

Re: OCR of typewriter copy

Posted: 2014-09-02T10:00:09-07:00
by fmw42
If on Linux or MacOSX or (Windows with Cygwin), you can try my script textcleaner at the link below.

Code: Select all

textcleaner -g -e normalize -f 15 -o 10 -t 35 snippetFrom1345.png snippetFrom1345_g_norm_f15_o10_35.png
Image



otherwise, use the IM function -lat

Code: Select all

convert snippetFrom1345.png -negate -lat 15x15+10% -negate snippetFrom1345_lat_15x15_10.png
Image


adjust the 15x15 as desired.

Re: OCR of typewriter copy

Posted: 2014-09-03T08:26:34-07:00
by rleir
Thanks, the -lat command line gives better results than the -divide line. Both work better than the old way, where I just brightened the input by 50%. I have not tried the textcleanup line yet but would like to.

But now I am using more CPU time for 'convert' than for 'tesseract. Yikes!

Re: OCR of typewriter copy

Posted: 2014-09-03T09:00:03-07:00
by rleir
Oops, sorry, hold that last post. The -lat method gives more words in the OCR output, but there are incorrect and broken up words. The -divide method gives almost correct text. I should try other values for 15x15.