Help Improving Text Of Scanned Image 4 OCR

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Post Reply
lindylex
Posts: 23
Joined: 2014-02-27T16:36:22-07:00
Authentication code: 6789

Help Improving Text Of Scanned Image 4 OCR

Post by lindylex »

I have a image pdf page. I would like to convert the page to an image and extract the text.

This is the pdf page. http://mo-de.net/d/out.pdf

This is what I have tried.

Convert the pdf page to an image.

Code: Select all

convert -density 200 –antialias -sharpen 0x3.0 -colorspace GRAY out.pdf t5.png
I use the following to clean up the gray at the bottom with solid white.

Code: Select all

convert -fuzz 30% -fill "#ffffff" -opaque "#f2f2f2" t5.png t6.png
Convert the image to text.

Code: Select all

tesseract t6.png o5 -l eng
My 2nd question is. How can I pipe the two convert commands together?
My FAILED attempt.

Code: Select all

convert -density 200 –antialias -sharpen 0x3.0 -colorspace GRAY out.pdf - | convert -fuzz 30% -fill "#ffffff" -opaque "#f2f2f2"  - - t5.png
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Help Improving Text Of Scanned Image 4 OCR

Post by fmw42 »

If on Linus/Mac OS or Windows with Cygwin, see my script textcleaner at the link below. Otherwise, see the IM function -lat
lindylex
Posts: 23
Joined: 2014-02-27T16:36:22-07:00
Authentication code: 6789

Re: Help Improving Text Of Scanned Image 4 OCR

Post by lindylex »

It is on Debian.
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Help Improving Text Of Scanned Image 4 OCR

Post by fmw42 »

That is unix, correct, and thus my script can work on that OS.
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Help Improving Text Of Scanned Image 4 OCR

Post by fmw42 »

Before using the script, convert your PDF to a high resolution raster image such as PNG. Use -density to get high resolution

Code: Select all

convert -density XX image.pdf image.png
where XX is >72 such as 288 (which is 4x). If the resulting image is too big, then do

Code: Select all

convert -density XX image.pdf -resize YY image.png
where YY=25% or larger when XX=288

Or resize after using textcleaner
lindylex
Posts: 23
Joined: 2014-02-27T16:36:22-07:00
Authentication code: 6789

Re: Help Improving Text Of Scanned Image 4 OCR

Post by lindylex »

fmw42, thanks for sharing this. I appreciate your hard work.
I tried 3 of the following commands on your site. This is the best on I got so far. Any sugeestion from looking at the pdf?

Code: Select all

./textcleaner -g -e stretch -f 25 -o 5 -s 1 out.pdf t9.png
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Help Improving Text Of Scanned Image 4 OCR

Post by fmw42 »

You did not use my suggestion of converting the pdf to png with -density before processing. Try this

Code: Select all

convert -density 288 out.pdf out1.png
textcleaner -g -e stretch -f 50 -o 10 -s 1 out1.png out1_f50_o10.png
convert out1_f50_o10.png -resize 25% out1_f50_o10_r25.png
or

Code: Select all

convert -density 288 out.pdf miff:- |\
   textcleaner -g -e stretch -f 50 -o 10 -s 1 - miff:- |\
   convert - -resize 25%out1_f50_o10_r25.png

Adjust the 25% as desired for the final size. The density 288 makes the out1.png about 4 times larger (higher quality). Add any other arguments you want to the textcleaner.

If you separate the two pages, you can use -deskew 40 to unrotate them so the lines are more even. If the pages are split before textcleaner, the use -u in textcleaner to do the unrotate.
Post Reply