Page 1 of 1

Help Improving Text Of Scanned Image 4 OCR

Posted: 2014-05-28T10:28:03-07:00
by lindylex
I have a image pdf page. I would like to convert the page to an image and extract the text.

This is the pdf page. http://mo-de.net/d/out.pdf

This is what I have tried.

Convert the pdf page to an image.

Code: Select all

convert -density 200 –antialias -sharpen 0x3.0 -colorspace GRAY out.pdf t5.png
I use the following to clean up the gray at the bottom with solid white.

Code: Select all

convert -fuzz 30% -fill "#ffffff" -opaque "#f2f2f2" t5.png t6.png
Convert the image to text.

Code: Select all

tesseract t6.png o5 -l eng
My 2nd question is. How can I pipe the two convert commands together?
My FAILED attempt.

Code: Select all

convert -density 200 –antialias -sharpen 0x3.0 -colorspace GRAY out.pdf - | convert -fuzz 30% -fill "#ffffff" -opaque "#f2f2f2"  - - t5.png

Re: Help Improving Text Of Scanned Image 4 OCR

Posted: 2014-05-28T11:04:28-07:00
by fmw42
If on Linus/Mac OS or Windows with Cygwin, see my script textcleaner at the link below. Otherwise, see the IM function -lat

Re: Help Improving Text Of Scanned Image 4 OCR

Posted: 2014-05-28T11:05:14-07:00
by lindylex
It is on Debian.

Re: Help Improving Text Of Scanned Image 4 OCR

Posted: 2014-05-28T11:06:37-07:00
by fmw42
That is unix, correct, and thus my script can work on that OS.

Re: Help Improving Text Of Scanned Image 4 OCR

Posted: 2014-05-28T11:10:30-07:00
by fmw42
Before using the script, convert your PDF to a high resolution raster image such as PNG. Use -density to get high resolution

Code: Select all

convert -density XX image.pdf image.png
where XX is >72 such as 288 (which is 4x). If the resulting image is too big, then do

Code: Select all

convert -density XX image.pdf -resize YY image.png
where YY=25% or larger when XX=288

Or resize after using textcleaner

Re: Help Improving Text Of Scanned Image 4 OCR

Posted: 2014-05-28T11:19:02-07:00
by lindylex
fmw42, thanks for sharing this. I appreciate your hard work.
I tried 3 of the following commands on your site. This is the best on I got so far. Any sugeestion from looking at the pdf?

Code: Select all

./textcleaner -g -e stretch -f 25 -o 5 -s 1 out.pdf t9.png

Re: Help Improving Text Of Scanned Image 4 OCR

Posted: 2014-05-28T14:50:25-07:00
by fmw42
You did not use my suggestion of converting the pdf to png with -density before processing. Try this

Code: Select all

convert -density 288 out.pdf out1.png
textcleaner -g -e stretch -f 50 -o 10 -s 1 out1.png out1_f50_o10.png
convert out1_f50_o10.png -resize 25% out1_f50_o10_r25.png
or

Code: Select all

convert -density 288 out.pdf miff:- |\
   textcleaner -g -e stretch -f 50 -o 10 -s 1 - miff:- |\
   convert - -resize 25%out1_f50_o10_r25.png

Adjust the 25% as desired for the final size. The density 288 makes the out1.png about 4 times larger (higher quality). Add any other arguments you want to the textcleaner.

If you separate the two pages, you can use -deskew 40 to unrotate them so the lines are more even. If the pages are split before textcleaner, the use -u in textcleaner to do the unrotate.