Image processing for better OCR result

imagemagicfan · Post by **imagemagicfan** » 2014-10-14T08:26:54-07:00

Hello dear members,

I am very new to Image processing. I have a question on image processing or cleanup before OCR. I am working on scanned mortgage documents, mostly TIFF and PDF images. Tesseract OCR is not giving expected output because most of the images have noise, punch holes, discontinued letters etc.

I am developing this module which will run in Windows environment only. I have gone through Fred's TextCleaner script. I don't want to use Cygwin to run the script.

I need to follow these steps:
1. Find out if an image really needs clean up operations i.e. It is not a better option to pass each and every image for preprocessing because of extra processing time and distortion in letters for processing
2. If image really needs processing then process / cleanup the image
3. Pass the image for OCR

Following basic functions are required for Image cleanup:
1. Image scaling
2. Image cropping at the text region
3. Image clipping
4. Image rotation
5. Lines straightening
6. Remove noise
7. Enhance local contrast
8. Autodetection of page orientation (90, 180, and 270 degrees)
9. Automated image de-skewing
10. Image despeckling

I have gone through links in ImageMagick forum as well as other links through Google, I could not find any proper answer which can provide me command lines which I can run as basic operation on all images i.e. Detect then Cleanup. Please help me on this. Thanks in advance.