Page 1 of 1

Removing noise from scanned text document

Posted: 2014-11-20T20:57:08-07:00
by nilambara
Hello,

I have a scan that ended up with quite a bit of noise. I've looked through what I could and I haven't yet been able to clean it up enough to OCR well. Maybe someone here has some good ideas?

http://i.imgur.com/Im9luhE.png

Thanks.

Re: Removing noise from scanned text document

Posted: 2014-11-20T21:05:30-07:00
by fmw42
Your noise spots are too thick. They are nearly the thickness of your characters. So it will be hard to distinguish. Do you have a grayscale scan before converting (thresholding) to black/white? That may be easier to clean.

Re: Removing noise from scanned text document

Posted: 2014-11-20T21:09:41-07:00
by nilambara
A good reminder. :) I had run the scan through ScanTaylor and forgot that the original is indeed quite different. Maybe there's hope with this:

http://i.imgur.com/jx8MPc9.png

Re: Removing noise from scanned text document

Posted: 2014-11-20T21:46:14-07:00
by snibgo
This looks like a grayscale image that has been dithered to black and white. It is better to start from the grayscale image.

Re: Removing noise from scanned text document

Posted: 2014-11-20T22:09:40-07:00
by fmw42
I have tried everything I can think of to no avail, including: -morphology close, -kuwahara, -enhance, -despeckle and my script, isonoise. Sorry, you really need to scan it better if you can. Can you take the page out to lay it flat so you don't get the dark shadow on the left edge?

Perhaps someone else might have some better ideas.

Re: Removing noise from scanned text document

Posted: 2014-11-21T07:03:46-07:00
by jaffamuffin
http://imgur.com/lcMhTp8

I ran it through 'halftone filter' in ISIS standard image processing toolkit. You really need better scans though, or supply a greyscale image, that could be thresholded using a adaptive thresholding technique.

Re: Removing noise from scanned text document

Posted: 2014-11-21T16:32:15-07:00
by fmw42
Here is an approach that mostly seems to work. But your noise is a bit large. This is just an outline and would work better on an image without such large noise spots. It is based upon the use of the new -connected-components function on a binary image.

Input:
Image


First I run the edge preserving smoothing filter, kuwahara, to smooth out the black area on the left but leave the text sharp.

Code: Select all

convert jx8MPc9.png -kuwahara 2 jx8MPc9_kuw2.png
Image


Next, I use -lat to remove the smoothed dark area on the left. It needs the negate, because it only works for white on a black background

Code: Select all

convert jx8MPc9_kuw2.png -negate -lat 10x10+10% -negate jx8MPc9_kuw2_lat10.png
Image


Next, I use a thresholded -connected-components to get a baseline image

Code: Select all

convert jx8MPc9_kuw2_lat10.png -connected-components 4 -threshold 0 -negate jx8MPc9_kuw2_lat10_cc.png
Image

Then I repeat but filter out regions that are below 30 pixels in area.

Code: Select all

convert jx8MPc9_kuw2_lat10.png -define connected-components:area-threshold=30 \
-connected-components 4 -threshold 0 -negate jx8MPc9_kuw2_lat10_cc30.png
Image

Then I get the difference (minus) between the two, which shows only the pixels that have been removed and have gone from black to white.

Code: Select all

convert jx8MPc9_kuw2_lat10_cc.png jx8MPc9_kuw2_lat10_cc30.png -compose minus -composite \
jx8MPc9_kuw2_lat10_cc_cc30_diff.png
Image

Finally I use the difference image as a mask with a white image, to remove those changed pixels from the lat result.

Code: Select all

convert jx8MPc9_kuw2_lat10.png \
\( -clone 0 -fill white -colorize 100% \) \
jx8MPc9_kuw2_lat10_cc_cc30_diff.png -\
compose over -composite jx8MPc9_kuw2_lat10_cc_diff_composite.png
Image


There is still some large noise spots left and few characters have lost parts. So I suspect it would have worked better if the noise was not so large in area and was significantly smaller than any text so that the area threshold could be reduced.


Re: Removing noise from scanned text document

Posted: 2014-11-22T15:47:59-07:00
by nilambara
Thank you so very much everyone. Indeed, a better scan would be ideal though it is not mine and the chances of getting another are slim.

It's less important how it looks than how it OCRs. Surprisingly, the first image I posted gives the best results though they are far from perfect. Recently, I discovered other noise removal filters like Neat Image though the results are still not terrific. If anyone has any other ideas, they'd be greatly appreciated. I would so much like to get the document in format that is usable.

Re: Removing noise from scanned text document

Posted: 2017-09-12T20:45:52-07:00
by ozbigben
Reviving this old one only because it was the basis for my solution. In my case I'm starting with a colour scan so the bitonal dithering is not an issue. My test images are an old manuscript (http://cat.lib.unimelb.edu.au/record=b2651611) chosen only to test out the image cleanup. The items to be scanned are much cleaner documents (for the most part)

I'd already gotten to a reasonably clean BW image using:

Code: Select all

convert INPUT.TIF  -colorspace gray -lat 60x60-15%% -depth 2 -compress Group4 OUTPUT.TIF
Negate is not required for -lat if you use a negative offset.

Image

This just leaves some of the smaller noise specks to deal with. I had tried -despeckle and -enhance from other posts but these didn't help in this case. -lat really takes care of most of it although the width selection can be a bit tricky if you want to retain large areas of black. I had a play with connected components thanks to Fred's suggestions here and looked up the other options for it. The process can be simplified using "-define connected-components:mean-color=true" to retain the original colour of the image so my final command line was:

Code: Select all

convert INPUT.TIF  -colorspace gray -lat 60x60-15%% -define connected-components:mean-color=true \
-define connected-components:area-threshold=12  -connected-components 4 \
-depth 2 -compress Group4 OUTPUT.TIF
NB I'm only removing very small noise. Don't want to un-dot "i"s.
Image

This pretty much does exactly what I wanted. and is relatively easy to tweak. I found with -lat that when you try larger widths that IM really hogs the CPU so for safety I throw in a "-limit threads 4" to be able to keep using the computer. For smaller widths though it's nice and fast.

Re: Removing noise from scanned text document

Posted: 2017-09-12T23:11:02-07:00
by ozbigben
nilambara wrote: 2014-11-22T15:47:59-07:00 ... Surprisingly, the first image I posted gives the best results though they are far from perfect...
Actually it's not that surprising. The OCR process uses the shape of the characters so the noise is mostly irrelevant unless it connects to the characters or is sufficiently large to be mistaken for an accent, grave etc... The other issue with noise is that the OCR process may recognise additional small letters in the image. The first image has a lot of noise but the filtering has put a white outline around the edges of the letters so the shapes are still OK. The amount of filtering required to remove the noise in this case will inevitably impact on the shape of the letters and thus affect the OCR accuracy.