Remove Background Noise For OCR

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Post Reply
inter
Posts: 2
Joined: 2011-05-19T08:43:15-07:00
Authentication code: 8675308

Remove Background Noise For OCR

Post by inter »

Hello all,

I've been trying to perform OCR on a number of images like the one below but am running in to issues because of all the noise present throughout the image. The two main issues being the slice of yellow around the number in the upper left corner of the image and the image of the film reel in the background of the image. Both of these things cause the accuracy of my OCR to decrease immensely.

I tried my hand at removing the background but had no luck with that. More recently I've been changing the images to gray scale, darkening the image and then changing anything to white that isn't black. I've experimented with a number of different fuzz levels, but none seem to produce particularly good results either.

Unfortunately my experience with ImageMagick is pretty minimal, so I figured I'd post here to see if anyone had any other ideas as to how I could better process these images.

Any help provided is greatly appreciated.

Image
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Remove Background Noise For OCR

Post by fmw42 »

you might experiment with looking at each channel of different colorspaces and see if any one makes it easier for you to get the text.

convert image -colorspace ??? -separate image_%d.png
User avatar
anthony
Posts: 8883
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: Remove Background Noise For OCR

Post by anthony »

The problem here is a low level pixel noise and lack of contrast.

For the former you can try many of the noise reducing methods, such as -median
or even -morphology smooth

The contrast is basically color adjustments. Most OCR software seems to like contrast increased to extreme thresholding so that each pixel is either black or white.

In either case you will probably need to first crop out individual areas of text you are interested in. OCR's software would probably have a lot of trouble with such a disordered collection of text, as it stands.


Basically... Simplify Simplify Simplify
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
User avatar
anthony
Posts: 8883
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: Remove Background Noise For OCR

Post by anthony »

Another alturnative is to avoid the images entierly.

If this is a 'live' feed that you are working with. then the "Movies on Demand" website may have the information in plain text or perhaps HTML that needs only minimal text processing to extarct the desired information!

OCR is hard. Text from web sites easy!
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
inter
Posts: 2
Joined: 2011-05-19T08:43:15-07:00
Authentication code: 8675308

Re: Remove Background Noise For OCR

Post by inter »

anthony wrote:Another alturnative is to avoid the images entierly.

If this is a 'live' feed that you are working with. then the "Movies on Demand" website may have the information in plain text or perhaps HTML that needs only minimal text processing to extarct the desired information!

OCR is hard. Text from web sites easy!
This was the first route I tried (and desperately hoped) to take, but it unfortunately didn't produce the same results.

Thank you both for your suggestions. I'll play around with them and circle back around with my results. :)
HugoRune
Posts: 90
Joined: 2009-03-11T02:45:12-07:00
Authentication code: 8675309

Re: Remove Background Noise For OCR

Post by HugoRune »

You could generate two images, one specifically to recognize white-on-dark text and one to recognize dark-on-white text

convert Cb3R5.png -normalize ( -clone 0 -blur 5 ) -compose minus -composite -normalize cbblack.png
convert Cb3R5.png -normalize ( -clone 0 -blur 5 ) +swap -compose minus -composite -normalize cbwhite.png

Image
Image

Still looks very hard to recognize without errors though

Alternatively:

If the images you want to process all have the same background, without moving parts, then you can extract that background and remove it fairly easily: http://www.imagemagick.org/Usage/masking/#known_bgnd

This would also work if there are a few possible backgrounds, or if the background consists of an animation with several frames, you just have to repeat the process for all possible backgrounds
User avatar
anthony
Posts: 8883
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: Remove Background Noise For OCR

Post by anthony »

At the bottom of the same page on masking, is a more avanced form of removing background that recovers anti-aliasing too.
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
Post Reply