Page 1 of 1

OCR a scan of a faded microfiche

Posted: 2016-04-02T23:01:27-07:00
by miguellint
Hello...

Would anyone have any suggestions for bringing an image up to OCR level.

The original image is a scan of an extremely faded microfiche.

Here is a snippet of the image. It's a Dropbox link so just X the pop-up asking you to join.

http://tinyurl.com/zdy5p4r

I had to do a colour capture as the microfiche was so faded.

My thinking was that I could use -black-threshold and maybe a bit of blur. Although the result is good enough for the human eye the OCR wasn't happy. Barely a 30% strike rate. In total there are about 500 pages to OCR so I'd like a bit better strike rate.

I've also tried some of Fred's noise removal scripts but I couldn't isolate them to the noise only as they ate into the letters as well.

Also looked at Snibgo's monochrome scripts but could not make head nor tail of them :-)

Any suggestions appreciated.

Many thanks
Miguel

Re: OCR a scan of a faded microfiche

Posted: 2016-04-02T23:49:04-07:00
by snibgo
I suggest you manually clean the image, using Gimp or similar, until your OCR software can read it. Then show us that image, and we might be able to suggest how to automatically to the cleaning.

Re: OCR a scan of a faded microfiche

Posted: 2016-04-03T00:30:33-07:00
by miguellint
Hello Snibgo...

Here's one I made earlier :-)

This is what a scanned fiche with a 100% OCR strike rate looks like...

http://tinyurl.com/zc6culq

The original scanned image is of such a reasonable quality that all I need do is deskew and crop the image then make it a bit more "solid" using the following command supplied by Fred...

Code: Select all

convert infile  -negate -lat 10x10+2% -negate outfile.png
If needed I can do a despeckle with the following...

Code: Select all

convert infile -morphology close diamond:1  outfile.png
I have the deskew/crop/negate commands in a bash script which will quite happily work away in the background and tidy up 500 images in an hour or two. The OCR strike rate is usually 90%-ish.

The annoying thing is that the fiche I'm currently scanning are really quite faded. Using the scanner's "colour capture" option is the only way to bring out any definition.

Any advice appreciated.

Many thanks
Miguel

Re: OCR a scan of a faded microfiche

Posted: 2016-04-04T19:57:00-07:00
by miguellint
Here's a bash step-by-step that improves the OCR strike rate considerably.

(And here's a Dropbox link to a Before/After image so just X any popups asking you to register.)

Before and After
http://tinyurl.com/hl4qx5k

Code: Select all

# Despeckle
for f in *.png
  do
    file=`convert $f -format "%f" info:`
    convert $file -morphology close diamond:1  ${file%.*}_dspk.png
  done

# Floodfill with white
for f in *dspk.png
  do
    file=`convert $f -format "%f" info:`
    coordsNW=`convert $f -format "0,0" info:`
    convert $file -fuzz 20% -fill white -draw "color $coordsNW floodfill" ${file%.*}_ff.png
  done

# Slight blur
for f in *ff.png
  do
    file=`convert $f -format "%f" info:`
    convert $file -blur 0x1   ${file%.*}_blur.png
  done

# Convert to grayscale
for f in *blur.png
  do
    file=`convert $f -format "%f" info:`
    convert  $file -type Grayscale  ${file%.*}_gray.png
  done  


# Fred's ImageMagick Textcleaner script - BEST SCRIPT EVER :-)
# Use everywhere even when not needed 
for f in *gray.png
  do
    file=`convert $f -format "%f" info:`
    textcleaner $file ${file%.*}_tc.png
  done

# Darken/even out text
for f in *tc.png
  do
    file=`convert $f -format "%f" info:`
    convert $file  -negate -lat 10x10+2% -negate   ${file%.*}_dark.png
  done

# Slight blur again
for f in *dark.png
  do
    file=`convert $f -format "%f" info:`
    convert $file -blur 0x1   ${file%.*}_cleaned.png
  done

rename 's/_dspk_ff_blur_gray_tc_dark_cleaned/_cleaned/' *
rm *dspk*.png

Re: OCR a scan of a faded microfiche

Posted: 2016-05-21T04:26:43-07:00
by atariZen
Please don't use tinyurl. I cannot follow any of your links because tinyurl blocks Tor.

Anyway, without being able to see your links, I'll blindly suggest a tool called "unpaper". When I have to OCR a document, I use imagemagick in combination with unpaper.