Scanned book: removing spots

Erik · Post by **Erik** » 2016-04-30T01:37:54-07:00

Version: ImageMagick 6.9.3-7 Q8 x86_64 2016-04-29 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2016 ImageMagick Studio LLC
License: http://www.imagemagick.org/script/license.php
Features: Cipher DPC Modules OpenMP
Delegates (built-in): bzlib freetype jng jp2 jpeg lcms ltdl lzma png webp wmf xml zlib

Hello,

following this forum and the manual I have been able process a scanned book and get nice BW images.

However, there remain spots near the border. These have to be removed before the greedy trimming, as these spots let the greedy trimming stop too early. I tried my hands on morphology but didn't get it right. (The morphology used in the script below closes little white gaps in the black characters.)

To remove these spots, the algorithm would need to:

1. Find grayscale clusters
2. Delete grayscale cluster if there is no other grayscale pixel close it.
3. Where close is a numeric value of 20 px. This would allow meaningful dots (above the character i etc.) to remain untouched.

Please refer to the script below. Any help would be appreciated.

Code: Select all

#!/bin/sh

# DEPENDS ON:
# brew install imagemagick --with-jp2 --with-openmp --with-quantum-depth-8
# brew install parallel
# http://www.fmwconcepts.com/imagemagick/autotrim/

mkdir -p w1/

FORMAT=png # output format
BWFUZZ=60 # higher values result in more black and less white area
TRIMFUZZ=80 # higher values result in more greedy trimming
LIMIT="-limit memory 300MB -limit map 600MB"

# To grayscale

find in/ -name "*.jp2" -exec basename {} .jp2 \; | parallel --bar -j 4 convert in/{}.jp2 $LIMIT -strip -flatten -alpha off -colorspace gray -fuzz $BWFUZZ% -fill white +opaque black +repage -morphology Open diamond -format $FORMAT w1/{}.$FORMAT

# Greedy trim

find w1/ -name "*.png" -exec basename {} \; | parallel --bar -j 4 autotrim -t -5 -b 5 -l -5 -r 5 -f $TRIMFUZZ w1/{} w1/{} > /dev/null

# To black-and-white

find w1/ -name "*.png" -exec basename {} \; | parallel --bar -j 4 convert w1/{} $LIMIT -quantize gray +dither -colors 2 -depth 2 +repage w1/{}

Input (JP2 image, may trigger a download in your browser)

http://drive.google.com/uc?export=view& ... DZOYXhWSmM

To Grayscale

http://drive.google.com/uc?export=view& ... 2ZmNHNSTUE

After this step, the dots near the border need to be removed.

To Black-White

http://drive.google.com/uc?export=view& ... lNoYXFSOEU

Post by **fmw42** » 2016-04-30T10:28:37-07:00

Average the image down to one row (or column). Check for non-white near the ends and a large gap of white then dark near the middle. If you find that, then crop to the dark area in the middle. Or just crop to the central dark region.

Post by **snibgo** » 2016-04-30T10:52:13-07:00

As Fred says.

However, black spots might appear in the middle of pages, separated from text. Here is a method to deal with those.

Given black text and spots on white background, one method of removing small isolated spots is to paint white over anything that certainly isn't text.

The first pair of erodes joins characters together, making large black blobs. Instead of a single erode with a two-dimensional shape such as a disk, I use two thin orthogonal rectangles. This is much faster.

Then we dilate in the x-dimension only to remove blobs that are small in that dimension, and erode the same so the result entirely covers the blobs.

Windows BAT syntax. Adjust for bash.

Code: Select all

convert ^
  -virtual-pixel White ^
  0317_grayscale.png ^
  -threshold 50%% ^
  +write x0.png ^
  -morphology Erode Rectangle:60x1 ^
  -morphology Erode Rectangle:1x60 ^
  +write x1.png ^
  -morphology Dilate Rectangle:100x1 ^
  +write x2.png ^
  -morphology Erode Rectangle:100x1 ^
  g3.png

Where there was text, g3 is black. Where there were spots, g3 is white. Where there were neither text nor spots, g3 is black or white.

Each intermediate "+write x?.png" is provided so you can see what is happening. They can be removed.

Code: Select all

convert ^
  0317_grayscale.png ^
  g3.png ^
  -compose Lighten -composite ^
  g4.png

The result, g4.txt, has any black marks that are not text over-painted with white. It takes about 6 seconds.

Post by **fmw42** » 2016-04-30T12:45:28-07:00

Snibgo,

Terrific method!

Much more direct than looking for gaps in 1D images (via txt:)

Post by **snibgo** » 2016-04-30T13:09:45-07:00

It's a cute method. It can be done with a single "convert" and no scripting. It won't remove horizontal black marks at top and bottom of scans because they resemble black text, so the 1D gap-method is still useful for that.

Erik · Post by **Erik** » 2016-05-01T04:14:35-07:00

That works well, thank you both for your help

Legacy ImageMagick Discussions Archive

Scanned book: removing spots

Scanned book: removing spots

Re: Scanned book: removing spots

Re: Scanned book: removing spots

Re: Scanned book: removing spots

Re: Scanned book: removing spots

Re: Scanned book: removing spots