Scanned book: removing spots

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Locked
Erik
Posts: 12
Joined: 2016-04-30T00:18:15-07:00
Authentication code: 1151

Scanned book: removing spots

Post by Erik »

Version: ImageMagick 6.9.3-7 Q8 x86_64 2016-04-29 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2016 ImageMagick Studio LLC
License: http://www.imagemagick.org/script/license.php
Features: Cipher DPC Modules OpenMP
Delegates (built-in): bzlib freetype jng jp2 jpeg lcms ltdl lzma png webp wmf xml zlib
Hello,

following this forum and the manual I have been able process a scanned book and get nice BW images.

However, there remain spots near the border. These have to be removed before the greedy trimming, as these spots let the greedy trimming stop too early. I tried my hands on morphology but didn't get it right. (The morphology used in the script below closes little white gaps in the black characters.)

To remove these spots, the algorithm would need to:

1. Find grayscale clusters
2. Delete grayscale cluster if there is no other grayscale pixel close it.
3. Where close is a numeric value of 20 px. This would allow meaningful dots (above the character i etc.) to remain untouched.

Please refer to the script below. Any help would be appreciated.

Code: Select all

#!/bin/sh

# DEPENDS ON:
# brew install imagemagick --with-jp2 --with-openmp --with-quantum-depth-8
# brew install parallel
# http://www.fmwconcepts.com/imagemagick/autotrim/

mkdir -p w1/

FORMAT=png # output format
BWFUZZ=60 # higher values result in more black and less white area
TRIMFUZZ=80 # higher values result in more greedy trimming
LIMIT="-limit memory 300MB -limit map 600MB"

# To grayscale

find in/ -name "*.jp2" -exec basename {} .jp2 \; | parallel --bar -j 4 convert in/{}.jp2 $LIMIT -strip -flatten -alpha off -colorspace gray -fuzz $BWFUZZ% -fill white +opaque black +repage -morphology Open diamond -format $FORMAT w1/{}.$FORMAT

# Greedy trim

find w1/ -name "*.png" -exec basename {} \; | parallel --bar -j 4 autotrim -t -5 -b 5 -l -5 -r 5 -f $TRIMFUZZ w1/{} w1/{} > /dev/null

# To black-and-white

find w1/ -name "*.png" -exec basename {} \; | parallel --bar -j 4 convert w1/{} $LIMIT -quantize gray +dither -colors 2 -depth 2 +repage w1/{}
Input (JP2 image, may trigger a download in your browser)

http://drive.google.com/uc?export=view& ... DZOYXhWSmM

To Grayscale

http://drive.google.com/uc?export=view& ... 2ZmNHNSTUE

After this step, the dots near the border need to be removed.

To Black-White

http://drive.google.com/uc?export=view& ... lNoYXFSOEU

User avatar
fmw42
Posts: 26383
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Scanned book: removing spots

Post by fmw42 »

Average the image down to one row (or column). Check for non-white near the ends and a large gap of white then dark near the middle. If you find that, then crop to the dark area in the middle. Or just crop to the central dark region.

snibgo
Posts: 13034
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Scanned book: removing spots

Post by snibgo »

As Fred says.

However, black spots might appear in the middle of pages, separated from text. Here is a method to deal with those.

Given black text and spots on white background, one method of removing small isolated spots is to paint white over anything that certainly isn't text.

The first pair of erodes joins characters together, making large black blobs. Instead of a single erode with a two-dimensional shape such as a disk, I use two thin orthogonal rectangles. This is much faster.

Then we dilate in the x-dimension only to remove blobs that are small in that dimension, and erode the same so the result entirely covers the blobs.

Windows BAT syntax. Adjust for bash.

Code: Select all

convert ^
  -virtual-pixel White ^
  0317_grayscale.png ^
  -threshold 50%% ^
  +write x0.png ^
  -morphology Erode Rectangle:60x1 ^
  -morphology Erode Rectangle:1x60 ^
  +write x1.png ^
  -morphology Dilate Rectangle:100x1 ^
  +write x2.png ^
  -morphology Erode Rectangle:100x1 ^
  g3.png
Where there was text, g3 is black. Where there were spots, g3 is white. Where there were neither text nor spots, g3 is black or white.

Each intermediate "+write x?.png" is provided so you can see what is happening. They can be removed.

Code: Select all

convert ^
  0317_grayscale.png ^
  g3.png ^
  -compose Lighten -composite ^
  g4.png
The result, g4.txt, has any black marks that are not text over-painted with white. It takes about 6 seconds.
snibgo's IM pages: im.snibgo.com

User avatar
fmw42
Posts: 26383
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Scanned book: removing spots

Post by fmw42 »

Snibgo,

Terrific method!

Much more direct than looking for gaps in 1D images (via txt:)

snibgo
Posts: 13034
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Scanned book: removing spots

Post by snibgo »

It's a cute method. It can be done with a single "convert" and no scripting. It won't remove horizontal black marks at top and bottom of scans because they resemble black text, so the 1D gap-method is still useful for that.
snibgo's IM pages: im.snibgo.com

Erik
Posts: 12
Joined: 2016-04-30T00:18:15-07:00
Authentication code: 1151

Re: Scanned book: removing spots

Post by Erik »

That works well, thank you both for your help :-)

Locked