Pre-processing for OCR (outlined font)

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Locked
Wolfgang Woehl
Posts: 34
Joined: 2010-02-25T15:22:50-07:00
Authentication code: 8675308

Pre-processing for OCR (outlined font)

Post by Wolfgang Woehl »

I'm trying to improve OCR/tesseract legibility of text rendered in an outline font. Text spacing is dense enough that a bunch of letter pairs will touch: http://minus.com/lyPtV8gTWD0SP. Feeding this original through tesseract will not output anything useful (completely garbled).

My best shot at it so far is trying to pick out the glyphs' meat by merging the black outline pixels with the background via floodfill:

Code: Select all

convert original.tif -fill black -draw 'color 5,5 floodfill' -negate output.tif
which results in http://minus.com/lbup7GGmpnwy6I. This improves OCR/tesseract output dramatically but it will result in garbled text wherever the floodfill can not reach (outlines touching) and leaves "insets" behind. E.g. "wollte" will turn into "wtalltue" because of the artefacts in "o" and between "t" and "e".

Improvements or, rather, a better idea much appreciated. Thanks in advance.
Version: ImageMagick 6.7.8-10 2012-10-07 Q16 http://www.imagemagick.org (Linux)

snibgo
Posts: 13034
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Pre-processing for OCR (outlined font)

Post by snibgo »

Morphology might be useful here. For example:

Code: Select all

"%IMG%convert" wollte.jpg ^
  -fuzz 50%% ^
  -fill Black ^
  -floodfill 0x0 White ^
  w1.png

"%IMG%convert" w1.png ^
  -morphology Hit-and-Miss "1x8:1,0,1,1,0,0,0,0" ^
  w2.png
This identifies areas where pixel are, reading downwards: white, black, white, white, and 4 blacks. These are most of the areas that are incorrectly left white.
snibgo's IM pages: im.snibgo.com

Wolfgang Woehl
Posts: 34
Joined: 2010-02-25T15:22:50-07:00
Authentication code: 8675308

Re: Pre-processing for OCR (outlined font)

Post by Wolfgang Woehl »

snibgo, very interesting idea indeed. Thanks for the suggestion. Going to experiment with it tomorrow. Hardcoded pixel matches though, yes? That's probably going to be a problem for variable-sized input.

snibgo
Posts: 13034
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Pre-processing for OCR (outlined font)

Post by snibgo »

I should have said: the script is Windows Bat; adjust as required for other languages. It works for any size of input.
snibgo's IM pages: im.snibgo.com

snibgo
Posts: 13034
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Pre-processing for OCR (outlined font)

Post by snibgo »

(Well, any size at least 1x8 pixels.)
snibgo's IM pages: im.snibgo.com

Wolfgang Woehl
Posts: 34
Joined: 2010-02-25T15:22:50-07:00
Authentication code: 8675308

Re: Pre-processing for OCR (outlined font)

Post by Wolfgang Woehl »

Right, but the kernels are fixed-size. Thus whatever it will match is fixed-size, from what I understand?

I found a related topic (area opening and closing):Morphology, area open and close. This is about selecting contiguous areas bigger (or smaller) than a specific amount of pixels. In conjunction with some neighbourhood checking this might be feasible (if it were a feature in the first place), right?

snibgo
Posts: 13034
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Pre-processing for OCR (outlined font)

Post by snibgo »

Ah, I see what you mean. Yes, different font sizes would need different kernels.

My fragment above is a building block to remove the unfilled pixels near the bottom of characters. The kernel can be inverted for unfilled pixels near the top of characters. That leaves a few isolated pixels, which a third pass can remove.

A complete Windows Bat script, that gives perfect results for your sample file, is below. It isn't fast, because of the repeated sub-image search for white pixels. Performance would be greatly improved by dumping w2.tiff to a text file and looping through it, floodfilling w1.tiff for each white-ish pixel in w2.tiff.

If your files have different font sizes, other morphology methods may be better. See http://www.imagemagick.org/Usage/morphology/ .

Code: Select all

"%IMG%convert" wollte.jpg ^
  -fuzz 50%% ^
  -fill Black ^
  -floodfill 0x0 White ^
  -alpha off ^
  -threshold 50%% ^
  -depth 8 ^
  w1.tiff



rem Find unfilled pixels near the bottom of characters.

"%IMG%convert" w1.tiff ^
  -morphology Hit-and-Miss "1x8:1,0,1,1,0,0,0,0" ^
  w2.tiff

:Loop1
rem Find a white pixel

"%IMG%compare" ^
  -metric pae -dissimilarity-threshold 1 ^
  w2.tiff ^
  -size 1x1 xc:white ^
  -subimage-search ^
  null: 2>wollteWhite.lis

type wollteWhite.lis

for /f "tokens=2,3,4 delims=()@, " %%a ^
in (wollteWhite.lis) ^
do (
  set score=%%a
  set foundX=%%b
  set foundY=%%c
)

if /I "%score%" gtr "0.1" goto noMore1

set /A imgY=%foundY%-3

"%IMG%convert" w1.tiff -fuzz 25%% -fill Black -draw ^"color %foundX%,%imgY% floodfill^" w1.tiff
"%IMG%convert" w2.tiff -fuzz 25%% -fill Black -draw ^"color %foundX%,%foundY% floodfill^" w2.tiff

goto Loop1

:noMore1



rem Find unfilled pixels near the top of characters.

"%IMG%convert" w1.tiff ^
  -threshold 50%% ^
  -morphology Hit-and-Miss "1x8:0,0,0,0,1,1,0,1" ^
  -threshold 50%% ^
  -depth 8 ^
  w2.tiff

:Loop2
rem Find a white pixel

"%IMG%compare" ^
  -metric pae -dissimilarity-threshold 1 ^
  w2.tiff ^
  -size 1x1 xc:white ^
  -subimage-search ^
  null: 2>wollteWhite.lis

type wollteWhite.lis

for /f "tokens=2,3,4 delims=()@, " %%a ^
in (wollteWhite.lis) ^
do (
  set score=%%a
  set foundX=%%b
  set foundY=%%c
)

if /I "%score%" gtr "0.1" goto noMore2

set /A imgY=%foundY%+4

"%IMG%convert" w1.tiff ^
  -fuzz 50%% -fill Black -draw ^"color %foundX%,%imgY% floodfill^" ^
  -threshold 50%% ^
  -depth 8 ^
  w1.tiff

"%IMG%convert" w2.tiff ^
  -fuzz 50%% -fill Black -draw ^"color %foundX%,%foundY% floodfill^" ^
  -threshold 50%% ^
  -depth 8 ^
  w2.tiff

goto Loop2

:noMore2


rem Eliminate single white pixels

"%IMG%convert" w1.tiff ^
  -threshold 50%% ^
  -morphology Hit-and-Miss "3x3:-,0,-,0,1,0,-,0,-" ^
  -threshold 50%% ^
  -depth 8 ^
  w2.tiff


:Loop3
rem Find a white pixel

"%IMG%compare" ^
  -metric pae -dissimilarity-threshold 1 ^
  w2.tiff ^
  -size 1x1 xc:white ^
  -subimage-search ^
  null: 2>wollteWhite.lis

type wollteWhite.lis

for /f "tokens=2,3,4 delims=()@, " %%a ^
in (wollteWhite.lis) ^
do (
  set score=%%a
  set foundX=%%b
  set foundY=%%c
)

if /I "%score%" gtr "0.1" goto noMore3

set /A imgY=%foundY%

"%IMG%convert" w1.tiff ^
  -fuzz 50%% -fill Black -draw ^"color %foundX%,%imgY% floodfill^" ^
  -threshold 50%% ^
  -depth 8 ^
  w1.tiff

"%IMG%convert" w2.tiff ^
  -fuzz 50%% -fill Black -draw ^"color %foundX%,%foundY% floodfill^" ^
  -threshold 50%% ^
  -depth 8 ^
  w2.tiff

goto Loop3

:noMore3


rem Finished. w1.tiff contains the result.
snibgo's IM pages: im.snibgo.com

Wolfgang Woehl
Posts: 34
Joined: 2010-02-25T15:22:50-07:00
Authentication code: 8675308

Re: Pre-processing for OCR (outlined font)

Post by Wolfgang Woehl »

Ok, there's one way to do it -- if somewhat cumbersome and, indeed, horrifyingly slow :) Thanks for the effort, snibgo! The problem, I think, with this approach is that it deals with the artefacts of an initial operation (background floodfill with outline color) which is not suitable in the first place. That idea was really only my first babystep towards a better understanding of the problem.

I'm experimenting with another observation: Insets in outlined fonts are surrounded, at least partially, by glyph "meat". Assuming a specific search direction (left-to-right or top-to-bottom), once you encounter an outline pixel (black in this case) with neighbouring background pixels (white in this case) the following pixel should be inside glyph "meat". Floodfill with a marker color there. The next outline pixel will either lead to background or to an "inset". The check for neighbouring background pixels would fail there because the surrounding outline is filled already with marker. Mixed results so far. With the densely packed outlines I have here some locations will fail with left-to-right search.

snibgo
Posts: 13034
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Pre-processing for OCR (outlined font)

Post by snibgo »

The speed of my script can be increased by a couple of orders of magnitude, so that needn't be a major concern (depending on the volume of work, of course).

I looked at a few ways of getting to the characters without also getting the centre of the "o", the gap between "t" and "e", between "w" and "a", and so on. I couldn't quickly find a method that was simpler than my script. (Which doesn't mean that no simpler solution exists, of course.)

The real problems start if you need a general solution for different font sizes or even different fonts. For example, if the font is constant but the size isn't, you might search for every "a", then every "b", and so on.
snibgo's IM pages: im.snibgo.com

snibgo
Posts: 13034
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Pre-processing for OCR (outlined font)

Post by snibgo »

Wolfgang Woehl wrote:This is about selecting contiguous areas bigger (or smaller) than a specific amount of pixels. In conjunction with some neighbourhood checking this might be feasible (if it were a feature in the first place), right?
Morphology can currently find areas bigger or smaller than various dimensions. (Checkout "distance".) While this might be useful, it doesn't offer an immediate solution, as the hole in "o" is larger than the dot in "i", for example.
snibgo's IM pages: im.snibgo.com

Wolfgang Woehl
Posts: 34
Joined: 2010-02-25T15:22:50-07:00
Authentication code: 8675308

Re: Pre-processing for OCR (outlined font)

Post by Wolfgang Woehl »

Yes, it's an attractively hard problem for any kind of non-OCR approach. From an OCR-centric point-of-view, though, it might be close to trivial, shape recognition and some intelligence towards the concept of outlines.

Locked