extracting text area from image

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Post Reply
aciobanu

extracting text area from image

Post by aciobanu »

Hi, all!

I am trying to use ImageMagick to extract strictly
the text area from a photograph of a book page.

If you look at the image attached, I am interested in the
green area and, if possible, the red area.

The problem is that it has to be automated and work
for books of various sizes.

Do you guys have any idea how I could achieve this?

My idea so far, is:
apply a really crazy filter that would transform the
green area into o big uniform blob, so that I can
then extract its coordinates, and then use those
on the original image.

Image -> http://picasaweb.google.com/capsunel/Im ... 3571163346

Alex
aciobanu

Re: extracting text area from image

Post by aciobanu »

If anybody is interested, there are some solutions proposed to this problem
on the mailing list. Here is the thread:
http://studio.imagemagick.org/pipermail ... html#19694

Alex
User avatar
anthony
Posts: 8883
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: extracting text area from image

Post by anthony »

The problem is that you know the image (after the -lat) is a border area, followed by a white box area, and an internal area.

One ideas is to create a tmp image that you 'blur', then color reduce the image
to 2 colors. This hopefully will reduce the image to a three boxes that you can study to find the center of the margin of the page. With that you can the crop the image
to the margins, and trim it to just text itself.

This technique i call a fuzzy or blurred trim...
http://www.imagemagick.org/Usage/crop/#trim_blur

It is a starting point at least.
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
User avatar
anthony
Posts: 8883
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: extracting text area from image

Post by anthony »

Another alturnative is to assume the outside areas are dark. So again in a temporary image, run it though a -median 10x10 filter to remove the text, to leave just a 'blank page'
That page should be a lot easier to determine the bounds off than a page filled with text!

A smaller -median filter will also help remove any 'junk' pixels left by the -lat operator.
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
aciobanu

Re: extracting text area from image

Post by aciobanu »

I've tried blurring the image (after -lat) and it gives pretty
good results. I get a cloudy uniform blob where the text is.

Still, I should not expect to solve the whole thing from
command line.

One strategy I consider is the following:
1. place yourself in the center of the image
2. start "moving" in 4 directions simultaneously (North, South, East, West)
3. count how many times you find black pixels, (aka text, image, etc)
4. when you you start getting only white pixels you stop

The place where you have stopped at North, South East and West
will give you the Ymin, Ymax, Xmax, Xmin where the interesting area is.
(0,0 coord is the NW corner)

I've seen something similar done in the unpaper tool. It should be
easy to do with IM.

Alex
User avatar
anthony
Posts: 8883
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: extracting text area from image

Post by anthony »

Use -median to remove the text, then look for where the paper ends ;)
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
jumpjack
Posts: 69
Joined: 2010-12-10T05:29:16-07:00
Authentication code: 8675308

Re: extracting text area from image

Post by jumpjack »

anthony wrote:Use -median to remove the text, then look for where the paper ends ;)
I need to do same thing of the Original Poster, but I can't understand this reply.
Any help?
User avatar
anthony
Posts: 8883
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: extracting text area from image

Post by anthony »

Please start a new thread with an example of YOUR image problem.
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: extracting text area from image

Post by fmw42 »

I don't know if this will work on your image, but you can try my script, textcleaner, at the link below.
jumpjack
Posts: 69
Joined: 2010-12-10T05:29:16-07:00
Authentication code: 8675308

Re: extracting text area from image

Post by jumpjack »

fmw42 wrote:I don't know if this will work on your image, but you can try my script, textcleaner, at the link below.
Thanks, but I do not need to clean the background, I want to extract text areas from a scanned page, and I need the script to find the areas by itself.
jumpjack
Posts: 69
Joined: 2010-12-10T05:29:16-07:00
Authentication code: 8675308

Re: extracting text area from image

Post by jumpjack »

anthony wrote:Please start a new thread with an example of YOUR image problem.
Already opened:
viewtopic.php?f=1&t=10377

I was looking for some hints here too.
jumpjack
Posts: 69
Joined: 2010-12-10T05:29:16-07:00
Authentication code: 8675308

Re: extracting text area from image

Post by jumpjack »

So?
No clues about how to determine areas containig words?
Post Reply