Find the white blocks between text and pictures

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Post Reply
rmagick
Posts: 245
Joined: 2006-03-16T17:30:48-07:00
Location: Durham, NC, USA

Find the white blocks between text and pictures

Post by rmagick »

An RMagick user asks:
I have a number of black-and-white scanned pages. To prepare them for OCR,
I have to split them in columns and rows. Additionally, somewhere in between, there
are pictures, which also need to be separated.

So, in a page that might look like this:

Text1 Text4 Text6

Text2 Pict1 Text7

Text3 Text5 Pict2

I'd like to find the largest blocks of white which separate the texts and pictures, both horizontally
and vertically.
Is there a way to do this with ImageMagick? I can easily convert the commands and options into RMagick code for him.
User avatar
anthony
Posts: 8883
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: Find the white blocks between text and pictures

Post by anthony »

Not directly in IM, but Im can be used to seperate the blocks.

I wrote a script as a proposed horizontal and vertial 'block' segmentation algorithm.

See the Script divide_vert which looks for rows of pixels that is all the same color (exactly the same at this time, no fuzz :-( ) and outputs either just the interesting blocks, or those bloce with the 'spacing' blank images, it found between those blocks.

this script was creating in response to someone elses problem, and should be able to be expanded to do what you request.

If you can give me a 'shrunk' example image, I may be able to work on it a bit more to make it do what you want, and make a utility that would be much more useful in general.


NOTE: most good OCR software have options to select the areas of the image that contains the text to be converted.

ASIDE: The DjVu image format is designed to do this horizontal / vertical separation of blocks right down to character level so as to find and delete duplicate images (characters) and thus shrink scanned images to the MAX, regardless of the font and style of the book being scanned!
This naturally lends itself perfectly for OCR conversion of the smaller images.
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
rmagick
Posts: 245
Joined: 2006-03-16T17:30:48-07:00
Location: Durham, NC, USA

Re: Find the white blocks between text and pictures

Post by rmagick »

Thanks for the tip, Anthony! I've passed it along to the RMagick user and volunteered to help convert it to Ruby and RMagick. I have to say though, after looking at your "heroic" script, I'm glad I get to use Ruby for my IM-ing :D
User avatar
anthony
Posts: 8883
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: Find the white blocks between text and pictures

Post by anthony »

You are quite welcome. The script was designed as a proof of concept rather than any specifically useful application, which is why it only does vertical segmentation.

Segmentation of an image into separate parts is an area that IM is sorely lacking, even if those parts are already well defined as in black and white scans of documents.

I would like to see vertical, horizontal, segmentation, and the black and white mask separation techniques (see segment_image script) programmed into the IM core for speed, as well as the addition of color and texture area division methods.

Please keep us (especially me) informed as to your RMagick progress in this matter.
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Find the white blocks between text and pictures

Post by fmw42 »

Has anyone considered building upon the blob counting method of the following post (of which el_supremo has provided some code):

viewtopic.php?f=1&t=10889

Did or could any of this be built into an IM function?
User avatar
anthony
Posts: 8883
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: Find the white blocks between text and pictures

Post by anthony »

el_supremo method was designed only to locate and output the largest blob, and did so with his own flood fill method.

My script, generates a separate image for each and every blob, regardless of colors.

This type of segmentation can also make good use of image 'morphology' operations as shown by Fred Weinhaus Scripts, to expand and merge 'near' segments. This also has yet to be added into the IM core.

Finally when you have a layered sequence of segment masks you can use them with my -layers composite operator to extract each mask from the original un-modified image.
This is something I have not started a section on in IM Examples, and probably should.
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
Post Reply