Split images by white space

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Bonzo
Posts: 2971
Joined: 2006-05-20T08:08:19-07:00
Location: Cambridge, England

Re: Split images by white space

Post by Bonzo »

Anthony has some install notes here: http://www.imagemagick.org/Usage/api/#building

I installed on a Centos 5.2? server using:

Code: Select all

# uninstall old ImageMagick
yum remove ImageMagick

# get new ImageMagick sources
wget ftp://ftp.imagemagick.org/pub/ImageMagick/ImageMagick.tar.gz
#or as default version did not work
wget ftp://ftp.imagemagick.org/pub/ImageMagick/ImageMagick-6.6.0-0.tar.gz

# untar
tar -zxvf ImageMagick*.tar.gz
cd ImageMagick*

# Extra steps recommended by snibgo – I think I managed to install OK before without but was starting to get a shared libraries: libMagickCore.so.3 error

export LDFLAGS="-L/usr/local/lib -Wl,-rpath,/usr/local/lib"
export LD_LIBRARY_PATH="/usr/local/lib"

ldd /usr/local/bin/convert
#ABOVE LINE ONLY DIDN’T WORK ON ONE SERVER BUT DID WORK ON ANOTHER

# End of extra steps

# configure and make
./configure
make

# install
make install

hm2k

Re: Split images by white space

Post by hm2k »

I manually upgraded.

Code: Select all

[user@blade ~]# convert -version
Version: ImageMagick 6.6.1-4 2010-04-21 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2010 ImageMagick Studio LLC
Features:
That seemed to do the trick:

Code: Select all

[user@blade artwork]# ./multicrop p29cE.jpg p29cE_out.jpg

Processing Image 0
  Size: 398x136
  Page Geometry: 443x540+17+44
Processing Image 1
  Size: 404x145
  Page Geometry: 443x540+17+222
Processing Image 2
  Size: 404x127
  Page Geometry: 443x540+20+393

[user@blade artwork]# ls p29cE*
p29cE.jpg  p29cE_out-0.jpg  p29cE_out-1.jpg  p29cE_out-2.jpg
:)
hm2k

Re: Split images by white space

Post by hm2k »

I just tried this in production on 10 files and it worked perfectly.

Thanks very much for your assistance.

Keep up the good work.
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Split images by white space

Post by fmw42 »

you are welcome. glad it was of help
johnbent
Posts: 14
Joined: 2014-12-16T10:08:07-07:00
Authentication code: 6789

Re: Split images by white space

Post by johnbent »

Anyone still monitoring this really old thread? I have over 350+ images that I'd love to split along "large" regions of whitespace. Can multicrop handle this? I couldn't figure out the arguments to use. Basically what I have is 350+ scanned pages of a dictionary and I'd like to convert them to text (I have permission from the copyright holder). It's too much work for me to do myself so I want to use mechanical turk. I'd like to create a task for each individual word in the dictionary. So is there a way to use multicrop to separate out each word entry in this picture:

Image
snibgo
Posts: 12159
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Split images by white space

Post by snibgo »

It is generally best to start a new thread for new questions. By all means, refer back to previous threads.

I would tackle it like this:

1. Deskew each page.

2. Chop off head and tail of each page.

3. Divide each page into two columns, both trimmed left and right.

4. Divide each column into lines (but with no further trimming).

Now, you have one image per line in the dictionary. Each image that has a character at the far left is the start of a definition. Each image with white space at the left is a continuation.

So you then join up all the lines for each definition, and send that to the OCR.
snibgo's IM pages: im.snibgo.com
johnbent
Posts: 14
Joined: 2014-12-16T10:08:07-07:00
Authentication code: 6789

Re: Split images by white space

Post by johnbent »

That's a great suggestion! Thanks very much. I'm a total newbie to imagemagick however. I'm willing to work to figure out how to do all of the above but if you know any of the command lines to perform each of those above steps automatically for each of the 350+ pages, that'd be a much appreciated head start. I think I'll also follow your suggestion and start a new thread for this.

PS: I haven't had good luck with automated OCR on this since I believe most OCR use language context and there isn't language context in OCR for Palauan. So my OCR plan is mechanical turk (human workers on amazon).
Post Reply