Page 1 of 1

Office Document Imaging compatibility

Posted: 2009-11-05T12:40:03-07:00
by Einar
Hello,

After some messing around, I've figured out how to use ImageMagick to convert pdf files into tiff files are compatible with Microsoft Office Document Imaging (because I'm cheap and I don't want to figure out how to use Tesseract). Since it took me entirely too long, I'm writing this post in the hope that future Internet-searchers have an easier time.

Weirdly enough, when you convert directly from pdf to tiff, you get a file that's not compatible with MODI. However, if you go to a jpeg in between, the resultant file is compatible. Hopefully someone with better knowledge of image formats than me can look at the two different outputs and figure out what options you need to toggle to do it in one step.

Here's what I did, as Windows batch commands:
  1. Code: Select all

    convert -quality 100 -density 400 -resize 25% in.pdf out%d.jpg
  2. Code: Select all

    FOR /F %a IN ('dir /b *.jpg') DO convert -colorspace RGB +compress -type TrueColor -resize 300% %a "%a"-new.tiff
  3. Code: Select all

    convert -adjoin out*.jpg-new.tiff tiff:everything.tiff
In step 1, the -density 400 and -resize 25% are based on magick's advice to supersample in this post. In step 2, it turns out that just naively converting from jpg to tiff creates an image that's too small for MODI, so I blow it up by 200%. (I'm not really sure why MODI chokes on this, seems like it's just written badly). Finally, in step three, I combine all of those separate .tiff files into one big .tiff file. I'm sure I could have done that as part of a different step, but my Magick-Fu is not that great.

And then tadah! everything.tiff has your entire PDF, in a format that MODI can read and OCR pretty well. Of course, I've run into problems converting the OCR'd tiff back into a PDF - maybe I'll reply to this once I find out how to fix that.

Hopefully this helps someone else.