Convert PDF to TIFF large file size

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Post Reply
dtran
Posts: 2
Joined: 2018-09-18T08:17:32-07:00
Authentication code: 1152

Convert PDF to TIFF large file size

Post by dtran »

My company is moving away from using an application called SimpleIndex that could OCR files for scanned images. I'm testing ImageMagicK/Tesseract OCR (hopefully with PHP to get the job done). Initially, we have a PDF that has several scanned images combined together. I then use this command line to use ImageMagicK to convert the PDF file to a TIF.

Code: Select all

magick.exe convert -strip -alpha off -density 300 100492.PDF -depth 2 -quality 100 -compress zip 100492.TIF
  • The original PDF size is at 2,573 KB.
  • After ImageMagicK it goes up 4,219 KB.
Is there anything else I can do to reduce the TIF file size without affecting the preferred density at 300 and reducing the resolution for tesseract?

For more info, next, I use tesseract to OCR the TIF file and output it as a PDF.
  • The end result is a 7,208 KB PDF.
  • This is more than double the size of the SimpleIndex file which is at 3,589 KB.
NOTE: Oddly enough I tested another TIF file (same original PDF file but changed the depth from 2 to 8 and quality from 100 to the default 92 on ImageMagicK which produced a 6,466 KB TIF file). After running tesseract on it produced the exact same size PDF at 7,208 KB PDF.
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Convert PDF to TIFF large file size

Post by fmw42 »

From https://www.imagemagick.org/script/comm ... hp#quality where it says:

"For the MIFF and TIFF image formats, quality/10 is the Zip/BZip compression level, which is 0 (worst but fastest compression) to 9 (best but slowest). It has no effect on the image appearance, since the compression is always lossless."

So the maximum -quality may be 90 which is converted to 9 for ZIP. I am not sure what -quality of 100 will produce.

But the size of file size of the TIFF, whether compressed or not, is not important, because tesseract will need to decompress it before doing its work. I think the issue is how tesseract creates your PDF and what the point size it uses for your text, which may be dependent on the -density you used to read the original PDF. I am not a Tesseract expert and will defer to others who know more about that than I.
snibgo
Posts: 12159
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Convert PDF to TIFF large file size

Post by snibgo »

If the source PDF contains text as editable text, there is no need to rasterize then OCR it. Tools like pdftotext do the job directly.

If it contains raster images, I would use pdfimages to extract the images, then OCR those.
snibgo's IM pages: im.snibgo.com
dtran
Posts: 2
Joined: 2018-09-18T08:17:32-07:00
Authentication code: 1152

Re: Convert PDF to TIFF large file size

Post by dtran »

snibgo wrote: 2018-09-18T08:58:33-07:00 If the source PDF contains text as editable text, there is no need to rasterize then OCR it. Tools like pdftotext do the job directly.

If it contains raster images, I would use pdfimages to extract the images, then OCR those.
I just checked out pdftotext, but from my quick overview of it, it just able to get the plain text only? I actually need to make these original PDF into searchable PDFs.
Post Reply