Page 1 of 1

Convert PDF to TIFF large file size

Posted: 2018-09-18T08:28:40-07:00
by dtran
My company is moving away from using an application called SimpleIndex that could OCR files for scanned images. I'm testing ImageMagicK/Tesseract OCR (hopefully with PHP to get the job done). Initially, we have a PDF that has several scanned images combined together. I then use this command line to use ImageMagicK to convert the PDF file to a TIF.

Code: Select all

magick.exe convert -strip -alpha off -density 300 100492.PDF -depth 2 -quality 100 -compress zip 100492.TIF
  • The original PDF size is at 2,573 KB.
  • After ImageMagicK it goes up 4,219 KB.
Is there anything else I can do to reduce the TIF file size without affecting the preferred density at 300 and reducing the resolution for tesseract?

For more info, next, I use tesseract to OCR the TIF file and output it as a PDF.
  • The end result is a 7,208 KB PDF.
  • This is more than double the size of the SimpleIndex file which is at 3,589 KB.
NOTE: Oddly enough I tested another TIF file (same original PDF file but changed the depth from 2 to 8 and quality from 100 to the default 92 on ImageMagicK which produced a 6,466 KB TIF file). After running tesseract on it produced the exact same size PDF at 7,208 KB PDF.

Re: Convert PDF to TIFF large file size

Posted: 2018-09-18T08:58:02-07:00
by fmw42
From https://www.imagemagick.org/script/comm ... hp#quality where it says:

"For the MIFF and TIFF image formats, quality/10 is the Zip/BZip compression level, which is 0 (worst but fastest compression) to 9 (best but slowest). It has no effect on the image appearance, since the compression is always lossless."

So the maximum -quality may be 90 which is converted to 9 for ZIP. I am not sure what -quality of 100 will produce.

But the size of file size of the TIFF, whether compressed or not, is not important, because tesseract will need to decompress it before doing its work. I think the issue is how tesseract creates your PDF and what the point size it uses for your text, which may be dependent on the -density you used to read the original PDF. I am not a Tesseract expert and will defer to others who know more about that than I.

Re: Convert PDF to TIFF large file size

Posted: 2018-09-18T08:58:33-07:00
by snibgo
If the source PDF contains text as editable text, there is no need to rasterize then OCR it. Tools like pdftotext do the job directly.

If it contains raster images, I would use pdfimages to extract the images, then OCR those.

Re: Convert PDF to TIFF large file size

Posted: 2018-09-18T09:13:14-07:00
by dtran
snibgo wrote: 2018-09-18T08:58:33-07:00 If the source PDF contains text as editable text, there is no need to rasterize then OCR it. Tools like pdftotext do the job directly.

If it contains raster images, I would use pdfimages to extract the images, then OCR those.
I just checked out pdftotext, but from my quick overview of it, it just able to get the plain text only? I actually need to make these original PDF into searchable PDFs.