Best Settings for Scanned Text Images

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Post Reply
jedidiah
Posts: 3
Joined: 2019-11-07T04:06:40-07:00
Authentication code: 1152

Best Settings for Scanned Text Images

Post by jedidiah » 2019-11-07T04:10:51-07:00

Hi everyone,
I have a pdf file of a scanned book which is 500mb large. I have been playing around for several days with different conversion options to make this smaller. Obviously I'm aware there is a trade off between size and quality, but I wondered if anyone has done any analysis about what are the best settings to use for text scans? I've found a lot of useful articles on here about compressing images in general, but nothing specific to text.
So far the best I have is "convert -density 192 in.pdf -threshold 50% -type bilevel -compress fax out.pdf" which reduces the file size to about 90mb, but the quality isn't quite good enough. Using density of 300 gives quite a larger file size.
Are there any better settings anyone might suggest?
Thanks a lot for any help!
Tom

snibgo
Posts: 12299
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Best Settings for Scanned Text Images

Post by snibgo » 2019-11-07T05:24:16-07:00

As your input is a PDF file, I suggest you use pdfimages to extract the images. When IM reads PDF (via Ghostscript), it rasterizes each page, which will resample the embedded raster images, which complicates matters.

So, use pdfimages to extract, then IM to process and reassemble into the new PDF.

In my limited experience, JPEG is a good compression format for compressed scans. Converting to black/white, with no grayscale, saves space but reduces legibility.
snibgo's IM pages: im.snibgo.com

jedidiah
Posts: 3
Joined: 2019-11-07T04:06:40-07:00
Authentication code: 1152

Re: Best Settings for Scanned Text Images

Post by jedidiah » 2019-11-07T05:37:03-07:00

Thanks .. ill try that first - the images are already in b&w .. but yes this will be an easier way of doing it.

jedidiah
Posts: 3
Joined: 2019-11-07T04:06:40-07:00
Authentication code: 1152

Re: Best Settings for Scanned Text Images

Post by jedidiah » 2019-11-07T13:21:12-07:00

So converting the pdf directly with
convert -density 300 input.pdf -threshold 50% -type bilevel -compress group4 out.pdf
or
convert -density 300 input.pdf -threshold 50% -type bilevel -compress fax out.pdf
produced the same file size of 143mb (not sure if fax and group4 are the same then).

Doing the extract images with pdfimages first produced 10gb of images. Then using;
convert *.p?m -threshold 50% -type bilevel -compress fax out.pdf
also produced a pdf of 143 mb though there was very slight differences in how the text appeared compared to the first method.

If anyone had improvements to this I'd be grateful to hear, but otherwise Im happy with the 143mb result.

Post Reply