ccitt files from pdfimages

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Locked
muccigrosso
Posts: 72
Joined: 2017-10-03T10:39:52-07:00
Authentication code: 1151

ccitt files from pdfimages

Post by muccigrosso »

I frequently use pdfimages to get the images out of PDFs. I like to use the "-all" switch to make sure that the images are extracted in their original format. Often I get pairs of ccitt files, with one file containing the data and the other the parameters, like "test-001.ccitt" and "test-001.param". This is described in the pdfimages man file.

My question is how to deal with these files. Can IM read them? Is there another way to convert them?

User avatar
fmw42
Posts: 26383
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: ccitt files from pdfimages

Post by fmw42 »

ImageMagick will use Ghostscript to rasterize the PDF. You are better off using pdfimages.

Formats that ImageMagick supports are listed for your computer using

Code: Select all

convert -list format
A generic list is at

https://imagemagick.org/script/formats.php

CCITT seems to be a TIFF fax compression format. So perhaps your image is a binary compressed TIFF file. See https://www.leadtools.com/help/sdk/v20/ ... rmats.html

Have you tried opening those files with ImageMagick. Best to simply try.

muccigrosso
Posts: 72
Joined: 2017-10-03T10:39:52-07:00
Authentication code: 1151

Re: ccitt files from pdfimages

Post by muccigrosso »

Thanks for the reply.

Indeed I'm not looking to convert the pdf with IM, just work with the resulting ccitt files from pdfimages. And yes, I've tried to just open the .ccitt file with IM and get the following error:

Code: Select all

magick: no decode delegate for this image format `CCITT' @ error/constitute.c/ReadImage/562.
An example param file reads, in its entirety:

Code: Select all

-4 -P -X 3450 -B -M
which according to the man page means that this is a Group 4 encoded image, 3,450 px wide, using 0 for black, 1 for white, data filled from most to least sig digit, and the beginning of line is not aligned on a byte boundary. And, yes, the ccitt file seems like a binary.

I'm not sure what to do with the page you linked.

I do create ccitt files (tiffs) all the time with IM.

Version: ImageMagick 7.0.10-10 Q16 x86_64 2020-05-01 https://imagemagick.org on MacOS 10.13

User avatar
fmw42
Posts: 26383
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: ccitt files from pdfimages

Post by fmw42 »

Looks like pdfimages, separated the binary image and its header and make two files in place of just one. Sorry, I do not know how to recombine them. But perhaps there are flags in pdfimages to set the output format, say, to TIFF, and perhaps that will keep them together. This is more of a question for the pdfimages developers than ImageMagick.

snibgo
Posts: 13034
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: ccitt files from pdfimages

Post by snibgo »

@muccigrosso: Can you link to a sample PDF with embedded CCITT image?

Perhaps the binary ccitt file can be read with IM's raw facility, if you supply the image "-size" and "-depth".
snibgo's IM pages: im.snibgo.com

muccigrosso
Posts: 72
Joined: 2017-10-03T10:39:52-07:00
Authentication code: 1151

Re: ccitt files from pdfimages

Post by muccigrosso »


snibgo
Posts: 13034
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: ccitt files from pdfimages

Post by snibgo »

It has 11841 bytes, or 94728 bits. If the width is 3450 pixels then the height would be about 27 pixels, which I suppose is wrong. I conclude that alex-038.ccitt is compressed, and IM's raw reader can't read it.
snibgo's IM pages: im.snibgo.com

muccigrosso
Posts: 72
Joined: 2017-10-03T10:39:52-07:00
Authentication code: 1151

Re: ccitt files from pdfimages

Post by muccigrosso »

A little hunting on StackOverflow turned up this question which provided a solution: fax2tiff

You can feed the ccitt file to fax2tiff, using the contents of the param file as the options for the command (I'm doing this on the command line), and throwing in -8 to make sure the output tiff is Group 4 compressed. Something like this:

Code: Select all

fax2tiff `cat extracted_image.params` -8 -o output.tiff extracted_image.ccitt

muccigrosso
Posts: 72
Joined: 2017-10-03T10:39:52-07:00
Authentication code: 1151

Re: ccitt files from pdfimages

Post by muccigrosso »

A little follow up.

If I make sure that fax2tiff creates an output file with the same parameters as the input which in this case means Group 4 compression and forcing output data to have bits filled from most significant bit ( MSB ) to most least bit ( LSB ), the tiff is nearly identical to the input ccitt. There's just a little data up front and at the end that is different. xxd shows this for the start of one tiff:

Code: Select all

00000000: 4949 2a00 4a2e 0000 ffff ffff ffff ffff  II*.J...........
00000010: ffff ffff ffff fffe 5811 99c6 192e 642e  ........X.....d.
vs this for the ccitt:

Code: Select all

00000000: ffff ffff ffff ffff ffff ffff ffff fffe  ................
00000010: 5811 99c6 192e 642e 9603 0cb4 4329 0fd7  X.....d.....C)..
So only the first eight octets(?) are prefixed. Limited testing (on seven files from the same PDF) suggests that it's just the second half of those that differ. The start is always

Code: Select all

4949 2a00
. Here are the first lines from those seven files:

Code: Select all

00000000: 4949 2a00 2c45 0400  II*.,E..
00000000: 4949 2a00 864a 0400  II*..J..
00000000: 4949 2a00 341f 0400  II*.4...
00000000: 4949 2a00 76b6 0000  II*.v...
00000000: 4949 2a00 f298 0000  II*.....
00000000: 4949 2a00 6898 0000  II*.h...
00000000: 4949 2a00 1ac2 0000  II*.....
Appended to the end of the tiff is more data that replaces a small amount from the ccit. This varies by file (except for the "fax2tiff" at the very end). The appendix starts at that "f0" octet:

Code: Select all

00002e38: ffff ffff ffff ffff ffff ffff fff0 0100  ................
00002e48: 1000 1200 0001 0300 0100 0000 7a0d 0000  ............z...
00002e58: 0101 0300 0100 0000 5114 0000 0201 0300  ........Q.......
00002e68: 0100 0000 0100 0000 0301 0300 0100 0000  ................
00002e78: 0400 0000 0601 0300 0100 0000 0000 0000  ................
00002e88: 0a01 0300 0100 0000 0100 0000 1101 0400  ................
00002e98: 0100 0000 0800 0000 1201 0300 0100 0000  ................
00002ea8: 0100 0000 1501 0300 0100 0000 0100 0000  ................
00002eb8: 1601 0400 0100 0000 ffff ffff 1701 0400  ................
00002ec8: 0100 0000 412e 0000 1a01 0500 0100 0000  ....A...........
00002ed8: 282f 0000 1b01 0500 0100 0000 302f 0000  (/..........0/..
00002ee8: 1c01 0300 0100 0000 0100 0000 2501 0400  ............%...
00002ef8: 0100 0000 0000 0000 2801 0300 0100 0000  ........(.......
00002f08: 0200 0000 2901 0300 0200 0000 0000 0100  ....)...........
00002f18: 3101 0200 0900 0000 382f 0000 0000 0000  1.......8/......
00002f28: cc00 0000 0100 0000 c400 0000 0100 0000  ................
00002f38: 6661 7832 7469 6666 00                   fax2tiff.

User avatar
magick
Site Admin
Posts: 11254
Joined: 2003-05-31T11:32:55-07:00

Re: ccitt files from pdfimages

Post by magick »

Try this command:

Code: Select all

convert -size 3450x3450 g4:alex-039.ccitt alex-039.png

muccigrosso
Posts: 72
Joined: 2017-10-03T10:39:52-07:00
Authentication code: 1151

Re: ccitt files from pdfimages

Post by muccigrosso »

magick wrote:
2020-05-13T04:04:07-07:00
Try this command:

Code: Select all

convert -size 3450x3450 g4:alex-039.ccitt alex-039.png
This works to create a viable png, except that the image height is incorrect and so the bottom of the image is missing. In this particular case it's 5201, according to the output tiff from fax2tiff. IM actually gives an error if I use that as the height:

Code: Select all

magick: Premature EOL at line 5200 of strip 0 (got 0, expected 3450). `Fax4Decode' @ warning/tiff.c/TIFFWarnings/1037.
Which is interesting because if I put fax2tiff in verbose mode it reports a similar error:

Code: Select all

Fax4Decode: Warning, Premature EOL at line 5200 of strip 4294967295 (got 0, expected 3450).
alex-039.ccitt:
5201 rows in input
0 total bad rows
0 max consecutive bad rows

muccigrosso
Posts: 72
Joined: 2017-10-03T10:39:52-07:00
Authentication code: 1151

Re: ccitt files from pdfimages

Post by muccigrosso »

magick wrote:
2020-05-13T04:04:07-07:00
Try this command:

Code: Select all

convert -size 3450x3450 g4:alex-039.ccitt alex-039.png
Trying to work with this further, so that I could use just IM to handle this, partly because fax2tiff errs in creating a file with an extra row in it (in this case, 5201 instead of 5200).

If I give "convert" an absurdly large number for the image height, it reports a bunch of errors as it tries to read beyond the file (I guess), and eventually reaches its limit (128) and outputs a file with the dimensions I gave it. Here are the errors:

Code: Select all

convert: Premature EOL at line 5200 of strip 0 (got 0, expected 3450). `Fax4Decode' @ warning/tiff.c/TIFFWarnings/896.
convert: Premature EOF at line 5200 of strip 0 (x 0). `Fax4Decode' @ warning/tiff.c/TIFFWarnings/896.
If I use "magick" instead, it just outputs one error message, but produces the same file:

Code: Select all

magick: Premature EOL at line 5200 of strip 0 (got 0, expected 3450). `Fax4Decode' @ warning/tiff.c/TIFFWarnings/1037
Is there a way to make either command simply stop the first time it hits the error and therefore not produce an image of the height requested?

Alternatively I suppose I could process the error output from something like the following where I give an absurdly large height (10x the width):

Code: Select all

magick -size 3450x345000 g4:alex-039.ccitt info:
which starts with

Code: Select all

magick: Premature EOL at line 5200
giving me the correct height.

Locked