scanned paper pdfs and original image extraction.

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Post Reply
cyclondude
Posts: 6
Joined: 2013-03-14T12:14:25-07:00
Authentication code: 6789

scanned paper pdfs and original image extraction.

Post by cyclondude »

Pixel wizards of open source. I summon thee!

Although my mortal quarrels are surely nothing to your dances with the four elements, I seek you to lift my troubles.

I have used IM to take scanned paper document pdfs to pngs with good results but I am curious if there is a method to take the image with its original resolution and pixel values. I'm including an example below where I believe the document has an image embedded for each page. My big concern is that I would like to maintain the original quality of the images embedded in the pdf as much as possible without resizing or changing the original qualities of each page. Thanks. The ultimate end of this project to preprocess the original images for OCR.

[pdf] http://www.muni.org/Departments/finance ... ection.pdf

What color of sorcery is protecting these legendary manuscripts?
snibgo
Posts: 12159
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: scanned paper pdfs and original image extraction.

Post by snibgo »

The required incantation for this file is "-density 300".

Code: Select all

convert -density 300 "2011 CAFR Introductory Section.pdf" ma.png
snibgo's IM pages: im.snibgo.com
User avatar
whugemann
Posts: 289
Joined: 2011-03-28T07:11:31-07:00
Authentication code: 8675308
Location: Münster, Germany 52°N,7.6°E

Re: scanned paper pdfs and original image extraction.

Post by whugemann »

There is special software that extracts the single pages from a PDF scan. Under Windows, this would be xpdf, more specificly pdfimages, which extracts the single pages from your document as .ppm files. The DOS command would be:
pdfimages your.pdf trunc

and would spit out
trunc-001.pbm
trunc-002.pbm
...
...
trunc-013.pbm

Which in turn could be bulk-convert into group4-coded TIFFs via IM.
The x- and y-resolution of your scans is however not the same. Thus snibgo's suggestion might be better.
Wolfgang Hugemann
User avatar
anthony
Posts: 8883
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: scanned paper pdfs and original image extraction.

Post by anthony »

All of which uses 'ghostscript' to generate the pages... including IM.
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
cyclondude
Posts: 6
Joined: 2013-03-14T12:14:25-07:00
Authentication code: 6789

Re: scanned paper pdfs and original image extraction.

Post by cyclondude »

Thank you humble warlocks. I pay homage to your wisdom.

I used pdfimages to convert from pdf to .ppm and .pbm files. And the converted these files to .tif and they look very clean and are working for what I was hoping for with OCR but if you don't mind I have a couple more questions to complete my understanding.

1. The images I extracted are coming out in a compressed kind of aspect ratio that I would not expect from a scanned paper document. Why is this? Also, I notice that they are rotated 90 degrees. Are these just likely artifacts of the scanner software?

2. why does
> pdfimages myfile.pdf output
make .ppm files for the first four pages and the rest are .pbm for the file I linked to above? They both are working fine for my use but why is this happening?

3. In your opinion is ghostscript a practical thing to learn or are there equivalent wrapper functions in imagemagick?

Thanks again seriously.
snibgo
Posts: 12159
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: scanned paper pdfs and original image extraction.

Post by snibgo »

I know nothing about pdfimages.

In my limited use of PDF or PS documents, using IM as a wrapper to gs is sufficient for my needs. If it wasn't, I would learn gs.
snibgo's IM pages: im.snibgo.com
cyclondude
Posts: 6
Joined: 2013-03-14T12:14:25-07:00
Authentication code: 6789

Re: scanned paper pdfs and original image extraction.

Post by cyclondude »

What can I do to this image to improve tesseract-ocr transcription quality?

http://s21.postimg.org/4wwardkk7/title.png

There are many images like it that have 1-5 words on it with the same pixel dimensions. Do they need to be larger? It is only 43 pixels tall.

Thanks.
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: scanned paper pdfs and original image extraction.

Post by fmw42 »

cyclondude wrote:What can I do to this image to improve tesseract-ocr transcription quality?

http://s21.postimg.org/4wwardkk7/title.png

There are many images like it that have 1-5 words on it with the same pixel dimensions. Do they need to be larger? It is only 43 pixels tall.

Thanks.

What command did you use to get this image? Did it come from converting a pdf? If so, then you need to use a higher density to read in the pdf.

convert -density 288 image.pdf -resize 25% image.png

This is supersampling so that the output quality is better, but the size remains the same. If you want a larger output image, then leave off the -resize 25% and just use -density to some value that looks better for you.
cyclondude
Posts: 6
Joined: 2013-03-14T12:14:25-07:00
Authentication code: 6789

Re: scanned paper pdfs and original image extraction.

Post by cyclondude »

I used:

Code: Select all

convert -density 300 mypdf.pdf output.png
After trying -density 600 mypdf.pdf -resize 50% it seems to work better. These shouldn't give the exact same image right? The -density 600 and -resize 50% are resampling at a greater quality. Is that correct? Thanks.
snibgo
Posts: 12159
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: scanned paper pdfs and original image extraction.

Post by snibgo »

If the PDF came from a scanner, I reckon the optimum "-density" setting is the same as the scanner resolution, which is often a multiple of 150 dpi. Then there is no need to resize, because that discards data that might be useful.
snibgo's IM pages: im.snibgo.com
DominiqueMichel
Posts: 4
Joined: 2013-04-29T03:24:47-07:00
Authentication code: 6789

Re: scanned paper pdfs and original image extraction.

Post by DominiqueMichel »

cyclondude wrote:1. The images I extracted are coming out in a compressed kind of aspect ratio that I would not expect from a scanned paper document. Why is this? Also, I notice that they are rotated 90 degrees. Are these just likely artifacts of the scanner software?
More likely from the pdf format. You can even have pdf files where the images are broken into multiples parts, and to get the images, you have to extract the whole pages and process them later with the gimp or a similar software.
Post Reply