scanned paper pdfs and original image extraction.

cyclondude · Post by **cyclondude** » 2013-04-08T09:25:27-07:00

Pixel wizards of open source. I summon thee!

Although my mortal quarrels are surely nothing to your dances with the four elements, I seek you to lift my troubles.

I have used IM to take scanned paper document pdfs to pngs with good results but I am curious if there is a method to take the image with its original resolution and pixel values. I'm including an example below where I believe the document has an image embedded for each page. My big concern is that I would like to maintain the original quality of the images embedded in the pdf as much as possible without resizing or changing the original qualities of each page. Thanks. The ultimate end of this project to preprocess the original images for OCR.

[pdf] http://www.muni.org/Departments/finance ... ection.pdf

What color of sorcery is protecting these legendary manuscripts?

Post by **snibgo** » 2013-04-08T10:00:52-07:00

The required incantation for this file is "-density 300".

Code: Select all

convert -density 300 "2011 CAFR Introductory Section.pdf" ma.png

whugemann · Post by **whugemann** » 2013-04-08T10:47:48-07:00

There is special software that extracts the single pages from a PDF scan. Under Windows, this would be xpdf, more specificly pdfimages, which extracts the single pages from your document as .ppm files. The DOS command would be:
pdfimages your.pdf trunc

and would spit out
trunc-001.pbm
trunc-002.pbm
...
...
trunc-013.pbm

Which in turn could be bulk-convert into group4-coded TIFFs via IM.
The x- and y-resolution of your scans is however not the same. Thus snibgo's suggestion might be better.

Post by **anthony** » 2013-04-08T16:30:24-07:00

All of which uses 'ghostscript' to generate the pages... including IM.

cyclondude · Post by **cyclondude** » 2013-04-09T15:47:04-07:00

Thank you humble warlocks. I pay homage to your wisdom.

I used pdfimages to convert from pdf to .ppm and .pbm files. And the converted these files to .tif and they look very clean and are working for what I was hoping for with OCR but if you don't mind I have a couple more questions to complete my understanding.

1. The images I extracted are coming out in a compressed kind of aspect ratio that I would not expect from a scanned paper document. Why is this? Also, I notice that they are rotated 90 degrees. Are these just likely artifacts of the scanner software?

2. why does
> pdfimages myfile.pdf output
make .ppm files for the first four pages and the rest are .pbm for the file I linked to above? They both are working fine for my use but why is this happening?

3. In your opinion is ghostscript a practical thing to learn or are there equivalent wrapper functions in imagemagick?

Thanks again seriously.

Post by **snibgo** » 2013-04-09T16:58:02-07:00

I know nothing about pdfimages.

In my limited use of PDF or PS documents, using IM as a wrapper to gs is sufficient for my needs. If it wasn't, I would learn gs.

cyclondude · Post by **cyclondude** » 2013-04-23T10:31:07-07:00

What can I do to this image to improve tesseract-ocr transcription quality?

http://s21.postimg.org/4wwardkk7/title.png

There are many images like it that have 1-5 words on it with the same pixel dimensions. Do they need to be larger? It is only 43 pixels tall.

Thanks.

Post by **fmw42** » 2013-04-23T10:35:40-07:00

cyclondude wrote:What can I do to this image to improve tesseract-ocr transcription quality?

http://s21.postimg.org/4wwardkk7/title.png

There are many images like it that have 1-5 words on it with the same pixel dimensions. Do they need to be larger? It is only 43 pixels tall.

Thanks.

What command did you use to get this image? Did it come from converting a pdf? If so, then you need to use a higher density to read in the pdf.

convert -density 288 image.pdf -resize 25% image.png

This is supersampling so that the output quality is better, but the size remains the same. If you want a larger output image, then leave off the -resize 25% and just use -density to some value that looks better for you.

cyclondude · Post by **cyclondude** » 2013-04-23T11:49:36-07:00

I used:

Code: Select all

convert -density 300 mypdf.pdf output.png

After trying -density 600 mypdf.pdf -resize 50% it seems to work better. These shouldn't give the exact same image right? The -density 600 and -resize 50% are resampling at a greater quality. Is that correct? Thanks.

Post by **snibgo** » 2013-04-23T13:05:42-07:00

If the PDF came from a scanner, I reckon the optimum "-density" setting is the same as the scanner resolution, which is often a multiple of 150 dpi. Then there is no need to resize, because that discards data that might be useful.

DominiqueMichel · Post by **DominiqueMichel** » 2013-04-29T04:05:33-07:00

cyclondude wrote:1. The images I extracted are coming out in a compressed kind of aspect ratio that I would not expect from a scanned paper document. Why is this? Also, I notice that they are rotated 90 degrees. Are these just likely artifacts of the scanner software?

More likely from the pdf format. You can even have pdf files where the images are broken into multiples parts, and to get the images, you have to extract the whole pages and process them later with the gimp or a similar software.

Legacy ImageMagick Discussions Archive

scanned paper pdfs and original image extraction.

scanned paper pdfs and original image extraction.

Re: scanned paper pdfs and original image extraction.

Re: scanned paper pdfs and original image extraction.

Re: scanned paper pdfs and original image extraction.

Re: scanned paper pdfs and original image extraction.

Re: scanned paper pdfs and original image extraction.

Re: scanned paper pdfs and original image extraction.

Re: scanned paper pdfs and original image extraction.

Re: scanned paper pdfs and original image extraction.

Re: scanned paper pdfs and original image extraction.

Re: scanned paper pdfs and original image extraction.