Basic settings for Tesseract

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Post Reply
Vico
Posts: 1
Joined: 2017-07-18T01:06:49-07:00
Authentication code: 1151

Basic settings for Tesseract

Post by Vico »

Hello guys,

I'm trying to get the text of a scanned PDF. I can't really give you the full version of the PDF since it's a legal document but here is a sample.
Image

My goal here is to improve tesseract results, because currently I'm getting the following text:
PRIVACY NOTE: Section 31B of the Rea! PropertyAct 1900 (RF Act} authorises the Registrar General to coilectthe informétion required
by this form for the estabiishment and maintenanée of the Real Property Act Register. Section 968 RP Act requires that

the Register is made available to any person for search upon payment of a fee, if any.
I've been trying to improve the pdf quality using ImageMagick. I'm doing it manually, but I'm trying to find general settings that will be applied to all PDFs. Since it's part of a software, I won't be able to play with the settings each time I upload a pdf.

One try I've done is using convert and the lat option to remove small imperfections like this:

Code: Select all

 convert -density 300 -monochrome -lat 15x15+10% in.pdf out.tif
Image

Imperfections are removed, but Tesseract doesn't detect anything now. I though it would be easier for it, but no.

I've seen the great textcleaner tool, with a lot of options, but as said before I can't really afford to change the settings for each pdf.

We can assume all PDF will have the same issues, so is there any "automatic" tool that will try to fix a PDF without telling the tool exactly what to do?

Thanks in advance


Edit :
As requested here is my IM version:
ImageMagick 7.0.5-4 Q16 x86_64 2017-03-25

And my platform I use for my tests is MacOs Sierra 10.12.4
Last edited by Vico on 2017-07-18T09:24:08-07:00, edited 1 time in total.
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Basic settings for Tesseract

Post by fmw42 »

Please always provide your IM version and platform when asking questions, since syntax may vary.

Try without the monochrome and read the input right after setting the density. You might also try larger densities and if you do, then increase the 15 arguments

Code: Select all

convert -density 300 in.pdf -negate -lat 15x15+10% -negate out.tif
Post Reply