Page 1 of 1

decrease text boldness while not losing accuracy in convert command to improve tesseract ocr

Posted: 2020-03-03T08:50:01-07:00
by isfando
I need to ocr pdf of financial statements . Right now i am using a convert command but i am not satisfied with the output in some cases.I want to decrease the boldness of the text as far as possible without losing accuracy. Also the whole picture after conversion looks a bit blurry.

*******************************CURRENT APPROACH****************************************

1) input image(i am applying the code on snapshot of pdf for sake of question here)
https://drive.google.com/open?id=1bGwzz ... C-qNjM4DlI

2)
(The following code is also used remove long horizantel lines but not minus before feeding the image to tesseract ocr)

Code: Select all

convert -density 300 before_convert.jpg -depth 8 -strip -background white -alpha off -threshold 70%% ^
  ( +clone ^
    -threshold 50%% ^
    -write mpr:ORG ^
    +delete ^
  ^) ^
  ( mpr:ORG ^
    -negate ^
    -morphology Erode rectangle:200x1 ^
    -mask mpr:ORG -morphology Dilate rectangle:200x1 ^
    +mask ^
    -morphology Dilate Disk:3 ^
  ^) ^
  -compose Lighten -composite ^
  ( +clone ^
    -morphology HMT "1x4:1,0,0,1" ^
  ^) ^
  -compose Lighten -composite ^
  ( +clone ^
    -morphology HMT "1x3:1,0,1" ^
  ^) ^
  -compose Lighten -composite ^
  ( +clone ^
    -morphology HMT "3x1:1,0,1" ^
  ^) ^
  -compose Lighten -composite ^
  -blur 0x0.5 after_convert.png

3) Result image
https://drive.google.com/open?id=1S_ItC ... SVkzfn_qyG

Re: decrease text boldness while not losing accuracy in convert command

Posted: 2020-03-03T10:51:43-07:00
by fmw42
Try this:

Code: Select all

convert before_convert.JPG -colorspace gray -blur 0x1 -level 0x50% -threshold 50% result.png

Re: decrease text boldness while not losing accuracy in convert command to improve tesseract ocr

Posted: 2020-03-03T12:01:06-07:00
by isfando
A thousands thanks fred.

I have updated the code in question showing the original code for removing horizantal lines but not minus.


I have a few questions because my knowledge of imagemagick is very surface level. if you have time to answer i would be thankful.

1) why are you not using the below parameters that i have used in my code. (the scanned financial statements pdfs that i use can contain noise often.)
-density 300 -strip -background white -alpha off

2) is it a good idea to merge your script with mine OR should i write two scripts and check which one is giving better results.I can check quality of results by checking the sum of all number in a column to sum in financial statement

3) What will your command do if it encounter already thin text.if you dont mind can you explain your command so i can tune it in future (i will also check the meaning of options you have used myself.)

4) processing power and time consumption is not imp but the accuracy of numbers is important because if they dont add up to be equal to summation of the column the approach will be discarded. Is your approach the best we can do in your opinion

Re: decrease text boldness while not losing accuracy in convert command to improve tesseract ocr

Posted: 2020-03-03T12:35:38-07:00
by fmw42
1) You did not post a PDF so I just tried to process what you provided. Your command does not deal with any noise.

2) Up to you. I cannot say for sure that my method will work for all your images. You will need to test. If using PDF files, then you will need to do your processing to get a good starting quality. My command will need adjusting for the blur if you have larger images (as per using -density 300 if what you posted is not the result of applying that density)

3) My command will still thin any image you provide. All it is doing is some blurring and then removing some of the blur to make the result thinner by using the -level command and then the threshold to make the remaining blur (gray values, not already white or black) into binary.

Re: decrease text boldness while not losing accuracy in convert command to improve tesseract ocr

Posted: 2020-03-04T08:28:46-07:00
by isfando
I have adapted your approach to a pdf.

*******************************CURRENT APPROACH****************************************

1)input pdf
https://drive.google.com/open?id=1nFpEQ ... nO9b6S4UOT

2)

Code: Select all

convert -density 300 pdfinputsample.pdf -strip -background white -alpha off  -colorspace gray -blur 0x0.2 -level 0x50% -threshold 50% result.png

3) Result image
https://drive.google.com/open?id=1gv5XL ... pyZa0kS0Zx




****************************My Questions******************************************
1) are these parameters the best to thin the numbers and stay safe with tesseract ocr accuracy. The range of boldness level can be low in other pdfs. i dont want a definite answer , just the best shot according to your experiance.

results of above convert command for different kind of pdfs

sample2
input https://drive.google.com/open?id=12U4a6 ... uJ9YGorwxu
output https://drive.google.com/open?id=1POOwl ... WUq7X-fsXA

sample3
input https://drive.google.com/open?id=1BFjok ... NguFsrvHAi
output https://drive.google.com/open?id=1ctsdq ... pSSaEOGuah

sample4
input https://drive.google.com/open?id=1-Lda_ ... 7r7p1DstYM
output https://drive.google.com/open?id=1zY8-j ... IMBaDvw6Nh

sample5
input https://drive.google.com/open?id=1xFIkF ... yCg0tChECO
output https://drive.google.com/open?id=1DFqU2 ... KR9Yj0zFem



2) How to do the reverse of this whole approach by bolding thin text BUT making sure the numbers don't touch each other.

Re: decrease text boldness while not losing accuracy in convert command to improve tesseract ocr

Posted: 2020-03-04T10:42:39-07:00
by fmw42
2) How to do the reverse of this whole approach by bolding thin text BUT making sure the numbers don't touch each other.
You can use morphology erode to thicken or do the blur and -threshold 0. But I know of no way to prevent the letters from touching.