Remove horizontal summation lines but keep a minus

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
isfando
Posts: 15
Joined: 2018-09-21T02:54:27-07:00
Authentication code: 1152

Remove horizontal summation lines but keep a minus

Post by isfando » 2018-09-21T03:30:35-07:00

Hi

I need to ocr pdf of financial statements with horizontal lines before the summation. The lines decrease the accuracy of digits i am parsing therefore i need to remove the lines but keep a minus sign before a digit. The digits need to used for calculation after parsing therefore accuracy is a crucial factor.For example a pdf might contain

220
-30
________
190
________


I want to get my result image from pdf as

220
-30

190

snibgo
Posts: 10708
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Remove horizontal summation lines but keep a minus

Post by snibgo » 2018-09-21T04:07:53-07:00

See viewtopic.php?f=1&t=22338&p=129166#p129154 where I show a command that turns white all black lines that are at least 50 pixels wide.
snibgo's IM pages: im.snibgo.com

isfando
Posts: 15
Joined: 2018-09-21T02:54:27-07:00
Authentication code: 1152

Re: Remove horizontal summation lines but keep a minus

Post by isfando » 2018-09-21T05:05:19-07:00

I applied your command but it dont white the lines. It makes them grey.

Before Applying your command
https://drive.google.com/open?id=1EN3Zj ... 4lPwESQT8K

After Applying your command
https://drive.google.com/open?id=1vXmjC ... hPmfveDuTG

I am using convert on windows with following information.

Version: ImageMagick 7.0.7-4 Q16 x64 2017-09-23 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2015 ImageMagick Studio LLC
License: http://www.imagemagick.org/script/license.php
Visual C++: 180040629
Features: Cipher DPC Modules OpenMP
Delegates (built-in): bzlib cairo flif freetype jng jp2 jpeg lcms lqr openexr pangocairo png ps rsvg tiff webp xml zlib

snibgo
Posts: 10708
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Remove horizontal summation lines but keep a minus

Post by snibgo » 2018-09-21T05:37:13-07:00

My command turns black lines into white, but your image doesn't have black lines. You could process it to make the lines black, then remove them, and use the pixels that have changed to paint white over your input image.

However, that image is low quality with small characters. I doubt that you will get reliable OCR from it.
snibgo's IM pages: im.snibgo.com

isfando
Posts: 15
Joined: 2018-09-21T02:54:27-07:00
Authentication code: 1152

Re: Remove horizontal summation lines but keep a minus

Post by isfando » 2018-09-21T06:19:31-07:00

Sorry the pictures earlier were taken in a bit bad zoom and they were part of bigger picture which i cant post for confidentiality reason.
I changed my lines to black and your command works for the most part but it leaves the jagged edges out.

I am using the following command to convert from pdf to png
convert -density 300 ./sam.pdf -depth 8 -strip -background white -alpha off -threshold 70% sam.png

I get this png as a result after running the above command
https://drive.google.com/open?id=1Fv3RI ... tZiIDDCVqc

After Applying your command
https://drive.google.com/open?id=1Q9xJX ... IBq0lrxU-L

The only problem is there is still some noise remaining from the removed lines which could be parsed as minus sign with tesseract.

snibgo
Posts: 10708
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Remove horizontal summation lines but keep a minus

Post by snibgo » 2018-09-21T06:57:08-07:00

A bit more pre-processing solves the problem:

Code: Select all

convert ^
  Before1.png ^
  -strip ^
  ( +clone ^
    -threshold 50%% ^
    -write mpr:ORG ^
    +delete ^
  ) ^
  ( mpr:ORG ^
    -negate ^
    -morphology Erode rectangle:50x1 ^
    -mask mpr:ORG -morphology Dilate rectangle:50x1 ^
    +mask ^
    -morphology Dilate Disk:2 ^
  ) ^
  -compose Lighten -composite ^
  ( +clone ^
    -morphology HMT "1x4:1,0,0,1" ^
  ) ^
  -compose Lighten -composite ^
  ( +clone ^
    -morphology HMT "1x3:1,0,1" ^
  ) ^
  -compose Lighten -composite ^
  ( +clone ^
    -morphology HMT "3x1:1,0,1" ^
  ) ^
  -compose Lighten -composite ^
  out.png
snibgo's IM pages: im.snibgo.com

isfando
Posts: 15
Joined: 2018-09-21T02:54:27-07:00
Authentication code: 1152

Re: Remove horizontal summation lines but keep a minus

Post by isfando » 2018-09-24T03:41:55-07:00

Snibgo it worked .thanks alot. If you can explain the script it would help me think independently and do changes in future. Currently i am not at a level to understand the pipeline of event happening in your code.

snibgo
Posts: 10708
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Remove horizontal summation lines but keep a minus

Post by snibgo » 2018-09-24T05:12:59-07:00

When trying to understand a long command, it's helpful to sprinkle "+write x0.png", "+write x1.png" etc after every step. We can then see the effects.

The goal is to remove the long black horizontal lines, ie to make then white. To do that, we make an image that is white where the lines are and black everywhere else. As the input isn't just black and white, we slightly enlarge the lines in that image.

convert ^ Run v6 convert
Before1.png ^ Read the image
-strip ^ Remove any superfluous metadata.
( +clone ^ Start a new image list; add to it a clone of the last image in the outer list.
-threshold 50%% ^ Threshold all the images in the current list (there is only one) so they are black and white only.
-write mpr:ORG ^ Write it to a memory location.
+delete ^ Remove it from the current list.
) ^ Close the current list. This would add any images from the nested list to the outer list, but there aren't any as we deleted it..
( mpr:ORG ^ Start a new list, reading the image we saved.
-negate ^ Invert black and white. Now we have white numbers and lines on a black background.
-morphology Erode rectangle:50x1 ^ Erode (remove) small horizontal lines. Now we have just the long horizontal lines, but slightly trimmed.
-mask mpr:ORG -morphology Dilate rectangle:50x1 ^ Dilate (make larger) the horizontal lines, using ORG as a mask so we get the full width of the lines.
+mask ^ Stop using the mask.
-morphology Dilate Disk:2 ^ Make the lines slightly taller (and wider).
) ^ Close the current list, copying the result from the inner list to the outer list. Now the list has two images: Before1.png, and thick white lines on a black background.
-compose Lighten -composite ^ Make each pixel the lighter of the two images. This paints white over the long horizontal lines in Before1.png.
( +clone ^
-morphology HMT "1x4:1,0,0,1" ^
) ^
-compose Lighten -composite ^
( +clone ^
-morphology HMT "1x3:1,0,1" ^
) ^
-compose Lighten -composite ^
( +clone ^
-morphology HMT "3x1:1,0,1" ^
) ^
-compose Lighten -composite ^
out.png

The final steps simply clean any noise from the image. These aren't needed for your example.
snibgo's IM pages: im.snibgo.com

isfando
Posts: 15
Joined: 2018-09-21T02:54:27-07:00
Authentication code: 1152

Re: Remove horizontal summation lines but keep a minus

Post by isfando » 2018-09-24T05:59:14-07:00

Thanks alot for the explanation now i have an idea of the pipeline. I noticed that the code still leaves a fragment of line at the bottom right corner of image. Plus any addition to smoothen numbers would be highly helpful.

Before
https://drive.google.com/open?id=135BfW ... LvCm7B0476

After
https://drive.google.com/open?id=1SXEzI ... PVIzMnf34I

snibgo
Posts: 10708
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Remove horizontal summation lines but keep a minus

Post by snibgo » 2018-09-24T07:03:47-07:00

To improve the bottom right, in "-morphology Dilate Disk:2 ^", change 2 to 3.

My command doesn't change the numbers. You can add a slight blur if you want, eg "-blur 0x0.5" just before the output filename.
snibgo's IM pages: im.snibgo.com

isfando
Posts: 15
Joined: 2018-09-21T02:54:27-07:00
Authentication code: 1152

Re: Remove horizontal summation lines but keep a minus

Post by isfando » 2018-09-25T04:23:53-07:00

I made the suggested changes and the results are pretty good. thanks alot. One last question in this regard. How can i feed a pdf file with multiple pages to your code and for each page the code is applied to it and as a result i get images in png format equal to number of pages in pdf file.

snibgo
Posts: 10708
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Remove horizontal summation lines but keep a minus

Post by snibgo » 2018-09-25T05:12:32-07:00

PDF documents are often too large for raster images of all the pages to be in memory simultaneously. Besides, adapting a complex convert command for multiple input images isn't trivial.

The easy solution is to use a shell loop to process one page at a time. For example, put the above command in a BAT file I'll call DoOnePage.bat that takes %1 as the input and %2 as the output. Then create another BAT file I'll call DoManyPages.bat like this (untested):

Code: Select all

set INPDF=mypdf.pdf

for /F "usebackq" %%L in (`exiftool -args -PageCount %INPDF%`) do set %%L

set /A LASTPAGE=%-PageCount%-1

for /L %%I in (0,1,%LASTPAGE%) do call DoOnePage %INPDF%[%%I] out_%%I.png
This uses exiftool to quickly count the pages.

However, that creates files like out_9.png and out_10.png, so they don't sort cleanly. I add leading zeros like this:

Code: Select all

for /L %%I in (0,1,%LASTPAGE%) do (
  set LZ=000000%%I
  set LZ=!LZ:~-6!
  call DoOnePage %INPDF%[%%I] out_!LZ!.png
)
This gives filenames like out_000009.png and out_000010.png.

[I haven't tested the above. Beware of my faulty memory.]
snibgo's IM pages: im.snibgo.com

isfando
Posts: 15
Joined: 2018-09-21T02:54:27-07:00
Authentication code: 1152

Re: Remove horizontal summation lines but keep a minus

Post by isfando » 2018-09-25T06:37:59-07:00

Thanks for you suggestion. I will try to make use of it. But In my case i will run the script on a server machine where memory is not a problem.The server has 128gb ram. If the original convert script could be changed to handle a pdf file as a whole, it would make my work quite easy.I would want the algorithm to apply to each page also so feeding a pdf to the original convert scripts suits my needs alot

snibgo
Posts: 10708
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Remove horizontal summation lines but keep a minus

Post by snibgo » 2018-09-25T06:52:39-07:00

The command could be adapted for multiple inputs, but I don't know how the masked morphology would be done. Each image needs its own mask, but IM's syntax needs an explicit name for the mask.

I've shown you a loop in a BAT script, which calls another BAT script that runs the command. Of course, you could do it in a single BAT script instead.

If you have loads of processors as well as memory, you could split the job into one page per processor, so it should be quick for the overall PDF document. That's what I do for video frames (although I have only 8 logical cores, and 12 GB memory).
snibgo's IM pages: im.snibgo.com

isfando
Posts: 15
Joined: 2018-09-21T02:54:27-07:00
Authentication code: 1152

Re: Remove horizontal summation lines but keep a minus

Post by isfando » 2018-09-25T07:59:21-07:00

ok I got the point.Your guidance is indeed very helpful. I was able to construct your script on my machine. That goes through pdf and run your convert command page by page. But now the quality of the result png images are not sharp. Below i have given steps for two approaches and the result from earlier approach is pretty crisp while that from the current approach is dull and has after marks of removal of lines . As an example i am using a pdf named sam.pdf containing only one page. My main question is how can i get as crisp results as earlier approach from the current approach.
(I am also presenting a tweaked approach at the end but i dont think its very efficient but the result is crisp with it)
********************EARLIER APPROACH******************************
1)

Code: Select all

 convert -density 300 ./sam.pdf -depth 8 -strip -background white -alpha off -threshold 70%  sam.png
the output image sam.png from this step is pretty crisp so the result in step 3 is also crisp
https://drive.google.com/open?id=1fBFFo ... HG-8w-6zGI
2)

Code: Select all

convert ^
  sam.png ^
  -strip ^
  ( +clone ^
    -threshold 50%% ^
    -write mpr:ORG ^
    +delete ^
  ) ^
  ( mpr:ORG ^
    -negate ^
    -morphology Erode rectangle:200x1 ^
    -mask mpr:ORG -morphology Dilate rectangle:200x1 ^
    +mask ^
    -morphology Dilate Disk:3 ^
  ) ^
  -compose Lighten -composite ^
  ( +clone ^
    -morphology HMT "1x4:1,0,0,1" ^
  ) ^
  -compose Lighten -composite ^
  ( +clone ^
    -morphology HMT "1x3:1,0,1" ^
  ) ^
  -compose Lighten -composite ^
  ( +clone ^
    -morphology HMT "3x1:1,0,1" ^
  ) ^
  -compose Lighten -composite ^
  -blur 0x0.5 out.png
3) Result image
https://drive.google.com/open?id=1obtnH ... hHFV0VsOCL


*********************CURRENT APPROACH*********************************
1) doonepage.bat

Code: Select all

convert ^
  -density 300  ^
  %1 ^
  -depth 8 ^
  -strip ^
  ( +clone ^
    -threshold 50%% ^
    -write mpr:ORG ^
    +delete ^
  ) ^
  ( mpr:ORG ^
    -negate ^
    -morphology Erode rectangle:200x1 ^
    -mask mpr:ORG -morphology Dilate rectangle:200x1 ^
    +mask ^
    -morphology Dilate Disk:3 ^
  ) ^
  -compose Lighten -composite ^
  ( +clone ^
    -morphology HMT "1x4:1,0,0,1" ^
  ) ^
  -compose Lighten -composite ^
  ( +clone ^
    -morphology HMT "1x3:1,0,1" ^
  ) ^
  -compose Lighten -composite ^
  ( +clone ^
    -morphology HMT "3x1:1,0,1" ^
  ) ^
  -compose Lighten -composite ^
  -blur 0x0.5 %2

2) domanypages.bat

Code: Select all

set INPDF=sam.pdf

for /F "usebackq" %%L in (`exiftool -args -PageCount %INPDF%`) do set %%L

set /A LASTPAGE=%-PageCount%-1

for /L %%I in (0,1,%LASTPAGE%) do call DoOnePage %INPDF%[%%I] out_%%I.png

3)Result
https://drive.google.com/open?id=1toqjB ... 5pItrdMeRi




*********************TWEAKED APPROACH*********************************
1) doonepagepre.bat

Code: Select all

convert -density 300 %1 -depth 8 -strip -background white -alpha off -threshold 70%%  %2


2)doonepage.bat

Code: Select all

convert ^
  %1 ^
  -strip ^
  ( +clone ^
    -threshold 50%% ^
    -write mpr:ORG ^
    +delete ^
  ) ^
  ( mpr:ORG ^
    -negate ^
    -morphology Erode rectangle:200x1 ^
    -mask mpr:ORG -morphology Dilate rectangle:200x1 ^
    +mask ^
    -morphology Dilate Disk:3 ^
  ) ^
  -compose Lighten -composite ^
  ( +clone ^
    -morphology HMT "1x4:1,0,0,1" ^
  ) ^
  -compose Lighten -composite ^
  ( +clone ^
    -morphology HMT "1x3:1,0,1" ^
  ) ^
  -compose Lighten -composite ^
  ( +clone ^
    -morphology HMT "3x1:1,0,1" ^
  ) ^
  -compose Lighten -composite ^
  -blur 0x0.5 %2


3)domanypages.bat

Code: Select all

set INPDF=sam.pdf

for /F "usebackq" %%L in (`exiftool -args -PageCount %INPDF%`) do set %%L

set /A LASTPAGE=%-PageCount%-1

for /L %%I in (0,1,%LASTPAGE%) do (
	call DoOnePagePre %INPDF%[%%I] out_%%I.png
	call DoOnePage out_%%I.png out_%%I.png
)

4) Result
They are crisp as earlier approach

Post Reply