Page 1 of 1

mark empty row in a column to avoid mixup in tesseract output

Posted: 2020-03-09T07:32:15-07:00
by isfando
I need to ocr pdf of financial statements . I am using tesseract to ocr the pdf documents. For the sake of this question we assume there are three columns.

Column 1 | Column 2 | Column 3
<keywords> | <number> | <number>


Column1 are financial keywords while column 2 and column 3 are numbers representing money. If in a row the value for column 2 is missing and column 3 is present then tesseract will move column3 value to column 2 . The makes the reading of pdf statement to fail because the sum of all rows is not equal to final sum in the financial statement.


In the example below i have given the sample input pdf file. I want to use image magick to mark the empty row in a column with a letter say 'X' .This will help me assume the values to be in the correct column . I have given a manual made output that i want from image magick.


*******************************Example Scenario****************************************

Input Pdf
https://drive.google.com/open?id=1zmKWb ... qXWpdMCFBL



Required Output
https://drive.google.com/open?id=1MTxLL ... f46yrOW1DN

Re: mark empty row in a column to avoid mixup in tesseract output

Posted: 2020-03-09T10:29:40-07:00
by fmw42
I think you need some other tool than ImageMagick, unless you want to add text by manually drawing it where you want the "x" to show.

Re: mark empty row in a column to avoid mixup in tesseract output

Posted: 2020-03-26T07:46:34-07:00
by furushito
fmw42 wrote:
2020-03-09T10:29:40-07:00
I think you need some other tool than ImageMagick, unless you want to add text by manually drawing it where you want the "x" to show.
Can you reccomend some, plz?

Re: mark empty row in a column to avoid mixup in tesseract output

Posted: 2020-03-26T08:18:44-07:00
by snibgo
In theory, you could identify the three columns by blurring vertically and finding the white gaps. Then find each row of text by the same method. This gives you all the cells. Each cell is empty or contains text. Replace empty cells with a same-size image containing "X". Then rebuild the image by appending the cells.

Re: mark empty row in a column to avoid mixup in tesseract output

Posted: 2020-03-30T08:50:31-07:00
by allexx
here are some other approaches to try:
-- see if tabula pdf helps (https://tabula.technology/)
-- try microsoft ocr; it's online, and is called "cognitive search" I think; produces impressive results
good luck