Page 1 of 1

Remove Tables from Scanned Document

Posted: 2018-07-12T01:10:09-07:00
by shahriarb
Hi,

Im trying to remove tables and lines from my scanned image but to keep the text inside of tables. i have tried to use morphology but i don't know the correct parameters. reason i need to do this is my tesseract OCR is actually adding some extra characters because of these tables and lines

Image

image without highlights for testing

Image

Re: Remove Tables from Scanned Document

Posted: 2018-07-17T10:30:32-07:00
by fmw42
There are techniques to do that, such as using -connected-components or -morphology. But your text connects to the lines (the bottom of the g, for example). Therefore -connected components would remove any connected text. Also your page is rotated so that the lines are not perfectly horizontal. Thus -morphology cannot use simple long lines for its kernel. You could try -deskew to unrotated the image so that the lines are straight, then use -moprhology with long horizontal kernel (longer than any line in the text characters to remove the horizontal lines and similarly with the vertical lines. However, the latter will be harder since the vertical lines are not much taller than some of the parts of your characters.