Remove Tables from Scanned Document

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Post Reply
shahriarb
Posts: 1
Joined: 2018-07-12T00:27:46-07:00
Authentication code: 1152

Remove Tables from Scanned Document

Post by shahriarb »

Hi,

Im trying to remove tables and lines from my scanned image but to keep the text inside of tables. i have tried to use morphology but i don't know the correct parameters. reason i need to do this is my tesseract OCR is actually adding some extra characters because of these tables and lines

Image

image without highlights for testing

Image
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Remove Tables from Scanned Document

Post by fmw42 »

There are techniques to do that, such as using -connected-components or -morphology. But your text connects to the lines (the bottom of the g, for example). Therefore -connected components would remove any connected text. Also your page is rotated so that the lines are not perfectly horizontal. Thus -morphology cannot use simple long lines for its kernel. You could try -deskew to unrotated the image so that the lines are straight, then use -moprhology with long horizontal kernel (longer than any line in the text characters to remove the horizontal lines and similarly with the vertical lines. However, the latter will be harder since the vertical lines are not much taller than some of the parts of your characters.
Post Reply