Page 1 of 1

Can we extract as an image blocks of text?

Posted: 2018-03-18T14:24:50-07:00
by FrereTuck
Hi,

my final goal is to cut words from a scanned text as images, make Tesseract find the text that is within them, and rename the image files containing the words with their content.
Image
For the first line of this image, there would be a first image containing FRUIT then another containing VINES and so on with a random name. If that is possible, I would then give each image to tesseract and rename each image file; the first one would be called FRUIT.png, the second VINES.png and so on.
I would then be able to rearrange text to form once more groups of words (FRUIT VINES) as images.

Do you think the first step could be done with ImageMagick?

Thanks a lot.

Re: Can we extract as an image blocks of text?

Posted: 2018-03-18T19:53:15-07:00
by muccigrosso
Isn't this what tesseract does? That is, it finds the text in images. it will output box coordinates, too. Look at the man page and especially the hocr output.

Re: Can we extract as an image blocks of text?

Posted: 2018-03-19T08:59:25-07:00
by FrereTuck
I will have a look at it, thanks!