Questions and postings pertaining to the development of ImageMagick, feature enhancements, and ImageMagick internals. ImageMagick source code and algorithms are discussed here. Usage questions which are too arcane for the normal user list should also be posted here.
I am trying to convert a PDF file to a TIFF file without losing its quality. But I see there is a loss of quality as a result when I OCR the TIFF file using Tesseract words are being misread. I am using Magick.NET-Q16-AnyCPU dll of version 7.0.0.0 in my C# application.
Here is my piece of code for creating a TIFF file from PDF bytes -
public void ConvertToTIFF(byte[] bytes)
{
ImageMagick.MagickReadSettings settings = new ImageMagick.MagickReadSettings();
settings.Density = new ImageMagick.Density(300, 300);
settings.UseMonochrome = true;
settings.CompressionMethod = CompressionMethod.LZW;
using (MagickImageCollection images = new MagickImageCollection())
{
images.Read(bytes, settings);
images.Write(targetFile);
}
}
Tried even increasing the DPI from 300 to 400/500 but I don't see much difference in the quality. Looking for some inputs here on how to retain the quality while TIFF conversion.
What do you mean by quality? Please show examples. You can upload to somewhere lke dropbox.com and paste the URLs here.
What does "UseMonochrome" do? If it converts the image to black and white only, that is a major drop in quality, and generally makes OCR more difficult. Stretching so paper is white and letters are black, with antialias between them, is better.
Thanks snibgo and fmw42 for your replies. I am using Magick.NET-Q16-AnyCPU dll of version 7.0.0.0 which I installed from Nuget. I haven't installed GS but I am using following - "gsdll64.dll" ,"gswin64c.exe" and referring them by means of - MagickNET.SetGhostscriptDirectory(@"~/somepath").
I understand providing a sample file will help you more, but sorry, I cannot upload either a PDF or TIFF files as they are confidential.
As you said, I tried using "Antialias= true" but that did not help in terms of improving OCR accuracy. In addition to the above, I have tried the following methods - MagickImage.Enhance(); MagickImage.Sharpen(); MagickImage.Magnify(); MagickImage.Normalize(); and found a slight improvement in the OCR accuracy but noticed they are taking up more time for a TIFF file creation.
Can you suggest any ImageMagick method/s which helps us in generating the same/more quality TIFF file as that of the source PDF (basically, precise and sharp TIFF file even at a higher zoom level, say 800%) and thus improving the accuracy of OCR...?
If you could show us a crop of a single word, that might help us understand what you mean by "quality". Otherwise we can only guess. Perhaps the letters are too small. Perhaps they are smudged. Perhaps you need a higher density. Perhaps ...