How do you detect duplicates? And how does IM Fingerprint work?

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Post Reply
chani
Posts: 8
Joined: 2018-10-21T12:38:33-07:00
Authentication code: 1152

How do you detect duplicates? And how does IM Fingerprint work?

Post by chani »

Hi there,

I tried searching for copy/duplicate/detect duplicate, though I didn't find something here. If I overlooked something (as in this has been already answered somewhere here on the board) please let me know. I also read: https://www.imagemagick.org/Usage/compare/. I am looking for a way to automate this.

Long story short: Something like a year ago I lost my photos as well as my backups and did end up with a folder containing most of them as well as modified (denoise, gamma, sharpen, scaled) duplicates of the original. Now I need to get rid of the duplicates. First of all I really just want to detect duplicates - choosing which of the duplicates to keep isn't that important currently.

So I tried the following:

1. Simple IM Fingerprint (storing all photos fingerprint in an array and while iterating over all my photos checking if something matches) - that seems to work quite good.
2. Downscale to 64x64 (as well tested 32x32), convert to grayscale, created 3 by 90-degree rotated versions, take the fingerprints of that to check for duplicates.

I might need a helping hand / idea about 2. To downscale

- first I used sample. That is pretty fast though no copies are detected.
- then I used scale. That is a little bit slower though still no copies are detected.
- then I used resize with POINT and BOX a little bit slower - still no copies.
- then I used resize with GAUSSIAN and HERMITE - GAUSSIAN is the slowest(!), HERMITE is a bit slower than above variants. THIS one detects some duplicates (so.. yes, it does work. It's just a little bit too slow).

Using sample/scale and follow that by a gaussian blur is still faster than using resize with GAUSSIAN - but it does not detect duplicates. So I'm curious why is a GAUSSIAN_RESIZE as well as HERMITE_RESIZE working and SAMPLE/SCALE+GAUSSIAN/BLUR not?

By the way, the fingerprint I am using is the one PHP's \Imagick::getImageSignature() gives back. Is that probably wrong to use for what I want to do? I'm not limited to PHP, Bash would be fine as well. How do you do that?

I noticed that auto-levels does not change the fingerprint. Looking for a way that color-distorted or gamma-corrected photos would still be detected as copies. For that I do the grayscale conversation. I also thought and tried creating an edge mask to use that - however, creating that mask takes way too long.

Thanks in advance,
Jean
chani
Posts: 8
Joined: 2018-10-21T12:38:33-07:00
Authentication code: 1152

Re: How do you detect duplicates? And how does IM Fingerprint work?

Post by chani »

Okay, I wrote something which seems to work, based on what I did read about aHash. Here's the PHP Code:

Code: Select all

        
        $im = new \Imagick($file);
        $im->sampleImage(16, 16);
        $im->transformImageColorspace(\Imagick::COLORSPACE_GRAY);
        $data = $im->getImageChannelMean(\Imagick::CHANNEL_RED);
        $mean = $data['mean'];
        $im->thresholdImage($mean);
        $hash = $im->getImageSignature();
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: How do you detect duplicates? And how does IM Fingerprint work?

Post by fmw42 »

You could use perceptual hash techniques. ImageMagick has a color phash. See https://imagemagick.org/discourse-serve ... =4&t=24906

I have built some other perceptual hash scripts at http://www.fmwconcepts.com/imagemagick/ ... /index.php.
Post Reply