How do you detect duplicates? And how does IM Fingerprint work?

Posted: 2018-11-18T03:19:49-07:00
by chani
I tried searching for copy/duplicate/detect duplicate, though I didn't find something here. If I overlooked something (as in this has been already answered somewhere here on the board) please let me know. I also read: I am looking for a way to automate this.

Long story short: Something like a year ago I lost my photos as well as my backups and did end up with a folder containing most of them as well as modified (denoise, gamma, sharpen, scaled) duplicates of the original. Now I need to get rid of the duplicates. First of all I really just want to detect duplicates - choosing which of the duplicates to keep isn't that important currently.

1. Simple IM Fingerprint (storing all photos fingerprint in an array and while iterating over all my photos checking if something matches) - that seems to work quite good.
2. Downscale to 64x64 (as well tested 32x32), convert to grayscale, created 3 by 90-degree rotated versions, take the fingerprints of that to check for duplicates.

I might need a helping hand / idea about 2. To downscale

- first I used sample. That is pretty fast though no copies are detected.
- then I used scale. That is a little bit slower though still no copies are detected.
- then I used resize with POINT and BOX a little bit slower - still no copies.
- then I used resize with GAUSSIAN and HERMITE - GAUSSIAN is the slowest(!), HERMITE is a bit slower than above variants. THIS one detects some duplicates (so.. yes, it does work. It's just a little bit too slow).

Using sample/scale and follow that by a gaussian blur is still faster than using resize with GAUSSIAN - but it does not detect duplicates. So I'm curious why is a GAUSSIAN_RESIZE as well as HERMITE_RESIZE working and SAMPLE/SCALE+GAUSSIAN/BLUR not?

By the way, the fingerprint I am using is the one PHP's \Imagick::getImageSignature() gives back. Is that probably wrong to use for what I want to do? I'm not limited to PHP, Bash would be fine as well. How do you do that?

I noticed that auto-levels does not change the fingerprint. Looking for a way that color-distorted or gamma-corrected photos would still be detected as copies. For that I do the grayscale conversation. I also thought and tried creating an edge mask to use that - however, creating that mask takes way too long.

Posted: 2018-11-18T10:38:57-07:00
by chani
Okay, I wrote something which seems to work, based on what I did read about aHash. Here's the PHP Code:

Code: Select all

        $im = new \Imagick($file);
        $im->sampleImage(16, 16);
        $data = $im->getImageChannelMean(\Imagick::CHANNEL_RED);
        $mean = $data['mean'];
        $hash = $im->getImageSignature();

Posted: 2018-11-18T12:09:31-07:00
by fmw42
You could use perceptual hash techniques. ImageMagick has a color phash. See ... =4&t=24906

I have built some other perceptual hash scripts at ... /index.php.