Tuesday, May 06, 2014

dupfinder

My family and I own about seven devices that takes photos. Most of these devices are phones. Funny enough, we only have one camera. With some many photos floating about, managing them is a headache.

The first step in managing these many photos is to weed out the duplicates. One sure way of generating duplicates on our computer is not to take care when transferring them from the phone to the computer. The following is the typical situation.


  • plug the phone to the computer.
  • transfer the pictures to the computer.
  • I don't delete the photos from my phone as I want to see them as I carry my phone about.
  • A few days later, I would repeat the download cycle again, except that I would copy all the photos to different directory. I can't remember where I put the photos a few days ago and I am too scared that I might delete photos.
To overcome this problem, I need a python script.

The algorithm is simple. In 3 simple steps,
  1. For each file generate a md5 digest of it.
  2. group all the files that have the md5 digest together.
  3. copy a file from each group to a new directory
The underlying assumption is that collisions are not possible, but as we all know (http://en.wikipedia.org/wiki/MD5#Collision_vulnerabilities) md5 are not collision free.

The script seems to be relatively fast. It through its way through 2500 photos in about 10mins. This is on an Mac Mini with an 2.7Ghz i7 processor. 

The first version is not fancy and not loaded up with options. There are a number of options i would like to add, including a paranoid and a statistical mode

If you are interested, the source code is available at my Github (https://github.com/tyc/dupfinder

No comments: