ClearBottle: python

Tuesday, May 06, 2014

dupfinder

My family and I own about seven devices that takes photos. Most of these devices are phones. Funny enough, we only have one camera. With some many photos floating about, managing them is a headache.

The first step in managing these many photos is to weed out the duplicates. One sure way of generating duplicates on our computer is not to take care when transferring them from the phone to the computer. The following is the typical situation.

plug the phone to the computer.
transfer the pictures to the computer.
I don't delete the photos from my phone as I want to see them as I carry my phone about.
A few days later, I would repeat the download cycle again, except that I would copy all the photos to different directory. I can't remember where I put the photos a few days ago and I am too scared that I might delete photos.

To overcome this problem, I need a python script.

The algorithm is simple. In 3 simple steps,

For each file generate a md5 digest of it.
group all the files that have the md5 digest together.
copy a file from each group to a new directory

The underlying assumption is that collisions are not possible, but as we all know (http://en.wikipedia.org/wiki/MD5#Collision_vulnerabilities) md5 are not collision free.

The script seems to be relatively fast. It through its way through 2500 photos in about 10mins. This is on an Mac Mini with an 2.7Ghz i7 processor.

The first version is not fancy and not loaded up with options. There are a number of options i would like to add, including a paranoid and a statistical mode

If you are interested, the source code is available at my Github (https://github.com/tyc/dupfinder)

Tuesday, July 10, 2012

csv.DictReader in python

I am playing around getting python to parse simple CSV files. In the past, it would get quite difficult no matter which language I choose to do it. PERL does some useful parsing of it, but python is dead simple. It becomes even simpler when usign the csv.DictReader. The following snippet of code illustrates the point.

#!/usr/bin/python
from optparse import OptionParser
import csv

parser = OptionParser()
parser.add_option("-f", "--filename", dest="filename", help="the filename of the csv file")

(options, args)=parser.parse_args()

if options.filename != None:
csv_filename = str(options.filename)
else:
csv_filename = "foobar.csv"
print "csv file is " + csv_filename

inputfile = open(csv_filename, "rb")
csvReader = csv.DictReader(inputfile, fieldnames=['create_ts','title','url','id','hn_discussion'], delimiter=',', quotechar='"')

for row in csvReader:
print row['id'] + " " + row['url']

inputfile.close()

The first part just parses for an input from the console for a filename of the csv file.

The second part is the interesting part where it parses the file as per the given fieldnames. The fieldnames were given during the call of the csv.DictReader.

By the way, this snippet of code is more prototype rather than production. Much error checking is missing and exception handling is missing, so beware.