Wednesday, May 14, 2014

Gmail Logging

A story posted on hacker news about Google logging every single email you ever wrote is an interesting discussion piece. Of course,  that is assuming that someone of the communication chain is using Google's services.

At each point, an opportunity of storing the data is presented. Let's look at the chain. But let me premise that I am not network engineer, so they will be gaps in my analysis.
  1. Joe sent the email on his computer to his brother Peter. A copy is saved on his email client. Typically in his sent folder.
  2. The mail server received his email to dispatch to the recipient. The mail server could save a copy of the email.
  3. The receiving mail server accepts the email and stores it until Peter is ready to take it. The mail server here stores it.
  4. Peter connect to the mail server and download the email.  It is stored in the Peter's inbox.

Just with my simple communication chain, each link provides an opportunity to stored the email.

Once you throw in a smartphone, cloud email clients that are accessible on multiple computer,  the opportunity increases quite quickly.

Tuesday, May 06, 2014

dupfinder

My family and I own about seven devices that takes photos. Most of these devices are phones. Funny enough, we only have one camera. With some many photos floating about, managing them is a headache.

The first step in managing these many photos is to weed out the duplicates. One sure way of generating duplicates on our computer is not to take care when transferring them from the phone to the computer. The following is the typical situation.


  • plug the phone to the computer.
  • transfer the pictures to the computer.
  • I don't delete the photos from my phone as I want to see them as I carry my phone about.
  • A few days later, I would repeat the download cycle again, except that I would copy all the photos to different directory. I can't remember where I put the photos a few days ago and I am too scared that I might delete photos.
To overcome this problem, I need a python script.

The algorithm is simple. In 3 simple steps,
  1. For each file generate a md5 digest of it.
  2. group all the files that have the md5 digest together.
  3. copy a file from each group to a new directory
The underlying assumption is that collisions are not possible, but as we all know (http://en.wikipedia.org/wiki/MD5#Collision_vulnerabilities) md5 are not collision free.

The script seems to be relatively fast. It through its way through 2500 photos in about 10mins. This is on an Mac Mini with an 2.7Ghz i7 processor. 

The first version is not fancy and not loaded up with options. There are a number of options i would like to add, including a paranoid and a statistical mode

If you are interested, the source code is available at my Github (https://github.com/tyc/dupfinder