MD5s

Using an MD5 Digest (or md5 sum) i a neat way of building a predictable key for data. Obviously there is the issue of MD5 collisions (there two completely different source data both produce the same MD5 Digets), but unless you’re building medical or safety equipment, for general text manipulation its pretty negligible.

However, MD5s can be represented in several ways. Lets discount the binary envoding of the 128-bits (16 bytes) of data as thats rather cumbersome, and if you’re storing this in a database such as MySQL, there isn’t a 16 bytes numeric data type; BIGINT is 8 bytes, so you’d have to use two BIGINTs and do lots of horible stuff.

That brings us to the base encodings. Base 16, or hexadecimal, would require us to use a text data type to store the results – as the base16 encoding will contains the numbers 0-9, and the letters A-F (or a-f – the case is irrelevent/insensative in base 16). It would be 32 “characters” long. We can stuff that in a column with no trouble (char(32)).

We can also use a Base 64  encoding, using upper and lower case letters and a few symbols as well as numerals 0-9. This comes to 22 characters (you’ll sometimes see == appended to a Base64 to make it 24 characters). Using 22 chars as a key instead of 32 is 31.25% less data. That makes your indexes that much more compact as well as the column data.

It may not be a perfect primary key, but its possibly reasaonable. But then comes the question of converting between Base16 and base 64. Here’s one way:

#!/usr/bin/perl
use strict;
use warnings;
use Digest::MD5;
use MIME::Base64;

my $data = "foobarbasbifffoobarbasbiff";
my $md5_base64 = Digest::MD5::md5_base64($data);
printf "%s in base64 as hex: %s\n", $md5_base64, unpack('H*', MIME::Base64::decode_base64($md5_base64));