CODE: Similarity Hashing

Perusing
git://svcs.cs.pdx.edu/storage/git/simhash.git

(the link points to a web interface for those less git-enabled—but you want to get git-enabled anyhow) will reveal a UNIX utility that I wrote about 1.5 years ago and finished cleaning up a bit tonight. This is code for "similarity hashing", in which we try to build small "fingerprints" of large files to speed up similarity comparisons between the files. simhash solves a piece of the file analysis puzzle: clustering and phylogenetic tree building can be layered on top.

Try it out and let me know what you think. (B)