(the link points to a web interface for those less git-enabled—but you want to get git-enabled anyhow) will reveal a UNIX utility that I wrote about 1.5 years ago and finished cleaning up a bit tonight. This is code for "similarity hashing", in which we try to build small "fingerprints" of large files to speed up similarity comparisons between the files. simhash solves a piece of the file analysis puzzle: clustering and phylogenetic tree building can be layered on top.
Try it out and let me know what you think. (B)