hitsaru

HashChecker – a tool for duplicate file detection

Filed under Development
Topics covered: , , , ,

If you are anywhere near as ‘techie’ as I am you have tons of media: thumbdrives, SD cards, hard drives. I even have CDs. I won’t confess if I have floppy disks. Guess. As the year closed out I finally started the massive project of attempting to consolidate all the media I’ve had lying around for ever. So far I’ve made it through SD cards and thumbdrives. Most of them anyway.

The problem with all of this is I have redundant data. Who doesn’t? Isn’t that the point of backup media? When you don’t know what’s a backup and what’s not that’s hard to negotiate. When you are conglomerating dozens of hard drives, terabytes of data, you tend to get duplicates and bloat.

I love Weekend Warrior style projects, and no, copying files is not exactly my idea of that. I really love it when they involve coding projects. I love having an idea and knowing that there is a way to use a computer to solve a problem. I love having a random idea and to just start coding off the hip. It’s hacklife style. The downside to this is that I tend to make people with formal programming education cringe when they look at my messy code (my wife included), but the most important thing is: did you get the results you wanted?

In this case I needed to know how many duplicates I had and where they were. I coded up something in python (https://github.com/girbotphone/girgit/tree/master/HASHCHECKER) that will do just that. Using os.walk and loops it combs a directory recursively, creating a file tree for us, using hashlib, it generates a SHA256 hash of each file in each directory, and it stores this info in a SQLite file. The compare script iterates through the database file we’ve created and tells us how many identical files we have in the database, our mapped drive/directory. This SQL file can also serve as a file integrity table, but I haven’t yet written a checker to verify integrity.

In my case, I ran it against my backups in progress and found that I had nearly 80k identical files. Ouch. Now to sort those and determine what I need to delete…..

Keep Hacking!

Stay Curious, Folks.