Coredump
Work, play, and everything in-between [feed]

Find duplicate files

Update: As Pádraig Brady, fslint maintainer, pointed out: fslint/findup *is* a shell script.

My 500-GB Seagate FreeAgent Desktop is almost filled to the brim (there’s *only* ~70GB free space left) so I need to find all duplicate files for clean-up.

Fortunately, there are tools to do just this. I tried fslint, which is also available in the Fedora repository. I also found several nifty scripts on the web.

I settled for a Perl script, found in PerlMonks, which I modified a bit (used digest() instead of hexdigest(), removed calculation of duplicate file size).

#!/usr/bin/perl -w
 
use strict;
use File::Find;
use Digest::MD5;
 
my %files;
 
find(\&check_file, $ARGV[0] || ".");
 
local $" = ", ";
foreach my $size (sort {$b < => $a} keys %files) {
  next unless @{$files{$size}} > 1;
  my %md5;
  foreach my $file (@{$files{$size}}) {
    open(FILE, $file) or next;
    binmode(FILE);
    push @{$md5{Digest::MD5->new->addfile(*FILE)->digest}},$file;
  }
  foreach my $hash (keys %md5) {
    next unless @{$md5{$hash}} > 1;
    print "$size @{$md5{$hash}}\n";
  }
}
 
sub check_file {
  -f && push @{$files{(stat(_))[7]}}, $File::Find::name;
}

I’m a shell-script junkie, so I whipped up something in Bash. It’s not as fast as the Perl implementation or fslint, but it does the job.

find "$@" -type f -exec md5sum {} \; | \
  sort -k 1,32 | uniq -w 32 -d -D | \
  awk 'NF { a[substr($0,0,32)]=(a[substr($0,0,32)]) ? a[substr($0,0,32)] FS $2 : $0 } \
    END \ 
    { for(i in a) print a[i] }'

(Awk is pretty cool, isn’t it?)

Of course, I tested all three on a directory with about 300 or so duplicate files, here are the results.

fslint/findup:

real    0m3.093s
user    0m1.812s
sys     0m0.368s

Perl:

real    0m4.668s
user    0m0.644s
sys     0m0.188s

Shell:

real    0m30.475s
user    0m1.842s
sys     0m1.692s

Okay, so the shell script’s performance was abysmal, but hey, it’s always reassuring to know that there are more than one way to do it. (Errr… that’s a Perl motto.)

Read more:


2 Comments

Pádraig Brady says:

fslint/findup is shell script :)
http://code.google.com/p/fslint/source/browse/trunk/fslint/findup

Posted on 6 January 2010 pm31 9:38 PM UTC

Ian Dexter says:

WOW! Didn’t realize that. Looks like I don’t have to reinvent the wheel, then.

Thanks for pointing this out. :)

Posted on 7 January 2010 pm31 1:59 PM UTC

Leave a comment


« »