Find duplicate files
Update: As Pádraig Brady, fslint maintainer, pointed out: fslint/findup *is* a shell script.
My 500-GB Seagate FreeAgent Desktop is almost filled to the brim (there’s *only* ~70GB free space left) so I need to find all duplicate files for clean-up.
Fortunately, there are tools to do just this. I tried fslint, which is also available in the Fedora repository. I also found several nifty scripts on the web.
I settled for a Perl script, found in PerlMonks, which I modified a bit (used digest() instead of hexdigest(), removed calculation of duplicate file size).
#!/usr/bin/perl -w use strict; use File::Find; use Digest::MD5; my %files; find(\&check_file, $ARGV[0] || "."); local $" = ", "; foreach my $size (sort {$b < => $a} keys %files) { next unless @{$files{$size}} > 1; my %md5; foreach my $file (@{$files{$size}}) { open(FILE, $file) or next; binmode(FILE); push @{$md5{Digest::MD5->new->addfile(*FILE)->digest}},$file; } foreach my $hash (keys %md5) { next unless @{$md5{$hash}} > 1; print "$size @{$md5{$hash}}\n"; } } sub check_file { -f && push @{$files{(stat(_))[7]}}, $File::Find::name; }
I’m a shell-script junkie, so I whipped up something in Bash. It’s not as fast as the Perl implementation or fslint, but it does the job.
find "$@" -type f -exec md5sum {} \; | \ sort -k 1,32 | uniq -w 32 -d -D | \ awk 'NF { a[substr($0,0,32)]=(a[substr($0,0,32)]) ? a[substr($0,0,32)] FS $2 : $0 } \ END \ { for(i in a) print a[i] }'
(Awk is pretty cool, isn’t it?)
Of course, I tested all three on a directory with about 300 or so duplicate files, here are the results.
fslint/findup:
real 0m3.093s user 0m1.812s sys 0m0.368s
Perl:
real 0m4.668s user 0m0.644s sys 0m0.188s
Shell:
real 0m30.475s user 0m1.842s sys 0m1.692s
Okay, so the shell script’s performance was abysmal, but hey, it’s always reassuring to know that there are more than one way to do it. (Errr… that’s a Perl motto.)
Read more:
- Using non-interactive FTP
- Generate random strings
- FizzBuzz
- Downloading Flickr photos
- Way over my head?
2 Comments