[Novalug] Poor Man's Dedup

James Ewing Cottrell 3rd JECottrell3@Comcast.NET
Tue Oct 20 20:44:10 EDT 2009


Bryan J. Smith wrote:
> Interesting approach. 

Yeah, like I said, the File System is a Poor Man's Database. SQL selects 
become finds (or ls) and greps, maybe with a sort thrown in.

Now, for the Poor Man's Dedup. Suppose you rsync or dump or otherwise 
capture a filesystem somewhere, organize your dumps by date.

/dump/20091019 = yesterday
/dump/20091020 = today
/dump/20091021 = tomorrow

/repo/####/#### is your repository. After the day's dump is completed, 
hash every file (I am assuming an 8 character hash is returned, but I 
just made that up) in it. Let's say you get 13572468. Move it to 
/repo/1357/2468 (unless it is there, in which case it should compare 
equal to it) and symlink it back to the original directory.

What I didn't address was modes, owners and groups, but they could be 
part of the filename.

Alternatively, you could use hardlinks, in which case, tar/cpio/rsync 
could pull each snapshot out easily.

In the symlink case, you could modify tar/cpio/rsync to follow only 
certain symlinks, like the ones that start with /repo.

Another possibility is to make each file a directory, and store the 
versions as hashes, so that you have /etc/passwd/{13572468,97538642}
and you can limit the number of has computations you have to perform, as 
diffs (but only when size/mtime match) probably go quicker.

Doing a "setattr +i" to prevent meddling is also a possibility.



More information about the Novalug mailing list