[Novalug] Poor Man's Dedup
James Ewing Cottrell 3rd
JECottrell3@Comcast.NET
Tue Oct 20 20:44:10 EDT 2009
Bryan J. Smith wrote:
> Interesting approach.
Yeah, like I said, the File System is a Poor Man's Database. SQL selects
become finds (or ls) and greps, maybe with a sort thrown in.
Now, for the Poor Man's Dedup. Suppose you rsync or dump or otherwise
capture a filesystem somewhere, organize your dumps by date.
/dump/20091019 = yesterday
/dump/20091020 = today
/dump/20091021 = tomorrow
/repo/####/#### is your repository. After the day's dump is completed,
hash every file (I am assuming an 8 character hash is returned, but I
just made that up) in it. Let's say you get 13572468. Move it to
/repo/1357/2468 (unless it is there, in which case it should compare
equal to it) and symlink it back to the original directory.
What I didn't address was modes, owners and groups, but they could be
part of the filename.
Alternatively, you could use hardlinks, in which case, tar/cpio/rsync
could pull each snapshot out easily.
In the symlink case, you could modify tar/cpio/rsync to follow only
certain symlinks, like the ones that start with /repo.
Another possibility is to make each file a directory, and store the
versions as hashes, so that you have /etc/passwd/{13572468,97538642}
and you can limit the number of has computations you have to perform, as
diffs (but only when size/mtime match) probably go quicker.
Doing a "setattr +i" to prevent meddling is also a possibility.
More information about the Novalug
mailing list