[Prophesy] Versioning filesystem

Mon Mar 24 09:39:44 EST 2003

Chances are I'm talking to myself, after not doing anything here for nine 
months or so.  That doesn't mean things haven't been happening.  
Specifically, I've been introspecting.

The main subject of introspection has been how I'd go about implementing a 
versioning filesystem, and even where that seems like a good thing to do.  I 
basically beat my head against a wrong approach for most of the nine months, 
pursing the idea of hooking out file_operations from a vfs a path_walk.  I 
thought that would be the most efficient way to hook the thing up, because 
writes could be specially handled, whereas reads would just follow the normal 
path.  This never worked out cleanly.  The vfs just isn't set up that way, 
and would have required major surgery.  Besides that, I gradually realized 
that I did not always want to pass reads straight through.  This would my 
design options by not allowing me to generate the read data on the fly.  Then 
I saw the light, by realizing that Martin Poole already had the right idea 
with his newuserfs.

Newuserfs is a forward port of Jeremy Fitzhardinge's userfs, which works by 
passing vfs operations through a pipe to user space.

After thinking about this a short time, I realized that I could start with 
ramfs, which implements full posix semantics and just bolt that onto a 
usermode daemon with the socket.  There are a number of right things about 
this approach, not least of which is the fact that the stack never gets very 
deep for either the task calling for file operations or the server 
implementing them.  This is because the kernel does a task switch to the 
server each time a complex low-level file operation needs to be done, and the 
stack-hungry things happen in user space.  There's no recursive calling into 
the kernel.

Another right thing is the way caching works with this approach, specifically 
the page cache and dcache.  For both, the vfs only needs help from the 
usermode daemon when some name or file data isn't in its cache.  So the 
usermode implementation can be quite slow and the cache will cover that up.  
Not that I want to make the usermode part slow, but in theory it could be, 
especially if there is database access and application of a chain of file 
differences going on.

So I started implementing this about 10 days ago and have been occupied with 
it since.  Things are going pretty well, to the point I could think about a 
code release in a week or two.  The project has a name:

   Stuf - STackable Usermode Filesystem

which is actually not specific to versioning filesystems.  A particular 
filesystem is implemented by a usermode server daemon that implements Stuf's 
socket protocol (which I call "beads").  The sever I'm working on now is 
called "simple" and just passes filesystem operations through to the 
underlying filesystem.  After that is working reasonably well, to the point 
that you can, say, compile a kernel on the stacked filesystem, I'll move on 
to a versioning server.

At this point I can mount a filesystem with the "stuff" command (Stuf 
Frontend), fork the server, connect the pipe, generate and pass FDs for both 
the mounted and underlying filesystem through the pipe.  The server can ioctl 
the virtual filesystem to take care of special needs that can't be satisfied 
by (or would be too slow and racy with) posix operations.  I can now pass 
open(2) requests through through the pipe, and am currently busy implementing 
a new system call that can open a file, given a directory fd and a name.

There's been a lot of work on SCM high level design considerations done on 
the Arch mailing list, including what needs to be done to satisfy the 
requirements of kernel developers.  It seems to me, that much of what has 
been discussed is suitable for implementation as a versioning filesystem, and 
so I have set out to do that.

Regards,

Daniel