[Prophesy] Re: Improved string transformation

Mon Jun 3 17:44:01 EST 2002

On Monday 03 June 2002 08:58, Rasmus Andersen wrote:
> On Sun, Jun 02, 2002 at 01:58:38AM +0200, Daniel Phillips wrote:
> > > And I think
> > > that this is one of the cardinal weak points in CVS, and thusly
> > > one where we should aim for being strong.
> > > 
> > > But I have no good ideas on how to handle this and still get
> > > transparency.
> > 
> > Ah, I see what you're getting at.  OK, we are not going to rely *only* on the 
> > transparent editing interface, but on other means of feeding the scm as well, 
> > to supply the other needed information, which will be kept with the deltas in 
> > the database.  I envision a graphical user interface running while the scm is 
> > running, which gives a view of the source tree and lets you walk over it to 
> > add comments, tags etc.  Of course we will provide a command shell way of 
> > doing these things as well.
> 
> I think my point is that at the lowest level of this, at the FS level,
> can we attach meaning (and comments) to (FS) operations? Some people
> obsessively save their buffers after each little edit and it would
> seem that a revision history reflecting this would not be very helpful.
> 
> On the other hand, if the SCM merely uses the FS operations to gather
> knowledge about changed objects[1], then the user would still have to
> do a explicit 'commit' to make a delta(?) and attach comments. Which
> isn't that far from what you would do anyway without the magic FS.
> 
> Or am I missing something?

No, good point, and it's one I've thought about, I just neglected to say 
anything about it.  My thinking is that the scm will normally save a 
transform against a file every time the file is written to for whatever 
reason, but when you commit, those transforms are collapsed into a single 
transform.  So until you do the commit, you have file-level undo, if you want 
it.  It's just easy to provide this, so why not?  We can also provide an 
option to leave those transformed un-composed in the database, which will eat 
a lot of space a probably be useless, but it might be interesting to 
somebody, and it's easy to do, so again, why not?

> [1] We haven't discussed the basic object in the SCM. Is it a file?
> A function? A line (of code)?

No, we haven't talked about it much, and it's getting to where we need to do that.

> I could see some nice things coming
> from having smaller granularity than the file one, but since we
> are aiming at having 'loose' dependencies in the SCM I think we
> will get those anyway.

The basic data object will be a transform, according to my current thinking, 
though other database entities will no doubt emerge as we go.  A transform 
epresses the difference between two strings, and we have not said yet whether 
the strings are whole files or something else.  Clearly, a single transform 
cannot be larger than a file, but is it useful for it to be smaller?  From a 
pure data storage point of view, no, that doesn't gain a lot, because if we 
want that, we can still express it with a single transform, and then have a 
list of regions in the transform that are of special interest, rather than 
having separate transforms.  However, in the process of doing some of the 
kinds of calculus that I expect we will want to do, I think we will want to 
generate transforms, on the fly, that are smaller than files, i.e., partition 
transforms into regions that reflect, say, the boundardies of a patch that we 
are trying to merge.

I'm personally not a great fan of line boundaries, as I believe they reduce 
generality.  However, we need to deal with them at times, especially when 
interfacing to diff.  They're likely to figure in our merge algorithms as 
well, since they tend to be a conceptually significant from the user's point 
of view.  But as far as letting them invade the data design - there's no 
need, and by being strict about that, the end result will be much more useful 
for handling binary files as well.

A practical question is whether we're going to version directories.  I 
mentioned the idea that each file object would have an id (which is 
universally unique) and the name of the file would be metadata associated 
with the object (i.e., an attribute of the object).  However, we will need to 
look up files rapidly by name, for example, when a file is changed and a 
transform needs to be recorded against it in the database.  This can of 
course be handled efficiently by appropriate use of database indexing.

We may sometimes want to traverse the database in directory order, perhaps 
when producing a diff between two tree versions.  Does this mean we want to 
record directories as objects?  I don't know yet.  It may be enough just to 
compute the directories on the fly.

Drifting further in that direction, the question arises of how much 
filesystem structure we want to support in the scm.  Do we want to support 
symlinks?  I think we do.  Hard links?  Good question.  Device nodes?  Hmm.
If we support all of the above, then what we have is more general than a 
source code versioning system, it's actually a versioning filesystem.  That's 
something to think about.  However, right now I'll be satisfied aiming at 
something with more modest goals.

--
Daniel