[Prophesy] Re: Improved string transformation
Daniel Phillips
phillips at bonn-fries.net
Mon Jun 3 17:44:01 EST 2002
On Monday 03 June 2002 08:58, Rasmus Andersen wrote:
> On Sun, Jun 02, 2002 at 01:58:38AM +0200, Daniel Phillips wrote:
> > > And I think
> > > that this is one of the cardinal weak points in CVS, and thusly
> > > one where we should aim for being strong.
> > >
> > > But I have no good ideas on how to handle this and still get
> > > transparency.
> >
> > Ah, I see what you're getting at. OK, we are not going to rely *only* on the
> > transparent editing interface, but on other means of feeding the scm as well,
> > to supply the other needed information, which will be kept with the deltas in
> > the database. I envision a graphical user interface running while the scm is
> > running, which gives a view of the source tree and lets you walk over it to
> > add comments, tags etc. Of course we will provide a command shell way of
> > doing these things as well.
>
> I think my point is that at the lowest level of this, at the FS level,
> can we attach meaning (and comments) to (FS) operations? Some people
> obsessively save their buffers after each little edit and it would
> seem that a revision history reflecting this would not be very helpful.
>
> On the other hand, if the SCM merely uses the FS operations to gather
> knowledge about changed objects[1], then the user would still have to
> do a explicit 'commit' to make a delta(?) and attach comments. Which
> isn't that far from what you would do anyway without the magic FS.
>
> Or am I missing something?
No, good point, and it's one I've thought about, I just neglected to say
anything about it. My thinking is that the scm will normally save a
transform against a file every time the file is written to for whatever
reason, but when you commit, those transforms are collapsed into a single
transform. So until you do the commit, you have file-level undo, if you want
it. It's just easy to provide this, so why not? We can also provide an
option to leave those transformed un-composed in the database, which will eat
a lot of space a probably be useless, but it might be interesting to
somebody, and it's easy to do, so again, why not?
> [1] We haven't discussed the basic object in the SCM. Is it a file?
> A function? A line (of code)?
No, we haven't talked about it much, and it's getting to where we need to do that.
> I could see some nice things coming
> from having smaller granularity than the file one, but since we
> are aiming at having 'loose' dependencies in the SCM I think we
> will get those anyway.
The basic data object will be a transform, according to my current thinking,
though other database entities will no doubt emerge as we go. A transform
epresses the difference between two strings, and we have not said yet whether
the strings are whole files or something else. Clearly, a single transform
cannot be larger than a file, but is it useful for it to be smaller? From a
pure data storage point of view, no, that doesn't gain a lot, because if we
want that, we can still express it with a single transform, and then have a
list of regions in the transform that are of special interest, rather than
having separate transforms. However, in the process of doing some of the
kinds of calculus that I expect we will want to do, I think we will want to
generate transforms, on the fly, that are smaller than files, i.e., partition
transforms into regions that reflect, say, the boundardies of a patch that we
are trying to merge.
I'm personally not a great fan of line boundaries, as I believe they reduce
generality. However, we need to deal with them at times, especially when
interfacing to diff. They're likely to figure in our merge algorithms as
well, since they tend to be a conceptually significant from the user's point
of view. But as far as letting them invade the data design - there's no
need, and by being strict about that, the end result will be much more useful
for handling binary files as well.
A practical question is whether we're going to version directories. I
mentioned the idea that each file object would have an id (which is
universally unique) and the name of the file would be metadata associated
with the object (i.e., an attribute of the object). However, we will need to
look up files rapidly by name, for example, when a file is changed and a
transform needs to be recorded against it in the database. This can of
course be handled efficiently by appropriate use of database indexing.
We may sometimes want to traverse the database in directory order, perhaps
when producing a diff between two tree versions. Does this mean we want to
record directories as objects? I don't know yet. It may be enough just to
compute the directories on the fly.
Drifting further in that direction, the question arises of how much
filesystem structure we want to support in the scm. Do we want to support
symlinks? I think we do. Hard links? Good question. Device nodes? Hmm.
If we support all of the above, then what we have is more general than a
source code versioning system, it's actually a versioning filesystem. That's
something to think about. However, right now I'll be satisfied aiming at
something with more modest goals.
--
Daniel
More information about the Prophesy
mailing list