[Prophesy] Re: Improved string transformation

Fri May 31 20:05:26 EST 2002

On Friday 31 May 2002 10:28, Rasmus Andersen wrote:
> On Fri, May 31, 2002 at 10:00:02AM +0200, Daniel Phillips wrote:
> > Today I added support for the 'move' operation to the string transforma, 
> > roughly doubling the size of the state network, leading me to reflect on how 
> > much a slight irregularity in an encoding scheme can bloat up an 
> > implementation.  Oh well, it's still reasonably tight and efficient, and it 
> > is not going to grow more any time in the near future, except to add more 
> > error checking in the transinfo function.
> 
> Hi Daniel.
> 
> Trying to reply to you is like hitting a moving target :) This is a
> shoret one; I'm otherwise occupied. BTW: I'm on the prophesy list;
> no need to cc me seperately.
> 
> It seems to me that we (you) are attacking some of the lower parts of
> an SCM before looking at the higher level ones.

Oh yes, very much so.  I like to do that, just to help get my mind wrapped 
around the problem.  It's especially nice when you can see a little part of 
the problem that breaks out and doesn't depend a lot on the high level 
design.  It's like memcpy, you don't have to think too much about the details 
of the applications are going to use it, just make it go as fast as possible 
and have as simple a form as possible.

> I dont think that
> would lead you current efforts to be wasted but sometimes I feel
> comfortable having thought roughly about things before doing them.

You gotta speak up ;-)

> Of course, that often also leads me to sit on my hands all day.

Judging by your previous work, I'd say that's a slight exaggeration.

> Anyway, some higher level concers I can list off the top of my
> head would be:
> 
> o branches
> o merges
> o distribution
> o providing usable change overviews and groupings based on dnotify 
>   recorded changes
> 
> This is a terse list and, as I tried to imply, it is probably
> independent of what you are doing now.

Indeed.  I'm mainly thinking about one are that's very important to me, 
personally, and isn't on your list, and that is: editing.  I have this idea 
that the fact you're using a SCM should be nearly totally transparent.  You 
just edit your files and the SCM takes can of making sure that nothing is 
every forgotten.  There are still a few more pieces of that puzzle to put in 
place, but as soon as it gets there, we *already have something useful*.

That said, let me do a little musing on the points you mentioned, which are 
also very important.  I'd like to try to see all of the points you mentioned 
as high level database problems and get a some primitives in place to help us 
think about what we can do at the high level.  So - soon we will have nice 
fast transforms, and we already have the idea that the transforms are applied 
backwards, starting from the current version on disk.  Since the transforms 
are fast, we can get lazy and apply an awful lot of them to do certain 
things, i.e., to make old versions of the code materialize quickly, for 
further editing, or for comparison against other versions.

I'm just going to throw some of my random thoughts on the table.  Don't 
assume any of the following is correct, it's just a starting point for 
discussion.

> o branches

First we should think about tree nodes.  A tree node is any place that we 
have set a checkpoint, that is, a tree state that we can restore.  Between 
any two nodes - and that includes nodes on different branches - we have or 
can compute a delta.  In general, we will use the structure of the tree to 
compute the delta.  A branch is something defined by the user.  It's 
simply a name that gets carried from node to node, as the user sets 
checkpoints, along with an incrementing generation number.  Sometimes the 
tree will fork, starting a new branch (duh).

At one of the tree nodes we will find the current copy of the source code, 
that is, the copy on disk.  It does not have to be at the end of a branch, 
it can be anywhere.  That's why it's important to be able to invert 
transformations quickly.  (Hey, when is somebody going to rise to my 
challenge of stating the algorithm for generating an inverse transform?)  For 
safety's sake we probably want to leave 'cached' copies of tips of branches 
somewhere on disk or in the database, so that we don't have to completely 
rely on the transform machinery to get us from node to node and back again.  
A related idea is that we sometimes want to have two nodes of the tree 
expressed on disk at the same time.

There are, of course, many different ways of expressing the delta between two 
nodes in terms of particular transformations.  This fact is what makes all of 
this interesting.

> o merges

A merge is the process of applying a subset of transformations that express 
the delta between a pair of nodes to some other node.  We need to apply some 
extra constraints to the transformations involved, for example, instead of 
just skipping text, we normally will want to ensure that the text skipped in 
the original node of the delta is the same as the text skipped in the target 
node.  In order to accomplish that, we may need to alter the transformations 
in various ways.  As with good old patch, we will sometimes need to refer to 
context to decide how to change the transformations.

Merging is related to branching in much the same way that integration is 
related to differentiation.  We may want to borrow some of the same 
techniques that are used for automatic symbolic integration.

And there is a very short - too short - treatment of merging.

> o distribution

I'm having a few glimmers of ideas about what to do there.  I think I 
mentioned one already - each node will be partitioned into regions, each of 
which will have an id which is unique in the universe - more or less 
(incorporating email addresses in the ids makes this come true for practical 
purposes).  Anyway, essentially, we want to do a merge between two branches 
in two separate repositories, so I think what we want to do is first create, 
in the destination repository, clones of the two nodes in the source 
repository that generated the set of transformations we have decided we want 
to send.  Then a normal merge is done in the destination repository.  Easy?  
Original for sure.

Now, the thing about those cross-repository clones is that we don't want to 
actually send the whole tree.   We want only to send the objects that the 
destination repository doesn't already have, and this is where the universal 
object ids come in.  We are not going to necessarily rely on common parentage 
to establish the equivalence of two objects - we will sometimes compare the 
objects, and decide that they are actually the same object.  To speed this up 
over a remote link, we can just compare hashes of objects.  The result of 
such objects deemed to be equivalent is that we will set up a mapping between 
to the two repositories that expresses the equivalence of objects, and we 
will also allow either of the repositories to rename any particular object so 
that it is exactly equivalent.  We can call this process 'melding'.  (Hey, 
time to start writing patents.  Well anyway, let me extract a promise right 
now, that by staying on this list you are making a promise to me to respect 
the confidence of this work until we release it publicly.  We do not want 
certain purveyors of close source software going off and writing patents on 
the work we're doing.  Given all that has happened recently, very little 
would surprise me any more.)

> o providing usable change overviews and groupings based on dnotify 
>   recorded changes

I'm now thinking that dnotify is the wrong model, and what we really want to 
to mount a magic filesystem over the mount point of the directory we want to 
manage.  The magic filesystem will trap all the file changes and call the the 
scm, then call the real filesystem.  Very simple.

As far as change overviews go, I think I'm a long way from even thinking 
about that.  A lot more of the basic ideas have to be in place first.  Having 
a full database around that we can do arbitrary queries on should help quite 
a lot.

If you have some specific ideas, don't be shy...

> If nothing else comes of this mail, take it as a reassurance
> that somebody out here is actually reading your mails :)

Oh, I know you are.  Everybody gets busy.  I wasn't even sure I was going to 
go through with this, but now I think I am - I can see something that already 
works, and in a cool and unique way - not too far off.

-- 
Daniel