[Prophesy] Re: Improved string transformation
Daniel Phillips
phillips at bonn-fries.net
Fri May 31 20:05:26 EST 2002
On Friday 31 May 2002 10:28, Rasmus Andersen wrote:
> On Fri, May 31, 2002 at 10:00:02AM +0200, Daniel Phillips wrote:
> > Today I added support for the 'move' operation to the string transforma,
> > roughly doubling the size of the state network, leading me to reflect on how
> > much a slight irregularity in an encoding scheme can bloat up an
> > implementation. Oh well, it's still reasonably tight and efficient, and it
> > is not going to grow more any time in the near future, except to add more
> > error checking in the transinfo function.
>
> Hi Daniel.
>
> Trying to reply to you is like hitting a moving target :) This is a
> shoret one; I'm otherwise occupied. BTW: I'm on the prophesy list;
> no need to cc me seperately.
>
> It seems to me that we (you) are attacking some of the lower parts of
> an SCM before looking at the higher level ones.
Oh yes, very much so. I like to do that, just to help get my mind wrapped
around the problem. It's especially nice when you can see a little part of
the problem that breaks out and doesn't depend a lot on the high level
design. It's like memcpy, you don't have to think too much about the details
of the applications are going to use it, just make it go as fast as possible
and have as simple a form as possible.
> I dont think that
> would lead you current efforts to be wasted but sometimes I feel
> comfortable having thought roughly about things before doing them.
You gotta speak up ;-)
> Of course, that often also leads me to sit on my hands all day.
Judging by your previous work, I'd say that's a slight exaggeration.
> Anyway, some higher level concers I can list off the top of my
> head would be:
>
> o branches
> o merges
> o distribution
> o providing usable change overviews and groupings based on dnotify
> recorded changes
>
> This is a terse list and, as I tried to imply, it is probably
> independent of what you are doing now.
Indeed. I'm mainly thinking about one are that's very important to me,
personally, and isn't on your list, and that is: editing. I have this idea
that the fact you're using a SCM should be nearly totally transparent. You
just edit your files and the SCM takes can of making sure that nothing is
every forgotten. There are still a few more pieces of that puzzle to put in
place, but as soon as it gets there, we *already have something useful*.
That said, let me do a little musing on the points you mentioned, which are
also very important. I'd like to try to see all of the points you mentioned
as high level database problems and get a some primitives in place to help us
think about what we can do at the high level. So - soon we will have nice
fast transforms, and we already have the idea that the transforms are applied
backwards, starting from the current version on disk. Since the transforms
are fast, we can get lazy and apply an awful lot of them to do certain
things, i.e., to make old versions of the code materialize quickly, for
further editing, or for comparison against other versions.
I'm just going to throw some of my random thoughts on the table. Don't
assume any of the following is correct, it's just a starting point for
discussion.
> o branches
First we should think about tree nodes. A tree node is any place that we
have set a checkpoint, that is, a tree state that we can restore. Between
any two nodes - and that includes nodes on different branches - we have or
can compute a delta. In general, we will use the structure of the tree to
compute the delta. A branch is something defined by the user. It's
simply a name that gets carried from node to node, as the user sets
checkpoints, along with an incrementing generation number. Sometimes the
tree will fork, starting a new branch (duh).
At one of the tree nodes we will find the current copy of the source code,
that is, the copy on disk. It does not have to be at the end of a branch,
it can be anywhere. That's why it's important to be able to invert
transformations quickly. (Hey, when is somebody going to rise to my
challenge of stating the algorithm for generating an inverse transform?) For
safety's sake we probably want to leave 'cached' copies of tips of branches
somewhere on disk or in the database, so that we don't have to completely
rely on the transform machinery to get us from node to node and back again.
A related idea is that we sometimes want to have two nodes of the tree
expressed on disk at the same time.
There are, of course, many different ways of expressing the delta between two
nodes in terms of particular transformations. This fact is what makes all of
this interesting.
> o merges
A merge is the process of applying a subset of transformations that express
the delta between a pair of nodes to some other node. We need to apply some
extra constraints to the transformations involved, for example, instead of
just skipping text, we normally will want to ensure that the text skipped in
the original node of the delta is the same as the text skipped in the target
node. In order to accomplish that, we may need to alter the transformations
in various ways. As with good old patch, we will sometimes need to refer to
context to decide how to change the transformations.
Merging is related to branching in much the same way that integration is
related to differentiation. We may want to borrow some of the same
techniques that are used for automatic symbolic integration.
And there is a very short - too short - treatment of merging.
> o distribution
I'm having a few glimmers of ideas about what to do there. I think I
mentioned one already - each node will be partitioned into regions, each of
which will have an id which is unique in the universe - more or less
(incorporating email addresses in the ids makes this come true for practical
purposes). Anyway, essentially, we want to do a merge between two branches
in two separate repositories, so I think what we want to do is first create,
in the destination repository, clones of the two nodes in the source
repository that generated the set of transformations we have decided we want
to send. Then a normal merge is done in the destination repository. Easy?
Original for sure.
Now, the thing about those cross-repository clones is that we don't want to
actually send the whole tree. We want only to send the objects that the
destination repository doesn't already have, and this is where the universal
object ids come in. We are not going to necessarily rely on common parentage
to establish the equivalence of two objects - we will sometimes compare the
objects, and decide that they are actually the same object. To speed this up
over a remote link, we can just compare hashes of objects. The result of
such objects deemed to be equivalent is that we will set up a mapping between
to the two repositories that expresses the equivalence of objects, and we
will also allow either of the repositories to rename any particular object so
that it is exactly equivalent. We can call this process 'melding'. (Hey,
time to start writing patents. Well anyway, let me extract a promise right
now, that by staying on this list you are making a promise to me to respect
the confidence of this work until we release it publicly. We do not want
certain purveyors of close source software going off and writing patents on
the work we're doing. Given all that has happened recently, very little
would surprise me any more.)
> o providing usable change overviews and groupings based on dnotify
> recorded changes
I'm now thinking that dnotify is the wrong model, and what we really want to
to mount a magic filesystem over the mount point of the directory we want to
manage. The magic filesystem will trap all the file changes and call the the
scm, then call the real filesystem. Very simple.
As far as change overviews go, I think I'm a long way from even thinking
about that. A lot more of the basic ideas have to be in place first. Having
a full database around that we can do arbitrary queries on should help quite
a lot.
If you have some specific ideas, don't be shy...
> If nothing else comes of this mail, take it as a reassurance
> that somebody out here is actually reading your mails :)
Oh, I know you are. Everybody gets busy. I wasn't even sure I was going to
go through with this, but now I think I am - I can see something that already
works, and in a cool and unique way - not too far off.
--
Daniel
More information about the Prophesy
mailing list