From phillips at bonn-fries.net  Sun Jun  2 09:58:38 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Sun, 2 Jun 2002 01:58:38 +0200
Subject: [Prophesy] Re: Improved string transformation
In-Reply-To: <20020531135941.D4082@jaquet.dk>
References: <E17DCB3-0006pb-00@starship> <E17DjHi-0007sN-00@starship> <20020531135941.D4082@jaquet.dk>
Message-ID: <E17EIla-0008BW-00@starship>

On Friday 31 May 2002 13:59, Rasmus Andersen wrote:
> Like with dnotify, I think that the grouping and manageability of
> changes coming through a magic FS is going to suffer.

Sorry, I must have missed your reasoning about this, could you please 
elaborate?

> And I think
> that this is one of the cardinal weak points in CVS, and thusly
> one where we should aim for being strong.
> 
> But I have no good ideas on how to handle this and still get
> transparency.

Ah, I see what you're getting at.  OK, we are not going to rely *only* on the 
transparent editing interface, but on other means of feeding the scm as well, 
to supply the other needed information, which will be kept with the deltas in 
the database.  I envision a graphical user interface running while the scm is 
running, which gives a view of the source tree and lets you walk over it to 
add comments, tags etc.  Of course we will provide a command shell way of 
doing these things as well.

While we're on that topic, we want to make the SCM an embeddable object, so 
that both the gui and the command interface simply invoke the scm methods.  I 
guess we can rely on Python to handle that aspect for us, and so not get 
stuck in some sticky tarpit like Corba, or COM, or building our own object 
embedding protocol.

Speaking of gui, I think Glade should be the tool, the only realistic 
alternative being QT/KDE, and while I do like the latter a lot in terms of 
sheer usability, I also like the faster startup and lower resource usage of 
GTK.  Furthermore I'm familiar with Glade, and I like the idea of being able 
to separate out the interface definition into an XML object.

So, now I'm going to take a quick look at how Glade and Python play together.

-- 
Daniel

p.s., I prefer being cc'd on replies to the list, that way a copy shows up in
my inbox, more convenient than checking all the mailing lists I'm subscribed 
too.


From phillips at bonn-fries.net  Sun Jun  2 16:23:53 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Sun, 2 Jun 2002 08:23:53 +0200
Subject: [Prophesy] Thoughts on lineage and derivation
Message-ID: <E17EOmP-0000HQ-00@starship>

I've got a few thoughts on database structure that I've been meaning to jot 
down, so here goes.

First, I'd like start using the term 'version' where previously I've been 
saying 'node'.  It's a lot more descriptive of what we really have in the 
database.  I'll just say 'node' when I'm talking about graph theory in 
general.

Speaking of graph theory, I realized that what we have in the database isn't 
a tree of versions at all, it's an arbitrary connected graph.  It's pretty 
much of a stretch to find strict trees in the real world of code development 
- cross pollination makes short work of that misapprehension.  The only thing 
that makes it look somewhat like a tree is geneology, and see the above 
remark on cross-pollination.  We could say at least that it's a non-cyclic 
graph because time only goes in one direction, but even that gets confused 
sometimes.  Just try importing some old code and see if time always goes 
forward or not.

So let's design everything based on no presumption of strict graph structure.

One thing that does impose a little order on the situation is that there is 
only one order that changes are apllied to the database.  That's a simple 
matter of incrementing a change number every time a change is applied.  We 
won't rely on that for much more than auditing and reporting though, since 
it's too restrictive.

Just for fun, we'll allow changes to be applied to any version in the tree, 
and yes, that can create various sort of inconsistencies, but instead of 
denying that such things can happen, we'll just record the fact that those 
inconsistencies exist in the database, and somebody can attempt to clean them 
up later.  We do not necessarily have to forget about the good old consistent 
version at the affected point in the database, and arguably we should never 
forget a version that's in an 'interior' version anyway.  (An 'interior' 
version is one from which at least one later version was derived.)

For that matter, it's a mistake to think of derivation along a single line, 
or even a single tree.  In fact, there are many objects that make up each 
version, and any of them can show lineage and be derived from, not even 
necessarily in the same version.  So lineage and inheritance are a lot more 
complex that they seem at first glance.

What's going to save us from getting confused are the object ids.  For any 
given object, typically a single source file, we will be able to trace exact 
lineage and derivations from it, and those will form a strict tree.  (Um, 
unless we allow objects to be made up of other objects, which I think we do.)

Notice how using an object id as a handle for a file object neatly answers 
the question of how to handle renames.  The name (complete with path) is just 
an attribute of the file object, and can change from version to version, just 
as the file text can.

-- 
Daniel


From phillips at bonn-fries.net  Sun Jun  2 17:50:02 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Sun, 2 Jun 2002 09:50:02 +0200
Subject: [Prophesy] Diff to transform conversion
Message-ID: <E17EQ7m-0000L2-00@starship>

Here's the algorithm for diff to transform conversion.  We will not try to 
generate the 'move' operation for the time being.

For convenience, the 'emit' operation described below will accumulate and
merge sequences of the same operation into a single operation.  This just 
requires remembering what the last operation was, and how long it is (the 
start is just a negative offset from the current input position).  When the 
operation being emitted is not the same as the last operation emitted, the 
appropriate operation is appended to the operation string.  Finally, after 
processing the entire, a copy is emitted.  Zero length operations are 
discarded.

The basic idea is to process the input text sequentially.  We will keep track 
of both current input line number and byte position.  We don't have to look 
at the output text at all, except that we may wish to actually apply the 
patch to ensure that the result of applying the patch or the transform is the 
same.

Below, when we copy or skip a line, or emit a line of text, we also account 
for the trailing end-of-line.  Strange things happen if there is no 
end-of-line at the end of the input, output or diff file.  Worry about that 
later.

The algorithm proper:

   Find the beginning of a patch.
   The pattern is: "---" <text> <endline> "+++" <text> <endline>

   While the next text is "@@" (beginning of chunk):

       Get the input line number and count, and the output line number and 
       count from the chunk header line.  Ensure the line numbers are
       monotonically increasing.  (The output line and count are not used
       in the algorithm below, but could be used for error checking.)

       Emit a copy from the current input position to the chunk's input
       line number, and advance the input position to the cunk's input line

       For each line of the chunk, until the current input line equals the 
       chunks's input line number plus the chunk's input line count:

           If the line begins with '-', emit a skip as long as the line
           If the line begins with '+', emit a text as long as the line
           If the line begins with ' ', emit a copy as long as the line
      
   Finally, emit a copy from the current input position to the end of
   the input text, and flush it to the operation string.

I think I'll try coding this in C, with the help of the regex library, though 
I know it would be easier in Python, and Rasmus has already written some nice 
regex's in Python for handling diffs.  However, the transform applying code 
is already in C, so the diff parsing code might as well be too.  Of course it 
means that another job coming up very soon is: figuring out how to interface 
Python to C functions.

-- 
Daniel


From phillips at bonn-fries.net  Sun Jun  2 21:14:43 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Sun, 2 Jun 2002 13:14:43 +0200
Subject: [Prophesy] Diff to transform conversion
In-Reply-To: <E17EQ7m-0000L2-00@starship>
References: <E17EQ7m-0000L2-00@starship>
Message-ID: <E17ETJs-0000Nz-00@starship>

On Sunday 02 June 2002 09:50, I wrote:
> I think I'll try coding this in C, with the help of the regex library...

The regex library was a big fat disappointment:

  - Cannot apply regex across more than one line.
  - Only matches zero terminated strings.

The latter restriction means it's no good for matching against a part of
string.  Come on guys, I thought Unix was designed by computer scientists,
not schoolchildren.

OK, next step is to just hand code it.  If you want a job done properly...

-- 
Daniel


From rasmus at jaquet.dk  Mon Jun  3 16:58:20 2002
From: rasmus at jaquet.dk (Rasmus Andersen)
Date: Mon, 3 Jun 2002 08:58:20 +0200
Subject: [Prophesy] Re: Improved string transformation
In-Reply-To: <E17EIla-0008BW-00@starship>; from phillips@bonn-fries.net on Sun, Jun 02, 2002 at 01:58:38AM +0200
References: <E17DCB3-0006pb-00@starship> <E17DjHi-0007sN-00@starship> <20020531135941.D4082@jaquet.dk> <E17EIla-0008BW-00@starship>
Message-ID: <20020603085819.A22496@jaquet.dk>

On Sun, Jun 02, 2002 at 01:58:38AM +0200, Daniel Phillips wrote:
> > And I think
> > that this is one of the cardinal weak points in CVS, and thusly
> > one where we should aim for being strong.
> > 
> > But I have no good ideas on how to handle this and still get
> > transparency.
> 
> Ah, I see what you're getting at.  OK, we are not going to rely *only* on the 
> transparent editing interface, but on other means of feeding the scm as well, 
> to supply the other needed information, which will be kept with the deltas in 
> the database.  I envision a graphical user interface running while the scm is 
> running, which gives a view of the source tree and lets you walk over it to 
> add comments, tags etc.  Of course we will provide a command shell way of 
> doing these things as well.

I think my point is that at the lowest level of this, at the FS level,
can we attach meaning (and comments) to (FS) operations? Some people
obsessively save their buffers after each little edit and it would
seem that a revision history reflecting this would not be very helpful.

On the other hand, if the SCM merely uses the FS operations to gather
knowledge about changed objects[1], then the user would still have to
do a explicit 'commit' to make a delta(?) and attach comments. Which
isn't that far from what you would do anyway without the magic FS.

Or am I missing something?

> 
> While we're on that topic, we want to make the SCM an embeddable object, so 
> that both the gui and the command interface simply invoke the scm methods.  I 
> guess we can rely on Python to handle that aspect for us, and so not get 
> stuck in some sticky tarpit like Corba, or COM, or building our own object 
> embedding protocol.

I agree strongly on the embeddable part. And python play nicely with
embedded C. And the other way around (being embedded) too.

> 
> Speaking of gui, I think Glade should be the tool, the only realistic 
> alternative being QT/KDE, and while I do like the latter a lot in terms of 
> sheer usability, I also like the faster startup and lower resource usage of 
> GTK.  Furthermore I'm familiar with Glade, and I like the idea of being able 
> to separate out the interface definition into an XML object.
> 
> So, now I'm going to take a quick look at how Glade and Python play together.

I am a GUI newbie so if you have experience, you lead the way.


[1] We haven't discussed the basic object in the SCM. Is it a file?
A function? A line (of code)? I could see some nice things coming
from having smaller granularity than the file one, but since we
are aiming at having 'loose' dependencies in the SCM I think we
will get those anyway.

Rasmus


From phillips at bonn-fries.net  Mon Jun  3 17:44:01 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Mon, 3 Jun 2002 09:44:01 +0200
Subject: [Prophesy] Re: Improved string transformation
In-Reply-To: <20020603085819.A22496@jaquet.dk>
References: <E17DCB3-0006pb-00@starship> <E17EIla-0008BW-00@starship> <20020603085819.A22496@jaquet.dk>
Message-ID: <E17EmVV-0000r8-00@starship>

On Monday 03 June 2002 08:58, Rasmus Andersen wrote:
> On Sun, Jun 02, 2002 at 01:58:38AM +0200, Daniel Phillips wrote:
> > > And I think
> > > that this is one of the cardinal weak points in CVS, and thusly
> > > one where we should aim for being strong.
> > > 
> > > But I have no good ideas on how to handle this and still get
> > > transparency.
> > 
> > Ah, I see what you're getting at.  OK, we are not going to rely *only* on the 
> > transparent editing interface, but on other means of feeding the scm as well, 
> > to supply the other needed information, which will be kept with the deltas in 
> > the database.  I envision a graphical user interface running while the scm is 
> > running, which gives a view of the source tree and lets you walk over it to 
> > add comments, tags etc.  Of course we will provide a command shell way of 
> > doing these things as well.
> 
> I think my point is that at the lowest level of this, at the FS level,
> can we attach meaning (and comments) to (FS) operations? Some people
> obsessively save their buffers after each little edit and it would
> seem that a revision history reflecting this would not be very helpful.
> 
> On the other hand, if the SCM merely uses the FS operations to gather
> knowledge about changed objects[1], then the user would still have to
> do a explicit 'commit' to make a delta(?) and attach comments. Which
> isn't that far from what you would do anyway without the magic FS.
> 
> Or am I missing something?

No, good point, and it's one I've thought about, I just neglected to say 
anything about it.  My thinking is that the scm will normally save a 
transform against a file every time the file is written to for whatever 
reason, but when you commit, those transforms are collapsed into a single 
transform.  So until you do the commit, you have file-level undo, if you want 
it.  It's just easy to provide this, so why not?  We can also provide an 
option to leave those transformed un-composed in the database, which will eat 
a lot of space a probably be useless, but it might be interesting to 
somebody, and it's easy to do, so again, why not?

> [1] We haven't discussed the basic object in the SCM. Is it a file?
> A function? A line (of code)?

No, we haven't talked about it much, and it's getting to where we need to do that.

> I could see some nice things coming
> from having smaller granularity than the file one, but since we
> are aiming at having 'loose' dependencies in the SCM I think we
> will get those anyway.

The basic data object will be a transform, according to my current thinking, 
though other database entities will no doubt emerge as we go.  A transform 
epresses the difference between two strings, and we have not said yet whether 
the strings are whole files or something else.  Clearly, a single transform 
cannot be larger than a file, but is it useful for it to be smaller?  From a 
pure data storage point of view, no, that doesn't gain a lot, because if we 
want that, we can still express it with a single transform, and then have a 
list of regions in the transform that are of special interest, rather than 
having separate transforms.  However, in the process of doing some of the 
kinds of calculus that I expect we will want to do, I think we will want to 
generate transforms, on the fly, that are smaller than files, i.e., partition 
transforms into regions that reflect, say, the boundardies of a patch that we 
are trying to merge.

I'm personally not a great fan of line boundaries, as I believe they reduce 
generality.  However, we need to deal with them at times, especially when 
interfacing to diff.  They're likely to figure in our merge algorithms as 
well, since they tend to be a conceptually significant from the user's point 
of view.  But as far as letting them invade the data design - there's no 
need, and by being strict about that, the end result will be much more useful 
for handling binary files as well.

A practical question is whether we're going to version directories.  I 
mentioned the idea that each file object would have an id (which is 
universally unique) and the name of the file would be metadata associated 
with the object (i.e., an attribute of the object).  However, we will need to 
look up files rapidly by name, for example, when a file is changed and a 
transform needs to be recorded against it in the database.  This can of 
course be handled efficiently by appropriate use of database indexing.

We may sometimes want to traverse the database in directory order, perhaps 
when producing a diff between two tree versions.  Does this mean we want to 
record directories as objects?  I don't know yet.  It may be enough just to 
compute the directories on the fly.

Drifting further in that direction, the question arises of how much 
filesystem structure we want to support in the scm.  Do we want to support 
symlinks?  I think we do.  Hard links?  Good question.  Device nodes?  Hmm.
If we support all of the above, then what we have is more general than a 
source code versioning system, it's actually a versioning filesystem.  That's 
something to think about.  However, right now I'll be satisfied aiming at 
something with more modest goals.

--
Daniel


From rasmus at jaquet.dk  Mon Jun  3 18:06:38 2002
From: rasmus at jaquet.dk (Rasmus Andersen)
Date: Mon, 3 Jun 2002 10:06:38 +0200
Subject: [Prophesy] Re: Improved string transformation
In-Reply-To: <E17EmVV-0000r8-00@starship>; from phillips@bonn-fries.net on Mon, Jun 03, 2002 at 09:44:01AM +0200
References: <E17DCB3-0006pb-00@starship> <E17EIla-0008BW-00@starship> <20020603085819.A22496@jaquet.dk> <E17EmVV-0000r8-00@starship>
Message-ID: <20020603100638.A22744@jaquet.dk>

On Mon, Jun 03, 2002 at 09:44:01AM +0200, Daniel Phillips wrote:
> > On the other hand, if the SCM merely uses the FS operations to gather
> > knowledge about changed objects[1], then the user would still have to
> > do a explicit 'commit' to make a delta(?) and attach comments. Which
> > isn't that far from what you would do anyway without the magic FS.
> > 
> > Or am I missing something?
> 
> No, good point, and it's one I've thought about, I just neglected to say 
> anything about it.  My thinking is that the scm will normally save a 
> transform against a file every time the file is written to for whatever 
> reason, but when you commit, those transforms are collapsed into a single 
> transform.  So until you do the commit, you have file-level undo, if you want 
> it.  It's just easy to provide this, so why not?  We can also provide an 
> option to leave those transformed un-composed in the database, which will eat 
> a lot of space a probably be useless, but it might be interesting to 
> somebody, and it's easy to do, so again, why not?

OK then. This would seem to be a reasonable middle ground. If it wasn't
for you having some FS experience already, I would probably think the
magic FS way too complex for what it is buying us/the user.

Can we do this without a kernel patch? A kernel patch may be a bit
too much for many that just wants to dip their toes.

> > I could see some nice things coming
> > from having smaller granularity than the file one, but since we
> > are aiming at having 'loose' dependencies in the SCM I think we
> > will get those anyway.
> 
[snip things about having files as basic versioning object]
> 
> A practical question is whether we're going to version directories.  I 
> mentioned the idea that each file object would have an id (which is 
> universally unique) and the name of the file would be metadata associated 
> with the object (i.e., an attribute of the object).  However, we will need to 
> look up files rapidly by name, for example, when a file is changed and a 
> transform needs to be recorded against it in the database.  This can of 
> course be handled efficiently by appropriate use of database indexing.
> 
> We may sometimes want to traverse the database in directory order, perhaps 
> when producing a diff between two tree versions.  Does this mean we want to 
> record directories as objects?  I don't know yet.  It may be enough just to 
> compute the directories on the fly.

Another related thing is, how do we group changes to achieve logically
connected changes, aka changesets in BK terminology? I guess that would
be by explicit operations in the GUI/command line thingie operating on
deltas?

> 
> Drifting further in that direction, the question arises of how much 
> filesystem structure we want to support in the scm.  Do we want to support 
> symlinks?  I think we do.  Hard links?  Good question.  Device nodes?  Hmm.
> If we support all of the above, then what we have is more general than a 
> source code versioning system, it's actually a versioning filesystem.  That's 
> something to think about.  However, right now I'll be satisfied aiming at 
> something with more modest goals.

Rik van Riel and Larry had some thought about using magic FS's to the
job a while back... <googling> Here we go. Its kinda sketchy but some
stuff can be had:

http://search.luky.org/linux-kernel.2001/msg25061.html

Rasmus


From phillips at bonn-fries.net  Mon Jun  3 18:55:56 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Mon, 3 Jun 2002 10:55:56 +0200
Subject: [Prophesy] Re: Improved string transformation
In-Reply-To: <20020603100638.A22744@jaquet.dk>
References: <E17DCB3-0006pb-00@starship> <E17EmVV-0000r8-00@starship> <20020603100638.A22744@jaquet.dk>
Message-ID: <E17End7-0000ri-00@starship>

On Monday 03 June 2002 10:06, Rasmus Andersen wrote:
> On Mon, Jun 03, 2002 at 09:44:01AM +0200, Daniel Phillips wrote:
> > > On the other hand, if the SCM merely uses the FS operations to gather
> > > knowledge about changed objects[1], then the user would still have to
> > > do a explicit 'commit' to make a delta(?) and attach comments. Which
> > > isn't that far from what you would do anyway without the magic FS.
> > > 
> > > Or am I missing something?
> > 
> > No, good point, and it's one I've thought about, I just neglected to say 
> > anything about it.  My thinking is that the scm will normally save a 
> > transform against a file every time the file is written to for whatever 
> > reason, but when you commit, those transforms are collapsed into a single 
> > transform.  So until you do the commit, you have file-level undo, if you want 
> > it.  It's just easy to provide this, so why not?  We can also provide an 
> > option to leave those transformed un-composed in the database, which will eat 
> > a lot of space a probably be useless, but it might be interesting to 
> > somebody, and it's easy to do, so again, why not?
> 
> OK then. This would seem to be a reasonable middle ground. If it wasn't
> for you having some FS experience already, I would probably think the
> magic FS way too complex for what it is buying us/the user.
> 
> Can we do this without a kernel patch? A kernel patch may be a bit
> too much for many that just wants to dip their toes.

By rights, a generic method for accomplishing such a thing should already
have been merged, but sadly that isn't the case, or perhaps fortunately, if
the official interface would have been less than ideal.  In any event, I'm
not shy about constructing should a thing if it's needed, and I can assure
you it will be elegant and efficient.  As far as applying a patch goes, I
think it will only be a module, and that module will be small, since most
of the work will be done in user land.  No, we absolutely can't do this
without involving the kernel, and no standard mechansim exists in Linux at
the moment for doing this.  Plan9 has 9P, a network protocol, precisely for
such a purpose, however I'd rather bypass the network and do a tight
little local interface.  If I decide the network interface is really the
right way to do it, or just want to be lazy, the uservfs project already
exists, and is being maintained I believe.  It isn't in kernel though, and
it depends on coda, which is a another whole big piece, so I'm not that
enthusiastic about it.  I'd rather just define a nice interface that
exports the vfs securely and racelessly to user space via the various nice
methods we have available.  It doesn't have to be particularly general
either, to get us going.  I consider this a fairly easy project and a
chance to get some experience with some of the ipc mechanisms I haven't
done a lot with to date, such as signals.

There is another, simpler method, and the one I propose to use for initial
testing: simply issue all edit commands and other file manipulations, such
as rename, patch etc. from a python shell, which will take care of the
needed preserving of data and calls to the scm.  This gives us a quick
start so we don't have to get bogged down in the details of filesystem
exporting, and others who just want to take a test drive might find this
method useful as well.

There's no question in my mind that the magic filesystem is the best
interface.

> > > I could see some nice things coming
> > > from having smaller granularity than the file one, but since we
> > > are aiming at having 'loose' dependencies in the SCM I think we
> > > will get those anyway.
> > 
> [snip things about having files as basic versioning object]
> > 
> > A practical question is whether we're going to version directories.  I 
> > mentioned the idea that each file object would have an id (which is 
> > universally unique) and the name of the file would be metadata associated 
> > with the object (i.e., an attribute of the object).  However, we will need to 
> > look up files rapidly by name, for example, when a file is changed and a 
> > transform needs to be recorded against it in the database.  This can of 
> > course be handled efficiently by appropriate use of database indexing.
> > 
> > We may sometimes want to traverse the database in directory order, perhaps 
> > when producing a diff between two tree versions.  Does this mean we want to 
> > record directories as objects?  I don't know yet.  It may be enough just to 
> > compute the directories on the fly.
> 
> Another related thing is, how do we group changes to achieve logically
> connected changes, aka changesets in BK terminology? I guess that would
> be by explicit operations in the GUI/command line thingie operating on
> deltas?

Right, and I'd like to expose the full power of sql for this purpose, while
also supporting other methods of course, such as remembering the regions
affected by imported patch sets, or indeed, remembering enough information
to reconstruct each patch set exactly.  Let's call that information 'scope',
and we want to carry scope information in a precise way in the database.  In
general, the scopes of changes should not overlap, but when they do, we need
to record exactly how.  Overlapping scope results either in ordering
dependencies, or conflicts.  In either case, we need to record just what
those dependencies or conflicts are.

> > Drifting further in that direction, the question arises of how much 
> > filesystem structure we want to support in the scm.  Do we want to support 
> > symlinks?  I think we do.  Hard links?  Good question.  Device nodes?  Hmm.
> > If we support all of the above, then what we have is more general than a 
> > source code versioning system, it's actually a versioning filesystem.  That's 
> > something to think about.  However, right now I'll be satisfied aiming at 
> > something with more modest goals.
> 
> Rik van Riel and Larry had some thought about using magic FS's to the
> job a while back... <googling> Here we go. Its kinda sketchy but some
> stuff can be had:
> 
> http://search.luky.org/linux-kernel.2001/msg25061.html

Yes, there you go.  'Obviously right'.  Except I don't want to involve the
network, that just doesn't make any sense to me.

-- 
Daniel


From phillips at bonn-fries.net  Mon Jun  3 22:09:14 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Mon, 3 Jun 2002 14:09:14 +0200
Subject: [Prophesy] Re: Improved string transformation
In-Reply-To: <20020603085819.A22496@jaquet.dk>
References: <E17DCB3-0006pb-00@starship> <E17EIla-0008BW-00@starship> <20020603085819.A22496@jaquet.dk>
Message-ID: <E17EqeA-0000tj-00@starship>

On Monday 03 June 2002 08:58, Rasmus Andersen wrote:
> > Speaking of gui, I think Glade should be the tool, the only realistic 
> > alternative being QT/KDE, and while I do like the latter a lot in terms of 
> > sheer usability, I also like the faster startup and lower resource usage of 
> > GTK.  Furthermore I'm familiar with Glade, and I like the idea of being able 
> > to separate out the interface definition into an XML object.
> > 
> > So, now I'm going to take a quick look at how Glade and Python play together.
> 
> I am a GUI newbie so if you have experience, you lead the way.

Here's a tutorial:

   http://www.ics.uci.edu/~xge/python-glade/python-glade.html

-- 
Daniel


From phillips at bonn-fries.net  Tue Jun  4 00:56:55 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Mon, 3 Jun 2002 16:56:55 +0200
Subject: [Prophesy] Re: Improved string transformation
In-Reply-To: <E17EqeA-0000tj-00@starship>
References: <E17DCB3-0006pb-00@starship> <20020603085819.A22496@jaquet.dk> <E17EqeA-0000tj-00@starship>
Message-ID: <E17EtGR-0000vk-00@starship>

On Monday 03 June 2002 14:09, Daniel Phillips wrote:
> On Monday 03 June 2002 08:58, Rasmus Andersen wrote:
> > > Speaking of gui, I think Glade should be the tool, the only realistic 
> > > alternative being QT/KDE, and while I do like the latter a lot in terms of 
> > > sheer usability, I also like the faster startup and lower resource usage of 
> > > GTK.  Furthermore I'm familiar with Glade, and I like the idea of being able 
> > > to separate out the interface definition into an XML object.
> > > 
> > > So, now I'm going to take a quick look at how Glade and Python play together.
> > 
> > I am a GUI newbie so if you have experience, you lead the way.
> 
> Here's a tutorial:
> 
>    http://www.ics.uci.edu/~xge/python-glade/python-glade.html

But it was a little terse, and oriented towards python 1.5 (I'm using 2.1,
to which the postgres database interface of choice is written).  Here's a
much nicer one:

   http://www.icon.co.za/~zapr/Project1.html

In fact, all that's required is 'google python glade'.

-- 
Daniel


From rasmus at jaquet.dk  Tue Jun  4 21:53:40 2002
From: rasmus at jaquet.dk (Rasmus Andersen)
Date: Tue, 4 Jun 2002 13:53:40 +0200
Subject: [Prophesy] Re: Improved string transformation
In-Reply-To: <E17DjHi-0007sN-00@starship>; from phillips@bonn-fries.net on Fri, May 31, 2002 at 12:05:26PM +0200
References: <E17DCB3-0006pb-00@starship> <E17DhKM-0007rM-00@starship> <20020531102806.B3135@jaquet.dk> <E17DjHi-0007sN-00@starship>
Message-ID: <20020604135340.C29724@jaquet.dk>

On Fri, May 31, 2002 at 12:05:26PM +0200, Daniel Phillips wrote:
> As far as change overviews go, I think I'm a long way from even thinking 
> about that.  A lot more of the basic ideas have to be in place first.  Having 
> a full database around that we can do arbitrary queries on should help quite 
> a lot.

Just a random thought I stumbled across: Since you want to store
the transforms in the DB, what are you doing queries on here?
Comments? AFAICS, it would not be feasible to do SQL queries on
the stored transforms.

Rasmus


From phillips at bonn-fries.net  Tue Jun  4 23:57:38 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Tue, 4 Jun 2002 15:57:38 +0200
Subject: [Prophesy] Re: Improved string transformation
In-Reply-To: <20020604135340.C29724@jaquet.dk>
References: <E17DCB3-0006pb-00@starship> <E17DjHi-0007sN-00@starship> <20020604135340.C29724@jaquet.dk>
Message-ID: <E17FEod-0001IE-00@starship>

On Tuesday 04 June 2002 13:53, Rasmus Andersen wrote:
> On Fri, May 31, 2002 at 12:05:26PM +0200, Daniel Phillips wrote:
> > As far as change overviews go, I think I'm a long way from even thinking 
> > about that.  A lot more of the basic ideas have to be in place first.  Having 
> > a full database around that we can do arbitrary queries on should help quite 
> > a lot.
> 
> Just a random thought I stumbled across: Since you want to store
> the transforms in the DB, what are you doing queries on here?
> Comments? AFAICS, it would not be feasible to do SQL queries on
> the stored transforms.

The transform is one field of a record whose primary index is, most
likely, object id, assuming the object is an entire file.  There may
be other as yet undetermined fields, for instance we may want to group
related transforms together into changes, and comments would be
attached to changes.

As far as what we can query, sure, it doesn't make sense to do an SQL
query on a transform itself, but we can quickly generate strings from
source+transforms and do queries on that.  Perhaps the result of the 
query would be used to go back and reorganize the transforms, or perhaps
the query will generate a list of regions of interest in the fully
expressed source of some version.

-- 
Daniel


From phillips at bonn-fries.net  Sat Jun  8 00:17:33 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Fri, 7 Jun 2002 16:17:33 +0200
Subject: [Prophesy] Diff to transform converter
Message-ID: <E17GKYX-0002oK-00@starship>

Here is a first cut at a uninified-diff to transform converter.  I considered 
using a parsing or lexing tool to do this, but in the end I decided to roll 
my own parser, as the diff syntax is quite simple.  I took a look at the 
Posix regex package but was disappointed to learn that it can only deal with 
null-terminated strings, which severely limits its usefulness.  The scanf 
library function, on the other hand, turned out to integrate quite well with 
my little parser.  I suppose that is because the diff syntax was originally 
constructed to be friendly to scanf.  Scanf is used to parse and convert the 
chunk line numbers.

The diff syntax proved to be context-free, that is, it's not necessary to 
decode any of the line numbers in order to complete the parse.  Furthermore, 
only the input line number of each chunk is needed to generate the transform 
output.  This fact certainly wasn't obvious when I started.

The parser itself is a state/transition machine, which is what all the gotos 
are about.  To hide the parsed text behind a stream abstraction and make the 
parser concise and readable, three helper macros are used:

  next_ch() - get the next character to parse, returning -1 if at end
  next_is(c) - return true if the next character is c
  skip_to(c) - skip ahead to c or end and return true if found

These macros assume the variables 'string' and 'limit' are within scope, 
defining the limits of the text to parse.  The macros make use of a pair of 
inlines, _next_is and _skip_to, which properly parameterize all the required 
state, so that no static variables are used in the parser itself, so that the 
end result is thread-safe.  (Note: here we see C at its weakest.  If it were 
possible to define the helper functions within the scope of the parser, no 
macros would be needed and no state would have to be passed.)

In the end, the parser itself, complete with a handle of helper functions, 
was the smaller part of the project - more than half the code is devoted to 
generating the transform codes.  

Code generation is slightly complicated by a mechanism for merging successive 
operation of the same type together.  This in turn requires the literal text 
for text_op operations to be identified in one place, and used in another, 
which introduces a new array to check for overflow and expand as necessary, 
etc.  Details like this tend to bulk up code quickly.

While the parser itself is thread-safe, i.e., it uses no static variables, 
the code generator isn't.  It uses a number of static variables and two 
arrays, mostly concerned with implementing the opcode merging.  This needs to 
be cleaned up at some point by encapsulating all state in a single struct to 
be shared by the parser and code generator.  The code generator itself will 
not embed nicely inside the parser because it is called from two places, one 
to output opcodes and the other at the end of the parse to flush out the 
final, pending opcode.

With the helper functions not inlined and gcc optimization level 2, the 
parser and code generator come in at less than 2K, a result that warms the 
heart of an old code mizer like me.  Inlining only adds another 100 bytes or 
so, so of course we will do it, to get the performance.

It's pretty much impossible to debug a parser and code generator like this 
without tracing output, and I have used my usual technique for that.  There 
are three macros that can be used to wrap tracing output statements: trace_on 
and trace_off, and a further macro, trace, which is defined as one or the 
other, depending on whether you want tracing output or not.  It would be 
foolish to assume that no further work needs to be done on the parser, so all 
the tracing code has been left in for now.

There needs to be more error checking.  As it stands, this code should 
perform its job correctly, on the assumption that the diff text is always 
correct.  It would of course be foolish to assume this.  Some of the 
redundant information in the diff can be used for a crosscheck:

  - The number of copied and skipped lines in each chunk should match
    the chunk's specified input line count

  - The number of copied and added lines in each chunk should match the
    specified chunk's output line count.

  - The current output line should be tracked and checked against the
    chunk's output line number

  - Copied and skipped text in the diff should be checked to ensure it
    matches the corresponding input text

  - Added text could possibly be checked against the original target
    text, but since the target text is not required for any other
    purpose, it makes more sense just to test-apply the generated
    transform to the input text, then ensure it matches the original
    target text

As suggested above, whenever we generate a transform we want to test-apply it 
to ensure it does in fact have the same result as applying the diff does, 
that is, it correctly generates the diff target given the input text and the 
transform.

Array overflow checking needs to be added, complete with automatic expansion 
of the arrays as needed.

The currect code does not take advantage of the move operation, which I noted 
earlier, is there so that text that is merely moved (or copied) from place to 
place in a string does not have to be encoded literally - it can always be 
taken from the input string when a transform needs to be applied.  This is 
merely an optimization, and a difficult one at that.  There are other tasks 
of more immediate importance.

In fact, the whole process of converting a diff to a transform is just a 
shortcut so that we can start loading the database with string differences 
without getting bogged down in the details of identifying which sections of 
text have changed and which have not.  Eventually, we do want to go to the 
effort of implementing a custom algorithm to do this, for a number of reasons:

  - It will be faster and (probably) more reliable than diff

  - It will work at a resolution of less than a line, so that in the
    common case where only a single name has changed in a line the
    redundant, invariant context will not be recorded.

  - It will handle binary files just as well as ascii text

Now I should also address the question: why not just use diff, and avoid all 
the trouble of implementing these transform things?  Well:

  - Transforms are considerably more compact than diff files.  For
    example, context and skipped lines from the diff are not encoded
    (or needed) in the transform.

  - Applying a transform will be significantly faster than applying a
    diff via patch.  This is an operation we will make heavy use of as
    the repository operations become more sophisticated.

  - The transform code is far less complex than patch and other diff
    utilities, and hence, correspondingly more trustworthy

  - We will probably want to handle more than one kind of diff syntax,
    and transforms provide a common storage format.

  - Transforms, because of their simpler structure, are much more
    suitable for calculus-type operations, such as composition.

A diff can do one thing that a transform cannot do: a diff can be applied in 
a fuzzy way, that is, the patched text does not have to be exactly the same 
as the original target text from which the diff was generated.  However, we 
don't require this property just now, since we only want to represent exact 
differences in the database.  Anything else is an error.  When we do get to 
the point of handling fuzzy problems like merging, we will need to build some 
more tools for the purpose.  We will not make the mistake of attempting to 
use a single tool such as diff for two different purposes, neither of which 
it is ideally suited to.

Now that I have broken the back of this biggish chunk of work, it's time to 
contemplate what else needs to be done to get to the point of having some 
minimally functional repository manager to play with:

  - Finish up this work by wrapping it with some test code to create
    diffs and cross check against the generated transforms

  - Wrap the C code as a Python library:

      http://www.python.org/doc/current/api/api.html

  - Think more about the details of the magic filesystem.  This won't
    have to be a general purpose stackable filesystem, it just has to 
    interface to userland in such a way that file text can be saved
    before being overwritten, and compared to the changed result when
    a file is closed.

  - Flesh out some more database format structural detail, so that
    filenames and directory structure can be tracked and simple
    metadata such as comments version names can be recorded

  - Do a little work to improve the python database interface class
    for record writing, taking advantage of Postres's "copy file" table
    loading command (which can be easily emulated with less efficient
    operations for other databases that don't have it).

As I mentioned previously, the magic filesystem interface doesn't necessarily 
have to exist before the system can be used: a simple workaround is to run 
the editing commands from a Python shell, with wrappers to run the required 
database operations.

The attached code demonstrates the conversion of a simple diff text into a 
transform.  It's all set to compile and run, with tracing on.  In the trace 
output, a character followed by ',' was read by skip_to and a character 
followed by '?' was tested by next_is.  Parse states are printed out at the 
beginning of a line, and generated operations at the end.  Finally, the 
generated transform is printed byte by byte in hex.

-- 
Daniel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: transform.c
Type: text/x-c
Size: 6867 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/prophesy/attachments/20020607/c371fe1a/attachment.bin>

From phillips at bonn-fries.net  Sat Jun  8 13:03:00 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Sat, 8 Jun 2002 05:03:00 +0200
Subject: [Prophesy] Diff to transform converter
In-Reply-To: <E17GKYX-0002oK-00@starship>
References: <E17GKYX-0002oK-00@starship>
Message-ID: <E17GWVJ-0002wH-00@starship>

This is a minor update, correcting the parser to reject diff strings where 
the '---' sequence does not occur at the beginning of a line.

-- 
Daniel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: transform.c
Type: text/x-c
Size: 6944 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/prophesy/attachments/20020608/921fac80/attachment.bin>

From phillips at bonn-fries.net  Sat Jun  8 23:18:57 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Sat, 8 Jun 2002 15:18:57 +0200
Subject: [Prophesy] Diff to transform converter
In-Reply-To: <E17GWVJ-0002wH-00@starship>
References: <E17GKYX-0002oK-00@starship> <E17GWVJ-0002wH-00@starship>
Message-ID: <E17Gg7P-0003Fn-00@starship>

Today's update includes detection of overflow in and automatic expansion of 
the two arrays of unpredictable size.  To regularize this somewhat messy 
operation a little, the array handling for one of the two was rewritten as 
pointer arithmetic so that both cases fit the model of output into an array 
with variable base, limit and current position.  A common 'expand' function 
is thus able to handle both cases.

This code is almost ready to be pressed into service.

-- 
Daniel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: transform.c
Type: text/x-c
Size: 7721 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/prophesy/attachments/20020608/1dedd488/attachment.bin>

From phillips at bonn-fries.net  Sun Jun  9 07:06:12 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Sat, 8 Jun 2002 23:06:12 +0200
Subject: [Prophesy] C from Python Example
Message-ID: <E17GnPa-0003Pj-00@starship>

It's about time to try using some C code from Python, to see how well that 
works out.  If it does work out then I suppose we are on our way.  I'm 
guessing that the work will go several times faster in Python than in C, 
because of fewer worres about memory allocation and such things.

While the Python C api is well documented, the requirements for compiling and 
installing python loadable modules written in C aren't.  After a little 
guesswork and fiddling around I came up with the following simple example, 
which defines a class with one method that simply returns a copy of the 
string passed to it.

File foo.c:

  #include "Python.h"

  static PyObject *foobar(PyObject *self, PyObject *args)
  {
	char *string;
	return PyArg_Parse(args, "s", &string)? PyString_FromString(string): NULL;
  }

  static PyMethodDef foo_methods[] =
  {
	{"bar", foobar},
	{NULL, NULL}
  };

  void initfoo()
  {
	Py_InitModule("foo", foo_methods);
  }

Apparently PyArg_Parse is deprecated, but it works.  The new, improved way 
isn't a lot different, but I haven't tried it yet.

I must say, it's a little frightening that Python actually parses a string on 
every call to a C function.  Surely there is a better way of doing this.  For 
example, parse the strings at module load time.  The moral of the story is: 
don't expect Python to be fast, not with this kind of implementation.  Oh 
well, it still should work out well for this project, since the heavy lifting 
will be done in tightly coded C.

The example can be compiled, installed as a module, and executed using the 
script line:

  cc -shared -I/usr/include/python2.2 python2.c -o foo.so && \
  sudo cp foo.so /usr/lib/python2.2/lib-dynload/ && \
  python2.2 -c "import foo; print foo.bar('test')"

I verified that this example works in python 1.5, 2.1 and 2.2.  The 
python2.2-dev package has to be installed to get the Python.h header file.

I'd like to import the module without copying it to the lib-dynload 
directory, which is not a good way to develop because it requires root 
privilege, and anyway, it's an annoying extra step.  I'm sure there's a way 
to do it, but I haven't found it yet.

There is also a fancy system called 'distutils' that builds and installs 
extension modules.  I don't really see why anything fancier than what I've 
shown here is needed for development.

Reference material is available here:

   http://python.org/doc/current/ext/ext.html
   "Extending and Embedding the Python Interpreter"

   http://python.org/doc/current/api/api.html
   "Python/C API Reference Manual"

-- 
Daniel


From phillips at bonn-fries.net  Sun Jun  9 18:50:00 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Sun, 9 Jun 2002 10:50:00 +0200
Subject: [Prophesy] Re: C from Python Example
Message-ID: <E17GyOe-0003Tv-00@starship>

On Saturday 08 June 2002 23:06, you wrote:
> The example can be compiled, installed as a module, and executed using the 
> script line:
> 
>   cc -shared -I/usr/include/python2.2 python2.c -o foo.so && \
                                        ^^^^^^^^^
>   sudo cp foo.so /usr/lib/python2.2/lib-dynload/ && \
>   python2.2 -c "import foo; print foo.bar('test')"

Correction:

    cc -shared -I/usr/include/python2.2 foo.c -o foo.so && \
    sudo cp foo.so /usr/lib/python2.2/lib-dynload/ && \
    python2.2 -c "import foo; print foo.bar('test')"

-- 
Daniel


From phillips at bonn-fries.net  Sun Jun  9 23:07:17 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Sun, 9 Jun 2002 15:07:17 +0200
Subject: [Prophesy] Diff to transform converter
In-Reply-To: <E17GKYX-0002oK-00@starship>
References: <E17GKYX-0002oK-00@starship>
Message-ID: <E17H2Pd-0003VO-00@starship>

On Friday 07 June 2002 16:17, Daniel Phillips wrote:
> There needs to be more error checking.  As it stands, this code should 
> perform its job correctly, on the assumption that the diff text is always 
> correct.  It would of course be foolish to assume this.  Some of the 
> redundant information in the diff can be used for a crosscheck:
> 
>   - The number of copied and skipped lines in each chunk should match
>     the chunk's specified input line count
> 
>   - The number of copied and added lines in each chunk should match the
>     specified chunk's output line count.
> 
>   - The current output line should be tracked and checked against the
>     chunk's output line number
> 
>   - Copied and skipped text in the diff should be checked to ensure it
>     matches the corresponding input text
> 
>   - Added text could possibly be checked against the original target
>     text, but since the target text is not required for any other
>     purpose, it makes more sense just to test-apply the generated
>     transform to the input text, then ensure it matches the original
>     target text

A couple of items to round out the list:

    - The input line number of each chunk should be monotonically
      increasing, and the input chunks should not overlap

    - The output line number of each chunk should be monotonically
      increasing, and the output chunks should not overlap

Under the category of 'further work', there needs to be special
attention paid to the possibility that the final line may not be terminated 
by an end-of-line character, in any combination of:

  - the input file
  - the output file
  - the diff file

Diff uses some bizarre syntax for indicating the absence of end-of-line in 
certain circumstances.  It doesn't seem to be documented (the unified diff 
format itself is only loosely documented, in a bsd man page) and I have not 
taken the time to reverse engineer it.  It seems to have something to do with 
a \ character beginning the line just after a +++ line, with a comment to the 
effect that an end of line is missing in one of the files.  Yuck.

If an end-of-line is missing in a diff file, it's probably fair to treat it 
as a syntax error.  If missing in the input or output file then we have to 
watch out for the (crude) diff syntax that indicates this and process it to 
produce the correct transform.  I believe this only affects the final 
operation, and then only with certain of the three basic operations.

-- 
Daniel


From erica at raqfaq.net  Mon Jun 10 05:14:39 2002
From: erica at raqfaq.net (Erica Douglass)
Date: Sun, 9 Jun 2002 12:14:39 -0700
Subject: [Prophesy] Diff to transform converter
In-Reply-To: <E17H2Pd-0003VO-00@starship>
Message-ID: <000201c20fe9$e7434fa0$ad7ba8c0@corkyserver>

Sending with the correct From: address this time...

> -----Original Message-----
> From: prophesy-admin at auug.org.au [mailto:prophesy-admin at auug.org.au]
On
> Behalf Of Daniel Phillips
> Sent: Sunday, June 09, 2002 6:07 AM
> To: prophesy at auug.org.au
> Subject: Re: [Prophesy] Diff to transform converter
> 
> On Friday 07 June 2002 16:17, Daniel Phillips wrote:
> 
> If an end-of-line is missing in a diff file, it's probably fair to
treat
> it
> as a syntax error.  If missing in the input or output file then we
have to
> watch out for the (crude) diff syntax that indicates this and process
it
> to
> produce the correct transform.  I believe this only affects the final
> operation, and then only with certain of the three basic operations.
> 
> --
> Daniel

Sometimes it really shows that you are a UNIX person. :P

You're forgetting that you're going to have to translate between \n,
\r\n (Windows), and \r (Macintosh) if you want full cross-platform
compatibility. Here's what I used to translate in PHP. It's based on the
browser detected and assumes that the file has been pulled into a string
called $content.

	// get os for carriage returns :P
	if(strstr(getenv('HTTP_USER_AGENT'), 'Win')) {
		$content = eregi_replace("\r","",$content);
	};

This brings up a whole lot of questions, like:

-- What is your interface going to be? If it's web-based, it's easy to
detect the browser and make assumptions. Cross-platform GUI... well,
it's not as easy. If you want to force people to use Linux, you can make
a Linux-only binary and a web-based client for people who aren't using
Linux, but then you might have some pissed-off customers.

-- What DO your customers want? At what stage do you want to start
pulling in user feedback? So far this list has mostly been "Daniel is
cool because he can do a diff transform, and look, here's this nifty
Python thing..." I usually start by asking the customer(s) what they
want and designing from that spec. I think that is pretty much the norm
in customer-centered development, which is definitely required if you
want this project to actually succeed rather than to be a PET (penis
enlargement tool).

I'm not trying to bash you, Daniel. I'm just questioning where this
project is going. I would like to see a nice marketing-style spec with
bullet points and customer needs analyses.

The question that everyone on this list should be thinking about is, "Is
this a serious project that I am willing to invest my time in?" If so,
we need a spec, not just C code.

Erica


From phillips at bonn-fries.net  Mon Jun 10 09:32:29 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Mon, 10 Jun 2002 01:32:29 +0200
Subject: [Prophesy] Diff to transform converter
In-Reply-To: <000001c20fe9$04884030$ad7ba8c0@corkyserver>
References: <000001c20fe9$04884030$ad7ba8c0@corkyserver>
Message-ID: <E17HCAj-0003bE-00@starship>

On Sunday 09 June 2002 21:08, Erica Douglass wrote:
> You're forgetting that you're going to have to translate between \n,
> \r\n (Windows), and \r (Macintosh) if you want full cross-platform
> compatibility.

Cross-platform compatibility isn't a goal, except that the end result
should work on all versions of Linux, not just Redhat or Debian.  If
somebody wants to port the code to other platforms, that's fine.

Somebody, sometime will no doubt want to pull some source files in
crlf format into the repository, and for that all we need is a
general-purpose filter:

   cat /c/sourcefile.c | crlf2unix >mytree/myfile.c

This follows from the overall design philosphy, which perhaps I
haven't expressed clearly enough, that a tree under management will
look and act just like a normal directory tree, except that any time
you create, change, move or delete a file in it, a daemon will
intercept the event and update a database accordingly.  Among other
things, this allows the use of a general purpose filter for such a
purpose as described above.

> This brings up a whole lot of questions, like:
> 
> -- What is your interface going to be?

The main interface is going to be a normal-looking filesystem as
described above.  This lets you get code into and out of the
repository, make changes to it and build binaries from it.  You can
use whatever tools you wish to for doing those things, so that part
of the interface is simple: whatever you are used to using.

Besides that, my immediate intention is to implement two commands:
tag and goto.  The first lets you give a name to the current
version, and the second alters the source text to be the same as
some previously tagged version.  Together these implement a simple
transport mechanism, which is the short-term goal I'm working
towards.  Other fancier interfaces can wait until we see how well
the transport mechanism actually works.

I should mention that my first concern is to make the single-user
interface very good, and at the same time, as transparent as
possible.  I feel that this is an area that has not been adequately
addressed in other source code management projects.  A tool should
make things easier for you, not harder.

> If it's web-based, it's easy to
> detect the browser and make assumptions.

A web interface isn't central.  When there is one, it's going to
be for reporting or download.  It would also be nice - very nice -
to have an lxr-style view of the code tree, with an incrementally
updating index.  I don't see using the browser to edit or manage
source files, the current state of browser technology just isn't
up to it.

> Cross-platform GUI... well,
> it's not as easy. If you want to force people to use Linux, you can make
> a Linux-only binary and a web-based client for people who aren't using
> Linux, but then you might have some pissed-off customers.

It's a stretch to call users of a free system 'customers'.  Anyone
who wants to use this system on some other platform than Linux can
get the source and do the port, or find a friend to do it.

> -- What DO your customers want? At what stage do you want to start
> pulling in user feedback? So far this list has mostly been "Daniel is
> cool because he can do a diff transform, and look, here's this nifty
> Python thing..." I usually start by asking the customer(s) what they
> want and designing from that spec. I think that is pretty much the norm
> in customer-centered development, which is definitely required if you
> want this project to actually succeed rather than to be a PET (penis
> enlargement tool).

Customer number one is me, and my main motivation is to come up with
something that saves time and does a more accurate job on certain
common tasks that have proved to be big time wasters for me.  One
such task is: preparation of patch sets, where each patch in the set
has to result in a system which builds and operates correctly.  To
date, I've handled that by maintaining multiple source trees and
diffing between them, but that is tedious and error prone, not to
mention consuming huge amounts of disk space.  Which brings me to
another big concern: saving disk space.  I see that on my laptop I
currently have xx gig in my src directory, and I have considerably
more on my server.  That's just too much.  Most of these source
trees are just minor variations on each other - experiments or
incremental versions.  I want to be able to go:

  goto linux-2.4.16; make install
  goto linux-2.4.19; make install

all in the same source tree.  (The new kbuild system, once it gets
into the tree, will help with this, as it - optionally - does not
pollute the source tree with build files.)

> I'm not trying to bash you, Daniel. I'm just questioning where this
> project is going. I would like to see a nice marketing-style spec with
> bullet points and customer needs analyses.
> 
> The question that everyone on this list should be thinking about is, "Is
> this a serious project that I am willing to invest my time in?" If so,
> we need a spec, not just C code.

The vast majority of successful open source projects start with some
working code that does something useful, so creating said working
code has to be the main focus at this point.  Besides, it's been fun
and interesting so far, and I do think it is going somewhere.

-- 
Daniel


From erica at simpli.biz  Mon Jun 10 05:08:19 2002
From: erica at simpli.biz (Erica Douglass)
Date: Sun, 9 Jun 2002 12:08:19 -0700
Subject: [Prophesy] Diff to transform converter
In-Reply-To: <E17H2Pd-0003VO-00@starship>
Message-ID: <000001c20fe9$04884030$ad7ba8c0@corkyserver>

> -----Original Message-----
> From: prophesy-admin at auug.org.au [mailto:prophesy-admin at auug.org.au]
On
> Behalf Of Daniel Phillips
> Sent: Sunday, June 09, 2002 6:07 AM
> To: prophesy at auug.org.au
> Subject: Re: [Prophesy] Diff to transform converter
> 
> On Friday 07 June 2002 16:17, Daniel Phillips wrote:
> 
> If an end-of-line is missing in a diff file, it's probably fair to
treat
> it
> as a syntax error.  If missing in the input or output file then we
have to
> watch out for the (crude) diff syntax that indicates this and process
it
> to
> produce the correct transform.  I believe this only affects the final
> operation, and then only with certain of the three basic operations.
> 
> --
> Daniel

Sometimes it really shows that you are a UNIX person. :P

You're forgetting that you're going to have to translate between \n,
\r\n (Windows), and \r (Macintosh) if you want full cross-platform
compatibility. Here's what I used to translate in PHP. It's based on the
browser detected and assumes that the file has been pulled into a string
called $content.

	// get os for carriage returns :P
	if(strstr(getenv('HTTP_USER_AGENT'), 'Win')) {
		$content = eregi_replace("\r","",$content);
	};

This brings up a whole lot of questions, like:

-- What is your interface going to be? If it's web-based, it's easy to
detect the browser and make assumptions. Cross-platform GUI... well,
it's not as easy. If you want to force people to use Linux, you can make
a Linux-only binary and a web-based client for people who aren't using
Linux, but then you might have some pissed-off customers.

-- What DO your customers want? At what stage do you want to start
pulling in user feedback? So far this list has mostly been "Daniel is
cool because he can do a diff transform, and look, here's this nifty
Python thing..." I usually start by asking the customer(s) what they
want and designing from that spec. I think that is pretty much the norm
in customer-centered development, which is definitely required if you
want this project to actually succeed rather than to be a PET (penis
enlargement tool).

I'm not trying to bash you, Daniel. I'm just questioning where this
project is going. I would like to see a nice marketing-style spec with
bullet points and customer needs analyses.

The question that everyone on this list should be thinking about is, "Is
this a serious project that I am willing to invest my time in?" If so,
we need a spec, not just C code.

Erica


From rasmus at jaquet.dk  Tue Jun 11 07:33:18 2002
From: rasmus at jaquet.dk (Rasmus Andersen)
Date: Mon, 10 Jun 2002 23:33:18 +0200
Subject: [Prophesy] C from Python Example
In-Reply-To: <E17GnPa-0003Pj-00@starship>
References: <E17GnPa-0003Pj-00@starship>
Message-ID: <20020610213318.GA2395@jaquet.dk>

On Sat, Jun 08, 2002 at 11:06:12PM +0200, Daniel Phillips wrote:
> The example can be compiled, installed as a module, and executed using the 
> script line:
> 
>   cc -shared -I/usr/include/python2.2 python2.c -o foo.so && \
>   sudo cp foo.so /usr/lib/python2.2/lib-dynload/ && \
>   python2.2 -c "import foo; print foo.bar('test')"

Doing this nets me 'ImportError: dynamic module does not define init
function (initfoo)'. (OK, I cheated and skipped the second line 
(see below)).

I have before used SWIG (www.swig.org) sucessfully as C/Python integrator.
The v1.3 (newest, I think) I had on my system choked on some of your
C constructs, using the attached patch helped.

Also, I had to use an interface file to get SWIG to grok other stuff.
Also attached.

I am not pushing this in a serious way since you seem to be doing
without but I wanted to show what I needed to do.

> 
> I verified that this example works in python 1.5, 2.1 and 2.2.  The 
> python2.2-dev package has to be installed to get the Python.h header file.
> 
> I'd like to import the module without copying it to the lib-dynload 
> directory, which is not a good way to develop because it requires root 
> privilege, and anyway, it's an annoying extra step.  I'm sure there's a way 
> to do it, but I haven't found it yet.
> 

This problem I did not have on python2.1 and python1.5 (solaris) and
python2.2 (linux).

Rasmus


From rasmus at jaquet.dk  Tue Jun 11 15:48:20 2002
From: rasmus at jaquet.dk (Rasmus Andersen)
Date: Tue, 11 Jun 2002 07:48:20 +0200
Subject: [Prophesy] C from Python Example
In-Reply-To: <20020610213318.GA2395@jaquet.dk>
References: <E17GnPa-0003Pj-00@starship> <20020610213318.GA2395@jaquet.dk>
Message-ID: <20020611054820.GA1630@jaquet.dk>

On Mon, Jun 10, 2002 at 11:33:18PM +0200, Rasmus Andersen wrote:
> I have before used SWIG (www.swig.org) sucessfully as C/Python integrator.
> The v1.3 (newest, I think) I had on my system choked on some of your
> C constructs, using the attached patch helped.
> 
> Also, I had to use an interface file to get SWIG to grok other stuff.
> Also attached.

Sigh.

Rasmus
-------------- next part --------------
--- transform.c.org	Mon Jun 10 22:25:39 2002
+++ transform.c	Mon Jun 10 23:29:21 2002
@@ -67,7 +67,8 @@
 // should not allow leading 0 for high
 // should trap all invalid opcodes
 // should take ops string limit and check against it
-struct transinfo {int in; int out;} transcheck(uchar *ops)
+struct transinfo {int in; int out;};
+struct transinfo transcheck(uchar *ops)
 {
 	unsigned c, state = 0, count = 0, ilen = 0, olen = 0;
 
@@ -153,7 +154,8 @@
 
 unsigned emit_line, this_line, hold_lines, hold_length, emit_op;
 uchar *emit_text, *this_text, *end_text, *outmem, *output, *outlim;
-struct holdline {char *source; unsigned length;} *holdmem, *hold, *holdlim;
+struct holdline {char *source; unsigned length;};
+struct holdline *holdmem, *hold, *holdlim;
 
 int emit(unsigned op)
 {
-------------- next part --------------
%module foo
%{
#define max(a, b) (a > b? a: b)

#define trace trace_on
#define trace_on(cmd) cmd
#define trace_off(cmd)

#define text_op 0
#define copy_op 1
#define skip_op 2
#define high_op 3

#define text(n) (n | (text_op << 6))
#define copy(n) (n | (copy_op << 6))
#define skip(n) (n | (skip_op << 6))
#define move(n, s) copy(0), copy(s), copy(n)
struct holdline {char *source; unsigned length;};
struct transinfo {int in_org; int out;};

%}

extern int transform(unsigned char *ops, unsigned char *in, unsigned char *out);
extern struct transinfo transcheck(unsigned char *ops);
extern int _next_is(unsigned char c, unsigned char **stringv, unsigned char *limit);
extern int _skip_to(unsigned char c, unsigned char **stringv, unsigned char *limit);
extern void expand(void **pbase, void **plim, void **pcur, unsigned more);
extern int emit(unsigned op);
extern int diff2transform(unsigned char *input, unsigned inlen, unsigned char *string, unsigned length);

extern unsigned emit_line, this_line, hold_lines, hold_length, emit_op;
extern unsigned char *emit_text, *this_text, *end_text, *outmem, *output, *outlim;
extern struct holdline *holdmem, *hold, *holdlim;

From rasmus at jaquet.dk  Tue Jun 11 17:31:06 2002
From: rasmus at jaquet.dk (Rasmus Andersen)
Date: Tue, 11 Jun 2002 09:31:06 +0200
Subject: [Prophesy] C from Python Example
In-Reply-To: <20020611054820.GA1630@jaquet.dk>; from rasmus@jaquet.dk on Tue, Jun 11, 2002 at 07:48:20AM +0200
References: <E17GnPa-0003Pj-00@starship> <20020610213318.GA2395@jaquet.dk> <20020611054820.GA1630@jaquet.dk>
Message-ID: <20020611093106.A4144@jaquet.dk>

On Tue, Jun 11, 2002 at 07:48:20AM +0200, Rasmus Andersen wrote:
> On Mon, Jun 10, 2002 at 11:33:18PM +0200, Rasmus Andersen wrote:
> > I have before used SWIG (www.swig.org) sucessfully as C/Python integrator.
> > The v1.3 (newest, I think) I had on my system choked on some of your
> > C constructs, using the attached patch helped.
> > 
> > Also, I had to use an interface file to get SWIG to grok other stuff.
> > Also attached.
> 
> Sigh.
> 
[snip attachments]

While I am at it, I might as well give the incantations:

% swig -python transform.i
% cc -I/usr/include/python2.2 -shared -o foo.so transform.c transform_wrap.c

The tutorial at the www.swig.org site is fairly short and concise.

Rasmus


From phillips at bonn-fries.net  Wed Jun 12 01:17:08 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Tue, 11 Jun 2002 17:17:08 +0200
Subject: [Prophesy] C from Python Example
In-Reply-To: <20020611093106.A4144@jaquet.dk>
References: <E17GnPa-0003Pj-00@starship> <20020611054820.GA1630@jaquet.dk> <20020611093106.A4144@jaquet.dk>
Message-ID: <E17HnOQ-00009J-00@starship>

On Tuesday 11 June 2002 09:31, Rasmus Andersen wrote:
> While I am at it, I might as well give the incantations:
> 
> % swig -python transform.i
> % cc -I/usr/include/python2.2 -shared -o foo.so transform.c transform_wrap.c
> 
> The tutorial at the www.swig.org site is fairly short and concise.

Yes, it is.  There's one incantation missing: how do you import your foo.so
module into python?  So far I don't know how to go about loading a module
that's in my current working directory.

-- 
Daniel


From phillips at bonn-fries.net  Wed Jun 12 01:31:55 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Tue, 11 Jun 2002 17:31:55 +0200
Subject: [Prophesy] C from Python Example
In-Reply-To: <20020611054820.GA1630@jaquet.dk>
References: <E17GnPa-0003Pj-00@starship> <20020610213318.GA2395@jaquet.dk> <20020611054820.GA1630@jaquet.dk>
Message-ID: <E17Hnch-00009j-00@starship>

On Tuesday 11 June 2002 07:48, Rasmus Andersen wrote:
> On Mon, Jun 10, 2002 at 11:33:18PM +0200, Rasmus Andersen wrote:
> > I have before used SWIG (www.swig.org) sucessfully as C/Python integrator.
> > The v1.3 (newest, I think) I had on my system choked on some of your
> > C constructs, using the attached patch helped.

Yes, it seems swig's parser has fallen a little bit behind and I suppose a
bug report to the project would be in order.

/me puts it on his list of things to do sometime

In the meantime your fix is fine.

> > Also, I had to use an interface file to get SWIG to grok other stuff.
> > Also attached.

Swig is very clearly a quick way to get started on constructing an
interface like this and I would have used it if I'd known about it.
(I did see swig mentioned a few times as I searched for documentation,
but mainly in the context of interfacing Python to C++, so that put
me off the scent.

There's a lot of knowledge encoded in the swig interface generators
that could be time consuming to acquire by other means.  For this
project I'd tend towards treating swig as more of a kind of tutorial 
that an essential build tool, since the Python/C interface is quite 
straightforward once you work out the basics, like where to find the
documentation and what's required to compile and link.  I do intend
to run swig from time to time to compare what it thinks is essential
for an interface, versus what I come up with from reading the docs.

OK... I just generated a swig python wrapper from your .i file...
Woohoo!  Over a thousand lines of wrapper, more than 3 times the size
of the project so far, and the generated code is 8 times the size. 
Well, I guess that's the problem with program-writing programs in
general.  By studying the wrapper I'm sure there are useful things to
learn, but I think it's easy enough to generate the Python wrappers
by hand, as needed.  Of course, that means being attentive and
worrying about things like object ref counts and locking, but these
are good to know about anyway.

Swig-friendly transform.c attached.

-- 
Daniel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: transform.c
Type: text/x-c
Size: 7761 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/prophesy/attachments/20020611/fbb50a6e/attachment.bin>

From rasmus at jaquet.dk  Wed Jun 12 02:47:34 2002
From: rasmus at jaquet.dk (Rasmus Andersen)
Date: Tue, 11 Jun 2002 18:47:34 +0200
Subject: [Prophesy] C from Python Example
In-Reply-To: <E17HnOQ-00009J-00@starship>; from phillips@bonn-fries.net on Tue, Jun 11, 2002 at 05:17:08PM +0200
References: <E17GnPa-0003Pj-00@starship> <20020611054820.GA1630@jaquet.dk> <20020611093106.A4144@jaquet.dk> <E17HnOQ-00009J-00@starship>
Message-ID: <20020611184734.A6465@jaquet.dk>

On Tue, Jun 11, 2002 at 05:17:08PM +0200, Daniel Phillips wrote:
> On Tuesday 11 June 2002 09:31, Rasmus Andersen wrote:
> > While I am at it, I might as well give the incantations:
> > 
> > % swig -python transform.i
> > % cc -I/usr/include/python2.2 -shared -o foo.so transform.c transform_wrap.c
> > 
> > The tutorial at the www.swig.org site is fairly short and concise.
> 
> Yes, it is.  There's one incantation missing: how do you import your foo.so
> module into python?  So far I don't know how to go about loading a module
> that's in my current working directory.

As I tried to say, that is one problem I haven't had: On a variety
of platforms I have been able to load from working directory.

Rasmus


From phillips at bonn-fries.net  Wed Jun 12 03:22:25 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Tue, 11 Jun 2002 19:22:25 +0200
Subject: [Prophesy] C from Python Example
In-Reply-To: <20020611184734.A6465@jaquet.dk>
References: <E17GnPa-0003Pj-00@starship> <E17HnOQ-00009J-00@starship> <20020611184734.A6465@jaquet.dk>
Message-ID: <E17HpLe-0000Fj-00@starship>

On Tuesday 11 June 2002 18:47, Rasmus Andersen wrote:
> On Tue, Jun 11, 2002 at 05:17:08PM +0200, Daniel Phillips wrote:
> > On Tuesday 11 June 2002 09:31, Rasmus Andersen wrote:
> > > While I am at it, I might as well give the incantations:
> > > 
> > > % swig -python transform.i
> > > % cc -I/usr/include/python2.2 -shared -o foo.so transform.c transform_wrap.c
> > > 
> > > The tutorial at the www.swig.org site is fairly short and concise.
> > 
> > Yes, it is.  There's one incantation missing: how do you import your foo.so
> > module into python?  So far I don't know how to go about loading a module
> > that's in my current working directory.
> 
> As I tried to say, that is one problem I haven't had: On a variety
> of platforms I have been able to load from working directory.

Is Debian one of those platforms?  If not, then I suppose it's just
a configuration issue, specifically, the initialization of sys.path:

  http://www.python.org/doc/current/ref/import.html

Could you please do:

  import sys
  sys.path

and see if "." is one of the entries?  The first entry I see here is '',
which seems a little odd.

-- 
Daniel


From rasmus at jaquet.dk  Wed Jun 12 03:35:48 2002
From: rasmus at jaquet.dk (Rasmus Andersen)
Date: Tue, 11 Jun 2002 19:35:48 +0200
Subject: [Prophesy] C from Python Example
In-Reply-To: <E17HpLe-0000Fj-00@starship>
References: <E17GnPa-0003Pj-00@starship> <E17HnOQ-00009J-00@starship> <20020611184734.A6465@jaquet.dk> <E17HpLe-0000Fj-00@starship>
Message-ID: <20020611173548.GA1761@jaquet.dk>

On Tue, Jun 11, 2002 at 07:22:25PM +0200, Daniel Phillips wrote:
> Is Debian one of those platforms?  If not, then I suppose it's just
> a configuration issue, specifically, the initialization of sys.path:
> 
>   http://www.python.org/doc/current/ref/import.html
> 
> Could you please do:
> 
>   import sys
>   sys.path
> 
> and see if "." is one of the entries?  The first entry I see here is '',
> which seems a little odd.

No debian: Solaris and Mandrake. But there is no '.' in my sys.path
(on Mandrake at least)? I get the '' as well.

Rasmus


From phillips at bonn-fries.net  Wed Jun 12 03:39:05 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Tue, 11 Jun 2002 19:39:05 +0200
Subject: [Prophesy] C from Python Example
In-Reply-To: <20020611173548.GA1761@jaquet.dk>
References: <E17GnPa-0003Pj-00@starship> <E17HpLe-0000Fj-00@starship> <20020611173548.GA1761@jaquet.dk>
Message-ID: <E17Hpbl-0000Fx-00@starship>

On Tuesday 11 June 2002 19:35, Rasmus Andersen wrote:
> On Tue, Jun 11, 2002 at 07:22:25PM +0200, Daniel Phillips wrote:
> > Is Debian one of those platforms?  If not, then I suppose it's just
> > a configuration issue, specifically, the initialization of sys.path:
> > 
> >   http://www.python.org/doc/current/ref/import.html
> > 
> > Could you please do:
> > 
> >   import sys
> >   sys.path
> > 
> > and see if "." is one of the entries?  The first entry I see here is '',
> > which seems a little odd.
> 
> No debian: Solaris and Mandrake. But there is no '.' in my sys.path
> (on Mandrake at least)? I get the '' as well.

OK, well I'll just put that one aside as a minor mystery to be
investigated in due course, and on with the show.

-- 
Daniel


From phillips at bonn-fries.net  Thu Jun 13 01:43:16 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Wed, 12 Jun 2002 17:43:16 +0200
Subject: [Prophesy] Database Structure and the Transport System
Message-ID: <E17IAHE-0000LV-00@starship>

As I mentioned earlier, I am working towards the goal of implementing a basic 
source tree transport system, which implements the following two operations:

  tag - give a name to the current version of the source tree
  goto - transform the current source tree to a previously tagged version

I have a theory that if I do this very accurately and efficiently, first it 
will be immediately useful for certain purposes such as preparing patch sets 
and storing multiple source tree versions in a single tree, and second, it 
can be extended naturally to become an elegant and useful distributed source 
code manager.

The immediate need is to define a database structure well suited to capturing 
incremental changes to a source tree and supporting the above transport 
system operations.  To help understand my thinking, it's useful to consider 
the following points:

 1) Not all the source is recorded in the database proper.  Some, or most
    of the text exists only as normal files in the source tree.  Only
    enough information is recorded in the database to support the transport
    mechanism.  Though this does add a little complexity to the system, it 
    means that the database can be considerably smaller in common 
    situations, and putting a source tree under management is a much
    faster operation than if all the source text had to be compressed and
    loaded into it.  (As an extension, the on-disk source could be
    redundantly encoded in the database in order to provide protection 
    against the possibility that the on-disk source could be changed
    while Prophesy isn't watching.)

 2) Transforms are unidirectional.  This is simply to save space - we
    don't have to record the text that a transform deletes from the
    on-disk version, only added text.  However, we can easily compute the 
    inverse of a transform given the input text, or portions of it.  At
    the time the transport system transforms the on-disk text into some
    other version, those portions of the input text needed to compute the 
    inverse transformation - needed so that the transport system can
    restore the on-disk text to its original form - can be stored in the 
    database.

Prophesy Database Structure
---------------------------

So far, I've identified the following essential database entities:

   - file
   - directory
   - version
   - transform

where 'directory' is a kind of file.  Each of these objects will have an 
internally-generated, permanent id.  The object id, especially for files and 
directories, is tantalizingly similar to a file inode, and it's tempting to 
use the underlying filesystem's actual inode number for the id, except that 
we are allowed to "copy -a" the whole source tree structure, and the inode 
numbers would change in the process.  Drat.  This means that Prophesy 
and the underlying filesystem are going to be doing a lot of lookups in 
parallel: the filesystem looks up a file by path and name, yielding an inode, 
then advises Prophesy that the file is to be altered.  Prophesy then has to 
look up the file by path and name, yielding an object id.  Oh well, we will 
ensure the latter operation is efficient.

The main (or perhaps only) function of directory objects is to support lookup 
of objects by name.  Each file or directory object has a name and a directory 
id, this pair being unique in a version.  A file object can be known by more 
than one name/directory pair, that  is, it can be hardlinked.

Tree Structure of Versions
--------------------------

The ability to return to some previous version of the source text and modify 
it means that the version structure is a tree, that is, any version can be 
forked.  The version table represents this tree in the form of a flat, 
relational table.

   Version table:

      version id, parent version, tag

That is, each version knows its parent.  If we want to know all the children 
of some version then we can query the database for all versions with the 
given parent.  We can cache the result if we want, to support repeated 
queries of this form efficiently.

Primary Data Representation
---------------------------

All primary data in Prophesy is represented by the combination of the current 
on-disk source text and changes relative to the current source text.  These 
changes take the form of three tables, as described below, and a journal 
table, to which additions to any of the three change tables are logged.

   journal table:

       journal id, comment, author, timestamp

The sole purpose of the journal table is tracking; the transport system does 
not make reference to it.  While the journal could in theory be used to wind 
the whole database back to any historical state, the transport mechanism 
provides a more powerful and efficient way of doing that.

Each change to file text between two versions results in an addition to the 
text change table.

   text change table:

      object id, input version, output version, journal id, 
      transform, untransform

The forward transform is not stored until needed, since it can be generated 
from the current text and the untransform, and so would contain only 
redundant text.  The forward transform must be generated the first time the 
current version moves downstream of the output version, that is, towards the 
root.  Optionally, the forward transform can be stored redundantly, to 
protect against the possibility that the on-disk tree could be changed 
without knowledge of Prophesy, or to allow an entire repository to be copied 
by copying just a single database file.

Each file or directory object in Prophesy database has a unique object id, 
which is used to track the object as it evolves from version to version.

   object create table:

      object id, input version, output version, journal id, name

An object delete is exactly an object create with the input and output 
versions reversed.  In other words, a create going from version A to version 
B implies a delete going from version B to version A.  For any delete, the 
object text must be stored, however for a create that would be redundant 
since the current text is on disk.  In fact, this is the same consideration 
apply as for text changes, and is handled the same way.  That is, when a 
non-empty file is deleted, Prophesy enters both a reverse create and a text 
change consisting of a single text remove (reverse add) operation into the 
database.

When an object is deleted, its object id is not reused (possibly excepting 
cases where an object is created and deleted within the same version, or an 
entire version is discarded) since it continues to exist in other versions 
within the same repository.  For the time being, a 32 bit object id should be 
sufficient.

Handling hard links correctly is expected to be problematic, however 
representing them is not a problem.  A hard link is simply a create for an 
object that already exists.

>From version to version, file and directory objects may be moved from any 
place in the source tree to any other.  Each such move results in an entry in 
the object move table.  As with filesystems, object rename is treated as a 
move.

   object move table:

      object id, input version, output version, journal id,
      input name, output name

Here, and everywhere else names are used, a 'name' is a pair: atom, directory.
Using atom ids rather than literal text in the move and create records means 
that these records consist only of fixed-size fields, which is friendly to 
database optimization.  Atoms also provide a measure of compression, since 
the atom table is shared by all versions.

Current State Cache
-------------------

Other that the version table, all primary objects in the database represent 
differences rather than current state.  A current state for any version can 
always be constructed by applying all transforms, create/deletes and moves 
encountered on the path from an old current version to a new current version. 
However, filename lookups need to be efficient, and so a hash table mapping 
all current names (atom, dir pair) to object ids is maintained incrementally.

The list of all current objects is easily and efficiently generated by taking 
the union of all object creates on the path from the root to the current 
version, less all object deletes.  This is rarely needed, so it is not 
maintained incrementally.

Name Lookup
-----------

Name lookup by full path is needed each time Prophesy intercepts and 
processes a change to a file, and that could add up to a lot of lookups. For 
example, a global edit might be performed, or a whole set of files untarred 
into a subdirectory, or a directory deleted.  It is desirable that typical 
file operations not be slowed noticeably by putting a source tree under 
management.

Therefore, to optimize directory lookups, an additional hash table is 
maintained, which maps hashes of full directory paths to directory objects.  
This avoids the need to iterate through each section of a directory path to 
perform a lookup.  When a directory name is changed, hashes of subdirectories 
need to be invalidated, and this is the only case where Prophesy needs to 
know the subdirectory tree of a given directory.  To optimize this, a 
directory table for the current version is maintained incrementally:

   Directory table:

      directory id, parent directory id

which forms a tree, since multiple directory ids can have the same parent 
directory id.

Epilogue
--------

On the 'well begun is half done' principle, this post constitutes my last 
major effort before turning to preparations for the Ottawa Linux Symposium 
and kernel summit.  In other words, I won't be implementing any of this for 
about a month.  This should provide adequate time for the ideas to mature.  
Of course I'll respond to any critical comment, or elaborate on any points I 
glossed over too quickly.

-- 
Daniel


From phillips at bonn-fries.net  Thu Jun 13 08:33:00 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Thu, 13 Jun 2002 00:33:00 +0200
Subject: [Prophesy] Background material - xdelta
Message-ID: <E17IGfl-0000Mr-00@starship>

Having roughed out the design of a storage engine for Prophesy, I thought I'd 
do a little research and I found this:

   http://telia.dl.sourceforge.net/sourceforge/xdelta/xdfs.pdf
   (Josh MacDonald's paper on delta compression)

Recommended reading.  And see:

   http://prcs.sourceforge.net/
   (PRCS revision control project, home page)

   http://telia.dl.sourceforge.net/sourceforge/prcs/prcs_doc.html
   (PRCS documentation)

Though much of what is written here seems similar to what I've mapped out, in 
the end, the implementation comes out very different.

-- 
Daniel


From phillips at bonn-fries.net  Thu Jun 13 22:13:07 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Thu, 13 Jun 2002 14:13:07 +0200
Subject: [Prophesy] Background material - Subversion
Message-ID: <E17ITTQ-0000Pn-00@starship>

Subversion is a well-established and active project whose design is similar 
in many ways to what I've put forth:

   http://subversion.tigris.org/

The subversion code is hosted online in a Subversion repository.  This 
directory of design notes makes interesting reading:

   http://svn.collab.net/repos/svn/branches/0.11.1/notes/

Subversion uses a database, currently Berkeley DB, with plans to switch to an 
SQL database at some point in the future.  Hmm.  I wonder, why not start 
there?

A repository is made available for distributed access via an Apache module, 
just as I'd planned.  The use of DAV gives a simple form of web browsing 
interface for free.  The Subversion engine is modeled on a filesystem, and 
seems headed in the direction of becoming a versioning filesystem, although 
the technical details of how to make it a mountable filesystem have not been 
addressed.  Instead, filesystem-like access is provided by way on a C api
modeled on the Posix file functions.

Much functionality appears to be available already, however, the file formats 
have not been frozen and database design issues seem to be in flux.

-- 
Daniel


From phillips at bonn-fries.net  Thu Jun 13 22:30:12 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Thu, 13 Jun 2002 14:30:12 +0200
Subject: [Prophesy] Database Structure and the Transport System
In-Reply-To: <E17IAHE-0000LV-00@starship>
References: <E17IAHE-0000LV-00@starship>
Message-ID: <E17ITjw-0000Pt-00@starship>

On Wednesday 12 June 2002 17:43, I wrote:
>
>    text change table:
> 
>       object id, input version, output version, journal id, 
>       transform, untransform
> 
>    object create table:
> 
>       object id, input version, output version, journal id, name
> 
>    object move table:
> 
>       object id, input version, output version, journal id,
>       input name, output name

After reflecting a little, I realized that where I used 'input version, 
output version' representing a change between two version nodes, I should 
have used the arc between versions, giving a representation that is more 
compact and easier to search:

   text change table:

      object id, version arc, journal id, transform, untransform

   object create table:

      object id, version arc, journal id, name

   object move table:

      object id, version arc, journal id, input name, output name

which adds a new database entity, 'version arc', the directed arc from a 
version's parent to itself.  The primary version arc definitions can be 
folded into the version table:

   version table:

      version id, parent version, version arc id, tag

-- 
Daniel


From mbp at samba.org  Fri Jun 14 10:28:54 2002
From: mbp at samba.org (Martin Pool)
Date: Fri, 14 Jun 2002 10:28:54 +1000
Subject: [Prophesy] comments so far
In-Reply-To: <E17ITTQ-0000Pn-00@starship>
References: <E17ITTQ-0000Pn-00@starship>
Message-ID: <20020614002851.GD6330@toey.sourcefrog.net>

These are mostly just ideas I've had in my mind about SCM; some of
them disagree with (what I've heard of) prophesy.  Of course, you can
do whatever you want.  So take them or leave them.

I think the hard thing about defining a SCM system is defining just
what SCM *means*.

As far as I can tell, you seem to be implementing a versioning
filesystem, which lets you tag and revisit points in history.  That's
very nice, but I don't think that is really the heart of the problem.

I believe that SCM systems, like programming languages, are primarily
tools for communication between programmers -- the pragmatics of
controlling the machine are secondary.  (Included is the case of a
programmer communicating with themselves over time.)

Hooking at the filesystem level is good for capturing all changes, but
I think they are very fine-grained and not meaningful.  I think it's a
bad idea -- although I of course respect you for trying it -- because
I think the benefits compared to regular commands don't justify the
added complexity and risk.

There's a hierarchy:

  release notes for a new version -- many end-users will read these;
    they'll include references to bugs fixed

  list of patches accepted -- every developer probably wants to read
    this

  list of small changes within a patch -- many programmers probably
    want to read this

  diff for an actual patch -- probably don't need to read it unless 
    I'm actually working in the area

Perhaps there are some other levels, but you get the idea.  I think
the recursive nature is very important.  The key job of the SCM system
is to help programmers manage the history of development of the
project.

Just keeping a GNU-style ChangeLog can be pretty useful even without
SCM.

Autogenerating a NEWS file by pulling out top-level comments would be
great, because it's one of the most useful tools to a user or
satellite developer.

Offline operation is crucial.  Most projects don't have everybody on a
LAN.  Open source is inherently distributed.  Time costs here will
drastically outweigh anything you can do with a database, etc, on the
server.

Arch makes every download of the product a potential working
directory.  I don't think it's necessary to keep the entire history in
every tarball, but it is perhaps good to keep references that tie the
files to their place in history.

It would, by extension, be nice to allow all downloads to happen over
http/ftp, and all submissions to happen by mail to a maintainer.  The
program should not require any intelligence in the protocol.

People shouldn't need permission to start hacking on a project, and to
keep versions locally.  They just need permission to commit to the
master site.

diffs have this nice property of being intelligible to humans and
programs.  Keep them.  Make minimal changes to handle chmod, mv, etc.

All other things being equal, files should be directly human-readable.
Use diffs.  Perhaps make ChangeLogs, or something similar, part of the
metadata.  (On the other hand, being readable might encourage editing
by hand, which would be bad.)

Writing new filesystems, diff formats, network protocols, etc is just
screwing around.  The heart of the problem is to get a good model for
*how to do SCM*.  You can implement (v1) using existing tools;
optimize later if it turns out that your model is correct.

Similarly, don't waste time writing GUIs; use emacs, xxdiff, dirdiff,
etc.  Write one later if it proves correct.

If I was starting from scratch, I would consider a typical open source
project:

 - email is key

 - people mail around patches; perhaps they get revised; eventually
   they get applied

 - the NEWS file says "applied patch for foofeature from
   jhacker at dot.com"

Projects sometimes split off files or subdirectories into other
projects; perhaps they diverge slightly.  It would be nice to handle
this.

For rsync and other projects, I keep patches that I have not yet
really accepted but that look good in CVS in patches/.  A SCM system
that managed this would be nice.  I think it's a promising model, not
a hack.

Disk is cheap.  Keep everything.

Networks are getting broader, but latency is not going to go away.

Do it in <4000lines.  Lions-book Unix was 10kloc, and look how many
good ideas they had in there.

-- 
Martin 


From phillips at bonn-fries.net  Sat Jun 15 00:09:55 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Fri, 14 Jun 2002 16:09:55 +0200
Subject: [Prophesy] comments so far
In-Reply-To: <20020614002851.GD6330@toey.sourcefrog.net>
References: <E17ITTQ-0000Pn-00@starship> <20020614002851.GD6330@toey.sourcefrog.net>
Message-ID: <E17Irm0-0000ie-00@starship>

On Friday 14 June 2002 02:28, Martin Pool wrote:
> These are mostly just ideas I've had in my mind about SCM; some of
> them disagree with (what I've heard of) prophesy.  Of course, you can
> do whatever you want.  So take them or leave them.
> 
> I think the hard thing about defining a SCM system is defining just
> what SCM *means*.
> 
> As far as I can tell, you seem to be implementing a versioning
> filesystem, which lets you tag and revisit points in history.  That's
> very nice, but I don't think that is really the heart of the problem.

It's the heart of a tool that (hopefully) let's you get at the heart of the 
problem.

> I believe that SCM systems, like programming languages, are primarily
> tools for communication between programmers -- the pragmatics of
> controlling the machine are secondary.  (Included is the case of a
> programmer communicating with themselves over time.)

I believe you're right, so long as SCM systems stay as clumsy as they are.  
If the archive system was actually easy and transparent to use, then 
programmers would use it as a tool for themselves, as a means of tracking 
multiple projects they're involved in, and trying out experiments.  In much 
the same way as we now rely on the undo chain in an editor - I do that, don't 
you?  That is, I rely on the editor's undo chain to back me out of failed 
experiments.  It gets to the point where I'm reluctant to shut down the 
machine because of all the state saved in the editor's undo chains.  Now, 
that's a system that works, but it's got glaring imperfections, beyond the 
fact that the state disappears when the editor shuts down.  The editors also 
don't know about each other, and they are incapable of maintaining undo 
chains across different files, let alone projects.
 
Granted, the SCM is also a tool for communication, but much good work has 
already been done there.  I think the distributed side of things is well 
known and under control, but today's crop of scm's still suck as development 
tools.  So that's where I'm concentrating.

> Hooking at the filesystem level is good for capturing all changes, but
> I think they are very fine-grained and not meaningful.

This was addressed earlier in an earlier post.  In the current version, every 
change to each file is recorded (and in order, giving you global undo, 
including undeletes) but when you close the version, the stacked changes are 
collapsed into a single layer of changes for the version.  To put it another 
way, the system journals individual changes, but (unless you tell it 
otherwise) only for the current version.

> I think it's a
> bad idea -- although I of course respect you for trying it -- because
> I think the benefits compared to regular commands don't justify the
> added complexity and risk.

Somebody from Apple said it well: "you should never have to tell the computer 
something it already knows".  Check-in and check-out are things the computer 
can figure out for itself.

Risk... I don't see it.  If anything, the risk of a programmer forgetting or 
misapplying a command is greater.  I know, I did it myself once :-)

As for complexity, I don't really see that.  Difficult, yes, because so far 
nobody has provided a suitable framework on Linux for stacking local 
filesystems.  Anyway, I don't intend to tackle the problem of exporting the 
vfs to user space in its full generality, but rather, just enough to provide 
the functionality I want.  If that provides a good base to work from towards 
a fully general system, then that's a bonus.

Finally, I don't have to depend on the magic filesystem effort being 
successful, since the fallback is just to go to the traditional way of doing 
things, with explicit commands (a file checkout has the immediate effect of 
loading the current contents of the file into the database).  However, that's 
way too dull for me and would fall well short of what I'd expect from a 21st 
century design.

I've only thought in general terms about how to implement the magic 
filesystem so far, however, now is the time to get down to specifics.  As a 
design rule, I'll try to work within existing kernel mechanisms, but if those 
mechanisms prove inadequate, I won't be shy about changing them.  In the end, 
if somebody comes up with a better way of doing the same thing, that's great, 
but right now the main concerned is functionality and reliability.  Other 
essential design parameters are:

  - Overhead imposed by the magic filesystem is insignificant
  - No performance impact at all outside the scope of the magic filesystem
  - No security compromise
  - No new dos
  - No new races

When the magic filesystem is mounted, it gets a new superblock and knows 
about the superblock of the underlying system.  We want to pass most vfs 
events straight through to the underlying filesystem, except for open, write, 
mmap and close (note that the vfs only passes the final file close event to 
the filesystem, and this isn't good enough).

A pass-through write would work as follows:

  - inodes of the magic filesystem are exactly the inodes of the
    underlying filesystem, except for having an i_sb that points at
    a magic_superblock in place of the underlying filesystem's native
    superblock (does this work??)

  - vfs calls magic_file->f_dentry->d_inode->i_fop->write(magic_file, ...)

  - this magic_file_write keeps the native superblock in a private field
    of the magic superblock:

       magic_file->f_dentry->d_inode->i_sb->private.real_sb

  - magic_file_write allocates a temporary buffer, invodes the native
    filesystem's ->read to read the to-be-overwritten data into it,
    writes that data into the userspace daemon's pipe, and releases the
    temporary buffer (there has to be a more direct way of doing this!)

  - magic_file_write then calls the underlying filesystem's ->write,
    with its native... (inode??, no, it points at magic_sb, recursion!!)
    could we temporarily reset the sb?? yikes.  Too bad generic_file_write
    takes a file instead of an inode.

Other considerations:

  - Modify dnotify to allow events on files, not just directories

  - For every file open, register on 

  - File open is overridden to attach notify events to file open and file
    close, if the file was opened r/w.   These events are directed at the
    user space daemon an

  - File write is overridden in magic_file_operations->write, to read the 
    current contents of the file in the overwritten region into a pipe.  If 
    the pipe is full the writing process blocks until the userspace daemon 
    empties it.


> There's a hierarchy:
> 
>   release notes for a new version -- many end-users will read these;
>     they'll include references to bugs fixed
> 
>   list of patches accepted -- every developer probably wants to read
>     this
> 
>   list of small changes within a patch -- many programmers probably
>     want to read this
> 
>   diff for an actual patch -- probably don't need to read it unless 
>     I'm actually working in the area
> 
> Perhaps there are some other levels, but you get the idea.  I think
> the recursive nature is very important.  The key job of the SCM system
> is to help programmers manage the history of development of the
> project.
> 
> Just keeping a GNU-style ChangeLog can be pretty useful even without
> SCM.
> 
> Autogenerating a NEWS file by pulling out top-level comments would be
> great, because it's one of the most useful tools to a user or
> satellite developer.
> 
> Offline operation is crucial.  Most projects don't have everybody on a
> LAN.  Open source is inherently distributed.  Time costs here will
> drastically outweigh anything you can do with a database, etc, on the
> server.
> 
> Arch makes every download of the product a potential working
> directory.  I don't think it's necessary to keep the entire history in
> every tarball, but it is perhaps good to keep references that tie the
> files to their place in history.
> 
> It would, by extension, be nice to allow all downloads to happen over
> http/ftp, and all submissions to happen by mail to a maintainer.  The
> program should not require any intelligence in the protocol.
> 
> People shouldn't need permission to start hacking on a project, and to
> keep versions locally.  They just need permission to commit to the
> master site.
> 
> diffs have this nice property of being intelligible to humans and
> programs.  Keep them.  Make minimal changes to handle chmod, mv, etc.
> 
> All other things being equal, files should be directly human-readable.
> Use diffs.  Perhaps make ChangeLogs, or something similar, part of the
> metadata.  (On the other hand, being readable might encourage editing
> by hand, which would be bad.)
> 
> Writing new filesystems, diff formats, network protocols, etc is just
> screwing around.  The heart of the problem is to get a good model for
> *how to do SCM*.  You can implement (v1) using existing tools;
> optimize later if it turns out that your model is correct.
> 
> Similarly, don't waste time writing GUIs; use emacs, xxdiff, dirdiff,
> etc.  Write one later if it proves correct.
> 
> If I was starting from scratch, I would consider a typical open source
> project:
> 
>  - email is key
> 
>  - people mail around patches; perhaps they get revised; eventually
>    they get applied
> 
>  - the NEWS file says "applied patch for foofeature from
>    jhacker at dot.com"
> 
> Projects sometimes split off files or subdirectories into other
> projects; perhaps they diverge slightly.  It would be nice to handle
> this.
> 
> For rsync and other projects, I keep patches that I have not yet
> really accepted but that look good in CVS in patches/.  A SCM system
> that managed this would be nice.  I think it's a promising model, not
> a hack.
> 
> Disk is cheap.  Keep everything.
> 
> Networks are getting broader, but latency is not going to go away.
> 
> Do it in <4000lines.  Lions-book Unix was 10kloc, and look how many
> good ideas they had in there.
> 
> -- 
> Martin 
> _______________________________________________
> Prophesy mailing list
> Prophesy at auug.org.au
> http://www.auug.org.au/mailman/listinfo/prophesy
> 
> 

-- 
Daniel


From mbp at samba.org  Sat Jun 15 04:06:03 2002
From: mbp at samba.org (Martin Pool)
Date: Sat, 15 Jun 2002 04:06:03 +1000
Subject: [Prophesy] comments so far
In-Reply-To: <E17Irm0-0000ie-00@starship>
References: <E17ITTQ-0000Pn-00@starship> <20020614002851.GD6330@toey.sourcefrog.net> <E17Irm0-0000ie-00@starship>
Message-ID: <20020614180558.GA10553@toey.sourcefrog.net>

I agree with you about the usefulness of editor undo chains.  Under
emacs, I have kept-new-versions set to about 10, and I regularly use
C-u C-x C-s to do "keep backup version" and diff-backup.  All very
nice and useful.

A filesystem that kept all versions would allow you to do this in a
program-neutral way, although I think that's not so important now that
almost all the GNU tools understand foo.c.~1~ backups.

However, it has the same problem that the results are largely lacking
semantics.  For example, looking back through the history of all
modifications to a directory, it seems impossible to tell which
versions of the source will actually compile correctly, and which were
intermediate versions that don't work.  If a program commits
early-and-often to CVS (say), but at least runs the test suite first,
then you have in general some guarantee about the internal consistency
of any committed version.  (It would be even better if CVS versions
were module-wide, like in Subversion.)

A magic filesystem is "mere mechanism".  I don't think you should be
spending so much time on it until you have a good design for the
version-control system built on top.

If it turns out that the design "on top" is no better than CVS, then
nobody will bother -- people who want neat features will use Bk (or a
free clone), and more conservative people will use CVS.

You've said that you need to be able to cope without the filesystem --
why not first implement the version without it, and then put it in as
a nicety later?

The same functions can be adequately (perhaps not quite as well)
achieved using editor undo, editor backups, or tux2fs.  

If the design can sensibly handle many small revisions then it would
be easy to have a program called by the editor on save that commits to
it.  If the design can't handle a huge number of revisions in a
sensible way, then it doesn't matter how they get generated.

> I believe you're right, so long as SCM systems stay as clumsy as they are.  
> If the archive system was actually easy and transparent to use, then 
> programmers would use it as a tool for themselves, as a means of tracking 
> multiple projects they're involved in, and trying out experiments.  In much 
> the same way as we now rely on the undo chain in an editor - I do that, don't 
> you?  That is, I rely on the editor's undo chain to back me out of failed 
> experiments.  It gets to the point where I'm reluctant to shut down the 
> machine because of all the state saved in the editor's undo chains.  Now, 
> that's a system that works, but it's got glaring imperfections, beyond the 
> fact that the state disappears when the editor shuts down.  The editors also 
> don't know about each other, and they are incapable of maintaining undo 
> chains across different files, let alone projects.

This is the perfect example of why semantic information is necessary.
Pressing C-_ repeatedly until it looks about right is error-prone and
labour intensive -- more than anything else, this limits the
usefulness of editor undo.  For fixing small mistakes it's good, but
for backing out of hour-long experiments it seems useless to me.  I
don't want to say "undo edit" a hundred times; I want to say "back up
to before I started working on this feature".

Ideally, I can have several trees around.  (Disk is cheap.)  Instead of
rolling back, just toss that directory tree on the floor so I can find
it later if I want to see what it was that I tried.

> Granted, the SCM is also a tool for communication, but much good work has 
> already been done there.  I think the distributed side of things is well 
> known and under control,

I think current SCMs are not nearly as good as they should be.  Bk is
the only decent distributed one, which is why it's doing so well.  

> but today's crop of scm's still suck as development tools.  So
> that's where I'm concentrating.

Do you mean they're not very helpful for the individual developer?
What kind of thing?

> This was addressed earlier in an earlier post.  In the current version, every 
> change to each file is recorded (and in order, giving you global undo, 
> including undeletes) but when you close the version, the stacked changes are 
> collapsed into a single layer of changes for the version.  To put it another 
> way, the system journals individual changes, but (unless you tell it 
> otherwise) only for the current version.

I disagree with this too :-)  

SCM shouldn't ever throw away information; it should only selectively
roll it up for display.  Once you've captured a diff it should be kept
forever.  Seeing the order in which edits within a version were made
might possibly be helpful in the future.  

For example, consider the case in which a version consists of me
taking a patch from somebody, and then fiddling things a bit to make
it merge properly.  From one point of view, those changes have to go
together, since both are necessary to make the program compile again.
On the other hand, it would be nice to be able to see the original
diff separately.

The more I think about it, the more I think some kind of recursive
nesting of versions makes sense.  Bk has this, but it enforces a
two-level model of changesets, which consist of deltas (which are more
or less diffs.)  But I can imagine a higher-level changeset containing
several others, particularly if they're ported or accepted from
somebody else.

> > I think it's a
> > bad idea -- although I of course respect you for trying it -- because
> > I think the benefits compared to regular commands don't justify the
> > added complexity and risk.
> 
> Somebody from Apple said it well: "you should never have to tell the computer 
> something it already knows".

Right, but you shouldn't be afraid to tell the computer things that
are pragmatically necessary.

Somewhat off-topic comparison: directory and file names are not really
necessary, because you can always search by content.  But in practice,
with some exceptions, systems that do that have often turned out to be
hard to use.  

> Check-in and check-out are things the computer can figure out for
> itself.

How?

How is the computer meant to know what I was thinking when I made a
change?  That's what future readers of the code really want to know.
It might even be *more* important than the change itself -- this is
why ChangeLogs can work in the absence of any other SCM.  I find it's
actually good discipline for the programmer too -- it helps them
concentrate on doing only one thing at a time.

> Risk... I don't see it.  If anything, the risk of a programmer forgetting or 
> misapplying a command is greater.  I know, I did it myself once :-)

Kernel crashes, down filesystems, etc.

If ClearCase is down, you can't do *anything*.  If your CVS server is
down, you can at least edit and compile locally, and diff against old
versions.  

> As for complexity, I don't really see that.  Difficult, yes, because so far 
> nobody has provided a suitable framework on Linux for stacking local 
> filesystems.

I agree that would be useful.  I just think you have a filesystem-hacker
hammer and are trying to apply it to a SCM thumb.

-- 
Martin 


From phillips at bonn-fries.net  Sat Jun 15 04:18:54 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Fri, 14 Jun 2002 20:18:54 +0200
Subject: [Prophesy] comments so far
In-Reply-To: <E17Irm0-0000ie-00@starship>
References: <E17ITTQ-0000Pn-00@starship> <20020614002851.GD6330@toey.sourcefrog.net> <E17Irm0-0000ie-00@starship>
Message-ID: <E17Ivf0-0000kr-00@starship>

On Friday 14 June 2002 16:09, Daniel Phillips wrote:
> On Friday 14 June 2002 02:28, Martin Pool wrote:

Well, I didn't intend to send the previous post until after having worked out 
more of the magic filesystem issues, however... the implication is that files 
under management of the magic filesystem have to have two inodes, one 
belonging to the magic filesystem and one belonging to the native filesystem.

I'm putting down much of this awkwardness to what I'm increasingly seeing as 
misdesign of the vfs, but cleaning that up is not the immediate project.  
I'll return to the question of the magic filesystem later.

OK, now the first thing I should say is that I agree with all the features 
you list below, and what I'm going to do now is speculate about how the 
current design can support each of them, or what needs to be done to support 
them.

> > There's a hierarchy:
> > 
> >   release notes for a new version -- many end-users will read these;
> >     they'll include references to bugs fixed

So the database needs to know what's a release note.  This is version 
metadata, since a release is always a version.  The question is, do we want 
to define metadata structure at the database table level, or do we want to 
just put all version metadata together in a single 'version metadata' record 
per version and parse it out with xml or some such?

> >   list of patches accepted -- every developer probably wants to read
> >     this

Meaning the system has to know what the patch is, when accepted, into what 
version, and so on.  What I'd like to do if possible is to carry forward 
patches as objects from version to version, so that the scm user can apply a 
patch to version 2.4.16 and remove it, perhaps after it's mutated a little, 
from version 2.4.19.  For now, the most practical way to do this is just keep 
the patch verbatim in the database (along with the who/when/etc information) 
and let the user figure out what has to be done to revert it later.  Hmm, 
yes, that's easy, and it's what you want I strongly suspect.

The list of patches applied to a particular version is actually very 
important.  Without it, you don't know what to revert.  I've often felt the 
lack of this kind of information.

Anyway, this feature is what bitkeeper would call 'import patch', except that 
Prophesy is going to remember more about the imported patch than Bitkeeper 
does, will keep the patch in its database, and will let you revert it without 
having to find the original copy on disk.

> >   list of small changes within a patch -- many programmers probably
> >     want to read this

Right, so when Prophesy parses out the patch (we don't need to use patch to 
do this any more, because of the parser I wrote) it will save the patch 
header as metadata, assuming it's a description.  The Prophesy user can edit 
this and mark it up so that it can generate a nice-looking listing of patch 
details (realistically, nobody ever edits these details, but it's nice to 
know you could).

> >   diff for an actual patch -- probably don't need to read it unless 
> >     I'm actually working in the area

Right, since the actual diff is compressed into the database, the web 
interface could pull it up for you.

> > Perhaps there are some other levels, but you get the idea.  I think
> > the recursive nature is very important.  The key job of the SCM system
> > is to help programmers manage the history of development of the
> > project.
> > 
> > Just keeping a GNU-style ChangeLog can be pretty useful even without
> > SCM.
> > 
> > Autogenerating a NEWS file by pulling out top-level comments would be
> > great, because it's one of the most useful tools to a user or
> > satellite developer.

Yes, here you'd have to convince your submitters to mark up their patches, or 
you'd have to do it yourself.  Taking the email subject line by default would 
be a good start.

> > Offline operation is crucial.  Most projects don't have everybody on a
> > LAN.  Open source is inherently distributed.  Time costs here will
> > drastically outweigh anything you can do with a database, etc, on the
> > server.

The database is installed and runs locally.  Operation is offline by default.

> > Arch makes every download of the product a potential working
> > directory.  I don't think it's necessary to keep the entire history in
> > every tarball, but it is perhaps good to keep references that tie the
> > files to their place in history.

That's right, for every repository there's a working directory.  The 
repository database lives in the root of the workign directory.  By the way, 
Prophesy is not so rude as to force an additional top level directory on top 
of the normal top directory as BitKeeper and other systems do.

> > It would, by extension, be nice to allow all downloads to happen over
> > http/ftp,

As with Subversion, distributed access will be provided in the form of an 
Apache module.  Providing an ftp view as well would be very nice.

> > and all submissions to happen by mail to a maintainer.  The
> > program should not require any intelligence in the protocol.

Right.  We want to integrate Rasmus's patchbot work.

> > People shouldn't need permission to start hacking on a project, and to
> > keep versions locally.  They just need permission to commit to the
> > master site.

True, and permission to transmit to the remote site is an entirely different 
thing, and should be easier to get than permission to commit to the remote 
site.

By the way, there will be not any 'master' site, only remote sites, i.e., 
Prophesy is peer-to-peer.

> > diffs have this nice property of being intelligible to humans and
> > programs.  Keep them.  Make minimal changes to handle chmod, mv, etc.

Right, keep the ability to parse them and generate them, but don't use them 
internally, they're inappropriate for that.  Except that Prophesy will 
archive the diff in its original form, as received.  I suppose that for 
symmetry we should allow diffs to be sent to be archived as well, complete 
with descriptive comments etc.

> > All other things being equal, files should be directly human-readable.
> > Use diffs.  Perhaps make ChangeLogs, or something similar, part of the
> > metadata.  (On the other hand, being readable might encourage editing
> > by hand, which would be bad.)

Using diffs internally in the database is out of the question.  They're just 
not an appropriate currency for the kinds of manipulations Prophesy has to do.

> > Writing new filesystems, diff formats, network protocols, etc is just
> > screwing around.

I agree about the network protocols, but not about the filesystem magic and 
the internal storage format.  Particularly in regards to the latter, look at 
the research that's been done.  There's a reason for it: archive size and 
efficiency of common operations is a very real problem.  Not to mention 
accuracy and power.  These things depend very much on the solidity of the 
foundation on which the superstructure stands.

> > The heart of the problem is to get a good model for
> > *how to do SCM*.  You can implement (v1) using existing tools;
> > optimize later if it turns out that your model is correct.

Well actually, by parsing diffs to get the transforms that's exactly what I'm 
doing.  (And it turns out that doing a proper binary diff isn't that hard.)  
Python, postgresql, glade, etc., are all 'existing tools'.  What other 
existing tools would you suggest?  Not patch.  It's much easier and faster to 
apply database deltas with the already-implemented transform mechanism.  
Later, when we get to merging, patch or a patch-like thing will be needed, 
and then we'll probably start with patch and move to something faster/more 
powerful/more reliable later.

> > Similarly, don't waste time writing GUIs; use emacs, xxdiff, dirdiff,
> > etc.  Write one later if it proves correct.

Agreed there.  However, once the basic transport mechanism is in place, a 
guid will follow very shortly afterwards, to show the version tree.

> > If I was starting from scratch, I would consider a typical open source
> > project:
> > 
> >  - email is key
> > 
> >  - people mail around patches; perhaps they get revised; eventually
> >    they get applied
> > 
> >  - the NEWS file says "applied patch for foofeature from
> >    jhacker at dot.com"

Yes indeed, we can and will automate that.

> > Projects sometimes split off files or subdirectories into other
> > projects; perhaps they diverge slightly.  It would be nice to handle
> > this.

Yes, a source tree should be able to inherit files from another project, and 
Prophesy should treat these files as descending from the same object.  Each 
file object can have its own evolutionary tree, and these tree are not the 
same or restricted at all by the version tree or project boundaries.  
Furthermore, we should be able to recognize that one object is identical to 
another in a remote tree, or had a common ancestor.  This touches on the 
subject of universal object ids, which I mentioned earlier in the archives, 
and I have not forgotten about it.  First things first, though.

> > For rsync and other projects, I keep patches that I have not yet
> > really accepted but that look good in CVS in patches/.  A SCM system
> > that managed this would be nice.  I think it's a promising model, not
> > a hack.
> > 
> > Disk is cheap.  Keep everything.

But keep it as compactly as you can.  It's not that cheap.  I have 7 gig of 
source on my laptop and several times that on my server.  Most of that 
consists of kernel trees, all slightly different versions, or different 
projects in them.  That's just silly.

> > Networks are getting broader, but latency is not going to go away.
> > 
> > Do it in <4000lines.  Lions-book Unix was 10kloc, and look how many
> > good ideas they had in there.

I suppose the first useful version will be about that size (4K lines).

-- 
Daniel


From phillips at bonn-fries.net  Sat Jun 15 05:03:00 2002
From: phillips at bonn-fries.net (Daniel Phillips)
Date: Fri, 14 Jun 2002 21:03:00 +0200
Subject: [Prophesy] comments so far
In-Reply-To: <20020614180558.GA10553@toey.sourcefrog.net>
References: <E17ITTQ-0000Pn-00@starship> <E17Irm0-0000ie-00@starship> <20020614180558.GA10553@toey.sourcefrog.net>
Message-ID: <E17IwLd-0000kz-00@starship>

On Friday 14 June 2002 20:06, Martin Pool wrote:
> I agree with you about the usefulness of editor undo chains.  Under
> emacs, I have kept-new-versions set to about 10, and I regularly use
> C-u C-x C-s to do "keep backup version" and diff-backup.  All very
> nice and useful.
> 
> A filesystem that kept all versions would allow you to do this in a
> program-neutral way, although I think that's not so important now that
> almost all the GNU tools understand foo.c.~1~ backups.
> 
> However, it has the same problem that the results are largely lacking
> semantics.  For example, looking back through the history of all
> modifications to a directory, it seems impossible to tell which
> versions of the source will actually compile correctly, and which were
> intermediate versions that don't work.

That we can solve by integrating with the build tool a little.  Every 
successful build marks a milestone in the Prophesy journal (not the same as a 
version).

> If a program commits
> early-and-often to CVS (say), but at least runs the test suite first,
> then you have in general some guarantee about the internal consistency
> of any committed version.  (It would be even better if CVS versions
> were module-wide, like in Subversion.)

Could you elaborate on this module-wise property?  I must have missed it 
while examining Subversion.

> A magic filesystem is "mere mechanism".  I don't think you should be
> spending so much time on it until you have a good design for the
> version-control system built on top.

I totally disagree.  I don't think you can build a tower on a bed of jello.  
The intrastructure is mere mechanism in the same sense that the operating 
system is mere mechanism: it defines what you can and can't do with the 
machine.

> If it turns out that the design "on top" is no better than CVS, then
> nobody will bother -- people who want neat features will use Bk (or a
> free clone), and more conservative people will use CVS.
> 
> You've said that you need to be able to cope without the filesystem --
> why not first implement the version without it, and then put it in as
> a nicety later?

Oh absolutely, I've stated that already, earlier in the archives.

> The same functions can be adequately (perhaps not quite as well)
> achieved using editor undo, editor backups, or tux2fs.  

Now wait, let's not confuse these things.  The magic filesystem only does one 
thing: sends overwritten text to a userspace daemon to be added to the change 
database.  Well, it notifies creates, deletes and truncates as well, but 
that's it.

> If the design can sensibly handle many small revisions then it would
> be easy to have a program called by the editor on save that commits to
> it.  If the design can't handle a huge number of revisions in a
> sensible way, then it doesn't matter how they get generated.

The current plan is to call out to the editor from Python, which will save 
the file contents beforehand.  This is just for testing.

> > I believe you're right, so long as SCM systems stay as clumsy as they are.  
> > If the archive system was actually easy and transparent to use, then 
> > programmers would use it as a tool for themselves, as a means of tracking 
> > multiple projects they're involved in, and trying out experiments.  In much 
> > the same way as we now rely on the undo chain in an editor - I do that, don't 
> > you?  That is, I rely on the editor's undo chain to back me out of failed 
> > experiments.  It gets to the point where I'm reluctant to shut down the 
> > machine because of all the state saved in the editor's undo chains.  Now, 
> > that's a system that works, but it's got glaring imperfections, beyond the 
> > fact that the state disappears when the editor shuts down.  The editors also 
> > don't know about each other, and they are incapable of maintaining undo 
> > chains across different files, let alone projects.
> 
> This is the perfect example of why semantic information is necessary.
> Pressing C-_ repeatedly until it looks about right is error-prone and
> labour intensive -- more than anything else, this limits the
> usefulness of editor undo.  For fixing small mistakes it's good, but
> for backing out of hour-long experiments it seems useless to me.  I
> don't want to say "undo edit" a hundred times; I want to say "back up
> to before I started working on this feature".

Right, unless you forgot to put down any kind of marker before you started
the session.  We can put down various kinds of markers in the journal to
help you be lazy here, including timestamps.  Furthermore, we can maintain
global undo/redo not as a single chain, but as a tree, like a version tree
which only gets pruned when you are absolutely sure you don't want to undo
any more.

> Ideally, I can have several trees around.  (Disk is cheap.)  Instead of
> rolling back, just toss that directory tree on the floor so I can find
> it later if I want to see what it was that I tried.

I don't know about you, but I often end up with trees sitting around and
I haven't got a clue what's in them and why they're there.  I always keep
a clean version of the tree around just for this reason: so I can diff
the mysterious tree and find out what's in it.  Prophesy should automate
this, and in addition, should hold some helpful metadata such as nicely
chosen version tags.

> > Granted, the SCM is also a tool for communication, but much good work has 
> > already been done there.  I think the distributed side of things is well 
> > known and under control,
> 
> I think current SCMs are not nearly as good as they should be.  Bk is
> the only decent distributed one, which is why it's doing so well.  

BitKeeper is very strong on the maintainer side, not so strong on the
submitter side.  This makes sense, as it was pitched to maintainers, and
in fact, that's were the big bottlenecks were.  I'm interested in doing
a better job on the developer side, which seems like virgin territory to
me.  I mean, how often do you hear the word 'usability' in connection
with source code management?

> > but today's crop of scm's still suck as development tools.  So
> > that's where I'm concentrating.
> 
> Do you mean they're not very helpful for the individual developer?
> What kind of thing?

There is too much fiddling with commands.  Every time you want to edit
a file you have to remember to check it out, and if you happen to be
thinking about an actual problem you were trying to solve at the time
the need arose, chances are your thought will vanish as you go through
the mechanics of checking out the needed file.  There are other rough
spots too, such as BitKeeper's insistance on adding an additional level
to the top of your tree.  I also find all those SCCS files peppered
through my source tree an ugly blemish.  Putting a tree under
management is an unecessarily complex project, and you have to submit
to a strip search.  CVS I won't even get into, nobody uses it locally
and you know why.

> > This was addressed earlier in an earlier post.  In the current version, every 
> > change to each file is recorded (and in order, giving you global undo, 
> > including undeletes) but when you close the version, the stacked changes are 
> > collapsed into a single layer of changes for the version.  To put it another 
> > way, the system journals individual changes, but (unless you tell it 
> > otherwise) only for the current version.
> 
> I disagree with this too :-)  
> 
> SCM shouldn't ever throw away information; it should only selectively
> roll it up for display.  Once you've captured a diff it should be kept
> forever.  Seeing the order in which edits within a version were made
> might possibly be helpful in the future.  

Sure, your edits can all be written to the journal, and that could
even be the default.  The journal is not the same as the version tree;
in the version tree we want to record only fully collapsed diffs
between versions.

> For example, consider the case in which a version consists of me
> taking a patch from somebody, and then fiddling things a bit to make
> it merge properly.  From one point of view, those changes have to go
> together, since both are necessary to make the program compile again.
> On the other hand, it would be nice to be able to see the original
> diff separately.

I think what we're going to do is actually compress the diff and
store it when you receive it, then make a journal entry when you
apply it.  Your fiddles are the difference between the version
with the diff, and your fiddled version.  It's not necessary to
record all your detailed edits to find the fiddles, though yes, it
would be nice to be able to fall back to that in murky situations.

> The more I think about it, the more I think some kind of recursive
> nesting of versions makes sense.  Bk has this, but it enforces a
> two-level model of changesets, which consist of deltas (which are more
> or less diffs.)  But I can imagine a higher-level changeset containing
> several others, particularly if they're ported or accepted from
> somebody else.

I've talked previously about 'regions', which are distinct parts
that together make up a larger diff.  It would make sense to nest such
things, and it might be possible to track regions as they evolve
through versions.  On the other hand, I don't see any obvious way to
nest versions themselves.

> > Check-in and check-out are things the computer can figure out for
> > itself.
> 
> How?

Prophesy knows you checked out a file, because you edited it.  Prophesy
knows you checked it in because you closed a version.

> How is the computer meant to know what I was thinking when I made a
> change?  That's what future readers of the code really want to know.
> It might even be *more* important than the change itself -- this is
> why ChangeLogs can work in the absence of any other SCM.  I find it's
> actually good discipline for the programmer too -- it helps them
> concentrate on doing only one thing at a time.
> 
> > Risk... I don't see it.  If anything, the risk of a programmer forgetting or 
> > misapplying a command is greater.  I know, I did it myself once :-)
> 
> Kernel crashes, down filesystems, etc.

Journalling filesystem...

> If ClearCase is down, you can't do *anything*.  If your CVS server is
> down, you can at least edit and compile locally, and diff against old
> versions.  

I suppose you missed the part where all repositories are local, and your
source tree is just a normal source tree with a database of diffs hidden
in the root.

> > As for complexity, I don't really see that.  Difficult, yes, because so far 
> > nobody has provided a suitable framework on Linux for stacking local 
> > filesystems.
> 
> I agree that would be useful.  I just think you have a filesystem-hacker
> hammer and are trying to apply it to a SCM thumb.

I think when you see where I'm going with it you will say 'aha'.

-- 
Daniel


From sfr at canb.auug.org.au  Sun Jun 16 12:26:44 2002
From: sfr at canb.auug.org.au (sfr at canb.auug.org.au)
Date: Sun, 16 Jun 2002 12:26:44 +1000 (EST)
Subject: [Prophesy] New user request
Message-ID: <200206160226.g5G2QiOF028821@supreme.pcug.org.au>

Hi,

Do you all know a person whose email address is luckas at musoft.de?

Should I let them on the list?

Cheers,
Stephen Rothwell


From mbp at samba.org  Wed Jun 19 12:53:41 2002
From: mbp at samba.org (Martin Pool)
Date: Wed, 19 Jun 2002 12:53:41 +1000
Subject: [Prophesy] comments so far
In-Reply-To: <20020614180558.GA10553@toey.sourcefrog.net>; from mbp@samba.org on Sat, Jun 15, 2002 at 04:06:01AM +1000
References: <E17ITTQ-0000Pn-00@starship> <20020614002851.GD6330@toey.sourcefrog.net> <E17Irm0-0000ie-00@starship> <20020614180558.GA10553@toey.sourcefrog.net>
Message-ID: <20020619125341.G32710@va.samba.org>

If you want to design userspace filesystem hook that's fine; if you
want to design a SCM system that's fine too (and more interesting to
me personally.)  If you think that a SCM system ought to be built on
top of kernel dnotify hooks then I really have to take issue with you.

In summary:

 [1] this turns out to be a real weak point in the biggest known
     implementation of the design, ClearCase

 [2] on general principle, things shouldn't be in the kernel unless
     they need to be

 [3] you're not tackling the real problem

[1]

I was looking at a Clearcase installation at a large company earlier
on today.  Everybody's views (~= working directories) are kept on this
machine under /view.  Fine.

  cd /view/
  ls -l 

Hangs.  Foo.  

  strace ls -l

shows it looping indefinitely on getdirent() (something like that?).
Pressing TAB in bash produces the same effect -- sometimes you have to
kill bash and log in again.  Very amusing.  You don't realize how
often you use this until you work on a machine without bash, or on a
machine were pressing tab is likely to hang your shell.

Anyhow, so I get a view name from somebody else, type it in casefully,
and can see things inside.  It is noticeably slower than it ought to
be, considering the machine it's stored on (modern PIII or something)
-- listing a directory takes a fair fraction of a second.  

Of course ClearCase is famous for having enormous hardware
requirements, exceeding the cost of a developer's desktop hardware.
This is no accident, but rather an essential implication of the
design: every file IO, even just creating a short-lived temporary
file, has to go to userspace, potentially across the network, into a
daemon, and potentially into a database.  A large fraction of IO on a
working directory will have nothing to do with SCM: it will be, e.g.,
compilation to a test copy.  It's dumb to impose the cost on
operations when there will be no benefit.

But it's basically all there, and seems to work well.  It seems like
ClearCase has some nice features.  One popular one is that there are
good X11 and W32 GUIs for all operations.  It would be good if free
systems had that, but it's really more or less independent of the
underlying architecture.

Later on we noticed that one of the build scripts was having trouble
removing a temporary directory.  Eventually it turned out that a file
in a /tmp subdirectory was causing unlink() to return ENOENT, even
though the file could be listed, stat'd, and even moved.  I suspect
ClearCase had somehow corrupted the machines dcache or something to
cause this behaviour.  The machine was in other respects pretty
standard.  Presumably rebooting will "fix" it.

So at this point I say:

 - "bloody proprietary kernel modules"
 - "bloody unnecessary kenrel modules"

(Insert epithet of choice in locales other than en_AU)

Now, of course, all software has bugs, and I guess Rational will
either eventually fix this, or explain how it's misconfigured on this
machine, or at any rate be interested to see the report which will be
passed to them.  

I don't expect software not to have bugs, but I do think if there are
simple design decisions that you can make early on that will reduce
the likelihood or severity of bugs, you should do so unless there is a
strong counterargument.

You can make an argument about open source being less buggy (or not)
or Rational being dumb (or not), but I don't think any of them is
clearly true.  At any rate, ClearCase is more mature than Prophesy is
likely to be any time soon.

I've seen bugs in BK; typically they can be resolved by using one of
BK's commands to preen a repository or remove leftover locks.  It
hasn't ever caused random other bad things to happen on unrelated
parts of my machine and I wouldn't expect it to.

[2]

I think the weight of OS design experience is behind me in saying that
things should not be in the kernel unless there is some security,
performance, or functionality reason why they have to be there.  I
realize you only want to put hooks into the kernel, not the whole
thing, but ClearCase does that too, and the issues still apply.

I don't see anything about SCM that can't be adequately done purely in
userspace.  In as much as Daniel is designing a system he wants other
people to work on and use, I think the obligation is on him to
demonstrate that a kernel dependency is necessary.  This is particular
so given [1], that putting it in the kernel has turned out to be a
problem in the past.  I don't think that justification is impossible,
but I'm a long way from being convinced.

I can see a few possible justifications, but I don't think any of them
stand up:

 "it's transparent"

   That's bogus; a CVS working directory and a ClearCase view are both
   trivially transparent in that you can read and edit files using
   normal tools, but you need to know magic commands or syntax to
   actually do anything.

 "it avoids having nasty CVS dirs lying around"

   It's slightly tidier, but it turns out not to be a real problem.
   If it bugged you, you could have just one in the top level, or make
   it a dot file.

 "you can auto-detect rename/add/delete"

   Handling renames is important, but automatically doing it is
   somewhat less so.  There are several other systems possibly as
   good:

    - magic tokens embedded in the file (arch)
    - detecting similar file text (bk)
    - explicit notification (pre or post)
    - ...

  These don't happen often enough that it needs to be completely
  transparent.  "bk mv foo bar" is not significantly harder; leaning
  to type it is trivial by comparison to learning the overall system.

 "you can keep intermediate changes"

  Well, that's nice.  But given that you're going to throw them away
  anyhow, I don't see how it's any better than editor backups or a
  filesystem with history.  I guess I don't see it as essentially part
  of SCM -- it's related but not the same.

  Given a tiny command that's run on each save or build you can do
  this from userspace anyhow.

People have tried keeping source in databases before (Zope, VisualAge,
various Smalltalks), but in general programmers seem to prefer
relatively little magic in their source directories.  Even MSVC++
keeps plain files on disk.  Having plain files opens up opportunities;
magic databases close them off.

[3] 

SCM is a hard problem to define; SCM software more or less maps 1:1
with the author's view of how software development is done or ought to
be done.  The challenge is to think about SCM differently, or more
clearly, than has been done before.  

Svn have already thought about this more than me.  My overall
impression is that they want to be a "good enough" replacement for
CVS's more gaping holes, which is a good goal.  

If you're going to write a new system rather than hack on (say)
Subversion, then it seems to me that you ought to aim to be better
than any existing design on at least one important point.

I know people here are talking about that, but I think it needs a lot
more work before writing code.

I think it's far more important than worrying about kernel hooks.

Problems that you ought to be thinking about, in my not-very-humble
opinion:

 * Do you want to support disconnected operation?  That sounds like a
   good idea, even when the systems are not really "disconnected" but
   just on a modem in another continent.  It definitely makes your job
   harder and more interesting: trivially, when you commit, the
   version number you generate must be local and not universally
   authoritative.  (cf bk's "keys") There are several levels, from
   merely being able to edit while disconnected (cvs) to making
   patches but not sending (diff and mail) to basically everything
   (bk). 

 * Can you have "threads" of development, where several changes are
   aimed at fixing the same thing, but they're not committed to a
   separate branch?

 * Is this meant for people working in an open source / internet way,
   or in a small-office way?  Or do you aim to handle both?  They seem
   pretty different: at one extreme, people just mail around patches;
   at the other, people just all work in the same directory.

   A lot of the literature about "Configuration Management" (capital
   C, M) is written from a military or enormous-project point of view,
   which is pretty different from that of open source hackers, and not
   necessarily better for all problems.

 * It seems obvious that you want some way of building logical changes
   that span multiple files.  Really?  Does it make sense to have two
   distinct changes to the same file inside this? 

 * Can changesets be nested?

 * How do you represent accepting a patch from somebody, without
   losing that patch's internal structure?

 * If you make a mistake in a commit message, can you go back and
   change it?  In many systems you can't, because that would be
   "rewriting history".  It seems useful though, in some cases, and
   you can solve it by introducing a meta-history concept.

 * How do you make all this comprehensible?  Can you explain it in a
   single page to a novice user, and leave the complicated stuff til
   later?  Will they get bitten if they try to work with just a simple
   understanding?

 * Subdirectories often spin off as child projects, (tdb from samba)
   or they might merge in (experimental architectures joining Linux.)
   Can that be supported in some way better than just copying a
   snapshot of the files across?  Do you want to?

 * What does it mean to support the "reviewer" role?

 * How do you handle repeated bidirectional merges between parallel
   streams of development?

 * Do you want to tackle the "star-merge" problem handled by arch,
   where you work out the order of applying multiple patches that is
   least likely to cause conflicts?

 * Does the system need to do anything to help with merging beyond
   just running something equivalent to diff3 and letting you resolve
   conflicts by hand?

 * Some object files are really hard/slow to produce and so it kind of
   makes sense to keep them in vc, although they don't really belong.
   (e.g. files requiring a special toolchain; autoconf output)  Can
   you keep them as second-class citizens to avoid conflicts, etc.

 * Sometimes people want to e.g. check in binaries of released
   versions, so that they can be exactly restored even if the compiler
   changes later.  What do you think of that?

 * Can the SCM play a role in communicating at appropriate levels of
   detail to various audiences?  (Users, potential users, managers,
   developers, core team, satellite developers, distribution
   maintainers, release engineers, ...)

 * What happens when you're in the middle of changing something and
   you notice a little bug?  You want to fix the bug, but also keep
   that fix separate from your main commit.  Under CVS, you might get
   a second checkout, fix it there, and merge, but that's slow and a
   lot of trouble, so people mostly don't bother.  It would be nice if
   they could.

 * What about developers who are trusted to commit to one branch, but
   not to HEAD?

 * Lots more questions.

This is long enough already, you get the idea.
--
Martin


From mbp at sourcefrog.net  Wed Jun 26 15:20:06 2002
From: mbp at sourcefrog.net (Martin Pool)
Date: Wed, 26 Jun 2002 15:20:06 +1000
Subject: [Prophesy] Microsoft's SourceDepot system
Message-ID: <20020626052003.GE11907@toey.sourcefrog.net>

In the spirit of "See Figure One", Microsoft have two source code
control systems: one they give to their customers, Visual SourceSafe
(which sucks), and one they use themselves, SourceDepot, which is
quite interesting.

Here are some slides about it:

  http://216.239.39.100/search?q=cache:4Y_wlCjY5gAC:www.usenix.org/events/usenix-win2000/invitedtalks/lucovsky.ppt+%22sourcedepot%22&hl=en&ie=UTF-8

The details are apparently quite hard to discover.

-- 
Martin 

http://www.things.org/~jym/fun/see-figure-1.html