[Prophesy] comments so far

Sat Jun 15 04:18:54 EST 2002

On Friday 14 June 2002 16:09, Daniel Phillips wrote:
> On Friday 14 June 2002 02:28, Martin Pool wrote:

Well, I didn't intend to send the previous post until after having worked out 
more of the magic filesystem issues, however... the implication is that files 
under management of the magic filesystem have to have two inodes, one 
belonging to the magic filesystem and one belonging to the native filesystem.

I'm putting down much of this awkwardness to what I'm increasingly seeing as 
misdesign of the vfs, but cleaning that up is not the immediate project.  
I'll return to the question of the magic filesystem later.

OK, now the first thing I should say is that I agree with all the features 
you list below, and what I'm going to do now is speculate about how the 
current design can support each of them, or what needs to be done to support 
them.

> > There's a hierarchy:
> > 
> >   release notes for a new version -- many end-users will read these;
> >     they'll include references to bugs fixed

So the database needs to know what's a release note.  This is version 
metadata, since a release is always a version.  The question is, do we want 
to define metadata structure at the database table level, or do we want to 
just put all version metadata together in a single 'version metadata' record 
per version and parse it out with xml or some such?

> >   list of patches accepted -- every developer probably wants to read
> >     this

Meaning the system has to know what the patch is, when accepted, into what 
version, and so on.  What I'd like to do if possible is to carry forward 
patches as objects from version to version, so that the scm user can apply a 
patch to version 2.4.16 and remove it, perhaps after it's mutated a little, 
from version 2.4.19.  For now, the most practical way to do this is just keep 
the patch verbatim in the database (along with the who/when/etc information) 
and let the user figure out what has to be done to revert it later.  Hmm, 
yes, that's easy, and it's what you want I strongly suspect.

The list of patches applied to a particular version is actually very 
important.  Without it, you don't know what to revert.  I've often felt the 
lack of this kind of information.

Anyway, this feature is what bitkeeper would call 'import patch', except that 
Prophesy is going to remember more about the imported patch than Bitkeeper 
does, will keep the patch in its database, and will let you revert it without 
having to find the original copy on disk.

> >   list of small changes within a patch -- many programmers probably
> >     want to read this

Right, so when Prophesy parses out the patch (we don't need to use patch to 
do this any more, because of the parser I wrote) it will save the patch 
header as metadata, assuming it's a description.  The Prophesy user can edit 
this and mark it up so that it can generate a nice-looking listing of patch 
details (realistically, nobody ever edits these details, but it's nice to 
know you could).

> >   diff for an actual patch -- probably don't need to read it unless 
> >     I'm actually working in the area

Right, since the actual diff is compressed into the database, the web 
interface could pull it up for you.

> > Perhaps there are some other levels, but you get the idea.  I think
> > the recursive nature is very important.  The key job of the SCM system
> > is to help programmers manage the history of development of the
> > project.
> > 
> > Just keeping a GNU-style ChangeLog can be pretty useful even without
> > SCM.
> > 
> > Autogenerating a NEWS file by pulling out top-level comments would be
> > great, because it's one of the most useful tools to a user or
> > satellite developer.

Yes, here you'd have to convince your submitters to mark up their patches, or 
you'd have to do it yourself.  Taking the email subject line by default would 
be a good start.

> > Offline operation is crucial.  Most projects don't have everybody on a
> > LAN.  Open source is inherently distributed.  Time costs here will
> > drastically outweigh anything you can do with a database, etc, on the
> > server.

The database is installed and runs locally.  Operation is offline by default.

> > Arch makes every download of the product a potential working
> > directory.  I don't think it's necessary to keep the entire history in
> > every tarball, but it is perhaps good to keep references that tie the
> > files to their place in history.

That's right, for every repository there's a working directory.  The 
repository database lives in the root of the workign directory.  By the way, 
Prophesy is not so rude as to force an additional top level directory on top 
of the normal top directory as BitKeeper and other systems do.

> > It would, by extension, be nice to allow all downloads to happen over
> > http/ftp,

As with Subversion, distributed access will be provided in the form of an 
Apache module.  Providing an ftp view as well would be very nice.

> > and all submissions to happen by mail to a maintainer.  The
> > program should not require any intelligence in the protocol.

Right.  We want to integrate Rasmus's patchbot work.

> > People shouldn't need permission to start hacking on a project, and to
> > keep versions locally.  They just need permission to commit to the
> > master site.

True, and permission to transmit to the remote site is an entirely different 
thing, and should be easier to get than permission to commit to the remote 
site.

By the way, there will be not any 'master' site, only remote sites, i.e., 
Prophesy is peer-to-peer.

> > diffs have this nice property of being intelligible to humans and
> > programs.  Keep them.  Make minimal changes to handle chmod, mv, etc.

Right, keep the ability to parse them and generate them, but don't use them 
internally, they're inappropriate for that.  Except that Prophesy will 
archive the diff in its original form, as received.  I suppose that for 
symmetry we should allow diffs to be sent to be archived as well, complete 
with descriptive comments etc.

> > All other things being equal, files should be directly human-readable.
> > Use diffs.  Perhaps make ChangeLogs, or something similar, part of the
> > metadata.  (On the other hand, being readable might encourage editing
> > by hand, which would be bad.)

Using diffs internally in the database is out of the question.  They're just 
not an appropriate currency for the kinds of manipulations Prophesy has to do.

> > Writing new filesystems, diff formats, network protocols, etc is just
> > screwing around.

I agree about the network protocols, but not about the filesystem magic and 
the internal storage format.  Particularly in regards to the latter, look at 
the research that's been done.  There's a reason for it: archive size and 
efficiency of common operations is a very real problem.  Not to mention 
accuracy and power.  These things depend very much on the solidity of the 
foundation on which the superstructure stands.

> > The heart of the problem is to get a good model for
> > *how to do SCM*.  You can implement (v1) using existing tools;
> > optimize later if it turns out that your model is correct.

Well actually, by parsing diffs to get the transforms that's exactly what I'm 
doing.  (And it turns out that doing a proper binary diff isn't that hard.)  
Python, postgresql, glade, etc., are all 'existing tools'.  What other 
existing tools would you suggest?  Not patch.  It's much easier and faster to 
apply database deltas with the already-implemented transform mechanism.  
Later, when we get to merging, patch or a patch-like thing will be needed, 
and then we'll probably start with patch and move to something faster/more 
powerful/more reliable later.

> > Similarly, don't waste time writing GUIs; use emacs, xxdiff, dirdiff,
> > etc.  Write one later if it proves correct.

Agreed there.  However, once the basic transport mechanism is in place, a 
guid will follow very shortly afterwards, to show the version tree.

> > If I was starting from scratch, I would consider a typical open source
> > project:
> > 
> >  - email is key
> > 
> >  - people mail around patches; perhaps they get revised; eventually
> >    they get applied
> > 
> >  - the NEWS file says "applied patch for foofeature from
> >    jhacker at dot.com"

Yes indeed, we can and will automate that.

> > Projects sometimes split off files or subdirectories into other
> > projects; perhaps they diverge slightly.  It would be nice to handle
> > this.

Yes, a source tree should be able to inherit files from another project, and 
Prophesy should treat these files as descending from the same object.  Each 
file object can have its own evolutionary tree, and these tree are not the 
same or restricted at all by the version tree or project boundaries.  
Furthermore, we should be able to recognize that one object is identical to 
another in a remote tree, or had a common ancestor.  This touches on the 
subject of universal object ids, which I mentioned earlier in the archives, 
and I have not forgotten about it.  First things first, though.

> > For rsync and other projects, I keep patches that I have not yet
> > really accepted but that look good in CVS in patches/.  A SCM system
> > that managed this would be nice.  I think it's a promising model, not
> > a hack.
> > 
> > Disk is cheap.  Keep everything.

But keep it as compactly as you can.  It's not that cheap.  I have 7 gig of 
source on my laptop and several times that on my server.  Most of that 
consists of kernel trees, all slightly different versions, or different 
projects in them.  That's just silly.

> > Networks are getting broader, but latency is not going to go away.
> > 
> > Do it in <4000lines.  Lions-book Unix was 10kloc, and look how many
> > good ideas they had in there.

I suppose the first useful version will be about that size (4K lines).

-- 
Daniel