From phillips at bonn-fries.net Sun Jun 2 09:58:38 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Sun, 2 Jun 2002 01:58:38 +0200 Subject: [Prophesy] Re: Improved string transformation In-Reply-To: <20020531135941.D4082@jaquet.dk> References: <20020531135941.D4082@jaquet.dk> Message-ID: On Friday 31 May 2002 13:59, Rasmus Andersen wrote: > Like with dnotify, I think that the grouping and manageability of > changes coming through a magic FS is going to suffer. Sorry, I must have missed your reasoning about this, could you please elaborate? > And I think > that this is one of the cardinal weak points in CVS, and thusly > one where we should aim for being strong. > > But I have no good ideas on how to handle this and still get > transparency. Ah, I see what you're getting at. OK, we are not going to rely *only* on the transparent editing interface, but on other means of feeding the scm as well, to supply the other needed information, which will be kept with the deltas in the database. I envision a graphical user interface running while the scm is running, which gives a view of the source tree and lets you walk over it to add comments, tags etc. Of course we will provide a command shell way of doing these things as well. While we're on that topic, we want to make the SCM an embeddable object, so that both the gui and the command interface simply invoke the scm methods. I guess we can rely on Python to handle that aspect for us, and so not get stuck in some sticky tarpit like Corba, or COM, or building our own object embedding protocol. Speaking of gui, I think Glade should be the tool, the only realistic alternative being QT/KDE, and while I do like the latter a lot in terms of sheer usability, I also like the faster startup and lower resource usage of GTK. Furthermore I'm familiar with Glade, and I like the idea of being able to separate out the interface definition into an XML object. So, now I'm going to take a quick look at how Glade and Python play together. -- Daniel p.s., I prefer being cc'd on replies to the list, that way a copy shows up in my inbox, more convenient than checking all the mailing lists I'm subscribed too. From phillips at bonn-fries.net Sun Jun 2 16:23:53 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Sun, 2 Jun 2002 08:23:53 +0200 Subject: [Prophesy] Thoughts on lineage and derivation Message-ID: I've got a few thoughts on database structure that I've been meaning to jot down, so here goes. First, I'd like start using the term 'version' where previously I've been saying 'node'. It's a lot more descriptive of what we really have in the database. I'll just say 'node' when I'm talking about graph theory in general. Speaking of graph theory, I realized that what we have in the database isn't a tree of versions at all, it's an arbitrary connected graph. It's pretty much of a stretch to find strict trees in the real world of code development - cross pollination makes short work of that misapprehension. The only thing that makes it look somewhat like a tree is geneology, and see the above remark on cross-pollination. We could say at least that it's a non-cyclic graph because time only goes in one direction, but even that gets confused sometimes. Just try importing some old code and see if time always goes forward or not. So let's design everything based on no presumption of strict graph structure. One thing that does impose a little order on the situation is that there is only one order that changes are apllied to the database. That's a simple matter of incrementing a change number every time a change is applied. We won't rely on that for much more than auditing and reporting though, since it's too restrictive. Just for fun, we'll allow changes to be applied to any version in the tree, and yes, that can create various sort of inconsistencies, but instead of denying that such things can happen, we'll just record the fact that those inconsistencies exist in the database, and somebody can attempt to clean them up later. We do not necessarily have to forget about the good old consistent version at the affected point in the database, and arguably we should never forget a version that's in an 'interior' version anyway. (An 'interior' version is one from which at least one later version was derived.) For that matter, it's a mistake to think of derivation along a single line, or even a single tree. In fact, there are many objects that make up each version, and any of them can show lineage and be derived from, not even necessarily in the same version. So lineage and inheritance are a lot more complex that they seem at first glance. What's going to save us from getting confused are the object ids. For any given object, typically a single source file, we will be able to trace exact lineage and derivations from it, and those will form a strict tree. (Um, unless we allow objects to be made up of other objects, which I think we do.) Notice how using an object id as a handle for a file object neatly answers the question of how to handle renames. The name (complete with path) is just an attribute of the file object, and can change from version to version, just as the file text can. -- Daniel From phillips at bonn-fries.net Sun Jun 2 17:50:02 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Sun, 2 Jun 2002 09:50:02 +0200 Subject: [Prophesy] Diff to transform conversion Message-ID: Here's the algorithm for diff to transform conversion. We will not try to generate the 'move' operation for the time being. For convenience, the 'emit' operation described below will accumulate and merge sequences of the same operation into a single operation. This just requires remembering what the last operation was, and how long it is (the start is just a negative offset from the current input position). When the operation being emitted is not the same as the last operation emitted, the appropriate operation is appended to the operation string. Finally, after processing the entire, a copy is emitted. Zero length operations are discarded. The basic idea is to process the input text sequentially. We will keep track of both current input line number and byte position. We don't have to look at the output text at all, except that we may wish to actually apply the patch to ensure that the result of applying the patch or the transform is the same. Below, when we copy or skip a line, or emit a line of text, we also account for the trailing end-of-line. Strange things happen if there is no end-of-line at the end of the input, output or diff file. Worry about that later. The algorithm proper: Find the beginning of a patch. The pattern is: "---" "+++" While the next text is "@@" (beginning of chunk): Get the input line number and count, and the output line number and count from the chunk header line. Ensure the line numbers are monotonically increasing. (The output line and count are not used in the algorithm below, but could be used for error checking.) Emit a copy from the current input position to the chunk's input line number, and advance the input position to the cunk's input line For each line of the chunk, until the current input line equals the chunks's input line number plus the chunk's input line count: If the line begins with '-', emit a skip as long as the line If the line begins with '+', emit a text as long as the line If the line begins with ' ', emit a copy as long as the line Finally, emit a copy from the current input position to the end of the input text, and flush it to the operation string. I think I'll try coding this in C, with the help of the regex library, though I know it would be easier in Python, and Rasmus has already written some nice regex's in Python for handling diffs. However, the transform applying code is already in C, so the diff parsing code might as well be too. Of course it means that another job coming up very soon is: figuring out how to interface Python to C functions. -- Daniel From phillips at bonn-fries.net Sun Jun 2 21:14:43 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Sun, 2 Jun 2002 13:14:43 +0200 Subject: [Prophesy] Diff to transform conversion In-Reply-To: References: Message-ID: On Sunday 02 June 2002 09:50, I wrote: > I think I'll try coding this in C, with the help of the regex library... The regex library was a big fat disappointment: - Cannot apply regex across more than one line. - Only matches zero terminated strings. The latter restriction means it's no good for matching against a part of string. Come on guys, I thought Unix was designed by computer scientists, not schoolchildren. OK, next step is to just hand code it. If you want a job done properly... -- Daniel From rasmus at jaquet.dk Mon Jun 3 16:58:20 2002 From: rasmus at jaquet.dk (Rasmus Andersen) Date: Mon, 3 Jun 2002 08:58:20 +0200 Subject: [Prophesy] Re: Improved string transformation In-Reply-To: ; from phillips@bonn-fries.net on Sun, Jun 02, 2002 at 01:58:38AM +0200 References: <20020531135941.D4082@jaquet.dk> Message-ID: <20020603085819.A22496@jaquet.dk> On Sun, Jun 02, 2002 at 01:58:38AM +0200, Daniel Phillips wrote: > > And I think > > that this is one of the cardinal weak points in CVS, and thusly > > one where we should aim for being strong. > > > > But I have no good ideas on how to handle this and still get > > transparency. > > Ah, I see what you're getting at. OK, we are not going to rely *only* on the > transparent editing interface, but on other means of feeding the scm as well, > to supply the other needed information, which will be kept with the deltas in > the database. I envision a graphical user interface running while the scm is > running, which gives a view of the source tree and lets you walk over it to > add comments, tags etc. Of course we will provide a command shell way of > doing these things as well. I think my point is that at the lowest level of this, at the FS level, can we attach meaning (and comments) to (FS) operations? Some people obsessively save their buffers after each little edit and it would seem that a revision history reflecting this would not be very helpful. On the other hand, if the SCM merely uses the FS operations to gather knowledge about changed objects[1], then the user would still have to do a explicit 'commit' to make a delta(?) and attach comments. Which isn't that far from what you would do anyway without the magic FS. Or am I missing something? > > While we're on that topic, we want to make the SCM an embeddable object, so > that both the gui and the command interface simply invoke the scm methods. I > guess we can rely on Python to handle that aspect for us, and so not get > stuck in some sticky tarpit like Corba, or COM, or building our own object > embedding protocol. I agree strongly on the embeddable part. And python play nicely with embedded C. And the other way around (being embedded) too. > > Speaking of gui, I think Glade should be the tool, the only realistic > alternative being QT/KDE, and while I do like the latter a lot in terms of > sheer usability, I also like the faster startup and lower resource usage of > GTK. Furthermore I'm familiar with Glade, and I like the idea of being able > to separate out the interface definition into an XML object. > > So, now I'm going to take a quick look at how Glade and Python play together. I am a GUI newbie so if you have experience, you lead the way. [1] We haven't discussed the basic object in the SCM. Is it a file? A function? A line (of code)? I could see some nice things coming from having smaller granularity than the file one, but since we are aiming at having 'loose' dependencies in the SCM I think we will get those anyway. Rasmus From phillips at bonn-fries.net Mon Jun 3 17:44:01 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Mon, 3 Jun 2002 09:44:01 +0200 Subject: [Prophesy] Re: Improved string transformation In-Reply-To: <20020603085819.A22496@jaquet.dk> References: <20020603085819.A22496@jaquet.dk> Message-ID: On Monday 03 June 2002 08:58, Rasmus Andersen wrote: > On Sun, Jun 02, 2002 at 01:58:38AM +0200, Daniel Phillips wrote: > > > And I think > > > that this is one of the cardinal weak points in CVS, and thusly > > > one where we should aim for being strong. > > > > > > But I have no good ideas on how to handle this and still get > > > transparency. > > > > Ah, I see what you're getting at. OK, we are not going to rely *only* on the > > transparent editing interface, but on other means of feeding the scm as well, > > to supply the other needed information, which will be kept with the deltas in > > the database. I envision a graphical user interface running while the scm is > > running, which gives a view of the source tree and lets you walk over it to > > add comments, tags etc. Of course we will provide a command shell way of > > doing these things as well. > > I think my point is that at the lowest level of this, at the FS level, > can we attach meaning (and comments) to (FS) operations? Some people > obsessively save their buffers after each little edit and it would > seem that a revision history reflecting this would not be very helpful. > > On the other hand, if the SCM merely uses the FS operations to gather > knowledge about changed objects[1], then the user would still have to > do a explicit 'commit' to make a delta(?) and attach comments. Which > isn't that far from what you would do anyway without the magic FS. > > Or am I missing something? No, good point, and it's one I've thought about, I just neglected to say anything about it. My thinking is that the scm will normally save a transform against a file every time the file is written to for whatever reason, but when you commit, those transforms are collapsed into a single transform. So until you do the commit, you have file-level undo, if you want it. It's just easy to provide this, so why not? We can also provide an option to leave those transformed un-composed in the database, which will eat a lot of space a probably be useless, but it might be interesting to somebody, and it's easy to do, so again, why not? > [1] We haven't discussed the basic object in the SCM. Is it a file? > A function? A line (of code)? No, we haven't talked about it much, and it's getting to where we need to do that. > I could see some nice things coming > from having smaller granularity than the file one, but since we > are aiming at having 'loose' dependencies in the SCM I think we > will get those anyway. The basic data object will be a transform, according to my current thinking, though other database entities will no doubt emerge as we go. A transform epresses the difference between two strings, and we have not said yet whether the strings are whole files or something else. Clearly, a single transform cannot be larger than a file, but is it useful for it to be smaller? From a pure data storage point of view, no, that doesn't gain a lot, because if we want that, we can still express it with a single transform, and then have a list of regions in the transform that are of special interest, rather than having separate transforms. However, in the process of doing some of the kinds of calculus that I expect we will want to do, I think we will want to generate transforms, on the fly, that are smaller than files, i.e., partition transforms into regions that reflect, say, the boundardies of a patch that we are trying to merge. I'm personally not a great fan of line boundaries, as I believe they reduce generality. However, we need to deal with them at times, especially when interfacing to diff. They're likely to figure in our merge algorithms as well, since they tend to be a conceptually significant from the user's point of view. But as far as letting them invade the data design - there's no need, and by being strict about that, the end result will be much more useful for handling binary files as well. A practical question is whether we're going to version directories. I mentioned the idea that each file object would have an id (which is universally unique) and the name of the file would be metadata associated with the object (i.e., an attribute of the object). However, we will need to look up files rapidly by name, for example, when a file is changed and a transform needs to be recorded against it in the database. This can of course be handled efficiently by appropriate use of database indexing. We may sometimes want to traverse the database in directory order, perhaps when producing a diff between two tree versions. Does this mean we want to record directories as objects? I don't know yet. It may be enough just to compute the directories on the fly. Drifting further in that direction, the question arises of how much filesystem structure we want to support in the scm. Do we want to support symlinks? I think we do. Hard links? Good question. Device nodes? Hmm. If we support all of the above, then what we have is more general than a source code versioning system, it's actually a versioning filesystem. That's something to think about. However, right now I'll be satisfied aiming at something with more modest goals. -- Daniel From rasmus at jaquet.dk Mon Jun 3 18:06:38 2002 From: rasmus at jaquet.dk (Rasmus Andersen) Date: Mon, 3 Jun 2002 10:06:38 +0200 Subject: [Prophesy] Re: Improved string transformation In-Reply-To: ; from phillips@bonn-fries.net on Mon, Jun 03, 2002 at 09:44:01AM +0200 References: <20020603085819.A22496@jaquet.dk> Message-ID: <20020603100638.A22744@jaquet.dk> On Mon, Jun 03, 2002 at 09:44:01AM +0200, Daniel Phillips wrote: > > On the other hand, if the SCM merely uses the FS operations to gather > > knowledge about changed objects[1], then the user would still have to > > do a explicit 'commit' to make a delta(?) and attach comments. Which > > isn't that far from what you would do anyway without the magic FS. > > > > Or am I missing something? > > No, good point, and it's one I've thought about, I just neglected to say > anything about it. My thinking is that the scm will normally save a > transform against a file every time the file is written to for whatever > reason, but when you commit, those transforms are collapsed into a single > transform. So until you do the commit, you have file-level undo, if you want > it. It's just easy to provide this, so why not? We can also provide an > option to leave those transformed un-composed in the database, which will eat > a lot of space a probably be useless, but it might be interesting to > somebody, and it's easy to do, so again, why not? OK then. This would seem to be a reasonable middle ground. If it wasn't for you having some FS experience already, I would probably think the magic FS way too complex for what it is buying us/the user. Can we do this without a kernel patch? A kernel patch may be a bit too much for many that just wants to dip their toes. > > I could see some nice things coming > > from having smaller granularity than the file one, but since we > > are aiming at having 'loose' dependencies in the SCM I think we > > will get those anyway. > [snip things about having files as basic versioning object] > > A practical question is whether we're going to version directories. I > mentioned the idea that each file object would have an id (which is > universally unique) and the name of the file would be metadata associated > with the object (i.e., an attribute of the object). However, we will need to > look up files rapidly by name, for example, when a file is changed and a > transform needs to be recorded against it in the database. This can of > course be handled efficiently by appropriate use of database indexing. > > We may sometimes want to traverse the database in directory order, perhaps > when producing a diff between two tree versions. Does this mean we want to > record directories as objects? I don't know yet. It may be enough just to > compute the directories on the fly. Another related thing is, how do we group changes to achieve logically connected changes, aka changesets in BK terminology? I guess that would be by explicit operations in the GUI/command line thingie operating on deltas? > > Drifting further in that direction, the question arises of how much > filesystem structure we want to support in the scm. Do we want to support > symlinks? I think we do. Hard links? Good question. Device nodes? Hmm. > If we support all of the above, then what we have is more general than a > source code versioning system, it's actually a versioning filesystem. That's > something to think about. However, right now I'll be satisfied aiming at > something with more modest goals. Rik van Riel and Larry had some thought about using magic FS's to the job a while back... Here we go. Its kinda sketchy but some stuff can be had: http://search.luky.org/linux-kernel.2001/msg25061.html Rasmus From phillips at bonn-fries.net Mon Jun 3 18:55:56 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Mon, 3 Jun 2002 10:55:56 +0200 Subject: [Prophesy] Re: Improved string transformation In-Reply-To: <20020603100638.A22744@jaquet.dk> References: <20020603100638.A22744@jaquet.dk> Message-ID: On Monday 03 June 2002 10:06, Rasmus Andersen wrote: > On Mon, Jun 03, 2002 at 09:44:01AM +0200, Daniel Phillips wrote: > > > On the other hand, if the SCM merely uses the FS operations to gather > > > knowledge about changed objects[1], then the user would still have to > > > do a explicit 'commit' to make a delta(?) and attach comments. Which > > > isn't that far from what you would do anyway without the magic FS. > > > > > > Or am I missing something? > > > > No, good point, and it's one I've thought about, I just neglected to say > > anything about it. My thinking is that the scm will normally save a > > transform against a file every time the file is written to for whatever > > reason, but when you commit, those transforms are collapsed into a single > > transform. So until you do the commit, you have file-level undo, if you want > > it. It's just easy to provide this, so why not? We can also provide an > > option to leave those transformed un-composed in the database, which will eat > > a lot of space a probably be useless, but it might be interesting to > > somebody, and it's easy to do, so again, why not? > > OK then. This would seem to be a reasonable middle ground. If it wasn't > for you having some FS experience already, I would probably think the > magic FS way too complex for what it is buying us/the user. > > Can we do this without a kernel patch? A kernel patch may be a bit > too much for many that just wants to dip their toes. By rights, a generic method for accomplishing such a thing should already have been merged, but sadly that isn't the case, or perhaps fortunately, if the official interface would have been less than ideal. In any event, I'm not shy about constructing should a thing if it's needed, and I can assure you it will be elegant and efficient. As far as applying a patch goes, I think it will only be a module, and that module will be small, since most of the work will be done in user land. No, we absolutely can't do this without involving the kernel, and no standard mechansim exists in Linux at the moment for doing this. Plan9 has 9P, a network protocol, precisely for such a purpose, however I'd rather bypass the network and do a tight little local interface. If I decide the network interface is really the right way to do it, or just want to be lazy, the uservfs project already exists, and is being maintained I believe. It isn't in kernel though, and it depends on coda, which is a another whole big piece, so I'm not that enthusiastic about it. I'd rather just define a nice interface that exports the vfs securely and racelessly to user space via the various nice methods we have available. It doesn't have to be particularly general either, to get us going. I consider this a fairly easy project and a chance to get some experience with some of the ipc mechanisms I haven't done a lot with to date, such as signals. There is another, simpler method, and the one I propose to use for initial testing: simply issue all edit commands and other file manipulations, such as rename, patch etc. from a python shell, which will take care of the needed preserving of data and calls to the scm. This gives us a quick start so we don't have to get bogged down in the details of filesystem exporting, and others who just want to take a test drive might find this method useful as well. There's no question in my mind that the magic filesystem is the best interface. > > > I could see some nice things coming > > > from having smaller granularity than the file one, but since we > > > are aiming at having 'loose' dependencies in the SCM I think we > > > will get those anyway. > > > [snip things about having files as basic versioning object] > > > > A practical question is whether we're going to version directories. I > > mentioned the idea that each file object would have an id (which is > > universally unique) and the name of the file would be metadata associated > > with the object (i.e., an attribute of the object). However, we will need to > > look up files rapidly by name, for example, when a file is changed and a > > transform needs to be recorded against it in the database. This can of > > course be handled efficiently by appropriate use of database indexing. > > > > We may sometimes want to traverse the database in directory order, perhaps > > when producing a diff between two tree versions. Does this mean we want to > > record directories as objects? I don't know yet. It may be enough just to > > compute the directories on the fly. > > Another related thing is, how do we group changes to achieve logically > connected changes, aka changesets in BK terminology? I guess that would > be by explicit operations in the GUI/command line thingie operating on > deltas? Right, and I'd like to expose the full power of sql for this purpose, while also supporting other methods of course, such as remembering the regions affected by imported patch sets, or indeed, remembering enough information to reconstruct each patch set exactly. Let's call that information 'scope', and we want to carry scope information in a precise way in the database. In general, the scopes of changes should not overlap, but when they do, we need to record exactly how. Overlapping scope results either in ordering dependencies, or conflicts. In either case, we need to record just what those dependencies or conflicts are. > > Drifting further in that direction, the question arises of how much > > filesystem structure we want to support in the scm. Do we want to support > > symlinks? I think we do. Hard links? Good question. Device nodes? Hmm. > > If we support all of the above, then what we have is more general than a > > source code versioning system, it's actually a versioning filesystem. That's > > something to think about. However, right now I'll be satisfied aiming at > > something with more modest goals. > > Rik van Riel and Larry had some thought about using magic FS's to the > job a while back... Here we go. Its kinda sketchy but some > stuff can be had: > > http://search.luky.org/linux-kernel.2001/msg25061.html Yes, there you go. 'Obviously right'. Except I don't want to involve the network, that just doesn't make any sense to me. -- Daniel From phillips at bonn-fries.net Mon Jun 3 22:09:14 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Mon, 3 Jun 2002 14:09:14 +0200 Subject: [Prophesy] Re: Improved string transformation In-Reply-To: <20020603085819.A22496@jaquet.dk> References: <20020603085819.A22496@jaquet.dk> Message-ID: On Monday 03 June 2002 08:58, Rasmus Andersen wrote: > > Speaking of gui, I think Glade should be the tool, the only realistic > > alternative being QT/KDE, and while I do like the latter a lot in terms of > > sheer usability, I also like the faster startup and lower resource usage of > > GTK. Furthermore I'm familiar with Glade, and I like the idea of being able > > to separate out the interface definition into an XML object. > > > > So, now I'm going to take a quick look at how Glade and Python play together. > > I am a GUI newbie so if you have experience, you lead the way. Here's a tutorial: http://www.ics.uci.edu/~xge/python-glade/python-glade.html -- Daniel From phillips at bonn-fries.net Tue Jun 4 00:56:55 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Mon, 3 Jun 2002 16:56:55 +0200 Subject: [Prophesy] Re: Improved string transformation In-Reply-To: References: <20020603085819.A22496@jaquet.dk> Message-ID: On Monday 03 June 2002 14:09, Daniel Phillips wrote: > On Monday 03 June 2002 08:58, Rasmus Andersen wrote: > > > Speaking of gui, I think Glade should be the tool, the only realistic > > > alternative being QT/KDE, and while I do like the latter a lot in terms of > > > sheer usability, I also like the faster startup and lower resource usage of > > > GTK. Furthermore I'm familiar with Glade, and I like the idea of being able > > > to separate out the interface definition into an XML object. > > > > > > So, now I'm going to take a quick look at how Glade and Python play together. > > > > I am a GUI newbie so if you have experience, you lead the way. > > Here's a tutorial: > > http://www.ics.uci.edu/~xge/python-glade/python-glade.html But it was a little terse, and oriented towards python 1.5 (I'm using 2.1, to which the postgres database interface of choice is written). Here's a much nicer one: http://www.icon.co.za/~zapr/Project1.html In fact, all that's required is 'google python glade'. -- Daniel From rasmus at jaquet.dk Tue Jun 4 21:53:40 2002 From: rasmus at jaquet.dk (Rasmus Andersen) Date: Tue, 4 Jun 2002 13:53:40 +0200 Subject: [Prophesy] Re: Improved string transformation In-Reply-To: ; from phillips@bonn-fries.net on Fri, May 31, 2002 at 12:05:26PM +0200 References: <20020531102806.B3135@jaquet.dk> Message-ID: <20020604135340.C29724@jaquet.dk> On Fri, May 31, 2002 at 12:05:26PM +0200, Daniel Phillips wrote: > As far as change overviews go, I think I'm a long way from even thinking > about that. A lot more of the basic ideas have to be in place first. Having > a full database around that we can do arbitrary queries on should help quite > a lot. Just a random thought I stumbled across: Since you want to store the transforms in the DB, what are you doing queries on here? Comments? AFAICS, it would not be feasible to do SQL queries on the stored transforms. Rasmus From phillips at bonn-fries.net Tue Jun 4 23:57:38 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Tue, 4 Jun 2002 15:57:38 +0200 Subject: [Prophesy] Re: Improved string transformation In-Reply-To: <20020604135340.C29724@jaquet.dk> References: <20020604135340.C29724@jaquet.dk> Message-ID: On Tuesday 04 June 2002 13:53, Rasmus Andersen wrote: > On Fri, May 31, 2002 at 12:05:26PM +0200, Daniel Phillips wrote: > > As far as change overviews go, I think I'm a long way from even thinking > > about that. A lot more of the basic ideas have to be in place first. Having > > a full database around that we can do arbitrary queries on should help quite > > a lot. > > Just a random thought I stumbled across: Since you want to store > the transforms in the DB, what are you doing queries on here? > Comments? AFAICS, it would not be feasible to do SQL queries on > the stored transforms. The transform is one field of a record whose primary index is, most likely, object id, assuming the object is an entire file. There may be other as yet undetermined fields, for instance we may want to group related transforms together into changes, and comments would be attached to changes. As far as what we can query, sure, it doesn't make sense to do an SQL query on a transform itself, but we can quickly generate strings from source+transforms and do queries on that. Perhaps the result of the query would be used to go back and reorganize the transforms, or perhaps the query will generate a list of regions of interest in the fully expressed source of some version. -- Daniel From phillips at bonn-fries.net Sat Jun 8 00:17:33 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Fri, 7 Jun 2002 16:17:33 +0200 Subject: [Prophesy] Diff to transform converter Message-ID: Here is a first cut at a uninified-diff to transform converter. I considered using a parsing or lexing tool to do this, but in the end I decided to roll my own parser, as the diff syntax is quite simple. I took a look at the Posix regex package but was disappointed to learn that it can only deal with null-terminated strings, which severely limits its usefulness. The scanf library function, on the other hand, turned out to integrate quite well with my little parser. I suppose that is because the diff syntax was originally constructed to be friendly to scanf. Scanf is used to parse and convert the chunk line numbers. The diff syntax proved to be context-free, that is, it's not necessary to decode any of the line numbers in order to complete the parse. Furthermore, only the input line number of each chunk is needed to generate the transform output. This fact certainly wasn't obvious when I started. The parser itself is a state/transition machine, which is what all the gotos are about. To hide the parsed text behind a stream abstraction and make the parser concise and readable, three helper macros are used: next_ch() - get the next character to parse, returning -1 if at end next_is(c) - return true if the next character is c skip_to(c) - skip ahead to c or end and return true if found These macros assume the variables 'string' and 'limit' are within scope, defining the limits of the text to parse. The macros make use of a pair of inlines, _next_is and _skip_to, which properly parameterize all the required state, so that no static variables are used in the parser itself, so that the end result is thread-safe. (Note: here we see C at its weakest. If it were possible to define the helper functions within the scope of the parser, no macros would be needed and no state would have to be passed.) In the end, the parser itself, complete with a handle of helper functions, was the smaller part of the project - more than half the code is devoted to generating the transform codes. Code generation is slightly complicated by a mechanism for merging successive operation of the same type together. This in turn requires the literal text for text_op operations to be identified in one place, and used in another, which introduces a new array to check for overflow and expand as necessary, etc. Details like this tend to bulk up code quickly. While the parser itself is thread-safe, i.e., it uses no static variables, the code generator isn't. It uses a number of static variables and two arrays, mostly concerned with implementing the opcode merging. This needs to be cleaned up at some point by encapsulating all state in a single struct to be shared by the parser and code generator. The code generator itself will not embed nicely inside the parser because it is called from two places, one to output opcodes and the other at the end of the parse to flush out the final, pending opcode. With the helper functions not inlined and gcc optimization level 2, the parser and code generator come in at less than 2K, a result that warms the heart of an old code mizer like me. Inlining only adds another 100 bytes or so, so of course we will do it, to get the performance. It's pretty much impossible to debug a parser and code generator like this without tracing output, and I have used my usual technique for that. There are three macros that can be used to wrap tracing output statements: trace_on and trace_off, and a further macro, trace, which is defined as one or the other, depending on whether you want tracing output or not. It would be foolish to assume that no further work needs to be done on the parser, so all the tracing code has been left in for now. There needs to be more error checking. As it stands, this code should perform its job correctly, on the assumption that the diff text is always correct. It would of course be foolish to assume this. Some of the redundant information in the diff can be used for a crosscheck: - The number of copied and skipped lines in each chunk should match the chunk's specified input line count - The number of copied and added lines in each chunk should match the specified chunk's output line count. - The current output line should be tracked and checked against the chunk's output line number - Copied and skipped text in the diff should be checked to ensure it matches the corresponding input text - Added text could possibly be checked against the original target text, but since the target text is not required for any other purpose, it makes more sense just to test-apply the generated transform to the input text, then ensure it matches the original target text As suggested above, whenever we generate a transform we want to test-apply it to ensure it does in fact have the same result as applying the diff does, that is, it correctly generates the diff target given the input text and the transform. Array overflow checking needs to be added, complete with automatic expansion of the arrays as needed. The currect code does not take advantage of the move operation, which I noted earlier, is there so that text that is merely moved (or copied) from place to place in a string does not have to be encoded literally - it can always be taken from the input string when a transform needs to be applied. This is merely an optimization, and a difficult one at that. There are other tasks of more immediate importance. In fact, the whole process of converting a diff to a transform is just a shortcut so that we can start loading the database with string differences without getting bogged down in the details of identifying which sections of text have changed and which have not. Eventually, we do want to go to the effort of implementing a custom algorithm to do this, for a number of reasons: - It will be faster and (probably) more reliable than diff - It will work at a resolution of less than a line, so that in the common case where only a single name has changed in a line the redundant, invariant context will not be recorded. - It will handle binary files just as well as ascii text Now I should also address the question: why not just use diff, and avoid all the trouble of implementing these transform things? Well: - Transforms are considerably more compact than diff files. For example, context and skipped lines from the diff are not encoded (or needed) in the transform. - Applying a transform will be significantly faster than applying a diff via patch. This is an operation we will make heavy use of as the repository operations become more sophisticated. - The transform code is far less complex than patch and other diff utilities, and hence, correspondingly more trustworthy - We will probably want to handle more than one kind of diff syntax, and transforms provide a common storage format. - Transforms, because of their simpler structure, are much more suitable for calculus-type operations, such as composition. A diff can do one thing that a transform cannot do: a diff can be applied in a fuzzy way, that is, the patched text does not have to be exactly the same as the original target text from which the diff was generated. However, we don't require this property just now, since we only want to represent exact differences in the database. Anything else is an error. When we do get to the point of handling fuzzy problems like merging, we will need to build some more tools for the purpose. We will not make the mistake of attempting to use a single tool such as diff for two different purposes, neither of which it is ideally suited to. Now that I have broken the back of this biggish chunk of work, it's time to contemplate what else needs to be done to get to the point of having some minimally functional repository manager to play with: - Finish up this work by wrapping it with some test code to create diffs and cross check against the generated transforms - Wrap the C code as a Python library: http://www.python.org/doc/current/api/api.html - Think more about the details of the magic filesystem. This won't have to be a general purpose stackable filesystem, it just has to interface to userland in such a way that file text can be saved before being overwritten, and compared to the changed result when a file is closed. - Flesh out some more database format structural detail, so that filenames and directory structure can be tracked and simple metadata such as comments version names can be recorded - Do a little work to improve the python database interface class for record writing, taking advantage of Postres's "copy file" table loading command (which can be easily emulated with less efficient operations for other databases that don't have it). As I mentioned previously, the magic filesystem interface doesn't necessarily have to exist before the system can be used: a simple workaround is to run the editing commands from a Python shell, with wrappers to run the required database operations. The attached code demonstrates the conversion of a simple diff text into a transform. It's all set to compile and run, with tracing on. In the trace output, a character followed by ',' was read by skip_to and a character followed by '?' was tested by next_is. Parse states are printed out at the beginning of a line, and generated operations at the end. Finally, the generated transform is printed byte by byte in hex. -- Daniel -------------- next part -------------- A non-text attachment was scrubbed... Name: transform.c Type: text/x-c Size: 6867 bytes Desc: not available URL: From phillips at bonn-fries.net Sat Jun 8 13:03:00 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Sat, 8 Jun 2002 05:03:00 +0200 Subject: [Prophesy] Diff to transform converter In-Reply-To: References: Message-ID: This is a minor update, correcting the parser to reject diff strings where the '---' sequence does not occur at the beginning of a line. -- Daniel -------------- next part -------------- A non-text attachment was scrubbed... Name: transform.c Type: text/x-c Size: 6944 bytes Desc: not available URL: From phillips at bonn-fries.net Sat Jun 8 23:18:57 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Sat, 8 Jun 2002 15:18:57 +0200 Subject: [Prophesy] Diff to transform converter In-Reply-To: References: Message-ID: Today's update includes detection of overflow in and automatic expansion of the two arrays of unpredictable size. To regularize this somewhat messy operation a little, the array handling for one of the two was rewritten as pointer arithmetic so that both cases fit the model of output into an array with variable base, limit and current position. A common 'expand' function is thus able to handle both cases. This code is almost ready to be pressed into service. -- Daniel -------------- next part -------------- A non-text attachment was scrubbed... Name: transform.c Type: text/x-c Size: 7721 bytes Desc: not available URL: From phillips at bonn-fries.net Sun Jun 9 07:06:12 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Sat, 8 Jun 2002 23:06:12 +0200 Subject: [Prophesy] C from Python Example Message-ID: It's about time to try using some C code from Python, to see how well that works out. If it does work out then I suppose we are on our way. I'm guessing that the work will go several times faster in Python than in C, because of fewer worres about memory allocation and such things. While the Python C api is well documented, the requirements for compiling and installing python loadable modules written in C aren't. After a little guesswork and fiddling around I came up with the following simple example, which defines a class with one method that simply returns a copy of the string passed to it. File foo.c: #include "Python.h" static PyObject *foobar(PyObject *self, PyObject *args) { char *string; return PyArg_Parse(args, "s", &string)? PyString_FromString(string): NULL; } static PyMethodDef foo_methods[] = { {"bar", foobar}, {NULL, NULL} }; void initfoo() { Py_InitModule("foo", foo_methods); } Apparently PyArg_Parse is deprecated, but it works. The new, improved way isn't a lot different, but I haven't tried it yet. I must say, it's a little frightening that Python actually parses a string on every call to a C function. Surely there is a better way of doing this. For example, parse the strings at module load time. The moral of the story is: don't expect Python to be fast, not with this kind of implementation. Oh well, it still should work out well for this project, since the heavy lifting will be done in tightly coded C. The example can be compiled, installed as a module, and executed using the script line: cc -shared -I/usr/include/python2.2 python2.c -o foo.so && \ sudo cp foo.so /usr/lib/python2.2/lib-dynload/ && \ python2.2 -c "import foo; print foo.bar('test')" I verified that this example works in python 1.5, 2.1 and 2.2. The python2.2-dev package has to be installed to get the Python.h header file. I'd like to import the module without copying it to the lib-dynload directory, which is not a good way to develop because it requires root privilege, and anyway, it's an annoying extra step. I'm sure there's a way to do it, but I haven't found it yet. There is also a fancy system called 'distutils' that builds and installs extension modules. I don't really see why anything fancier than what I've shown here is needed for development. Reference material is available here: http://python.org/doc/current/ext/ext.html "Extending and Embedding the Python Interpreter" http://python.org/doc/current/api/api.html "Python/C API Reference Manual" -- Daniel From phillips at bonn-fries.net Sun Jun 9 18:50:00 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Sun, 9 Jun 2002 10:50:00 +0200 Subject: [Prophesy] Re: C from Python Example Message-ID: On Saturday 08 June 2002 23:06, you wrote: > The example can be compiled, installed as a module, and executed using the > script line: > > cc -shared -I/usr/include/python2.2 python2.c -o foo.so && \ ^^^^^^^^^ > sudo cp foo.so /usr/lib/python2.2/lib-dynload/ && \ > python2.2 -c "import foo; print foo.bar('test')" Correction: cc -shared -I/usr/include/python2.2 foo.c -o foo.so && \ sudo cp foo.so /usr/lib/python2.2/lib-dynload/ && \ python2.2 -c "import foo; print foo.bar('test')" -- Daniel From phillips at bonn-fries.net Sun Jun 9 23:07:17 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Sun, 9 Jun 2002 15:07:17 +0200 Subject: [Prophesy] Diff to transform converter In-Reply-To: References: Message-ID: On Friday 07 June 2002 16:17, Daniel Phillips wrote: > There needs to be more error checking. As it stands, this code should > perform its job correctly, on the assumption that the diff text is always > correct. It would of course be foolish to assume this. Some of the > redundant information in the diff can be used for a crosscheck: > > - The number of copied and skipped lines in each chunk should match > the chunk's specified input line count > > - The number of copied and added lines in each chunk should match the > specified chunk's output line count. > > - The current output line should be tracked and checked against the > chunk's output line number > > - Copied and skipped text in the diff should be checked to ensure it > matches the corresponding input text > > - Added text could possibly be checked against the original target > text, but since the target text is not required for any other > purpose, it makes more sense just to test-apply the generated > transform to the input text, then ensure it matches the original > target text A couple of items to round out the list: - The input line number of each chunk should be monotonically increasing, and the input chunks should not overlap - The output line number of each chunk should be monotonically increasing, and the output chunks should not overlap Under the category of 'further work', there needs to be special attention paid to the possibility that the final line may not be terminated by an end-of-line character, in any combination of: - the input file - the output file - the diff file Diff uses some bizarre syntax for indicating the absence of end-of-line in certain circumstances. It doesn't seem to be documented (the unified diff format itself is only loosely documented, in a bsd man page) and I have not taken the time to reverse engineer it. It seems to have something to do with a \ character beginning the line just after a +++ line, with a comment to the effect that an end of line is missing in one of the files. Yuck. If an end-of-line is missing in a diff file, it's probably fair to treat it as a syntax error. If missing in the input or output file then we have to watch out for the (crude) diff syntax that indicates this and process it to produce the correct transform. I believe this only affects the final operation, and then only with certain of the three basic operations. -- Daniel From erica at raqfaq.net Mon Jun 10 05:14:39 2002 From: erica at raqfaq.net (Erica Douglass) Date: Sun, 9 Jun 2002 12:14:39 -0700 Subject: [Prophesy] Diff to transform converter In-Reply-To: Message-ID: <000201c20fe9$e7434fa0$ad7ba8c0@corkyserver> Sending with the correct From: address this time... > -----Original Message----- > From: prophesy-admin at auug.org.au [mailto:prophesy-admin at auug.org.au] On > Behalf Of Daniel Phillips > Sent: Sunday, June 09, 2002 6:07 AM > To: prophesy at auug.org.au > Subject: Re: [Prophesy] Diff to transform converter > > On Friday 07 June 2002 16:17, Daniel Phillips wrote: > > If an end-of-line is missing in a diff file, it's probably fair to treat > it > as a syntax error. If missing in the input or output file then we have to > watch out for the (crude) diff syntax that indicates this and process it > to > produce the correct transform. I believe this only affects the final > operation, and then only with certain of the three basic operations. > > -- > Daniel Sometimes it really shows that you are a UNIX person. :P You're forgetting that you're going to have to translate between \n, \r\n (Windows), and \r (Macintosh) if you want full cross-platform compatibility. Here's what I used to translate in PHP. It's based on the browser detected and assumes that the file has been pulled into a string called $content. // get os for carriage returns :P if(strstr(getenv('HTTP_USER_AGENT'), 'Win')) { $content = eregi_replace("\r","",$content); }; This brings up a whole lot of questions, like: -- What is your interface going to be? If it's web-based, it's easy to detect the browser and make assumptions. Cross-platform GUI... well, it's not as easy. If you want to force people to use Linux, you can make a Linux-only binary and a web-based client for people who aren't using Linux, but then you might have some pissed-off customers. -- What DO your customers want? At what stage do you want to start pulling in user feedback? So far this list has mostly been "Daniel is cool because he can do a diff transform, and look, here's this nifty Python thing..." I usually start by asking the customer(s) what they want and designing from that spec. I think that is pretty much the norm in customer-centered development, which is definitely required if you want this project to actually succeed rather than to be a PET (penis enlargement tool). I'm not trying to bash you, Daniel. I'm just questioning where this project is going. I would like to see a nice marketing-style spec with bullet points and customer needs analyses. The question that everyone on this list should be thinking about is, "Is this a serious project that I am willing to invest my time in?" If so, we need a spec, not just C code. Erica From phillips at bonn-fries.net Mon Jun 10 09:32:29 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Mon, 10 Jun 2002 01:32:29 +0200 Subject: [Prophesy] Diff to transform converter In-Reply-To: <000001c20fe9$04884030$ad7ba8c0@corkyserver> References: <000001c20fe9$04884030$ad7ba8c0@corkyserver> Message-ID: On Sunday 09 June 2002 21:08, Erica Douglass wrote: > You're forgetting that you're going to have to translate between \n, > \r\n (Windows), and \r (Macintosh) if you want full cross-platform > compatibility. Cross-platform compatibility isn't a goal, except that the end result should work on all versions of Linux, not just Redhat or Debian. If somebody wants to port the code to other platforms, that's fine. Somebody, sometime will no doubt want to pull some source files in crlf format into the repository, and for that all we need is a general-purpose filter: cat /c/sourcefile.c | crlf2unix >mytree/myfile.c This follows from the overall design philosphy, which perhaps I haven't expressed clearly enough, that a tree under management will look and act just like a normal directory tree, except that any time you create, change, move or delete a file in it, a daemon will intercept the event and update a database accordingly. Among other things, this allows the use of a general purpose filter for such a purpose as described above. > This brings up a whole lot of questions, like: > > -- What is your interface going to be? The main interface is going to be a normal-looking filesystem as described above. This lets you get code into and out of the repository, make changes to it and build binaries from it. You can use whatever tools you wish to for doing those things, so that part of the interface is simple: whatever you are used to using. Besides that, my immediate intention is to implement two commands: tag and goto. The first lets you give a name to the current version, and the second alters the source text to be the same as some previously tagged version. Together these implement a simple transport mechanism, which is the short-term goal I'm working towards. Other fancier interfaces can wait until we see how well the transport mechanism actually works. I should mention that my first concern is to make the single-user interface very good, and at the same time, as transparent as possible. I feel that this is an area that has not been adequately addressed in other source code management projects. A tool should make things easier for you, not harder. > If it's web-based, it's easy to > detect the browser and make assumptions. A web interface isn't central. When there is one, it's going to be for reporting or download. It would also be nice - very nice - to have an lxr-style view of the code tree, with an incrementally updating index. I don't see using the browser to edit or manage source files, the current state of browser technology just isn't up to it. > Cross-platform GUI... well, > it's not as easy. If you want to force people to use Linux, you can make > a Linux-only binary and a web-based client for people who aren't using > Linux, but then you might have some pissed-off customers. It's a stretch to call users of a free system 'customers'. Anyone who wants to use this system on some other platform than Linux can get the source and do the port, or find a friend to do it. > -- What DO your customers want? At what stage do you want to start > pulling in user feedback? So far this list has mostly been "Daniel is > cool because he can do a diff transform, and look, here's this nifty > Python thing..." I usually start by asking the customer(s) what they > want and designing from that spec. I think that is pretty much the norm > in customer-centered development, which is definitely required if you > want this project to actually succeed rather than to be a PET (penis > enlargement tool). Customer number one is me, and my main motivation is to come up with something that saves time and does a more accurate job on certain common tasks that have proved to be big time wasters for me. One such task is: preparation of patch sets, where each patch in the set has to result in a system which builds and operates correctly. To date, I've handled that by maintaining multiple source trees and diffing between them, but that is tedious and error prone, not to mention consuming huge amounts of disk space. Which brings me to another big concern: saving disk space. I see that on my laptop I currently have xx gig in my src directory, and I have considerably more on my server. That's just too much. Most of these source trees are just minor variations on each other - experiments or incremental versions. I want to be able to go: goto linux-2.4.16; make install goto linux-2.4.19; make install all in the same source tree. (The new kbuild system, once it gets into the tree, will help with this, as it - optionally - does not pollute the source tree with build files.) > I'm not trying to bash you, Daniel. I'm just questioning where this > project is going. I would like to see a nice marketing-style spec with > bullet points and customer needs analyses. > > The question that everyone on this list should be thinking about is, "Is > this a serious project that I am willing to invest my time in?" If so, > we need a spec, not just C code. The vast majority of successful open source projects start with some working code that does something useful, so creating said working code has to be the main focus at this point. Besides, it's been fun and interesting so far, and I do think it is going somewhere. -- Daniel From erica at simpli.biz Mon Jun 10 05:08:19 2002 From: erica at simpli.biz (Erica Douglass) Date: Sun, 9 Jun 2002 12:08:19 -0700 Subject: [Prophesy] Diff to transform converter In-Reply-To: Message-ID: <000001c20fe9$04884030$ad7ba8c0@corkyserver> > -----Original Message----- > From: prophesy-admin at auug.org.au [mailto:prophesy-admin at auug.org.au] On > Behalf Of Daniel Phillips > Sent: Sunday, June 09, 2002 6:07 AM > To: prophesy at auug.org.au > Subject: Re: [Prophesy] Diff to transform converter > > On Friday 07 June 2002 16:17, Daniel Phillips wrote: > > If an end-of-line is missing in a diff file, it's probably fair to treat > it > as a syntax error. If missing in the input or output file then we have to > watch out for the (crude) diff syntax that indicates this and process it > to > produce the correct transform. I believe this only affects the final > operation, and then only with certain of the three basic operations. > > -- > Daniel Sometimes it really shows that you are a UNIX person. :P You're forgetting that you're going to have to translate between \n, \r\n (Windows), and \r (Macintosh) if you want full cross-platform compatibility. Here's what I used to translate in PHP. It's based on the browser detected and assumes that the file has been pulled into a string called $content. // get os for carriage returns :P if(strstr(getenv('HTTP_USER_AGENT'), 'Win')) { $content = eregi_replace("\r","",$content); }; This brings up a whole lot of questions, like: -- What is your interface going to be? If it's web-based, it's easy to detect the browser and make assumptions. Cross-platform GUI... well, it's not as easy. If you want to force people to use Linux, you can make a Linux-only binary and a web-based client for people who aren't using Linux, but then you might have some pissed-off customers. -- What DO your customers want? At what stage do you want to start pulling in user feedback? So far this list has mostly been "Daniel is cool because he can do a diff transform, and look, here's this nifty Python thing..." I usually start by asking the customer(s) what they want and designing from that spec. I think that is pretty much the norm in customer-centered development, which is definitely required if you want this project to actually succeed rather than to be a PET (penis enlargement tool). I'm not trying to bash you, Daniel. I'm just questioning where this project is going. I would like to see a nice marketing-style spec with bullet points and customer needs analyses. The question that everyone on this list should be thinking about is, "Is this a serious project that I am willing to invest my time in?" If so, we need a spec, not just C code. Erica From rasmus at jaquet.dk Tue Jun 11 07:33:18 2002 From: rasmus at jaquet.dk (Rasmus Andersen) Date: Mon, 10 Jun 2002 23:33:18 +0200 Subject: [Prophesy] C from Python Example In-Reply-To: References: Message-ID: <20020610213318.GA2395@jaquet.dk> On Sat, Jun 08, 2002 at 11:06:12PM +0200, Daniel Phillips wrote: > The example can be compiled, installed as a module, and executed using the > script line: > > cc -shared -I/usr/include/python2.2 python2.c -o foo.so && \ > sudo cp foo.so /usr/lib/python2.2/lib-dynload/ && \ > python2.2 -c "import foo; print foo.bar('test')" Doing this nets me 'ImportError: dynamic module does not define init function (initfoo)'. (OK, I cheated and skipped the second line (see below)). I have before used SWIG (www.swig.org) sucessfully as C/Python integrator. The v1.3 (newest, I think) I had on my system choked on some of your C constructs, using the attached patch helped. Also, I had to use an interface file to get SWIG to grok other stuff. Also attached. I am not pushing this in a serious way since you seem to be doing without but I wanted to show what I needed to do. > > I verified that this example works in python 1.5, 2.1 and 2.2. The > python2.2-dev package has to be installed to get the Python.h header file. > > I'd like to import the module without copying it to the lib-dynload > directory, which is not a good way to develop because it requires root > privilege, and anyway, it's an annoying extra step. I'm sure there's a way > to do it, but I haven't found it yet. > This problem I did not have on python2.1 and python1.5 (solaris) and python2.2 (linux). Rasmus From rasmus at jaquet.dk Tue Jun 11 15:48:20 2002 From: rasmus at jaquet.dk (Rasmus Andersen) Date: Tue, 11 Jun 2002 07:48:20 +0200 Subject: [Prophesy] C from Python Example In-Reply-To: <20020610213318.GA2395@jaquet.dk> References: <20020610213318.GA2395@jaquet.dk> Message-ID: <20020611054820.GA1630@jaquet.dk> On Mon, Jun 10, 2002 at 11:33:18PM +0200, Rasmus Andersen wrote: > I have before used SWIG (www.swig.org) sucessfully as C/Python integrator. > The v1.3 (newest, I think) I had on my system choked on some of your > C constructs, using the attached patch helped. > > Also, I had to use an interface file to get SWIG to grok other stuff. > Also attached. Sigh. Rasmus -------------- next part -------------- --- transform.c.org Mon Jun 10 22:25:39 2002 +++ transform.c Mon Jun 10 23:29:21 2002 @@ -67,7 +67,8 @@ // should not allow leading 0 for high // should trap all invalid opcodes // should take ops string limit and check against it -struct transinfo {int in; int out;} transcheck(uchar *ops) +struct transinfo {int in; int out;}; +struct transinfo transcheck(uchar *ops) { unsigned c, state = 0, count = 0, ilen = 0, olen = 0; @@ -153,7 +154,8 @@ unsigned emit_line, this_line, hold_lines, hold_length, emit_op; uchar *emit_text, *this_text, *end_text, *outmem, *output, *outlim; -struct holdline {char *source; unsigned length;} *holdmem, *hold, *holdlim; +struct holdline {char *source; unsigned length;}; +struct holdline *holdmem, *hold, *holdlim; int emit(unsigned op) { -------------- next part -------------- %module foo %{ #define max(a, b) (a > b? a: b) #define trace trace_on #define trace_on(cmd) cmd #define trace_off(cmd) #define text_op 0 #define copy_op 1 #define skip_op 2 #define high_op 3 #define text(n) (n | (text_op << 6)) #define copy(n) (n | (copy_op << 6)) #define skip(n) (n | (skip_op << 6)) #define move(n, s) copy(0), copy(s), copy(n) struct holdline {char *source; unsigned length;}; struct transinfo {int in_org; int out;}; %} extern int transform(unsigned char *ops, unsigned char *in, unsigned char *out); extern struct transinfo transcheck(unsigned char *ops); extern int _next_is(unsigned char c, unsigned char **stringv, unsigned char *limit); extern int _skip_to(unsigned char c, unsigned char **stringv, unsigned char *limit); extern void expand(void **pbase, void **plim, void **pcur, unsigned more); extern int emit(unsigned op); extern int diff2transform(unsigned char *input, unsigned inlen, unsigned char *string, unsigned length); extern unsigned emit_line, this_line, hold_lines, hold_length, emit_op; extern unsigned char *emit_text, *this_text, *end_text, *outmem, *output, *outlim; extern struct holdline *holdmem, *hold, *holdlim; From rasmus at jaquet.dk Tue Jun 11 17:31:06 2002 From: rasmus at jaquet.dk (Rasmus Andersen) Date: Tue, 11 Jun 2002 09:31:06 +0200 Subject: [Prophesy] C from Python Example In-Reply-To: <20020611054820.GA1630@jaquet.dk>; from rasmus@jaquet.dk on Tue, Jun 11, 2002 at 07:48:20AM +0200 References: <20020610213318.GA2395@jaquet.dk> <20020611054820.GA1630@jaquet.dk> Message-ID: <20020611093106.A4144@jaquet.dk> On Tue, Jun 11, 2002 at 07:48:20AM +0200, Rasmus Andersen wrote: > On Mon, Jun 10, 2002 at 11:33:18PM +0200, Rasmus Andersen wrote: > > I have before used SWIG (www.swig.org) sucessfully as C/Python integrator. > > The v1.3 (newest, I think) I had on my system choked on some of your > > C constructs, using the attached patch helped. > > > > Also, I had to use an interface file to get SWIG to grok other stuff. > > Also attached. > > Sigh. > [snip attachments] While I am at it, I might as well give the incantations: % swig -python transform.i % cc -I/usr/include/python2.2 -shared -o foo.so transform.c transform_wrap.c The tutorial at the www.swig.org site is fairly short and concise. Rasmus From phillips at bonn-fries.net Wed Jun 12 01:17:08 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Tue, 11 Jun 2002 17:17:08 +0200 Subject: [Prophesy] C from Python Example In-Reply-To: <20020611093106.A4144@jaquet.dk> References: <20020611054820.GA1630@jaquet.dk> <20020611093106.A4144@jaquet.dk> Message-ID: On Tuesday 11 June 2002 09:31, Rasmus Andersen wrote: > While I am at it, I might as well give the incantations: > > % swig -python transform.i > % cc -I/usr/include/python2.2 -shared -o foo.so transform.c transform_wrap.c > > The tutorial at the www.swig.org site is fairly short and concise. Yes, it is. There's one incantation missing: how do you import your foo.so module into python? So far I don't know how to go about loading a module that's in my current working directory. -- Daniel From phillips at bonn-fries.net Wed Jun 12 01:31:55 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Tue, 11 Jun 2002 17:31:55 +0200 Subject: [Prophesy] C from Python Example In-Reply-To: <20020611054820.GA1630@jaquet.dk> References: <20020610213318.GA2395@jaquet.dk> <20020611054820.GA1630@jaquet.dk> Message-ID: On Tuesday 11 June 2002 07:48, Rasmus Andersen wrote: > On Mon, Jun 10, 2002 at 11:33:18PM +0200, Rasmus Andersen wrote: > > I have before used SWIG (www.swig.org) sucessfully as C/Python integrator. > > The v1.3 (newest, I think) I had on my system choked on some of your > > C constructs, using the attached patch helped. Yes, it seems swig's parser has fallen a little bit behind and I suppose a bug report to the project would be in order. /me puts it on his list of things to do sometime In the meantime your fix is fine. > > Also, I had to use an interface file to get SWIG to grok other stuff. > > Also attached. Swig is very clearly a quick way to get started on constructing an interface like this and I would have used it if I'd known about it. (I did see swig mentioned a few times as I searched for documentation, but mainly in the context of interfacing Python to C++, so that put me off the scent. There's a lot of knowledge encoded in the swig interface generators that could be time consuming to acquire by other means. For this project I'd tend towards treating swig as more of a kind of tutorial that an essential build tool, since the Python/C interface is quite straightforward once you work out the basics, like where to find the documentation and what's required to compile and link. I do intend to run swig from time to time to compare what it thinks is essential for an interface, versus what I come up with from reading the docs. OK... I just generated a swig python wrapper from your .i file... Woohoo! Over a thousand lines of wrapper, more than 3 times the size of the project so far, and the generated code is 8 times the size. Well, I guess that's the problem with program-writing programs in general. By studying the wrapper I'm sure there are useful things to learn, but I think it's easy enough to generate the Python wrappers by hand, as needed. Of course, that means being attentive and worrying about things like object ref counts and locking, but these are good to know about anyway. Swig-friendly transform.c attached. -- Daniel -------------- next part -------------- A non-text attachment was scrubbed... Name: transform.c Type: text/x-c Size: 7761 bytes Desc: not available URL: From rasmus at jaquet.dk Wed Jun 12 02:47:34 2002 From: rasmus at jaquet.dk (Rasmus Andersen) Date: Tue, 11 Jun 2002 18:47:34 +0200 Subject: [Prophesy] C from Python Example In-Reply-To: ; from phillips@bonn-fries.net on Tue, Jun 11, 2002 at 05:17:08PM +0200 References: <20020611054820.GA1630@jaquet.dk> <20020611093106.A4144@jaquet.dk> Message-ID: <20020611184734.A6465@jaquet.dk> On Tue, Jun 11, 2002 at 05:17:08PM +0200, Daniel Phillips wrote: > On Tuesday 11 June 2002 09:31, Rasmus Andersen wrote: > > While I am at it, I might as well give the incantations: > > > > % swig -python transform.i > > % cc -I/usr/include/python2.2 -shared -o foo.so transform.c transform_wrap.c > > > > The tutorial at the www.swig.org site is fairly short and concise. > > Yes, it is. There's one incantation missing: how do you import your foo.so > module into python? So far I don't know how to go about loading a module > that's in my current working directory. As I tried to say, that is one problem I haven't had: On a variety of platforms I have been able to load from working directory. Rasmus From phillips at bonn-fries.net Wed Jun 12 03:22:25 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Tue, 11 Jun 2002 19:22:25 +0200 Subject: [Prophesy] C from Python Example In-Reply-To: <20020611184734.A6465@jaquet.dk> References: <20020611184734.A6465@jaquet.dk> Message-ID: On Tuesday 11 June 2002 18:47, Rasmus Andersen wrote: > On Tue, Jun 11, 2002 at 05:17:08PM +0200, Daniel Phillips wrote: > > On Tuesday 11 June 2002 09:31, Rasmus Andersen wrote: > > > While I am at it, I might as well give the incantations: > > > > > > % swig -python transform.i > > > % cc -I/usr/include/python2.2 -shared -o foo.so transform.c transform_wrap.c > > > > > > The tutorial at the www.swig.org site is fairly short and concise. > > > > Yes, it is. There's one incantation missing: how do you import your foo.so > > module into python? So far I don't know how to go about loading a module > > that's in my current working directory. > > As I tried to say, that is one problem I haven't had: On a variety > of platforms I have been able to load from working directory. Is Debian one of those platforms? If not, then I suppose it's just a configuration issue, specifically, the initialization of sys.path: http://www.python.org/doc/current/ref/import.html Could you please do: import sys sys.path and see if "." is one of the entries? The first entry I see here is '', which seems a little odd. -- Daniel From rasmus at jaquet.dk Wed Jun 12 03:35:48 2002 From: rasmus at jaquet.dk (Rasmus Andersen) Date: Tue, 11 Jun 2002 19:35:48 +0200 Subject: [Prophesy] C from Python Example In-Reply-To: References: <20020611184734.A6465@jaquet.dk> Message-ID: <20020611173548.GA1761@jaquet.dk> On Tue, Jun 11, 2002 at 07:22:25PM +0200, Daniel Phillips wrote: > Is Debian one of those platforms? If not, then I suppose it's just > a configuration issue, specifically, the initialization of sys.path: > > http://www.python.org/doc/current/ref/import.html > > Could you please do: > > import sys > sys.path > > and see if "." is one of the entries? The first entry I see here is '', > which seems a little odd. No debian: Solaris and Mandrake. But there is no '.' in my sys.path (on Mandrake at least)? I get the '' as well. Rasmus From phillips at bonn-fries.net Wed Jun 12 03:39:05 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Tue, 11 Jun 2002 19:39:05 +0200 Subject: [Prophesy] C from Python Example In-Reply-To: <20020611173548.GA1761@jaquet.dk> References: <20020611173548.GA1761@jaquet.dk> Message-ID: On Tuesday 11 June 2002 19:35, Rasmus Andersen wrote: > On Tue, Jun 11, 2002 at 07:22:25PM +0200, Daniel Phillips wrote: > > Is Debian one of those platforms? If not, then I suppose it's just > > a configuration issue, specifically, the initialization of sys.path: > > > > http://www.python.org/doc/current/ref/import.html > > > > Could you please do: > > > > import sys > > sys.path > > > > and see if "." is one of the entries? The first entry I see here is '', > > which seems a little odd. > > No debian: Solaris and Mandrake. But there is no '.' in my sys.path > (on Mandrake at least)? I get the '' as well. OK, well I'll just put that one aside as a minor mystery to be investigated in due course, and on with the show. -- Daniel From phillips at bonn-fries.net Thu Jun 13 01:43:16 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Wed, 12 Jun 2002 17:43:16 +0200 Subject: [Prophesy] Database Structure and the Transport System Message-ID: As I mentioned earlier, I am working towards the goal of implementing a basic source tree transport system, which implements the following two operations: tag - give a name to the current version of the source tree goto - transform the current source tree to a previously tagged version I have a theory that if I do this very accurately and efficiently, first it will be immediately useful for certain purposes such as preparing patch sets and storing multiple source tree versions in a single tree, and second, it can be extended naturally to become an elegant and useful distributed source code manager. The immediate need is to define a database structure well suited to capturing incremental changes to a source tree and supporting the above transport system operations. To help understand my thinking, it's useful to consider the following points: 1) Not all the source is recorded in the database proper. Some, or most of the text exists only as normal files in the source tree. Only enough information is recorded in the database to support the transport mechanism. Though this does add a little complexity to the system, it means that the database can be considerably smaller in common situations, and putting a source tree under management is a much faster operation than if all the source text had to be compressed and loaded into it. (As an extension, the on-disk source could be redundantly encoded in the database in order to provide protection against the possibility that the on-disk source could be changed while Prophesy isn't watching.) 2) Transforms are unidirectional. This is simply to save space - we don't have to record the text that a transform deletes from the on-disk version, only added text. However, we can easily compute the inverse of a transform given the input text, or portions of it. At the time the transport system transforms the on-disk text into some other version, those portions of the input text needed to compute the inverse transformation - needed so that the transport system can restore the on-disk text to its original form - can be stored in the database. Prophesy Database Structure --------------------------- So far, I've identified the following essential database entities: - file - directory - version - transform where 'directory' is a kind of file. Each of these objects will have an internally-generated, permanent id. The object id, especially for files and directories, is tantalizingly similar to a file inode, and it's tempting to use the underlying filesystem's actual inode number for the id, except that we are allowed to "copy -a" the whole source tree structure, and the inode numbers would change in the process. Drat. This means that Prophesy and the underlying filesystem are going to be doing a lot of lookups in parallel: the filesystem looks up a file by path and name, yielding an inode, then advises Prophesy that the file is to be altered. Prophesy then has to look up the file by path and name, yielding an object id. Oh well, we will ensure the latter operation is efficient. The main (or perhaps only) function of directory objects is to support lookup of objects by name. Each file or directory object has a name and a directory id, this pair being unique in a version. A file object can be known by more than one name/directory pair, that is, it can be hardlinked. Tree Structure of Versions -------------------------- The ability to return to some previous version of the source text and modify it means that the version structure is a tree, that is, any version can be forked. The version table represents this tree in the form of a flat, relational table. Version table: version id, parent version, tag That is, each version knows its parent. If we want to know all the children of some version then we can query the database for all versions with the given parent. We can cache the result if we want, to support repeated queries of this form efficiently. Primary Data Representation --------------------------- All primary data in Prophesy is represented by the combination of the current on-disk source text and changes relative to the current source text. These changes take the form of three tables, as described below, and a journal table, to which additions to any of the three change tables are logged. journal table: journal id, comment, author, timestamp The sole purpose of the journal table is tracking; the transport system does not make reference to it. While the journal could in theory be used to wind the whole database back to any historical state, the transport mechanism provides a more powerful and efficient way of doing that. Each change to file text between two versions results in an addition to the text change table. text change table: object id, input version, output version, journal id, transform, untransform The forward transform is not stored until needed, since it can be generated from the current text and the untransform, and so would contain only redundant text. The forward transform must be generated the first time the current version moves downstream of the output version, that is, towards the root. Optionally, the forward transform can be stored redundantly, to protect against the possibility that the on-disk tree could be changed without knowledge of Prophesy, or to allow an entire repository to be copied by copying just a single database file. Each file or directory object in Prophesy database has a unique object id, which is used to track the object as it evolves from version to version. object create table: object id, input version, output version, journal id, name An object delete is exactly an object create with the input and output versions reversed. In other words, a create going from version A to version B implies a delete going from version B to version A. For any delete, the object text must be stored, however for a create that would be redundant since the current text is on disk. In fact, this is the same consideration apply as for text changes, and is handled the same way. That is, when a non-empty file is deleted, Prophesy enters both a reverse create and a text change consisting of a single text remove (reverse add) operation into the database. When an object is deleted, its object id is not reused (possibly excepting cases where an object is created and deleted within the same version, or an entire version is discarded) since it continues to exist in other versions within the same repository. For the time being, a 32 bit object id should be sufficient. Handling hard links correctly is expected to be problematic, however representing them is not a problem. A hard link is simply a create for an object that already exists. >From version to version, file and directory objects may be moved from any place in the source tree to any other. Each such move results in an entry in the object move table. As with filesystems, object rename is treated as a move. object move table: object id, input version, output version, journal id, input name, output name Here, and everywhere else names are used, a 'name' is a pair: atom, directory. Using atom ids rather than literal text in the move and create records means that these records consist only of fixed-size fields, which is friendly to database optimization. Atoms also provide a measure of compression, since the atom table is shared by all versions. Current State Cache ------------------- Other that the version table, all primary objects in the database represent differences rather than current state. A current state for any version can always be constructed by applying all transforms, create/deletes and moves encountered on the path from an old current version to a new current version. However, filename lookups need to be efficient, and so a hash table mapping all current names (atom, dir pair) to object ids is maintained incrementally. The list of all current objects is easily and efficiently generated by taking the union of all object creates on the path from the root to the current version, less all object deletes. This is rarely needed, so it is not maintained incrementally. Name Lookup ----------- Name lookup by full path is needed each time Prophesy intercepts and processes a change to a file, and that could add up to a lot of lookups. For example, a global edit might be performed, or a whole set of files untarred into a subdirectory, or a directory deleted. It is desirable that typical file operations not be slowed noticeably by putting a source tree under management. Therefore, to optimize directory lookups, an additional hash table is maintained, which maps hashes of full directory paths to directory objects. This avoids the need to iterate through each section of a directory path to perform a lookup. When a directory name is changed, hashes of subdirectories need to be invalidated, and this is the only case where Prophesy needs to know the subdirectory tree of a given directory. To optimize this, a directory table for the current version is maintained incrementally: Directory table: directory id, parent directory id which forms a tree, since multiple directory ids can have the same parent directory id. Epilogue -------- On the 'well begun is half done' principle, this post constitutes my last major effort before turning to preparations for the Ottawa Linux Symposium and kernel summit. In other words, I won't be implementing any of this for about a month. This should provide adequate time for the ideas to mature. Of course I'll respond to any critical comment, or elaborate on any points I glossed over too quickly. -- Daniel From phillips at bonn-fries.net Thu Jun 13 08:33:00 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Thu, 13 Jun 2002 00:33:00 +0200 Subject: [Prophesy] Background material - xdelta Message-ID: Having roughed out the design of a storage engine for Prophesy, I thought I'd do a little research and I found this: http://telia.dl.sourceforge.net/sourceforge/xdelta/xdfs.pdf (Josh MacDonald's paper on delta compression) Recommended reading. And see: http://prcs.sourceforge.net/ (PRCS revision control project, home page) http://telia.dl.sourceforge.net/sourceforge/prcs/prcs_doc.html (PRCS documentation) Though much of what is written here seems similar to what I've mapped out, in the end, the implementation comes out very different. -- Daniel From phillips at bonn-fries.net Thu Jun 13 22:13:07 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Thu, 13 Jun 2002 14:13:07 +0200 Subject: [Prophesy] Background material - Subversion Message-ID: Subversion is a well-established and active project whose design is similar in many ways to what I've put forth: http://subversion.tigris.org/ The subversion code is hosted online in a Subversion repository. This directory of design notes makes interesting reading: http://svn.collab.net/repos/svn/branches/0.11.1/notes/ Subversion uses a database, currently Berkeley DB, with plans to switch to an SQL database at some point in the future. Hmm. I wonder, why not start there? A repository is made available for distributed access via an Apache module, just as I'd planned. The use of DAV gives a simple form of web browsing interface for free. The Subversion engine is modeled on a filesystem, and seems headed in the direction of becoming a versioning filesystem, although the technical details of how to make it a mountable filesystem have not been addressed. Instead, filesystem-like access is provided by way on a C api modeled on the Posix file functions. Much functionality appears to be available already, however, the file formats have not been frozen and database design issues seem to be in flux. -- Daniel From phillips at bonn-fries.net Thu Jun 13 22:30:12 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Thu, 13 Jun 2002 14:30:12 +0200 Subject: [Prophesy] Database Structure and the Transport System In-Reply-To: References: Message-ID: On Wednesday 12 June 2002 17:43, I wrote: > > text change table: > > object id, input version, output version, journal id, > transform, untransform > > object create table: > > object id, input version, output version, journal id, name > > object move table: > > object id, input version, output version, journal id, > input name, output name After reflecting a little, I realized that where I used 'input version, output version' representing a change between two version nodes, I should have used the arc between versions, giving a representation that is more compact and easier to search: text change table: object id, version arc, journal id, transform, untransform object create table: object id, version arc, journal id, name object move table: object id, version arc, journal id, input name, output name which adds a new database entity, 'version arc', the directed arc from a version's parent to itself. The primary version arc definitions can be folded into the version table: version table: version id, parent version, version arc id, tag -- Daniel From mbp at samba.org Fri Jun 14 10:28:54 2002 From: mbp at samba.org (Martin Pool) Date: Fri, 14 Jun 2002 10:28:54 +1000 Subject: [Prophesy] comments so far In-Reply-To: References: Message-ID: <20020614002851.GD6330@toey.sourcefrog.net> These are mostly just ideas I've had in my mind about SCM; some of them disagree with (what I've heard of) prophesy. Of course, you can do whatever you want. So take them or leave them. I think the hard thing about defining a SCM system is defining just what SCM *means*. As far as I can tell, you seem to be implementing a versioning filesystem, which lets you tag and revisit points in history. That's very nice, but I don't think that is really the heart of the problem. I believe that SCM systems, like programming languages, are primarily tools for communication between programmers -- the pragmatics of controlling the machine are secondary. (Included is the case of a programmer communicating with themselves over time.) Hooking at the filesystem level is good for capturing all changes, but I think they are very fine-grained and not meaningful. I think it's a bad idea -- although I of course respect you for trying it -- because I think the benefits compared to regular commands don't justify the added complexity and risk. There's a hierarchy: release notes for a new version -- many end-users will read these; they'll include references to bugs fixed list of patches accepted -- every developer probably wants to read this list of small changes within a patch -- many programmers probably want to read this diff for an actual patch -- probably don't need to read it unless I'm actually working in the area Perhaps there are some other levels, but you get the idea. I think the recursive nature is very important. The key job of the SCM system is to help programmers manage the history of development of the project. Just keeping a GNU-style ChangeLog can be pretty useful even without SCM. Autogenerating a NEWS file by pulling out top-level comments would be great, because it's one of the most useful tools to a user or satellite developer. Offline operation is crucial. Most projects don't have everybody on a LAN. Open source is inherently distributed. Time costs here will drastically outweigh anything you can do with a database, etc, on the server. Arch makes every download of the product a potential working directory. I don't think it's necessary to keep the entire history in every tarball, but it is perhaps good to keep references that tie the files to their place in history. It would, by extension, be nice to allow all downloads to happen over http/ftp, and all submissions to happen by mail to a maintainer. The program should not require any intelligence in the protocol. People shouldn't need permission to start hacking on a project, and to keep versions locally. They just need permission to commit to the master site. diffs have this nice property of being intelligible to humans and programs. Keep them. Make minimal changes to handle chmod, mv, etc. All other things being equal, files should be directly human-readable. Use diffs. Perhaps make ChangeLogs, or something similar, part of the metadata. (On the other hand, being readable might encourage editing by hand, which would be bad.) Writing new filesystems, diff formats, network protocols, etc is just screwing around. The heart of the problem is to get a good model for *how to do SCM*. You can implement (v1) using existing tools; optimize later if it turns out that your model is correct. Similarly, don't waste time writing GUIs; use emacs, xxdiff, dirdiff, etc. Write one later if it proves correct. If I was starting from scratch, I would consider a typical open source project: - email is key - people mail around patches; perhaps they get revised; eventually they get applied - the NEWS file says "applied patch for foofeature from jhacker at dot.com" Projects sometimes split off files or subdirectories into other projects; perhaps they diverge slightly. It would be nice to handle this. For rsync and other projects, I keep patches that I have not yet really accepted but that look good in CVS in patches/. A SCM system that managed this would be nice. I think it's a promising model, not a hack. Disk is cheap. Keep everything. Networks are getting broader, but latency is not going to go away. Do it in <4000lines. Lions-book Unix was 10kloc, and look how many good ideas they had in there. -- Martin From phillips at bonn-fries.net Sat Jun 15 00:09:55 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Fri, 14 Jun 2002 16:09:55 +0200 Subject: [Prophesy] comments so far In-Reply-To: <20020614002851.GD6330@toey.sourcefrog.net> References: <20020614002851.GD6330@toey.sourcefrog.net> Message-ID: On Friday 14 June 2002 02:28, Martin Pool wrote: > These are mostly just ideas I've had in my mind about SCM; some of > them disagree with (what I've heard of) prophesy. Of course, you can > do whatever you want. So take them or leave them. > > I think the hard thing about defining a SCM system is defining just > what SCM *means*. > > As far as I can tell, you seem to be implementing a versioning > filesystem, which lets you tag and revisit points in history. That's > very nice, but I don't think that is really the heart of the problem. It's the heart of a tool that (hopefully) let's you get at the heart of the problem. > I believe that SCM systems, like programming languages, are primarily > tools for communication between programmers -- the pragmatics of > controlling the machine are secondary. (Included is the case of a > programmer communicating with themselves over time.) I believe you're right, so long as SCM systems stay as clumsy as they are. If the archive system was actually easy and transparent to use, then programmers would use it as a tool for themselves, as a means of tracking multiple projects they're involved in, and trying out experiments. In much the same way as we now rely on the undo chain in an editor - I do that, don't you? That is, I rely on the editor's undo chain to back me out of failed experiments. It gets to the point where I'm reluctant to shut down the machine because of all the state saved in the editor's undo chains. Now, that's a system that works, but it's got glaring imperfections, beyond the fact that the state disappears when the editor shuts down. The editors also don't know about each other, and they are incapable of maintaining undo chains across different files, let alone projects. Granted, the SCM is also a tool for communication, but much good work has already been done there. I think the distributed side of things is well known and under control, but today's crop of scm's still suck as development tools. So that's where I'm concentrating. > Hooking at the filesystem level is good for capturing all changes, but > I think they are very fine-grained and not meaningful. This was addressed earlier in an earlier post. In the current version, every change to each file is recorded (and in order, giving you global undo, including undeletes) but when you close the version, the stacked changes are collapsed into a single layer of changes for the version. To put it another way, the system journals individual changes, but (unless you tell it otherwise) only for the current version. > I think it's a > bad idea -- although I of course respect you for trying it -- because > I think the benefits compared to regular commands don't justify the > added complexity and risk. Somebody from Apple said it well: "you should never have to tell the computer something it already knows". Check-in and check-out are things the computer can figure out for itself. Risk... I don't see it. If anything, the risk of a programmer forgetting or misapplying a command is greater. I know, I did it myself once :-) As for complexity, I don't really see that. Difficult, yes, because so far nobody has provided a suitable framework on Linux for stacking local filesystems. Anyway, I don't intend to tackle the problem of exporting the vfs to user space in its full generality, but rather, just enough to provide the functionality I want. If that provides a good base to work from towards a fully general system, then that's a bonus. Finally, I don't have to depend on the magic filesystem effort being successful, since the fallback is just to go to the traditional way of doing things, with explicit commands (a file checkout has the immediate effect of loading the current contents of the file into the database). However, that's way too dull for me and would fall well short of what I'd expect from a 21st century design. I've only thought in general terms about how to implement the magic filesystem so far, however, now is the time to get down to specifics. As a design rule, I'll try to work within existing kernel mechanisms, but if those mechanisms prove inadequate, I won't be shy about changing them. In the end, if somebody comes up with a better way of doing the same thing, that's great, but right now the main concerned is functionality and reliability. Other essential design parameters are: - Overhead imposed by the magic filesystem is insignificant - No performance impact at all outside the scope of the magic filesystem - No security compromise - No new dos - No new races When the magic filesystem is mounted, it gets a new superblock and knows about the superblock of the underlying system. We want to pass most vfs events straight through to the underlying filesystem, except for open, write, mmap and close (note that the vfs only passes the final file close event to the filesystem, and this isn't good enough). A pass-through write would work as follows: - inodes of the magic filesystem are exactly the inodes of the underlying filesystem, except for having an i_sb that points at a magic_superblock in place of the underlying filesystem's native superblock (does this work??) - vfs calls magic_file->f_dentry->d_inode->i_fop->write(magic_file, ...) - this magic_file_write keeps the native superblock in a private field of the magic superblock: magic_file->f_dentry->d_inode->i_sb->private.real_sb - magic_file_write allocates a temporary buffer, invodes the native filesystem's ->read to read the to-be-overwritten data into it, writes that data into the userspace daemon's pipe, and releases the temporary buffer (there has to be a more direct way of doing this!) - magic_file_write then calls the underlying filesystem's ->write, with its native... (inode??, no, it points at magic_sb, recursion!!) could we temporarily reset the sb?? yikes. Too bad generic_file_write takes a file instead of an inode. Other considerations: - Modify dnotify to allow events on files, not just directories - For every file open, register on - File open is overridden to attach notify events to file open and file close, if the file was opened r/w. These events are directed at the user space daemon an - File write is overridden in magic_file_operations->write, to read the current contents of the file in the overwritten region into a pipe. If the pipe is full the writing process blocks until the userspace daemon empties it. > There's a hierarchy: > > release notes for a new version -- many end-users will read these; > they'll include references to bugs fixed > > list of patches accepted -- every developer probably wants to read > this > > list of small changes within a patch -- many programmers probably > want to read this > > diff for an actual patch -- probably don't need to read it unless > I'm actually working in the area > > Perhaps there are some other levels, but you get the idea. I think > the recursive nature is very important. The key job of the SCM system > is to help programmers manage the history of development of the > project. > > Just keeping a GNU-style ChangeLog can be pretty useful even without > SCM. > > Autogenerating a NEWS file by pulling out top-level comments would be > great, because it's one of the most useful tools to a user or > satellite developer. > > Offline operation is crucial. Most projects don't have everybody on a > LAN. Open source is inherently distributed. Time costs here will > drastically outweigh anything you can do with a database, etc, on the > server. > > Arch makes every download of the product a potential working > directory. I don't think it's necessary to keep the entire history in > every tarball, but it is perhaps good to keep references that tie the > files to their place in history. > > It would, by extension, be nice to allow all downloads to happen over > http/ftp, and all submissions to happen by mail to a maintainer. The > program should not require any intelligence in the protocol. > > People shouldn't need permission to start hacking on a project, and to > keep versions locally. They just need permission to commit to the > master site. > > diffs have this nice property of being intelligible to humans and > programs. Keep them. Make minimal changes to handle chmod, mv, etc. > > All other things being equal, files should be directly human-readable. > Use diffs. Perhaps make ChangeLogs, or something similar, part of the > metadata. (On the other hand, being readable might encourage editing > by hand, which would be bad.) > > Writing new filesystems, diff formats, network protocols, etc is just > screwing around. The heart of the problem is to get a good model for > *how to do SCM*. You can implement (v1) using existing tools; > optimize later if it turns out that your model is correct. > > Similarly, don't waste time writing GUIs; use emacs, xxdiff, dirdiff, > etc. Write one later if it proves correct. > > If I was starting from scratch, I would consider a typical open source > project: > > - email is key > > - people mail around patches; perhaps they get revised; eventually > they get applied > > - the NEWS file says "applied patch for foofeature from > jhacker at dot.com" > > Projects sometimes split off files or subdirectories into other > projects; perhaps they diverge slightly. It would be nice to handle > this. > > For rsync and other projects, I keep patches that I have not yet > really accepted but that look good in CVS in patches/. A SCM system > that managed this would be nice. I think it's a promising model, not > a hack. > > Disk is cheap. Keep everything. > > Networks are getting broader, but latency is not going to go away. > > Do it in <4000lines. Lions-book Unix was 10kloc, and look how many > good ideas they had in there. > > -- > Martin > _______________________________________________ > Prophesy mailing list > Prophesy at auug.org.au > http://www.auug.org.au/mailman/listinfo/prophesy > > -- Daniel From mbp at samba.org Sat Jun 15 04:06:03 2002 From: mbp at samba.org (Martin Pool) Date: Sat, 15 Jun 2002 04:06:03 +1000 Subject: [Prophesy] comments so far In-Reply-To: References: <20020614002851.GD6330@toey.sourcefrog.net> Message-ID: <20020614180558.GA10553@toey.sourcefrog.net> I agree with you about the usefulness of editor undo chains. Under emacs, I have kept-new-versions set to about 10, and I regularly use C-u C-x C-s to do "keep backup version" and diff-backup. All very nice and useful. A filesystem that kept all versions would allow you to do this in a program-neutral way, although I think that's not so important now that almost all the GNU tools understand foo.c.~1~ backups. However, it has the same problem that the results are largely lacking semantics. For example, looking back through the history of all modifications to a directory, it seems impossible to tell which versions of the source will actually compile correctly, and which were intermediate versions that don't work. If a program commits early-and-often to CVS (say), but at least runs the test suite first, then you have in general some guarantee about the internal consistency of any committed version. (It would be even better if CVS versions were module-wide, like in Subversion.) A magic filesystem is "mere mechanism". I don't think you should be spending so much time on it until you have a good design for the version-control system built on top. If it turns out that the design "on top" is no better than CVS, then nobody will bother -- people who want neat features will use Bk (or a free clone), and more conservative people will use CVS. You've said that you need to be able to cope without the filesystem -- why not first implement the version without it, and then put it in as a nicety later? The same functions can be adequately (perhaps not quite as well) achieved using editor undo, editor backups, or tux2fs. If the design can sensibly handle many small revisions then it would be easy to have a program called by the editor on save that commits to it. If the design can't handle a huge number of revisions in a sensible way, then it doesn't matter how they get generated. > I believe you're right, so long as SCM systems stay as clumsy as they are. > If the archive system was actually easy and transparent to use, then > programmers would use it as a tool for themselves, as a means of tracking > multiple projects they're involved in, and trying out experiments. In much > the same way as we now rely on the undo chain in an editor - I do that, don't > you? That is, I rely on the editor's undo chain to back me out of failed > experiments. It gets to the point where I'm reluctant to shut down the > machine because of all the state saved in the editor's undo chains. Now, > that's a system that works, but it's got glaring imperfections, beyond the > fact that the state disappears when the editor shuts down. The editors also > don't know about each other, and they are incapable of maintaining undo > chains across different files, let alone projects. This is the perfect example of why semantic information is necessary. Pressing C-_ repeatedly until it looks about right is error-prone and labour intensive -- more than anything else, this limits the usefulness of editor undo. For fixing small mistakes it's good, but for backing out of hour-long experiments it seems useless to me. I don't want to say "undo edit" a hundred times; I want to say "back up to before I started working on this feature". Ideally, I can have several trees around. (Disk is cheap.) Instead of rolling back, just toss that directory tree on the floor so I can find it later if I want to see what it was that I tried. > Granted, the SCM is also a tool for communication, but much good work has > already been done there. I think the distributed side of things is well > known and under control, I think current SCMs are not nearly as good as they should be. Bk is the only decent distributed one, which is why it's doing so well. > but today's crop of scm's still suck as development tools. So > that's where I'm concentrating. Do you mean they're not very helpful for the individual developer? What kind of thing? > This was addressed earlier in an earlier post. In the current version, every > change to each file is recorded (and in order, giving you global undo, > including undeletes) but when you close the version, the stacked changes are > collapsed into a single layer of changes for the version. To put it another > way, the system journals individual changes, but (unless you tell it > otherwise) only for the current version. I disagree with this too :-) SCM shouldn't ever throw away information; it should only selectively roll it up for display. Once you've captured a diff it should be kept forever. Seeing the order in which edits within a version were made might possibly be helpful in the future. For example, consider the case in which a version consists of me taking a patch from somebody, and then fiddling things a bit to make it merge properly. From one point of view, those changes have to go together, since both are necessary to make the program compile again. On the other hand, it would be nice to be able to see the original diff separately. The more I think about it, the more I think some kind of recursive nesting of versions makes sense. Bk has this, but it enforces a two-level model of changesets, which consist of deltas (which are more or less diffs.) But I can imagine a higher-level changeset containing several others, particularly if they're ported or accepted from somebody else. > > I think it's a > > bad idea -- although I of course respect you for trying it -- because > > I think the benefits compared to regular commands don't justify the > > added complexity and risk. > > Somebody from Apple said it well: "you should never have to tell the computer > something it already knows". Right, but you shouldn't be afraid to tell the computer things that are pragmatically necessary. Somewhat off-topic comparison: directory and file names are not really necessary, because you can always search by content. But in practice, with some exceptions, systems that do that have often turned out to be hard to use. > Check-in and check-out are things the computer can figure out for > itself. How? How is the computer meant to know what I was thinking when I made a change? That's what future readers of the code really want to know. It might even be *more* important than the change itself -- this is why ChangeLogs can work in the absence of any other SCM. I find it's actually good discipline for the programmer too -- it helps them concentrate on doing only one thing at a time. > Risk... I don't see it. If anything, the risk of a programmer forgetting or > misapplying a command is greater. I know, I did it myself once :-) Kernel crashes, down filesystems, etc. If ClearCase is down, you can't do *anything*. If your CVS server is down, you can at least edit and compile locally, and diff against old versions. > As for complexity, I don't really see that. Difficult, yes, because so far > nobody has provided a suitable framework on Linux for stacking local > filesystems. I agree that would be useful. I just think you have a filesystem-hacker hammer and are trying to apply it to a SCM thumb. -- Martin From phillips at bonn-fries.net Sat Jun 15 04:18:54 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Fri, 14 Jun 2002 20:18:54 +0200 Subject: [Prophesy] comments so far In-Reply-To: References: <20020614002851.GD6330@toey.sourcefrog.net> Message-ID: On Friday 14 June 2002 16:09, Daniel Phillips wrote: > On Friday 14 June 2002 02:28, Martin Pool wrote: Well, I didn't intend to send the previous post until after having worked out more of the magic filesystem issues, however... the implication is that files under management of the magic filesystem have to have two inodes, one belonging to the magic filesystem and one belonging to the native filesystem. I'm putting down much of this awkwardness to what I'm increasingly seeing as misdesign of the vfs, but cleaning that up is not the immediate project. I'll return to the question of the magic filesystem later. OK, now the first thing I should say is that I agree with all the features you list below, and what I'm going to do now is speculate about how the current design can support each of them, or what needs to be done to support them. > > There's a hierarchy: > > > > release notes for a new version -- many end-users will read these; > > they'll include references to bugs fixed So the database needs to know what's a release note. This is version metadata, since a release is always a version. The question is, do we want to define metadata structure at the database table level, or do we want to just put all version metadata together in a single 'version metadata' record per version and parse it out with xml or some such? > > list of patches accepted -- every developer probably wants to read > > this Meaning the system has to know what the patch is, when accepted, into what version, and so on. What I'd like to do if possible is to carry forward patches as objects from version to version, so that the scm user can apply a patch to version 2.4.16 and remove it, perhaps after it's mutated a little, from version 2.4.19. For now, the most practical way to do this is just keep the patch verbatim in the database (along with the who/when/etc information) and let the user figure out what has to be done to revert it later. Hmm, yes, that's easy, and it's what you want I strongly suspect. The list of patches applied to a particular version is actually very important. Without it, you don't know what to revert. I've often felt the lack of this kind of information. Anyway, this feature is what bitkeeper would call 'import patch', except that Prophesy is going to remember more about the imported patch than Bitkeeper does, will keep the patch in its database, and will let you revert it without having to find the original copy on disk. > > list of small changes within a patch -- many programmers probably > > want to read this Right, so when Prophesy parses out the patch (we don't need to use patch to do this any more, because of the parser I wrote) it will save the patch header as metadata, assuming it's a description. The Prophesy user can edit this and mark it up so that it can generate a nice-looking listing of patch details (realistically, nobody ever edits these details, but it's nice to know you could). > > diff for an actual patch -- probably don't need to read it unless > > I'm actually working in the area Right, since the actual diff is compressed into the database, the web interface could pull it up for you. > > Perhaps there are some other levels, but you get the idea. I think > > the recursive nature is very important. The key job of the SCM system > > is to help programmers manage the history of development of the > > project. > > > > Just keeping a GNU-style ChangeLog can be pretty useful even without > > SCM. > > > > Autogenerating a NEWS file by pulling out top-level comments would be > > great, because it's one of the most useful tools to a user or > > satellite developer. Yes, here you'd have to convince your submitters to mark up their patches, or you'd have to do it yourself. Taking the email subject line by default would be a good start. > > Offline operation is crucial. Most projects don't have everybody on a > > LAN. Open source is inherently distributed. Time costs here will > > drastically outweigh anything you can do with a database, etc, on the > > server. The database is installed and runs locally. Operation is offline by default. > > Arch makes every download of the product a potential working > > directory. I don't think it's necessary to keep the entire history in > > every tarball, but it is perhaps good to keep references that tie the > > files to their place in history. That's right, for every repository there's a working directory. The repository database lives in the root of the workign directory. By the way, Prophesy is not so rude as to force an additional top level directory on top of the normal top directory as BitKeeper and other systems do. > > It would, by extension, be nice to allow all downloads to happen over > > http/ftp, As with Subversion, distributed access will be provided in the form of an Apache module. Providing an ftp view as well would be very nice. > > and all submissions to happen by mail to a maintainer. The > > program should not require any intelligence in the protocol. Right. We want to integrate Rasmus's patchbot work. > > People shouldn't need permission to start hacking on a project, and to > > keep versions locally. They just need permission to commit to the > > master site. True, and permission to transmit to the remote site is an entirely different thing, and should be easier to get than permission to commit to the remote site. By the way, there will be not any 'master' site, only remote sites, i.e., Prophesy is peer-to-peer. > > diffs have this nice property of being intelligible to humans and > > programs. Keep them. Make minimal changes to handle chmod, mv, etc. Right, keep the ability to parse them and generate them, but don't use them internally, they're inappropriate for that. Except that Prophesy will archive the diff in its original form, as received. I suppose that for symmetry we should allow diffs to be sent to be archived as well, complete with descriptive comments etc. > > All other things being equal, files should be directly human-readable. > > Use diffs. Perhaps make ChangeLogs, or something similar, part of the > > metadata. (On the other hand, being readable might encourage editing > > by hand, which would be bad.) Using diffs internally in the database is out of the question. They're just not an appropriate currency for the kinds of manipulations Prophesy has to do. > > Writing new filesystems, diff formats, network protocols, etc is just > > screwing around. I agree about the network protocols, but not about the filesystem magic and the internal storage format. Particularly in regards to the latter, look at the research that's been done. There's a reason for it: archive size and efficiency of common operations is a very real problem. Not to mention accuracy and power. These things depend very much on the solidity of the foundation on which the superstructure stands. > > The heart of the problem is to get a good model for > > *how to do SCM*. You can implement (v1) using existing tools; > > optimize later if it turns out that your model is correct. Well actually, by parsing diffs to get the transforms that's exactly what I'm doing. (And it turns out that doing a proper binary diff isn't that hard.) Python, postgresql, glade, etc., are all 'existing tools'. What other existing tools would you suggest? Not patch. It's much easier and faster to apply database deltas with the already-implemented transform mechanism. Later, when we get to merging, patch or a patch-like thing will be needed, and then we'll probably start with patch and move to something faster/more powerful/more reliable later. > > Similarly, don't waste time writing GUIs; use emacs, xxdiff, dirdiff, > > etc. Write one later if it proves correct. Agreed there. However, once the basic transport mechanism is in place, a guid will follow very shortly afterwards, to show the version tree. > > If I was starting from scratch, I would consider a typical open source > > project: > > > > - email is key > > > > - people mail around patches; perhaps they get revised; eventually > > they get applied > > > > - the NEWS file says "applied patch for foofeature from > > jhacker at dot.com" Yes indeed, we can and will automate that. > > Projects sometimes split off files or subdirectories into other > > projects; perhaps they diverge slightly. It would be nice to handle > > this. Yes, a source tree should be able to inherit files from another project, and Prophesy should treat these files as descending from the same object. Each file object can have its own evolutionary tree, and these tree are not the same or restricted at all by the version tree or project boundaries. Furthermore, we should be able to recognize that one object is identical to another in a remote tree, or had a common ancestor. This touches on the subject of universal object ids, which I mentioned earlier in the archives, and I have not forgotten about it. First things first, though. > > For rsync and other projects, I keep patches that I have not yet > > really accepted but that look good in CVS in patches/. A SCM system > > that managed this would be nice. I think it's a promising model, not > > a hack. > > > > Disk is cheap. Keep everything. But keep it as compactly as you can. It's not that cheap. I have 7 gig of source on my laptop and several times that on my server. Most of that consists of kernel trees, all slightly different versions, or different projects in them. That's just silly. > > Networks are getting broader, but latency is not going to go away. > > > > Do it in <4000lines. Lions-book Unix was 10kloc, and look how many > > good ideas they had in there. I suppose the first useful version will be about that size (4K lines). -- Daniel From phillips at bonn-fries.net Sat Jun 15 05:03:00 2002 From: phillips at bonn-fries.net (Daniel Phillips) Date: Fri, 14 Jun 2002 21:03:00 +0200 Subject: [Prophesy] comments so far In-Reply-To: <20020614180558.GA10553@toey.sourcefrog.net> References: <20020614180558.GA10553@toey.sourcefrog.net> Message-ID: On Friday 14 June 2002 20:06, Martin Pool wrote: > I agree with you about the usefulness of editor undo chains. Under > emacs, I have kept-new-versions set to about 10, and I regularly use > C-u C-x C-s to do "keep backup version" and diff-backup. All very > nice and useful. > > A filesystem that kept all versions would allow you to do this in a > program-neutral way, although I think that's not so important now that > almost all the GNU tools understand foo.c.~1~ backups. > > However, it has the same problem that the results are largely lacking > semantics. For example, looking back through the history of all > modifications to a directory, it seems impossible to tell which > versions of the source will actually compile correctly, and which were > intermediate versions that don't work. That we can solve by integrating with the build tool a little. Every successful build marks a milestone in the Prophesy journal (not the same as a version). > If a program commits > early-and-often to CVS (say), but at least runs the test suite first, > then you have in general some guarantee about the internal consistency > of any committed version. (It would be even better if CVS versions > were module-wide, like in Subversion.) Could you elaborate on this module-wise property? I must have missed it while examining Subversion. > A magic filesystem is "mere mechanism". I don't think you should be > spending so much time on it until you have a good design for the > version-control system built on top. I totally disagree. I don't think you can build a tower on a bed of jello. The intrastructure is mere mechanism in the same sense that the operating system is mere mechanism: it defines what you can and can't do with the machine. > If it turns out that the design "on top" is no better than CVS, then > nobody will bother -- people who want neat features will use Bk (or a > free clone), and more conservative people will use CVS. > > You've said that you need to be able to cope without the filesystem -- > why not first implement the version without it, and then put it in as > a nicety later? Oh absolutely, I've stated that already, earlier in the archives. > The same functions can be adequately (perhaps not quite as well) > achieved using editor undo, editor backups, or tux2fs. Now wait, let's not confuse these things. The magic filesystem only does one thing: sends overwritten text to a userspace daemon to be added to the change database. Well, it notifies creates, deletes and truncates as well, but that's it. > If the design can sensibly handle many small revisions then it would > be easy to have a program called by the editor on save that commits to > it. If the design can't handle a huge number of revisions in a > sensible way, then it doesn't matter how they get generated. The current plan is to call out to the editor from Python, which will save the file contents beforehand. This is just for testing. > > I believe you're right, so long as SCM systems stay as clumsy as they are. > > If the archive system was actually easy and transparent to use, then > > programmers would use it as a tool for themselves, as a means of tracking > > multiple projects they're involved in, and trying out experiments. In much > > the same way as we now rely on the undo chain in an editor - I do that, don't > > you? That is, I rely on the editor's undo chain to back me out of failed > > experiments. It gets to the point where I'm reluctant to shut down the > > machine because of all the state saved in the editor's undo chains. Now, > > that's a system that works, but it's got glaring imperfections, beyond the > > fact that the state disappears when the editor shuts down. The editors also > > don't know about each other, and they are incapable of maintaining undo > > chains across different files, let alone projects. > > This is the perfect example of why semantic information is necessary. > Pressing C-_ repeatedly until it looks about right is error-prone and > labour intensive -- more than anything else, this limits the > usefulness of editor undo. For fixing small mistakes it's good, but > for backing out of hour-long experiments it seems useless to me. I > don't want to say "undo edit" a hundred times; I want to say "back up > to before I started working on this feature". Right, unless you forgot to put down any kind of marker before you started the session. We can put down various kinds of markers in the journal to help you be lazy here, including timestamps. Furthermore, we can maintain global undo/redo not as a single chain, but as a tree, like a version tree which only gets pruned when you are absolutely sure you don't want to undo any more. > Ideally, I can have several trees around. (Disk is cheap.) Instead of > rolling back, just toss that directory tree on the floor so I can find > it later if I want to see what it was that I tried. I don't know about you, but I often end up with trees sitting around and I haven't got a clue what's in them and why they're there. I always keep a clean version of the tree around just for this reason: so I can diff the mysterious tree and find out what's in it. Prophesy should automate this, and in addition, should hold some helpful metadata such as nicely chosen version tags. > > Granted, the SCM is also a tool for communication, but much good work has > > already been done there. I think the distributed side of things is well > > known and under control, > > I think current SCMs are not nearly as good as they should be. Bk is > the only decent distributed one, which is why it's doing so well. BitKeeper is very strong on the maintainer side, not so strong on the submitter side. This makes sense, as it was pitched to maintainers, and in fact, that's were the big bottlenecks were. I'm interested in doing a better job on the developer side, which seems like virgin territory to me. I mean, how often do you hear the word 'usability' in connection with source code management? > > but today's crop of scm's still suck as development tools. So > > that's where I'm concentrating. > > Do you mean they're not very helpful for the individual developer? > What kind of thing? There is too much fiddling with commands. Every time you want to edit a file you have to remember to check it out, and if you happen to be thinking about an actual problem you were trying to solve at the time the need arose, chances are your thought will vanish as you go through the mechanics of checking out the needed file. There are other rough spots too, such as BitKeeper's insistance on adding an additional level to the top of your tree. I also find all those SCCS files peppered through my source tree an ugly blemish. Putting a tree under management is an unecessarily complex project, and you have to submit to a strip search. CVS I won't even get into, nobody uses it locally and you know why. > > This was addressed earlier in an earlier post. In the current version, every > > change to each file is recorded (and in order, giving you global undo, > > including undeletes) but when you close the version, the stacked changes are > > collapsed into a single layer of changes for the version. To put it another > > way, the system journals individual changes, but (unless you tell it > > otherwise) only for the current version. > > I disagree with this too :-) > > SCM shouldn't ever throw away information; it should only selectively > roll it up for display. Once you've captured a diff it should be kept > forever. Seeing the order in which edits within a version were made > might possibly be helpful in the future. Sure, your edits can all be written to the journal, and that could even be the default. The journal is not the same as the version tree; in the version tree we want to record only fully collapsed diffs between versions. > For example, consider the case in which a version consists of me > taking a patch from somebody, and then fiddling things a bit to make > it merge properly. From one point of view, those changes have to go > together, since both are necessary to make the program compile again. > On the other hand, it would be nice to be able to see the original > diff separately. I think what we're going to do is actually compress the diff and store it when you receive it, then make a journal entry when you apply it. Your fiddles are the difference between the version with the diff, and your fiddled version. It's not necessary to record all your detailed edits to find the fiddles, though yes, it would be nice to be able to fall back to that in murky situations. > The more I think about it, the more I think some kind of recursive > nesting of versions makes sense. Bk has this, but it enforces a > two-level model of changesets, which consist of deltas (which are more > or less diffs.) But I can imagine a higher-level changeset containing > several others, particularly if they're ported or accepted from > somebody else. I've talked previously about 'regions', which are distinct parts that together make up a larger diff. It would make sense to nest such things, and it might be possible to track regions as they evolve through versions. On the other hand, I don't see any obvious way to nest versions themselves. > > Check-in and check-out are things the computer can figure out for > > itself. > > How? Prophesy knows you checked out a file, because you edited it. Prophesy knows you checked it in because you closed a version. > How is the computer meant to know what I was thinking when I made a > change? That's what future readers of the code really want to know. > It might even be *more* important than the change itself -- this is > why ChangeLogs can work in the absence of any other SCM. I find it's > actually good discipline for the programmer too -- it helps them > concentrate on doing only one thing at a time. > > > Risk... I don't see it. If anything, the risk of a programmer forgetting or > > misapplying a command is greater. I know, I did it myself once :-) > > Kernel crashes, down filesystems, etc. Journalling filesystem... > If ClearCase is down, you can't do *anything*. If your CVS server is > down, you can at least edit and compile locally, and diff against old > versions. I suppose you missed the part where all repositories are local, and your source tree is just a normal source tree with a database of diffs hidden in the root. > > As for complexity, I don't really see that. Difficult, yes, because so far > > nobody has provided a suitable framework on Linux for stacking local > > filesystems. > > I agree that would be useful. I just think you have a filesystem-hacker > hammer and are trying to apply it to a SCM thumb. I think when you see where I'm going with it you will say 'aha'. -- Daniel From sfr at canb.auug.org.au Sun Jun 16 12:26:44 2002 From: sfr at canb.auug.org.au (sfr at canb.auug.org.au) Date: Sun, 16 Jun 2002 12:26:44 +1000 (EST) Subject: [Prophesy] New user request Message-ID: <200206160226.g5G2QiOF028821@supreme.pcug.org.au> Hi, Do you all know a person whose email address is luckas at musoft.de? Should I let them on the list? Cheers, Stephen Rothwell From mbp at samba.org Wed Jun 19 12:53:41 2002 From: mbp at samba.org (Martin Pool) Date: Wed, 19 Jun 2002 12:53:41 +1000 Subject: [Prophesy] comments so far In-Reply-To: <20020614180558.GA10553@toey.sourcefrog.net>; from mbp@samba.org on Sat, Jun 15, 2002 at 04:06:01AM +1000 References: <20020614002851.GD6330@toey.sourcefrog.net> <20020614180558.GA10553@toey.sourcefrog.net> Message-ID: <20020619125341.G32710@va.samba.org> If you want to design userspace filesystem hook that's fine; if you want to design a SCM system that's fine too (and more interesting to me personally.) If you think that a SCM system ought to be built on top of kernel dnotify hooks then I really have to take issue with you. In summary: [1] this turns out to be a real weak point in the biggest known implementation of the design, ClearCase [2] on general principle, things shouldn't be in the kernel unless they need to be [3] you're not tackling the real problem [1] I was looking at a Clearcase installation at a large company earlier on today. Everybody's views (~= working directories) are kept on this machine under /view. Fine. cd /view/ ls -l Hangs. Foo. strace ls -l shows it looping indefinitely on getdirent() (something like that?). Pressing TAB in bash produces the same effect -- sometimes you have to kill bash and log in again. Very amusing. You don't realize how often you use this until you work on a machine without bash, or on a machine were pressing tab is likely to hang your shell. Anyhow, so I get a view name from somebody else, type it in casefully, and can see things inside. It is noticeably slower than it ought to be, considering the machine it's stored on (modern PIII or something) -- listing a directory takes a fair fraction of a second. Of course ClearCase is famous for having enormous hardware requirements, exceeding the cost of a developer's desktop hardware. This is no accident, but rather an essential implication of the design: every file IO, even just creating a short-lived temporary file, has to go to userspace, potentially across the network, into a daemon, and potentially into a database. A large fraction of IO on a working directory will have nothing to do with SCM: it will be, e.g., compilation to a test copy. It's dumb to impose the cost on operations when there will be no benefit. But it's basically all there, and seems to work well. It seems like ClearCase has some nice features. One popular one is that there are good X11 and W32 GUIs for all operations. It would be good if free systems had that, but it's really more or less independent of the underlying architecture. Later on we noticed that one of the build scripts was having trouble removing a temporary directory. Eventually it turned out that a file in a /tmp subdirectory was causing unlink() to return ENOENT, even though the file could be listed, stat'd, and even moved. I suspect ClearCase had somehow corrupted the machines dcache or something to cause this behaviour. The machine was in other respects pretty standard. Presumably rebooting will "fix" it. So at this point I say: - "bloody proprietary kernel modules" - "bloody unnecessary kenrel modules" (Insert epithet of choice in locales other than en_AU) Now, of course, all software has bugs, and I guess Rational will either eventually fix this, or explain how it's misconfigured on this machine, or at any rate be interested to see the report which will be passed to them. I don't expect software not to have bugs, but I do think if there are simple design decisions that you can make early on that will reduce the likelihood or severity of bugs, you should do so unless there is a strong counterargument. You can make an argument about open source being less buggy (or not) or Rational being dumb (or not), but I don't think any of them is clearly true. At any rate, ClearCase is more mature than Prophesy is likely to be any time soon. I've seen bugs in BK; typically they can be resolved by using one of BK's commands to preen a repository or remove leftover locks. It hasn't ever caused random other bad things to happen on unrelated parts of my machine and I wouldn't expect it to. [2] I think the weight of OS design experience is behind me in saying that things should not be in the kernel unless there is some security, performance, or functionality reason why they have to be there. I realize you only want to put hooks into the kernel, not the whole thing, but ClearCase does that too, and the issues still apply. I don't see anything about SCM that can't be adequately done purely in userspace. In as much as Daniel is designing a system he wants other people to work on and use, I think the obligation is on him to demonstrate that a kernel dependency is necessary. This is particular so given [1], that putting it in the kernel has turned out to be a problem in the past. I don't think that justification is impossible, but I'm a long way from being convinced. I can see a few possible justifications, but I don't think any of them stand up: "it's transparent" That's bogus; a CVS working directory and a ClearCase view are both trivially transparent in that you can read and edit files using normal tools, but you need to know magic commands or syntax to actually do anything. "it avoids having nasty CVS dirs lying around" It's slightly tidier, but it turns out not to be a real problem. If it bugged you, you could have just one in the top level, or make it a dot file. "you can auto-detect rename/add/delete" Handling renames is important, but automatically doing it is somewhat less so. There are several other systems possibly as good: - magic tokens embedded in the file (arch) - detecting similar file text (bk) - explicit notification (pre or post) - ... These don't happen often enough that it needs to be completely transparent. "bk mv foo bar" is not significantly harder; leaning to type it is trivial by comparison to learning the overall system. "you can keep intermediate changes" Well, that's nice. But given that you're going to throw them away anyhow, I don't see how it's any better than editor backups or a filesystem with history. I guess I don't see it as essentially part of SCM -- it's related but not the same. Given a tiny command that's run on each save or build you can do this from userspace anyhow. People have tried keeping source in databases before (Zope, VisualAge, various Smalltalks), but in general programmers seem to prefer relatively little magic in their source directories. Even MSVC++ keeps plain files on disk. Having plain files opens up opportunities; magic databases close them off. [3] SCM is a hard problem to define; SCM software more or less maps 1:1 with the author's view of how software development is done or ought to be done. The challenge is to think about SCM differently, or more clearly, than has been done before. Svn have already thought about this more than me. My overall impression is that they want to be a "good enough" replacement for CVS's more gaping holes, which is a good goal. If you're going to write a new system rather than hack on (say) Subversion, then it seems to me that you ought to aim to be better than any existing design on at least one important point. I know people here are talking about that, but I think it needs a lot more work before writing code. I think it's far more important than worrying about kernel hooks. Problems that you ought to be thinking about, in my not-very-humble opinion: * Do you want to support disconnected operation? That sounds like a good idea, even when the systems are not really "disconnected" but just on a modem in another continent. It definitely makes your job harder and more interesting: trivially, when you commit, the version number you generate must be local and not universally authoritative. (cf bk's "keys") There are several levels, from merely being able to edit while disconnected (cvs) to making patches but not sending (diff and mail) to basically everything (bk). * Can you have "threads" of development, where several changes are aimed at fixing the same thing, but they're not committed to a separate branch? * Is this meant for people working in an open source / internet way, or in a small-office way? Or do you aim to handle both? They seem pretty different: at one extreme, people just mail around patches; at the other, people just all work in the same directory. A lot of the literature about "Configuration Management" (capital C, M) is written from a military or enormous-project point of view, which is pretty different from that of open source hackers, and not necessarily better for all problems. * It seems obvious that you want some way of building logical changes that span multiple files. Really? Does it make sense to have two distinct changes to the same file inside this? * Can changesets be nested? * How do you represent accepting a patch from somebody, without losing that patch's internal structure? * If you make a mistake in a commit message, can you go back and change it? In many systems you can't, because that would be "rewriting history". It seems useful though, in some cases, and you can solve it by introducing a meta-history concept. * How do you make all this comprehensible? Can you explain it in a single page to a novice user, and leave the complicated stuff til later? Will they get bitten if they try to work with just a simple understanding? * Subdirectories often spin off as child projects, (tdb from samba) or they might merge in (experimental architectures joining Linux.) Can that be supported in some way better than just copying a snapshot of the files across? Do you want to? * What does it mean to support the "reviewer" role? * How do you handle repeated bidirectional merges between parallel streams of development? * Do you want to tackle the "star-merge" problem handled by arch, where you work out the order of applying multiple patches that is least likely to cause conflicts? * Does the system need to do anything to help with merging beyond just running something equivalent to diff3 and letting you resolve conflicts by hand? * Some object files are really hard/slow to produce and so it kind of makes sense to keep them in vc, although they don't really belong. (e.g. files requiring a special toolchain; autoconf output) Can you keep them as second-class citizens to avoid conflicts, etc. * Sometimes people want to e.g. check in binaries of released versions, so that they can be exactly restored even if the compiler changes later. What do you think of that? * Can the SCM play a role in communicating at appropriate levels of detail to various audiences? (Users, potential users, managers, developers, core team, satellite developers, distribution maintainers, release engineers, ...) * What happens when you're in the middle of changing something and you notice a little bug? You want to fix the bug, but also keep that fix separate from your main commit. Under CVS, you might get a second checkout, fix it there, and merge, but that's slow and a lot of trouble, so people mostly don't bother. It would be nice if they could. * What about developers who are trusted to commit to one branch, but not to HEAD? * Lots more questions. This is long enough already, you get the idea. -- Martin From mbp at sourcefrog.net Wed Jun 26 15:20:06 2002 From: mbp at sourcefrog.net (Martin Pool) Date: Wed, 26 Jun 2002 15:20:06 +1000 Subject: [Prophesy] Microsoft's SourceDepot system Message-ID: <20020626052003.GE11907@toey.sourcefrog.net> In the spirit of "See Figure One", Microsoft have two source code control systems: one they give to their customers, Visual SourceSafe (which sucks), and one they use themselves, SourceDepot, which is quite interesting. Here are some slides about it: http://216.239.39.100/search?q=cache:4Y_wlCjY5gAC:www.usenix.org/events/usenix-win2000/invitedtalks/lucovsky.ppt+%22sourcedepot%22&hl=en&ie=UTF-8 The details are apparently quite hard to discover. -- Martin http://www.things.org/~jym/fun/see-figure-1.html