From phillips at istop.com Mon May 16 13:52:10 2005 From: phillips at istop.com (Daniel Phillips) Date: Sun, 15 May 2005 23:52:10 -0400 Subject: [Prophesy] Part II: Rip van Winkel Wakes up Message-ID: <200505152352.10334.phillips@istop.com> Hi, anybody who happens to still be subscribed to this mailing list! I have promised that there would be more activity here at some point, and now there is. As everybody knows, the big news is that Bitkeeper has been exorcised from the Linux kernel development toolchain. That is the good news; the bad news is: this happened on very short notice, so we only have a month or so more to get a BitKeeper replacement in place. Well the fact is, several new tools - Git, bzr-ng, Mercurial - are nearly in place already. In the style of true troupers, these tools are already being rushed into use without being 100% ready. The standard process of fixing everything up on the fly to more-or-less work is progressing well. The part that Prophesy can play in all this is, there is all that valuable versioning history locked up in now-dead Bitkeeper repositories. We would like to liberate that, and use it. Tridge kindly provided an ascii database dumper for BitKeeper repositories, and now it seems that Prophesy just may be able to provide a useful link between Tridge's dump format and the native formats of other open source SCM systems. See the next message for a major update of the Prophesy code in the form of a parser for Tridge's BK dumps, coupled to a basic versioned namespace manager. Regards, Daniel From phillips at istop.com Mon May 16 17:32:28 2005 From: phillips at istop.com (Daniel Phillips) Date: Mon, 16 May 2005 03:32:28 -0400 Subject: [Prophesy] New Prophesy code: undump.c Message-ID: <200505160332.28978.phillips@istop.com> Hi all, Here we go, another 1,500 lines of prophesy code... after two years of quietness. That's how many lines a day? Hmm, I won't think about that. This code builds on the earlier diff parsing and transformation code posted here, and extends it to parse Tridge's bk dump files. There is a rather nice system for feeding a large disk file into a memory-based parser, while providing a guarantee that each parsed item will be available in memory in its entirety, which is very helpful for generating the binary diffs from unidiffs. The parse window expands automatically to accommodate parse items of unknown size (e.g., a giant diff for a newly created file) and relocates memory pointers as necessary if the buffer has to be realloced to a new location. The parse builds a version table and a delta table, and translates each unidiff from the dump file into a binary delta. The binary delta format was described in earlier posts here. The code to generate the binary deltas is the same code as posted here earlier. The higher level dump file parsing code is new. The text deltas are not used for anything yet and are simply discarded. They have not yet been checked for correctness. The parse also builds a versioned namespace consisting of all possible directory trees. For any repository version, a full directory tree can be created in roughly linear time and file lookups proceed in roughly constant time. The versioned tree is also created in linear time, as are all the other indexing structures generated as the dump file is processed. The undump command line accepts a list of version numbers after the dump file name. For each version number, a full directory tree will be listed from the repository. Negative version numbers means "count backwards from the final version". So: undump linux-2.5.dump -1 will parse the source-puller dump of Linus's Linux BK repository, then list a directory/file tree for the final version (containing somewhat more than 20,000 files). The dump file parse is designed to run at nearly disk platter speed. However some planned optimizations of the namespace constructing code are not implemented yet, so the parse in fact takes about 50% longer than just reading the whole file without doing anything. There is some experimental code that optimizes caching of huge files like Tridge's linux-2.5 dump (3.8 GBytes) and it does work, but it doesn't make up for some of the lazy n^2 operations that will soon be eliminated. You only have to parse one of these bk dumps once, ever, in theory. In practice, since undump does not save the repository information to a working format (only loads it) the bk dump file is going to be read quite a few more times during the course of fleshing out the prophesy code. It takes about 3 minutes to parse the big Linux dump on an unimpressively specced machine, which is actually pretty fast. That might eventually be improved to about 2 minutes. Anyway, for smaller bk dumps, it is blindingly fast. The versioned namespace for all 64K+ versions in the Linux repository takes 17.5 MBytes, which seems a bargain considering the power it can bring to, e.g., an interactive version walking interface. Versioned text items are likely to be much more memory hungry, so we will need to see what has to be done there. Needless to say, I will not fritter away any bytes of main memory that are not necessary for near-linear text version generation. There will be a lot more features and functionality coming over the next few weeks. This is a weekend project, so please do not expect to see anything new until next weekend. Some items upcoming are: * Apply the deltas and see if we get the correct files back * Save the repository data and version namespace structures to "bizarre" format, so the bk dump doesn't have to be re-read * Provide a variant that parses the (much smaller) bizarre log format * Implement skip deltas to accelerate text version construction * Provide a spiffy user interface to the versioned directories * Handle other repository attributes besides just text and namespaces * Turn the braindead BitKeeper deletes into real deletes Regards, Daniel -------------- next part -------------- A non-text attachment was scrubbed... Name: prophesy.tgz Type: application/x-tgz Size: 14085 bytes Desc: not available URL: