[Prophesy] New Prophesy code: undump.c

Mon May 16 17:32:28 EST 2005

Hi all,

Here we go, another 1,500 lines of prophesy code... after two years of 
quietness.  That's how many lines a day?  Hmm, I won't think about that.

This code builds on the earlier diff parsing and transformation code posted 
here, and extends it to parse Tridge's bk dump files.  There is a rather nice 
system for feeding a large disk file into a memory-based parser, while 
providing a guarantee that each parsed item will be available in memory in 
its entirety, which is very helpful for generating the binary diffs from 
unidiffs.  The parse window expands automatically to accommodate parse items 
of unknown size (e.g., a giant diff for a newly created file) and relocates 
memory pointers as necessary if the buffer has to be realloced to a new 
location.

The parse builds a version table and a delta table, and translates each 
unidiff from the dump file into a binary delta.  The binary delta format was 
described in earlier posts here.  The code to generate the binary deltas is 
the same code as posted here earlier.  The higher level dump file parsing 
code is new.  The text deltas are not used for anything yet and are simply 
discarded.  They have not yet been checked for correctness.

The parse also builds a versioned namespace consisting of all possible 
directory trees.  For any repository version, a full directory tree can be 
created in roughly linear time and file lookups proceed in roughly constant 
time.  The versioned tree is also created in linear time, as are all the 
other indexing structures generated as the dump file is processed.

The undump command line accepts a list of version numbers after the dump file 
name.  For each version number, a full directory tree will be listed from the 
repository.  Negative version numbers means "count backwards from the final 
version".  So:

  undump linux-2.5.dump -1

will parse the source-puller dump of Linus's Linux BK repository, then list a 
directory/file tree for the final version (containing somewhat more than 
20,000 files).

The dump file parse is designed to run at nearly disk platter speed.  However 
some planned optimizations of the namespace constructing code are not 
implemented yet, so the parse in fact takes about 50% longer than just 
reading the whole file without doing anything.  There is some experimental 
code that optimizes caching of huge files like Tridge's linux-2.5 dump (3.8 
GBytes) and it does work, but it doesn't make up for some of the lazy n^2 
operations that will soon be eliminated.

You only have to parse one of these bk dumps once, ever, in theory.  In 
practice, since undump does not save the repository information to a working 
format (only loads it) the bk dump file is going to be read quite a few more 
times during the course of fleshing out the prophesy code.  It takes about 3 
minutes to parse the big Linux dump on an unimpressively specced machine, 
which is actually pretty fast.  That might eventually be improved to about 2 
minutes.  Anyway, for smaller bk dumps, it is blindingly fast.

The versioned namespace for all 64K+ versions in the Linux repository takes 
17.5 MBytes, which seems a bargain considering the power it can bring to, 
e.g., an interactive version walking interface.  Versioned text items are 
likely to be much more memory hungry, so we will need to see what has to be 
done there.  Needless to say, I will not fritter away any bytes of main 
memory that are not necessary for near-linear text version generation.

There will be a lot more features and functionality coming over the next few 
weeks.  This is a weekend project, so please do not expect to see anything 
new until next weekend.  Some items upcoming are:

  * Apply the deltas and see if we get the correct files back

  * Save the repository data and version namespace structures to "bizarre"
    format, so the bk dump doesn't have to be re-read

  * Provide a variant that parses the (much smaller) bizarre log format

  * Implement skip deltas to accelerate text version construction

  * Provide a spiffy user interface to the versioned directories

  * Handle other repository attributes besides just text and namespaces

  * Turn the braindead BitKeeper deletes into real deletes

Regards,

Daniel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: prophesy.tgz
Type: application/x-tgz
Size: 14085 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/prophesy/attachments/20050516/f476caab/attachment.bin>