[Prophesy] New Prophesy code: undump.c
Daniel Phillips
phillips at istop.com
Mon May 16 17:32:28 EST 2005
Hi all,
Here we go, another 1,500 lines of prophesy code... after two years of
quietness. That's how many lines a day? Hmm, I won't think about that.
This code builds on the earlier diff parsing and transformation code posted
here, and extends it to parse Tridge's bk dump files. There is a rather nice
system for feeding a large disk file into a memory-based parser, while
providing a guarantee that each parsed item will be available in memory in
its entirety, which is very helpful for generating the binary diffs from
unidiffs. The parse window expands automatically to accommodate parse items
of unknown size (e.g., a giant diff for a newly created file) and relocates
memory pointers as necessary if the buffer has to be realloced to a new
location.
The parse builds a version table and a delta table, and translates each
unidiff from the dump file into a binary delta. The binary delta format was
described in earlier posts here. The code to generate the binary deltas is
the same code as posted here earlier. The higher level dump file parsing
code is new. The text deltas are not used for anything yet and are simply
discarded. They have not yet been checked for correctness.
The parse also builds a versioned namespace consisting of all possible
directory trees. For any repository version, a full directory tree can be
created in roughly linear time and file lookups proceed in roughly constant
time. The versioned tree is also created in linear time, as are all the
other indexing structures generated as the dump file is processed.
The undump command line accepts a list of version numbers after the dump file
name. For each version number, a full directory tree will be listed from the
repository. Negative version numbers means "count backwards from the final
version". So:
undump linux-2.5.dump -1
will parse the source-puller dump of Linus's Linux BK repository, then list a
directory/file tree for the final version (containing somewhat more than
20,000 files).
The dump file parse is designed to run at nearly disk platter speed. However
some planned optimizations of the namespace constructing code are not
implemented yet, so the parse in fact takes about 50% longer than just
reading the whole file without doing anything. There is some experimental
code that optimizes caching of huge files like Tridge's linux-2.5 dump (3.8
GBytes) and it does work, but it doesn't make up for some of the lazy n^2
operations that will soon be eliminated.
You only have to parse one of these bk dumps once, ever, in theory. In
practice, since undump does not save the repository information to a working
format (only loads it) the bk dump file is going to be read quite a few more
times during the course of fleshing out the prophesy code. It takes about 3
minutes to parse the big Linux dump on an unimpressively specced machine,
which is actually pretty fast. That might eventually be improved to about 2
minutes. Anyway, for smaller bk dumps, it is blindingly fast.
The versioned namespace for all 64K+ versions in the Linux repository takes
17.5 MBytes, which seems a bargain considering the power it can bring to,
e.g., an interactive version walking interface. Versioned text items are
likely to be much more memory hungry, so we will need to see what has to be
done there. Needless to say, I will not fritter away any bytes of main
memory that are not necessary for near-linear text version generation.
There will be a lot more features and functionality coming over the next few
weeks. This is a weekend project, so please do not expect to see anything
new until next weekend. Some items upcoming are:
* Apply the deltas and see if we get the correct files back
* Save the repository data and version namespace structures to "bizarre"
format, so the bk dump doesn't have to be re-read
* Provide a variant that parses the (much smaller) bizarre log format
* Implement skip deltas to accelerate text version construction
* Provide a spiffy user interface to the versioned directories
* Handle other repository attributes besides just text and namespaces
* Turn the braindead BitKeeper deletes into real deletes
Regards,
Daniel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: prophesy.tgz
Type: application/x-tgz
Size: 14085 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/prophesy/attachments/20050516/f476caab/attachment.bin>
More information about the Prophesy
mailing list