[Prophesy] Repository format

Sat Nov 19 08:25:15 EST 2005

Here is a cut and paste from the (updated) source text, now called untridge.c.
The repository format is evolving, but it would seem, in a monotonic forward
direction.  After mulling over how to add versioned text to the existing
versioned namespace, I basically reinvented the concept of weave, but with
twists, see below.  This is in process of implementation.

The notion of multiple births appeared recently, mainly as a way to avoid
allowing multiple parents in the version graph, which would have a fairly
horrible impact on version generation efficiency.

Regards,

Daniel

/*
  * My terminology:
  *   node -> a unique object in the repository store.  The low order bit of a
  *     node id specifies dict or text and the high order bits are the node
  *     number.
  *
  *   dict node -> a versioned directory object (a weave-like thingy)
  *   text node -> a versioned file object (also a weave-like thingy)
  *
  * A repository is a tree of dict nodes with text nodes at the leaves.  Each
  * node (dict or text) carries its own, complete history.  In other words,
  * each node is a versioned object.
  *
  * The repository is much like a traditional filesystem.  The correspondence is:
  *   text node -> file
  *   dict node -> directory
  *   node number -> inode number
  *
  * Dict node format:
  *   A dict node is an unordered collection of dict entries.  Each entry has a
  *   node id, a version in which the entry was created and N versions in which it
  *   died.  Note the similarity to the text node format below.  An entry can die
  *   by being unlinked (delete or move) or by being renamed.
  *
  * Text node format:
  *   A text node is a sequence of text fragments.  Each text fragment points
  *   to a region of the base text, which is initially the exact file text.  As
  *   new fragments are inserted, new fragment text is appended to the base
  *   text.  Each fragment gives the version at which the fragment was first
  *   inserted and the N (0 or more) versions where the fragment was deleted.
  *   In effect, the birth and N deaths specify a subset of the repository
  *   version graph.  We generate the plaintext for a given version by
  *   determining, for each fragment, whether the text fragment is live in the
  *   version we are generating.  If so, we append the fragment text to the
  *   output text and go on to the next fragment.  To see if a given fragment
  *   is live (included in) a given version, we check that the fragment birth
  *   and none of its deaths are ancestors of the current version.
  *
  * Improved text fragment format:
  *   In effect, each text fragment carries its own, complete history.  But if we
  *   don't do something special, then when we merge we will have to create a new
  *   fragment at the merge point because the merge source is not an ancestor of
  *   the merge result.  It then becomes messy to match up the old and new,
  *   identical fragments to learn that a given fragment actually came from a
  *   merge source, not the merge point.  In traditional weave format this
  *   problem is solved by allowing multiple version parents, implicitly
  *   including all changes from the forked version(s) in the merge result.
  *   Unwanted text in the resulting "mash-up" is then deleted.  I don't want to
  *   use this approach because I do not know any easy way to write an O(1)
  *   ancestry test if multiple parents are allowed.  Furthermore, there could
  *   be many parents of a merge point, or the merge may be a cherry-pick,
  *   either of which could require many deletions.  Instead I allow a fragment
  *   to have multiple births.  When we merge, we add a new birth at the merge
  *   point.  The first birth is the real birth of the fragment, other births
  *   are there to compute fragment liveness efficiently.  The liveness test is
  *   now: any of the fragment births and none of its deaths are ancestors of
  *   the current version.
  *
  * Future things:
  *  - It would be easy to generalize the versioned tree to multiple roots, i.e.,
  *    multiple directory trees possibly sharing files via hardlink or symlink.
  *  - versioned node attributes
  *     - for either dict or text nodes
  *     - fixed size and variable sized are separate cases
  *     - standard attributes (e.g., permissions, text) are separate case
  *       from user defined
  *  - Symlinks
  *     - just specially flagged text nodes
  */