[Prophesy] String transformations

Thu May 30 08:44:20 EST 2002

I'm still not sure that it's a good idea to build a whole SCM from scratch, 
however, *if* I was going to do that I'd start with some good basic 
operations, combined with some notion of how they fit in with the grand 
scheme.

As far as the grand scheme goes, the idea is that we will maintain a database 
of differences between predecessor versions of files.  We will also maintain 
information about how the files were created, deleted and moved around in the 
source tree, but that's not the subject of today's post.  Instead, I'll focus 
on how the differences between files are to be maintained.  And I'm going to 
get way more detailed than perhaps is justified at this stage, just because I 
feel that way today.  In other words, watch out: bit bashing alert.

I'm looking at each file as a binary string.  To get a precessor version of a 
file, we retrieve some kind of transformation information from the database 
and apply it to the string consisting of the current contents of the file, 
yielding the immediate precessor of the file.  Of course, every SCM does this 
in some way, nonetheless, I feel compelled to invent yet another way of doing 
it.

My philosophy is that these string transformations are exact, and so there 
exist far more efficient and general means of encoding and applying the 
needed transformations than, for example, diff and patch.  Not to mention 
simpler.  So let me jump straight into my detailed design.

We can see the process of transforming of any binary string into any other as 
a sequence of simple operations carried out on an input and output string.  
Working from left to right in the input string, the following three 
operations are sufficient:

   text - append literal text to the destination string
   copy - append bytes from the source to destination string
   skip - skip over bytes in the source string

Additionally, we may wish to optimize the case where a transformation simply 
moves text from one place to another:

   move - append text from an arbitrary location in the input string

To see why the move operation is desirable, consider that without it, moving 
a block of text from one place to another in a file results in the block of 
text needing to be recorded in the database, whereas using the move 
operation, that same block of text can be obtained from the current version 
of the file.  This represents a significant space savings.  Presumeably, the 
transformation strings are to be compressed when stored in the database, but 
since the original version of the file is not encoded in the database, it is 
not possible to rely on compression to avoid encoding the moved block of text 
literally.  My goal, therefore, is to design the transformation encoding in a 
way that works well together with compression.

Each of the above operations takes a count parameter, specifying how many 
bytes to append to the output string.  Additionally, the 'text' operation 
takes 'count' bytes of literal text, and the 'move' operation takes an 
parameter specifying where the text to be copied is located in the input 
string.

Now I need to encode these operations in a way that is hopefully compact and 
easy to evaluate.  I observed that in each case I have a count and an 
operation, so I would like to encode that in a single number.  I also have to 
worry about the range of that number: I'd like to encode most of the 
command/count values in a single byte.  To accomplish this, I introduce an 
additional primitive operation, which supplies some high order bits for the 
count of a following operation:

   high - supply additional high order count bits for a following operation

Since I am a miser with memory, I decided to allocate only two bits for the 
command encoding:

  0 = text
  1 = copy
  2 = skip
  3 = high

Expressed as macros, we have:

   text(n)
   copy(n)
   skip(n)
   high(n)

which generates an operation code by shifting the operation number into the 
high order two bits of parameter 'n', which is thus limited to six bits.  The 
high operation can appear several times in a row, each time an additional six 
bits, with the most significant bits appearing first.

The move operation cannot be accomodated in this scheme, and needs some 
different encoding.  That's ok, the move operation is different anyway in 
that it needs two numeric parameters.  Fortunately, there are a number of 
possible operations that are no-ops when the numeric parameter is zero, and 
these are thus available for use as escape codes.  My plan is to encode the 
move operation - which is only an optimization - as a triple:

  copy(0), copy(position), copy(count)

That's enough on 'move' for now.  I did not implement it, but I did 
accomodate it in the design.

Additionally, text(0) indicates the end of the transformation sequence.  That 
leaves two escape codes, skip(0) and high(0), for future expansion, the 
latter being available only when it appears in leading position, since 
high(0) may well appear in the less significant bytes of a large numeric 
parameter.

Leaving out 'move', we end up with a simple implementation:

void transform(uchar *ops, uchar *src, uchar *dst)
{
	unsigned c, count = 0;

	while ((c = *ops++)) {
		count = (count << 6) | (c & 0x3f);
		switch (c >> 6) {
		case textop:
			memcpy(dst, ops, count);
			dst +=count;
			ops += count;
			count = 0;
			break;
		case copyop:
			memcpy(dst, src, count);
			dst +=count;
		case skipop:
			src +=count;
			count = 0;
			break;
		}
	}
}

The 'ops' string controls the transformation of 'src' into 'dst'.  For 
example, the ops string:

   copy(2), skip(4), text(6), "foobar", copy(5), 0

Transforms the input string:

   "I love lucy"

into the output string

   "I foobar lucy"

(Note that we could easily express the above in terms of stream IO 
operations, since all operations are sequential.  However, it's doubtful 
whether there is any need to do that on modern machines, and in any event, 
the move operation would present something of a problem.)

Given an operation string, we can compute the length of both the input and 
output strings, as follows:

struct transinfo {int in; int out;} transcheck(uchar *ops)
{
	unsigned c, count = 0, ilen = 0, olen = 0;

	while ((c = *ops++)) {
		count = (count << 6) | (c & 0x3f);
		switch (c >> 6) {
		case textop:
			olen +=count;
			ops += count;
			count = 0;
			break;
		case copyop:
			olen +=count;
		case skipop:
			ilen +=count;
			count = 0;
			break;
		}
	}
	return (struct transinfo) {ilen, olen};
}

This function ought to take the length of the operation string as a parameter 
as well, and ensure that the termination of the sequence occurs at exactly 
that length.

Given an input string and a transformation string, we can compute the inverse 
transformation string that converts the resulting output string back to the 
input string.  This interesting exercise is left to the reader ;-)

Demonstration code attached.

-- 
Daniel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: transform.c
Type: text/x-c
Size: 1384 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/prophesy/attachments/20020530/a437a35b/attachment.bin>