[Prophesy] String transformations
Daniel Phillips
phillips at bonn-fries.net
Thu May 30 08:44:20 EST 2002
I'm still not sure that it's a good idea to build a whole SCM from scratch,
however, *if* I was going to do that I'd start with some good basic
operations, combined with some notion of how they fit in with the grand
scheme.
As far as the grand scheme goes, the idea is that we will maintain a database
of differences between predecessor versions of files. We will also maintain
information about how the files were created, deleted and moved around in the
source tree, but that's not the subject of today's post. Instead, I'll focus
on how the differences between files are to be maintained. And I'm going to
get way more detailed than perhaps is justified at this stage, just because I
feel that way today. In other words, watch out: bit bashing alert.
I'm looking at each file as a binary string. To get a precessor version of a
file, we retrieve some kind of transformation information from the database
and apply it to the string consisting of the current contents of the file,
yielding the immediate precessor of the file. Of course, every SCM does this
in some way, nonetheless, I feel compelled to invent yet another way of doing
it.
My philosophy is that these string transformations are exact, and so there
exist far more efficient and general means of encoding and applying the
needed transformations than, for example, diff and patch. Not to mention
simpler. So let me jump straight into my detailed design.
We can see the process of transforming of any binary string into any other as
a sequence of simple operations carried out on an input and output string.
Working from left to right in the input string, the following three
operations are sufficient:
text - append literal text to the destination string
copy - append bytes from the source to destination string
skip - skip over bytes in the source string
Additionally, we may wish to optimize the case where a transformation simply
moves text from one place to another:
move - append text from an arbitrary location in the input string
To see why the move operation is desirable, consider that without it, moving
a block of text from one place to another in a file results in the block of
text needing to be recorded in the database, whereas using the move
operation, that same block of text can be obtained from the current version
of the file. This represents a significant space savings. Presumeably, the
transformation strings are to be compressed when stored in the database, but
since the original version of the file is not encoded in the database, it is
not possible to rely on compression to avoid encoding the moved block of text
literally. My goal, therefore, is to design the transformation encoding in a
way that works well together with compression.
Each of the above operations takes a count parameter, specifying how many
bytes to append to the output string. Additionally, the 'text' operation
takes 'count' bytes of literal text, and the 'move' operation takes an
parameter specifying where the text to be copied is located in the input
string.
Now I need to encode these operations in a way that is hopefully compact and
easy to evaluate. I observed that in each case I have a count and an
operation, so I would like to encode that in a single number. I also have to
worry about the range of that number: I'd like to encode most of the
command/count values in a single byte. To accomplish this, I introduce an
additional primitive operation, which supplies some high order bits for the
count of a following operation:
high - supply additional high order count bits for a following operation
Since I am a miser with memory, I decided to allocate only two bits for the
command encoding:
0 = text
1 = copy
2 = skip
3 = high
Expressed as macros, we have:
text(n)
copy(n)
skip(n)
high(n)
which generates an operation code by shifting the operation number into the
high order two bits of parameter 'n', which is thus limited to six bits. The
high operation can appear several times in a row, each time an additional six
bits, with the most significant bits appearing first.
The move operation cannot be accomodated in this scheme, and needs some
different encoding. That's ok, the move operation is different anyway in
that it needs two numeric parameters. Fortunately, there are a number of
possible operations that are no-ops when the numeric parameter is zero, and
these are thus available for use as escape codes. My plan is to encode the
move operation - which is only an optimization - as a triple:
copy(0), copy(position), copy(count)
That's enough on 'move' for now. I did not implement it, but I did
accomodate it in the design.
Additionally, text(0) indicates the end of the transformation sequence. That
leaves two escape codes, skip(0) and high(0), for future expansion, the
latter being available only when it appears in leading position, since
high(0) may well appear in the less significant bytes of a large numeric
parameter.
Leaving out 'move', we end up with a simple implementation:
void transform(uchar *ops, uchar *src, uchar *dst)
{
unsigned c, count = 0;
while ((c = *ops++)) {
count = (count << 6) | (c & 0x3f);
switch (c >> 6) {
case textop:
memcpy(dst, ops, count);
dst +=count;
ops += count;
count = 0;
break;
case copyop:
memcpy(dst, src, count);
dst +=count;
case skipop:
src +=count;
count = 0;
break;
}
}
}
The 'ops' string controls the transformation of 'src' into 'dst'. For
example, the ops string:
copy(2), skip(4), text(6), "foobar", copy(5), 0
Transforms the input string:
"I love lucy"
into the output string
"I foobar lucy"
(Note that we could easily express the above in terms of stream IO
operations, since all operations are sequential. However, it's doubtful
whether there is any need to do that on modern machines, and in any event,
the move operation would present something of a problem.)
Given an operation string, we can compute the length of both the input and
output strings, as follows:
struct transinfo {int in; int out;} transcheck(uchar *ops)
{
unsigned c, count = 0, ilen = 0, olen = 0;
while ((c = *ops++)) {
count = (count << 6) | (c & 0x3f);
switch (c >> 6) {
case textop:
olen +=count;
ops += count;
count = 0;
break;
case copyop:
olen +=count;
case skipop:
ilen +=count;
count = 0;
break;
}
}
return (struct transinfo) {ilen, olen};
}
This function ought to take the length of the operation string as a parameter
as well, and ensure that the termination of the sequence occurs at exactly
that length.
Given an input string and a transformation string, we can compute the inverse
transformation string that converts the resulting output string back to the
input string. This interesting exercise is left to the reader ;-)
Demonstration code attached.
--
Daniel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: transform.c
Type: text/x-c
Size: 1384 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/prophesy/attachments/20020530/a437a35b/attachment.bin>
More information about the Prophesy
mailing list