[Prophesy] Diff to transform converter

Sat Jun 8 00:17:33 EST 2002

Here is a first cut at a uninified-diff to transform converter.  I considered 
using a parsing or lexing tool to do this, but in the end I decided to roll 
my own parser, as the diff syntax is quite simple.  I took a look at the 
Posix regex package but was disappointed to learn that it can only deal with 
null-terminated strings, which severely limits its usefulness.  The scanf 
library function, on the other hand, turned out to integrate quite well with 
my little parser.  I suppose that is because the diff syntax was originally 
constructed to be friendly to scanf.  Scanf is used to parse and convert the 
chunk line numbers.

The diff syntax proved to be context-free, that is, it's not necessary to 
decode any of the line numbers in order to complete the parse.  Furthermore, 
only the input line number of each chunk is needed to generate the transform 
output.  This fact certainly wasn't obvious when I started.

The parser itself is a state/transition machine, which is what all the gotos 
are about.  To hide the parsed text behind a stream abstraction and make the 
parser concise and readable, three helper macros are used:

  next_ch() - get the next character to parse, returning -1 if at end
  next_is(c) - return true if the next character is c
  skip_to(c) - skip ahead to c or end and return true if found

These macros assume the variables 'string' and 'limit' are within scope, 
defining the limits of the text to parse.  The macros make use of a pair of 
inlines, _next_is and _skip_to, which properly parameterize all the required 
state, so that no static variables are used in the parser itself, so that the 
end result is thread-safe.  (Note: here we see C at its weakest.  If it were 
possible to define the helper functions within the scope of the parser, no 
macros would be needed and no state would have to be passed.)

In the end, the parser itself, complete with a handle of helper functions, 
was the smaller part of the project - more than half the code is devoted to 
generating the transform codes.  

Code generation is slightly complicated by a mechanism for merging successive 
operation of the same type together.  This in turn requires the literal text 
for text_op operations to be identified in one place, and used in another, 
which introduces a new array to check for overflow and expand as necessary, 
etc.  Details like this tend to bulk up code quickly.

While the parser itself is thread-safe, i.e., it uses no static variables, 
the code generator isn't.  It uses a number of static variables and two 
arrays, mostly concerned with implementing the opcode merging.  This needs to 
be cleaned up at some point by encapsulating all state in a single struct to 
be shared by the parser and code generator.  The code generator itself will 
not embed nicely inside the parser because it is called from two places, one 
to output opcodes and the other at the end of the parse to flush out the 
final, pending opcode.

With the helper functions not inlined and gcc optimization level 2, the 
parser and code generator come in at less than 2K, a result that warms the 
heart of an old code mizer like me.  Inlining only adds another 100 bytes or 
so, so of course we will do it, to get the performance.

It's pretty much impossible to debug a parser and code generator like this 
without tracing output, and I have used my usual technique for that.  There 
are three macros that can be used to wrap tracing output statements: trace_on 
and trace_off, and a further macro, trace, which is defined as one or the 
other, depending on whether you want tracing output or not.  It would be 
foolish to assume that no further work needs to be done on the parser, so all 
the tracing code has been left in for now.

There needs to be more error checking.  As it stands, this code should 
perform its job correctly, on the assumption that the diff text is always 
correct.  It would of course be foolish to assume this.  Some of the 
redundant information in the diff can be used for a crosscheck:

  - The number of copied and skipped lines in each chunk should match
    the chunk's specified input line count

  - The number of copied and added lines in each chunk should match the
    specified chunk's output line count.

  - The current output line should be tracked and checked against the
    chunk's output line number

  - Copied and skipped text in the diff should be checked to ensure it
    matches the corresponding input text

  - Added text could possibly be checked against the original target
    text, but since the target text is not required for any other
    purpose, it makes more sense just to test-apply the generated
    transform to the input text, then ensure it matches the original
    target text

As suggested above, whenever we generate a transform we want to test-apply it 
to ensure it does in fact have the same result as applying the diff does, 
that is, it correctly generates the diff target given the input text and the 
transform.

Array overflow checking needs to be added, complete with automatic expansion 
of the arrays as needed.

The currect code does not take advantage of the move operation, which I noted 
earlier, is there so that text that is merely moved (or copied) from place to 
place in a string does not have to be encoded literally - it can always be 
taken from the input string when a transform needs to be applied.  This is 
merely an optimization, and a difficult one at that.  There are other tasks 
of more immediate importance.

In fact, the whole process of converting a diff to a transform is just a 
shortcut so that we can start loading the database with string differences 
without getting bogged down in the details of identifying which sections of 
text have changed and which have not.  Eventually, we do want to go to the 
effort of implementing a custom algorithm to do this, for a number of reasons:

  - It will be faster and (probably) more reliable than diff

  - It will work at a resolution of less than a line, so that in the
    common case where only a single name has changed in a line the
    redundant, invariant context will not be recorded.

  - It will handle binary files just as well as ascii text

Now I should also address the question: why not just use diff, and avoid all 
the trouble of implementing these transform things?  Well:

  - Transforms are considerably more compact than diff files.  For
    example, context and skipped lines from the diff are not encoded
    (or needed) in the transform.

  - Applying a transform will be significantly faster than applying a
    diff via patch.  This is an operation we will make heavy use of as
    the repository operations become more sophisticated.

  - The transform code is far less complex than patch and other diff
    utilities, and hence, correspondingly more trustworthy

  - We will probably want to handle more than one kind of diff syntax,
    and transforms provide a common storage format.

  - Transforms, because of their simpler structure, are much more
    suitable for calculus-type operations, such as composition.

A diff can do one thing that a transform cannot do: a diff can be applied in 
a fuzzy way, that is, the patched text does not have to be exactly the same 
as the original target text from which the diff was generated.  However, we 
don't require this property just now, since we only want to represent exact 
differences in the database.  Anything else is an error.  When we do get to 
the point of handling fuzzy problems like merging, we will need to build some 
more tools for the purpose.  We will not make the mistake of attempting to 
use a single tool such as diff for two different purposes, neither of which 
it is ideally suited to.

Now that I have broken the back of this biggish chunk of work, it's time to 
contemplate what else needs to be done to get to the point of having some 
minimally functional repository manager to play with:

  - Finish up this work by wrapping it with some test code to create
    diffs and cross check against the generated transforms

  - Wrap the C code as a Python library:

      http://www.python.org/doc/current/api/api.html

  - Think more about the details of the magic filesystem.  This won't
    have to be a general purpose stackable filesystem, it just has to 
    interface to userland in such a way that file text can be saved
    before being overwritten, and compared to the changed result when
    a file is closed.

  - Flesh out some more database format structural detail, so that
    filenames and directory structure can be tracked and simple
    metadata such as comments version names can be recorded

  - Do a little work to improve the python database interface class
    for record writing, taking advantage of Postres's "copy file" table
    loading command (which can be easily emulated with less efficient
    operations for other databases that don't have it).

As I mentioned previously, the magic filesystem interface doesn't necessarily 
have to exist before the system can be used: a simple workaround is to run 
the editing commands from a Python shell, with wrappers to run the required 
database operations.

The attached code demonstrates the conversion of a simple diff text into a 
transform.  It's all set to compile and run, with tracing on.  In the trace 
output, a character followed by ',' was read by skip_to and a character 
followed by '?' was tested by next_is.  Parse states are printed out at the 
beginning of a line, and generated operations at the end.  Finally, the 
generated transform is printed byte by byte in hex.

-- 
Daniel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: transform.c
Type: text/x-c
Size: 6867 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/prophesy/attachments/20020607/c371fe1a/attachment.bin>