[Prophesy] Diff to transform converter
Daniel Phillips
phillips at bonn-fries.net
Sat Jun 8 00:17:33 EST 2002
Here is a first cut at a uninified-diff to transform converter. I considered
using a parsing or lexing tool to do this, but in the end I decided to roll
my own parser, as the diff syntax is quite simple. I took a look at the
Posix regex package but was disappointed to learn that it can only deal with
null-terminated strings, which severely limits its usefulness. The scanf
library function, on the other hand, turned out to integrate quite well with
my little parser. I suppose that is because the diff syntax was originally
constructed to be friendly to scanf. Scanf is used to parse and convert the
chunk line numbers.
The diff syntax proved to be context-free, that is, it's not necessary to
decode any of the line numbers in order to complete the parse. Furthermore,
only the input line number of each chunk is needed to generate the transform
output. This fact certainly wasn't obvious when I started.
The parser itself is a state/transition machine, which is what all the gotos
are about. To hide the parsed text behind a stream abstraction and make the
parser concise and readable, three helper macros are used:
next_ch() - get the next character to parse, returning -1 if at end
next_is(c) - return true if the next character is c
skip_to(c) - skip ahead to c or end and return true if found
These macros assume the variables 'string' and 'limit' are within scope,
defining the limits of the text to parse. The macros make use of a pair of
inlines, _next_is and _skip_to, which properly parameterize all the required
state, so that no static variables are used in the parser itself, so that the
end result is thread-safe. (Note: here we see C at its weakest. If it were
possible to define the helper functions within the scope of the parser, no
macros would be needed and no state would have to be passed.)
In the end, the parser itself, complete with a handle of helper functions,
was the smaller part of the project - more than half the code is devoted to
generating the transform codes.
Code generation is slightly complicated by a mechanism for merging successive
operation of the same type together. This in turn requires the literal text
for text_op operations to be identified in one place, and used in another,
which introduces a new array to check for overflow and expand as necessary,
etc. Details like this tend to bulk up code quickly.
While the parser itself is thread-safe, i.e., it uses no static variables,
the code generator isn't. It uses a number of static variables and two
arrays, mostly concerned with implementing the opcode merging. This needs to
be cleaned up at some point by encapsulating all state in a single struct to
be shared by the parser and code generator. The code generator itself will
not embed nicely inside the parser because it is called from two places, one
to output opcodes and the other at the end of the parse to flush out the
final, pending opcode.
With the helper functions not inlined and gcc optimization level 2, the
parser and code generator come in at less than 2K, a result that warms the
heart of an old code mizer like me. Inlining only adds another 100 bytes or
so, so of course we will do it, to get the performance.
It's pretty much impossible to debug a parser and code generator like this
without tracing output, and I have used my usual technique for that. There
are three macros that can be used to wrap tracing output statements: trace_on
and trace_off, and a further macro, trace, which is defined as one or the
other, depending on whether you want tracing output or not. It would be
foolish to assume that no further work needs to be done on the parser, so all
the tracing code has been left in for now.
There needs to be more error checking. As it stands, this code should
perform its job correctly, on the assumption that the diff text is always
correct. It would of course be foolish to assume this. Some of the
redundant information in the diff can be used for a crosscheck:
- The number of copied and skipped lines in each chunk should match
the chunk's specified input line count
- The number of copied and added lines in each chunk should match the
specified chunk's output line count.
- The current output line should be tracked and checked against the
chunk's output line number
- Copied and skipped text in the diff should be checked to ensure it
matches the corresponding input text
- Added text could possibly be checked against the original target
text, but since the target text is not required for any other
purpose, it makes more sense just to test-apply the generated
transform to the input text, then ensure it matches the original
target text
As suggested above, whenever we generate a transform we want to test-apply it
to ensure it does in fact have the same result as applying the diff does,
that is, it correctly generates the diff target given the input text and the
transform.
Array overflow checking needs to be added, complete with automatic expansion
of the arrays as needed.
The currect code does not take advantage of the move operation, which I noted
earlier, is there so that text that is merely moved (or copied) from place to
place in a string does not have to be encoded literally - it can always be
taken from the input string when a transform needs to be applied. This is
merely an optimization, and a difficult one at that. There are other tasks
of more immediate importance.
In fact, the whole process of converting a diff to a transform is just a
shortcut so that we can start loading the database with string differences
without getting bogged down in the details of identifying which sections of
text have changed and which have not. Eventually, we do want to go to the
effort of implementing a custom algorithm to do this, for a number of reasons:
- It will be faster and (probably) more reliable than diff
- It will work at a resolution of less than a line, so that in the
common case where only a single name has changed in a line the
redundant, invariant context will not be recorded.
- It will handle binary files just as well as ascii text
Now I should also address the question: why not just use diff, and avoid all
the trouble of implementing these transform things? Well:
- Transforms are considerably more compact than diff files. For
example, context and skipped lines from the diff are not encoded
(or needed) in the transform.
- Applying a transform will be significantly faster than applying a
diff via patch. This is an operation we will make heavy use of as
the repository operations become more sophisticated.
- The transform code is far less complex than patch and other diff
utilities, and hence, correspondingly more trustworthy
- We will probably want to handle more than one kind of diff syntax,
and transforms provide a common storage format.
- Transforms, because of their simpler structure, are much more
suitable for calculus-type operations, such as composition.
A diff can do one thing that a transform cannot do: a diff can be applied in
a fuzzy way, that is, the patched text does not have to be exactly the same
as the original target text from which the diff was generated. However, we
don't require this property just now, since we only want to represent exact
differences in the database. Anything else is an error. When we do get to
the point of handling fuzzy problems like merging, we will need to build some
more tools for the purpose. We will not make the mistake of attempting to
use a single tool such as diff for two different purposes, neither of which
it is ideally suited to.
Now that I have broken the back of this biggish chunk of work, it's time to
contemplate what else needs to be done to get to the point of having some
minimally functional repository manager to play with:
- Finish up this work by wrapping it with some test code to create
diffs and cross check against the generated transforms
- Wrap the C code as a Python library:
http://www.python.org/doc/current/api/api.html
- Think more about the details of the magic filesystem. This won't
have to be a general purpose stackable filesystem, it just has to
interface to userland in such a way that file text can be saved
before being overwritten, and compared to the changed result when
a file is closed.
- Flesh out some more database format structural detail, so that
filenames and directory structure can be tracked and simple
metadata such as comments version names can be recorded
- Do a little work to improve the python database interface class
for record writing, taking advantage of Postres's "copy file" table
loading command (which can be easily emulated with less efficient
operations for other databases that don't have it).
As I mentioned previously, the magic filesystem interface doesn't necessarily
have to exist before the system can be used: a simple workaround is to run
the editing commands from a Python shell, with wrappers to run the required
database operations.
The attached code demonstrates the conversion of a simple diff text into a
transform. It's all set to compile and run, with tracing on. In the trace
output, a character followed by ',' was read by skip_to and a character
followed by '?' was tested by next_is. Parse states are printed out at the
beginning of a line, and generated operations at the end. Finally, the
generated transform is printed byte by byte in hex.
--
Daniel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: transform.c
Type: text/x-c
Size: 6867 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/prophesy/attachments/20020607/c371fe1a/attachment.bin>
More information about the Prophesy
mailing list