[ccan] New module: bdelta

Sun Aug 21 21:12:31 EST 2011

On Fri, 19 Aug 2011 10:41:33 -0400, Joey Adams <joeyadams3.14159 at gmail.com> wrote:
> On Fri, Aug 19, 2011 at 2:17 AM, Rusty Russell <rusty at rustcorp.com.au> wrote:
> > I like it!  It's a perfect example of a ccan module: it does one thing,
> > and does it well.
> 
> Thanks!
> 
> > A few minor nitpicks from reading the patch (I haven't applied it yet!)
> >
> > 1) I'm always tempted to reuse errno rather than invent my
> > own error codes (ENOMEM, EINVAL and EIO?), but otherwise I would do an
> > enum for the error codes; it just makes the API more explicit.
> 
> My error strings are more descriptive.  "Patch applied to wrong
> string" is far more descriptive than "Invalid argument".

Sure, but that's partially a documentation question.

> Besides, I
> prefer to be portable in a library, and ENOMEM/EINVAL/EIO are defined
> by POSIX, not by the C standards.

Well, we don't avoid write() and read() :)

> I was going to use an enum, but what would I call it?  Remember that
> C++ would treat enum bdelta {...} as a type definition (rather than
> simply an enum tag), meaning "bdelta" could no longer be used as a
> variable name.

I don't really care about C++, but enum bdelta seems wrong anyway:
bdelta_err maybe?

> > 2) Maybe use the term 'char arrays' rather than 'byte strings' as
> >   string has that 0-terminated sense for C coders.
> 
> I wasn't sure what to call them.  PostgreSQL calls them "byte arrays
> (BYTEA)".  Haskell calls it ByteString.  char almost implies plain
> text, while "byte" or "octet" suggests the library can be used with
> binary data.  A "string" in many other programming languages can have
> null bytes.

"byte arrays"?  I just did a double-take at the term 'strings'...

> Another naming consideration: I'm not sure whether to call it a
> "patch", "diff", or "delta".  I called the library bdelta rather than
> bdiff mainly because bdiff_diff looks silly.  I should probably use
> all of those terms so people using search engines can find what
> they're looking for.
> 
> Note that I shunned the word "binary" in my documentation because the
> diff algorithm shortens at the byte level, not the bit level.  An
> insertion of one bit would change all the bytes that follow.

Yeah, bdelta = byte delta?

> > 3) I wouldn't implement bdelta_perror personally, since I prefer
> >   err/errx/warn/warnx.
> 
> I'm following the lead of GnuTLS here, which has a function called
> gnutls_perror.  It's really convenient to have, if you want to follow
> the good practice of handling all error codes but haven't become
> interested in the errors themselves yet.

Your implementation though, is *worse* than someone implementing it
themselves.  I'm not arguing against bdelta_strerror(), just
bdelta_perror().  It's stupid.

If you're a library, you shouldn't be printing to stderr unless you're
about to die (cf. err()).  Otherwise, you often want a format string
rather than a fixed string.  I've found that generally when coders try
to use perror it makes their code worse.

> err/errx/warn/warnx aren't defined by C or POSIX, but are BSD extensions.

Absolutely!  And when we find a platform which doesn't implement them,
I'll write ccan/err as a wrapper.

> > 4) We really need a good bitstring library: it'd be great for this!
> 
> You just said "string"!  :-)

Yeah, I suck :)

> > I think testing could be improved a bit.  For example, I'd test the
> > diff results directly, before testing patching.  eg. that "aaaa" vs
> > "aaaa" produced a non-diff, then "aaaa" vs "aaaab", etc.  Then once the
> > diffs pass, test those patches.
> 
> Indeed, bdelta could use more trivial tests.  Someone on freenode/#c
> recently asked if anyone had any tests for longest common subsequence,
> and I pointed to bdelta.  Having more tests would have been nice.
> 
> Also, it would be nice to have some tests that test the backend
> algorithm itself (diff_myers), checking the result against expected,
> hard-coded patch strings.
> 
> > As far as I can tell, if your diff never generated anything but whole
> > new strings, you'd pass?
> 
> Indeed.  Otherwise, it would be a performance test, and I'm not sure
> CCAN supports those yet :-)

Well, hence manual testing.

I have an idea for performance stuff, but it's a while away...

> > And I don't think your random test actually tests patching?
> 
> Actually, they do.  bdelta_diff verifies the patch before returning
> it.  I was originally going to make this an assertion check, but then
> users (myself included) would have to manually check the result
> anyway.  One bad diff can corrupt hundreds of datums.

Ah, I missed that.  I would have been tempted to put it under !NDEBUG or
CCAN_BDELTA_DEBUG; once you're confident in the algorithm the checking
seems questionable.

> Thanks for the review!

No worries, now I have to actually *understand* the code to give
feedback on the implementation :)

Thanks!
Rusty.