[ccan] [PATCH] base64: implements rfc4648, the base64 encoding

Wed Feb 25 13:30:37 AEDT 2015

On Tue, Feb 24, 2015 at 02:16:28PM +0200, Ran Benita wrote:
> On Tue, Feb 24, 2015 at 09:08:16PM +1100, Peter Barker wrote:
> > On Mon, 23 Feb 2015, Ran Benita wrote:
> > 
> > ... lots of feedback!  Thanks!
> > 
> > >On Thu, Jan 22, 2015 at 11:53:26AM +1100, Peter Barker wrote:
> > >>Encode buffers into base64 according to rfc4648.
> > >>Decode base64-encoded buffers according to the same standard.
> > >
> > >Some comments below. However I took a look at RFC4648 and it actually
> > >points to a C99 implementation, which is also GPL2. So maybe a laxer
> > >license could help to differentiate a bit (of course that's up to you).
> > 
> > I had meant to use LGPLv2+.  I've linked to BSD-MIT instead.
> 
> Great, now I'll be able to use it :)
> 
> > Looking at this actually showed up a problem with ccanlint; base64.c and
> > base64.h both referenced LGPLv2+ - but the _info file specified the GPL. The
> > check in ccanlint (license_comment.c) looks like this:
> > 
> > if (strstr(lines[i],licenses[m->license].shortname))
> > 	found_flavor = true;
> > 
> > ... so it didn't pick this discrepancy up.  I'll see if I can get a patch
> > for that together.
> > 
> > >>+ *      	// print the base64-encoded form of the program arguments
> > >>+ *      	for(i=1;i<argc;i++) {
> > >>+ *      		size_t space_required = base64_encoded_length(strlen(argv[i]));
> > >
> > >I know it's just an example but maybe save the strlen instead of calling
> > >it twice?
> > 
> > Done.
> > 
> > >>+ *      		base64_encoded_string = malloc(space_required);
> > >>+ *      		base64_encode(base64_encoded_string,sizeof(space_required),argv[i],strlen(argv[i]));
> > >
> > >If I got it right, the `sizeof` here should be removed.
> > 
> > Quite right.  I think that must have evolved (badly) from using a static
> > buffer.  At least I hope so :-)
> > 
> > >>+	ret = alphabet->decode_map[(unsigned char)b64letter];
> > >>+	if (ret == '~') {
> > >
> > >This assumes that `~` >= 64 right? That's valid but `0xff` would feel
> > >less arbitrary.
> > 
> > Yes.  Something has gone awry at my end somehow; these have already been
> > changed to 0xff.  ~ was chosen originally to make the statically-defined
> > decode maps more readable, but David also suggested something less
> > arbitrary.
> > 
> > >>+int base64_char_in_alphabet(const base64_alphabet_t *alphabet,
> > >>+			    const char b64char) {
> > >>+	return (alphabet->decode_map[(unsigned char)b64char] != '~');
> > >
> > >This ought to give a warning about casting out a const (I'd just remove
> > >it).
> > 
> > The cast prevents a "using signed char as an index" warning.  One cast here
> > or many in the callers - here was better, I think.
> 
> The `char` -> `unsigned char` is OK - I was thinking about the `const`
> -> no-`const` cast. I remember gcc ferociously warning about this, but
> it doesn't warn here. So I suppose it only warns about pointers (where
> such a cast can actually be dangerous), not integral types.

That strikes me as a rather stupid warning, because dangerous as it
is, there would be no way to suppress it.  There's no way of saying "I
know what I'm doing" that's stronger than a cast.  So if it exists,
maybe it's disabled in the ccan makefiles?  (Fwiw, sparse does have
such a warning, and it's own special way to suppress it).

TBH, I'm also completely baffled at the "signed char as an index"
warning.  I can't see why that would be bad, and particularly not why
it would be any worse than an unsigned char.

In this instance I'd suggest just using an (int) or (unsigned) cast.
The width doesn't matter once it's put into the index.

> > >>+void base64_encode_tail_using_alphabet(const base64_alphabet_t *alphabet,
> > >>+				       char dest[4],
> > >>+				       const char *src, const size_t srclen)
> > >>+{
> > >>+	char longsrc[3];
> > >>+
> > >>+	memcpy(longsrc, src, srclen);
> > >>+	memset(longsrc+srclen, '\0', 3-srclen);
> > >
> > >`3 - srclen` this looks dangerous; maybe add a warning about `srclen`
> > >max value in the doc comments. In this case thought you can init longsrc
> > >to be zerod, i.e. `char longsrc[3] = { 0 };`.
> > 
> > Cool, done.
> > 
> > I've dropped an assert in as well as adjusting the comments in the header
> > file.
> > 
> > >>+	while (srclen - src_offset >= 3) {
> > >
> > >It's easier to see there's no underflow if it's written as
> > >   src_offset + 3 <= srclen
> > 
> > Doesn't that introduce an integer overflow?  I know you're going to have to
> > work really hard to hit that, but.....
> 
> Yes, I just ignored that :) If you want to protect against overflow, and
> you can "prove" that there's no underflow (which is usually more likely
> with unsigned's), than OK.

I have to disagree with you here, I think the original version is
clearer code.  I think it's sufficiently clear that there isn't an
underflow (since src_offset is never advanced past srclen), and it
lacks the integer overflow that your version introduces.

[snip]
> > >>+  "\x22\x23\x24\x25\x26" /* 105 */					\
> > >>+  "\x27\x28\x29\x2a\x2b" /* 110 */					\
> > >>+  "\x2c\x2d\x2e\x2f\x30" /* 115 */					\
> > >>+  "\x31\x32\x33~~" /* 120 */						\
> > >>+  "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" /* 125 */	\
> > >>+  "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" /* 175 */	\
> > >>+  "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" /* 225 */
> > >>+
> > >>+static const base64_alphabet_t alphabet_rfc4648_pregen = {
> > >
> > >Why not export this one?
> > 
> > I was trying to keep the interface as small as possible.  I couldn't see any
> > use case for it outside the library.  Having said that, I can see a caller
> > switching between alphabets and wanting to use the rfc4648 one (rather than
> > switching between functions to call).

I think that last is a good enough reason to export it.  It also means
that the rfc4648 specific wrappers can become inlines in the header.

Actually, reading a bit further on, I'd export *just* this, not the
1-way alphabet that is exported - you can get to that through the
encode map field if you need it.

> > This actually also points out the whole "alphabet" problem.  This library is
> > currently confused as to what "alphabet" is.  The next version of this patch
> > might be have base64_init_maps(const char *), base64_maps_t,
> > base64_decode_tail_using_maps(base64_maps_t x, ...) etc.  A better name I've
> > not come up with.
> 
> "alphabet" made sense to me. I thought you didn't export this pregen'ed
> alphabet since you can get the same with the `base64_alphabet_rfc4648`
> array and `base64_init_alphabet()`, but then I thought it'd be more
> helpful to just export the entire thing.

I think I'd be happy enough with either "alphabet" or "maps".

[snip]
> > >>+  "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/",
> > >>+  base64_decode_map_rfc4648
> > >>+};
> > >>+
> > >>+#undef base64_decode_map_rfc4648
> > >>+
> > >
> > >Perhaps the following functions should be `static inline` in the header?
> > >Otherwise in some cases the function call can't be inlined away. It also
> > >makes it clear these are just simple wrappers.
> > 
> > Done.  Those functions have jumped between base64.c and base64.h a lot :-)
> > 
> > >>@@ -0,0 +1,200 @@
> > >>+/* Licensed under LGPLv2+ - see LICENSE file for details */
> > >>+#ifndef CCAN_BASE64_H>> +#define CCAN_BASE64_H
> > >>+
> > >>+#include <stdio.h> /* For size_t */
> > >
> > >If you only need `size_t`, you can include stddef.h instead.
> > 
> > Changed.
> > 
> > >>+/**
> > >>+ * base64_alphabet_rfc4648 - the base64 alphabet as defined in rfc4648
> > >>+ */
> > >>+static const char base64_alphabet_rfc4648[] =
> > >
> > >I'd put an explicit length here, though there's no reason to...
> > 
> > Hmmm.  I think other people might object to it being unnecessary verbiage.
> > It doesn't fuss me either way - but I think it does add a bit of clarity.
> > Added the length.
> 
> Yes, this one is completely personal preference. I does have an
> advantage/disadvantage (depending on how you look at it) if you
> accidentally miss some characters.

I'm not sure, but would an explicit length mean you get a warning if a
character was accidentally inserted or deleted into the string?  If
so, that would be a reasont to include it.

> > >>+/**
> > >>+ * base64_encode - Encode a buffer into base64 according to rfc4648
> > >>+ * @param dest Buffer to encode into
> > >>+ * @param destlen Length of the destination buffer
> > >>+ * @param src Buffer to encode
> > >>+ * @param srclen Length of the data to encode
> > >>+ * @return Number of encoded bytes set in dest. -1 on error (and errno set)
> > >
> > >Maybe return `-ERRORCODE` on error, instead of -1 (errno is annoying to
> > >use). What errors can happen here? (Also for other functions).
> > 
> > OK, now this is interesting.  There are a few errors the existing interface
> > might return - EDOM for passing bad encoded data in, EOVERFLOW for a
> > too-small buffer for encode/decode, EINVAL if the input is short.
> 
> Please mention them where they can occur. I and probably others usually
> skip the documentation blocks when the function signature itself is
> clear enough (as in this case), but then look back for the error
> handling.
> 
> > 
> > There's no reason why the existing entry points couldn't return -ERRORCODE.
> > 
> > I'm curious to know *why*, 'though.  If the API is ever extended such that
> > something returns a pointer - errno is going to have to be used anyway, and
> > consistency would be lost.
> > 
> > Stylistically, I'm not sure how returning the error code is advantageous:
> > 
> > if ((size_t err = base64_encode(...)) < 0) {
> > 	perror(err)
> 
> Here it would be `-err` :)
> 
> > 	abort();
> > }
> > 
> > vs
> > 
> > if (base64_encode(...)) < 0) {
> 
> Ah, I missed this one - this branch is never taken since the type of the
> return value is `size_t` (unsigned). It would probably work with some
> implicit casting if you write an explicit `== -1` but what `read(2)`,
> `write(2)` do instead is return `ssize_t`. So I'd use it here, if you're
> willing to pay the extra bit in the range of possible buffer lengths.
> 
> > 	perror(errno);
> > 	abort();
> > }
> > 
> > Obviously using errno irks at least one programmer.  I'd like to know why
> > you find it annoying to use :-)
> > 
> 
> I personally don't like errno because you always have to remember to
> zero it before using checking it, and if you want to propagate it you
> have to use `saved_errno` variables, etc. Whereas with the error-code
> return everything happens in the program flow.
> 
> Of course errno is well-established so if you prefer it you should use
> it!

Yeah, I'm torn here.  I'm not a fan of errno, but I'm also not a fan
of deviating from established interfaces without a really good reason.

> > >>+size_t base64_encode(char *dest, const size_t destlen, const char *src, const size_t srclen);
> > >
> > >`const size_t foo` parameter is a bit funny-looking.
> > 
> > Hmmm.  David mentioned the same thing.  I defended it then by pointing out
> > it (a) ensures you don't accidentally fiddle with something you weren't
> > expecting to and (b) might give hints to the compiler.  This is obviously
> > not a globally accepted argument :-)
> 
> That's cool, as long as it's done consciously :)

I'm still uncomfortable with it.  And I think the reason is that it
exposes in the function signature what's really an internal detail.
For a simple pass-by-value integer, the caller doesn't care if the
fuction mangles its copy.  But now, if you change implementation so
that you do want to change that variable from the original parameter
value, you need to either change the signature, or do a copy into a
local.

The object code will all end up identical whichever way.

> > >Have you thought about adding a convenience function which only takes
> > >src+srclen and returns a malloc'd buffer? This seems to be a pretty
> > >common use case, e.g. in the example in `_info`. Also for decode.
> > 
> > If you look back through the ccan mail archives I posted a ridiculously
> > complicated API which would encode/decode your strings in place, resizing
> > buffers appropriately (assuming it could take() them).  David smacked me
> > around the head to keep this as simple as possible to start off with.  So to
> > answer your question, "yes, absolutely" :-)  Didn't do this for two reasons:
> > (a) aforementioned simplicity and (b) base64_encode returns size_t and 0 is
> > a valid return value; there's no out-of-band return value. A mallocing
> > encode signature (and I had this at one stage, I believe) might look like
> > this:
> > 
> > if (base64_mallocing_encode("fred",4, &encoded_length) < 0) { ... }
> 
> This should be `encoded_buffer` right?
> 
> > 
> > Which is not as "simple" :-)
> 
> Oh sorry, I missed the previous discussion. If it turned out to
> complicate the API too much then OK!

The previous discussion was off list, sorry.

So, it's not so much that the allocating interface is "too
complicated".  It's just that for flexibility you also want the
non-allocating interface.  Given that, it seems better to thrash out
the non-allocating interface first.  Conveience wrappers, like
auto-allocation, can come in a later patch, since they'll be
implemented in terms of the non-allocating interface.

Admittedly, my view is coloured by the fact that I've done enough work
in firmware and other restricted environments that I dislike
interfaces that unnecessarily assume you'll always have a working
malloc() interface.

> > >>+size_t base64_decoded_length(size_t srclen);
> > >>+
> > >>+/**
> > >>+ * base64_decode - decode An rfc4648 base64-encoded string
> > >>+ * @param dest Buffer to decode into
> > >>+ * @param destlen Length of the destination buffer
> > >>+ * @param src Buffer to decode
> > >>+ * @param srclen Length of the data to decode
> > >>+ * @return Number of decoded bytes set in dest. -1 on error (and errno set)
> > >>+ * @note dest will be nul-padded to destlen
> > >
> > >Since base-64 pads with `=` in the last bits this can give the wrong
> > >impression!
> > 
> > Hmmm.  A little bit of rewording:
> >  * @note dest will be nul-padded to destlen (past any required padding)
> > 
> > which is still not great, but I'm trying to avoid epic comment blocks in the
> > header.
> > 
> > >In any case why should `dest` be nul-padded? (I always thought strncpy
> > >was bad for doing that unless you write out directory entries in a
> > >filesystem or something).
> > 
> > It was, in fact, to make it look like strncpy.  The belief is that the vast
> > majority of the time you will be printing the base64 strings out.
> > 
> > You can't always-nul-terminate (i.e. have encoded_size return +1 and
> > explicitly null things); you might be encoding into a larger buffer (and the
> > caller scribbling down the length you return alongside it).
> > 
> > You could not-nul-terminate - but why would somone pass in a buffer size
> > larger than what's strictly required if they didn't want you to play with
> > it? :-)
> > 
> > Callers would look like this:
> > src = "fred";
> > len = base64_encoded_length(strlen(src));
> > dest = xmalloc(len+1);
> > base64_encode(dest,len,src);
> > dest[len] = '\0';
> 
> Is `base64_encoded_length` exact? Otherwise I'd use `base_encode()`
> retval here.

It is exact; easily calculated output size is a nice property of base64.

> > printf("%s\n",dest);
> > 
> > Or this:
> > src = "fred";
> > len = base64_encoded_length(strlen(src)) + 1; /* for null-termination */
> > dest = xmalloc(len);
> > base64_encode(dest,len,src);
> > printf("%s\n",dest);
> > 
> > I really prefer the latter.  I'm guessing you don't :-)
> 
> If you prefer it then OK.
> 
> I dislike it because the expectation is that runtime ~ (length of input)
> not (length of output buffer). And I also remember avoiding `strncpy` a
> few times because it has this behavior.

I sort of agree in theory, but not really in practice.  The O(length
of output) portion has a low constant (memset() is pretty fast) and it
only comes into play if you over-allocate your output buffer.

If a caller routinely uses output buffers large enough for the
memset() to be significant, and doesn't dynamically size using
base64_encoded_length(), I think they deserve what they get.

> Maybe you can say that the buffer is nul-terminated if possible? (i.e.
> writes only one '\0' if there is space). Although TBH even that I'd leave
> out.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/ccan/attachments/20150225/53abb56b/attachment.sig>