[ccan] [PATCH] base64: implements rfc4648, the base64 encoding

Tue Feb 24 23:16:28 AEDT 2015

On Tue, Feb 24, 2015 at 09:08:16PM +1100, Peter Barker wrote:
> On Mon, 23 Feb 2015, Ran Benita wrote:
> 
> ... lots of feedback!  Thanks!
> 
> >On Thu, Jan 22, 2015 at 11:53:26AM +1100, Peter Barker wrote:
> >>Encode buffers into base64 according to rfc4648.
> >>Decode base64-encoded buffers according to the same standard.
> >
> >Some comments below. However I took a look at RFC4648 and it actually
> >points to a C99 implementation, which is also GPL2. So maybe a laxer
> >license could help to differentiate a bit (of course that's up to you).
> 
> I had meant to use LGPLv2+.  I've linked to BSD-MIT instead.

Great, now I'll be able to use it :)

> Looking at this actually showed up a problem with ccanlint; base64.c and
> base64.h both referenced LGPLv2+ - but the _info file specified the GPL. The
> check in ccanlint (license_comment.c) looks like this:
> 
> if (strstr(lines[i],licenses[m->license].shortname))
> 	found_flavor = true;
> 
> ... so it didn't pick this discrepancy up.  I'll see if I can get a patch
> for that together.
> 
> >>+ *      	// print the base64-encoded form of the program arguments
> >>+ *      	for(i=1;i<argc;i++) {
> >>+ *      		size_t space_required = base64_encoded_length(strlen(argv[i]));
> >
> >I know it's just an example but maybe save the strlen instead of calling
> >it twice?
> 
> Done.
> 
> >>+ *      		base64_encoded_string = malloc(space_required);
> >>+ *      		base64_encode(base64_encoded_string,sizeof(space_required),argv[i],strlen(argv[i]));
> >
> >If I got it right, the `sizeof` here should be removed.
> 
> Quite right.  I think that must have evolved (badly) from using a static
> buffer.  At least I hope so :-)
> 
> >>+	ret = alphabet->decode_map[(unsigned char)b64letter];
> >>+	if (ret == '~') {
> >
> >This assumes that `~` >= 64 right? That's valid but `0xff` would feel
> >less arbitrary.
> 
> Yes.  Something has gone awry at my end somehow; these have already been
> changed to 0xff.  ~ was chosen originally to make the statically-defined
> decode maps more readable, but David also suggested something less
> arbitrary.
> 
> >>+int base64_char_in_alphabet(const base64_alphabet_t *alphabet,
> >>+			    const char b64char) {
> >>+	return (alphabet->decode_map[(unsigned char)b64char] != '~');
> >
> >This ought to give a warning about casting out a const (I'd just remove
> >it).
> 
> The cast prevents a "using signed char as an index" warning.  One cast here
> or many in the callers - here was better, I think.

The `char` -> `unsigned char` is OK - I was thinking about the `const`
-> no-`const` cast. I remember gcc ferociously warning about this, but
it doesn't warn here. So I suppose it only warns about pointers (where
such a cast can actually be dangerous), not integral types.

> >>+void base64_encode_tail_using_alphabet(const base64_alphabet_t *alphabet,
> >>+				       char dest[4],
> >>+				       const char *src, const size_t srclen)
> >>+{
> >>+	char longsrc[3];
> >>+
> >>+	memcpy(longsrc, src, srclen);
> >>+	memset(longsrc+srclen, '\0', 3-srclen);
> >
> >`3 - srclen` this looks dangerous; maybe add a warning about `srclen`
> >max value in the doc comments. In this case thought you can init longsrc
> >to be zerod, i.e. `char longsrc[3] = { 0 };`.
> 
> Cool, done.
> 
> I've dropped an assert in as well as adjusting the comments in the header
> file.
> 
> >>+	while (srclen - src_offset >= 3) {
> >
> >It's easier to see there's no underflow if it's written as
> >   src_offset + 3 <= srclen
> 
> Doesn't that introduce an integer overflow?  I know you're going to have to
> work really hard to hit that, but.....

Yes, I just ignored that :) If you want to protect against overflow, and
you can "prove" that there's no underflow (which is usually more likely
with unsigned's), than OK.

> >>+		base64_encode_triplet_using_alphabet(alphabet, &dest[dest_offset], &src[src_offset]);
> >>+		src_offset += 3;
> >>+		dest_offset += 4;
> >>+	}
> >>+
> >>+	if (srclen - src_offset) {
> >
> >Here too: `srclen != src_offset` or better `src_offset < srclen`.
> 
> Changed.  I was assigning into a local variable there but took it out.
> 
> >>+	char longsrc[4];
> >>+	int quartet_result;
> >>+	size_t insize = srclen;
> >>+
> >>+	if (insize == 0) {
> >>+		return 0;
> >>+	}
> >>+	while (src[insize-1] == '=') { /* throw away padding symbols */
> >
> >If I give srclen=1, src="=" there's an underflow here (don't think
> >that's valid but might be malicious).
> 
> Fixed.
> 
> >>+	if (insize == 1) {
> .
> .
> >>+
> >>+	return insize - 1;
> >
> >Here too if I follow correctly `insize` can be 0. Not sure which int
> >conversions happen here but it'd be better not to need that...
> 
> Fixed.
> 
> >>+	for(i=0; srclen - i > 4; i+=4) {
> >
> >`i + 4 < srclen` would be nicer re. underflow (though I looks fine
> >here).
> 
> Again, written this way to avoid integer overflow - just in case i == size_t
> - 3, for example.
> 
> >>+  "\x22\x23\x24\x25\x26" /* 105 */					\
> >>+  "\x27\x28\x29\x2a\x2b" /* 110 */					\
> >>+  "\x2c\x2d\x2e\x2f\x30" /* 115 */					\
> >>+  "\x31\x32\x33~~" /* 120 */						\
> >>+  "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" /* 125 */	\
> >>+  "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" /* 175 */	\
> >>+  "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" /* 225 */
> >>+
> >>+static const base64_alphabet_t alphabet_rfc4648_pregen = {
> >
> >Why not export this one?
> 
> I was trying to keep the interface as small as possible.  I couldn't see any
> use case for it outside the library.  Having said that, I can see a caller
> switching between alphabets and wanting to use the rfc4648 one (rather than
> switching between functions to call).
> 
> This actually also points out the whole "alphabet" problem.  This library is
> currently confused as to what "alphabet" is.  The next version of this patch
> might be have base64_init_maps(const char *), base64_maps_t,
> base64_decode_tail_using_maps(base64_maps_t x, ...) etc.  A better name I've
> not come up with.

"alphabet" made sense to me. I thought you didn't export this pregen'ed
alphabet since you can get the same with the `base64_alphabet_rfc4648`
array and `base64_init_alphabet()`, but then I thought it'd be more
helpful to just export the entire thing.

> >>+  "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/",
> >>+  base64_decode_map_rfc4648
> >>+};
> >>+
> >>+#undef base64_decode_map_rfc4648
> >>+
> >
> >Perhaps the following functions should be `static inline` in the header?
> >Otherwise in some cases the function call can't be inlined away. It also
> >makes it clear these are just simple wrappers.
> 
> Done.  Those functions have jumped between base64.c and base64.h a lot :-)
> 
> >>@@ -0,0 +1,200 @@
> >>+/* Licensed under LGPLv2+ - see LICENSE file for details */
> >>+#ifndef CCAN_BASE64_H>> +#define CCAN_BASE64_H
> >>+
> >>+#include <stdio.h> /* For size_t */
> >
> >If you only need `size_t`, you can include stddef.h instead.
> 
> Changed.
> 
> >>+/**
> >>+ * base64_alphabet_rfc4648 - the base64 alphabet as defined in rfc4648
> >>+ */
> >>+static const char base64_alphabet_rfc4648[] =
> >
> >I'd put an explicit length here, though there's no reason to...
> 
> Hmmm.  I think other people might object to it being unnecessary verbiage.
> It doesn't fuss me either way - but I think it does add a bit of clarity.
> Added the length.

Yes, this one is completely personal preference. I does have an
advantage/disadvantage (depending on how you look at it) if you
accidentally miss some characters.

> >>+/**
> >>+ * base64_encode - Encode a buffer into base64 according to rfc4648
> >>+ * @param dest Buffer to encode into
> >>+ * @param destlen Length of the destination buffer
> >>+ * @param src Buffer to encode
> >>+ * @param srclen Length of the data to encode
> >>+ * @return Number of encoded bytes set in dest. -1 on error (and errno set)
> >
> >Maybe return `-ERRORCODE` on error, instead of -1 (errno is annoying to
> >use). What errors can happen here? (Also for other functions).
> 
> OK, now this is interesting.  There are a few errors the existing interface
> might return - EDOM for passing bad encoded data in, EOVERFLOW for a
> too-small buffer for encode/decode, EINVAL if the input is short.

Please mention them where they can occur. I and probably others usually
skip the documentation blocks when the function signature itself is
clear enough (as in this case), but then look back for the error
handling.

> 
> There's no reason why the existing entry points couldn't return -ERRORCODE.
> 
> I'm curious to know *why*, 'though.  If the API is ever extended such that
> something returns a pointer - errno is going to have to be used anyway, and
> consistency would be lost.
> 
> Stylistically, I'm not sure how returning the error code is advantageous:
> 
> if ((size_t err = base64_encode(...)) < 0) {
> 	perror(err)

Here it would be `-err` :)

> 	abort();
> }
> 
> vs
> 
> if (base64_encode(...)) < 0) {

Ah, I missed this one - this branch is never taken since the type of the
return value is `size_t` (unsigned). It would probably work with some
implicit casting if you write an explicit `== -1` but what `read(2)`,
`write(2)` do instead is return `ssize_t`. So I'd use it here, if you're
willing to pay the extra bit in the range of possible buffer lengths.

> 	perror(errno);
> 	abort();
> }
> 
> Obviously using errno irks at least one programmer.  I'd like to know why
> you find it annoying to use :-)
> 

I personally don't like errno because you always have to remember to
zero it before using checking it, and if you want to propagate it you
have to use `saved_errno` variables, etc. Whereas with the error-code
return everything happens in the program flow.

Of course errno is well-established so if you prefer it you should use
it!

> >>+size_t base64_encode(char *dest, const size_t destlen, const char *src, const size_t srclen);
> >
> >`const size_t foo` parameter is a bit funny-looking.
> 
> Hmmm.  David mentioned the same thing.  I defended it then by pointing out
> it (a) ensures you don't accidentally fiddle with something you weren't
> expecting to and (b) might give hints to the compiler.  This is obviously
> not a globally accepted argument :-)

That's cool, as long as it's done consciously :)

> >Have you thought about adding a convenience function which only takes
> >src+srclen and returns a malloc'd buffer? This seems to be a pretty
> >common use case, e.g. in the example in `_info`. Also for decode.
> 
> If you look back through the ccan mail archives I posted a ridiculously
> complicated API which would encode/decode your strings in place, resizing
> buffers appropriately (assuming it could take() them).  David smacked me
> around the head to keep this as simple as possible to start off with.  So to
> answer your question, "yes, absolutely" :-)  Didn't do this for two reasons:
> (a) aforementioned simplicity and (b) base64_encode returns size_t and 0 is
> a valid return value; there's no out-of-band return value. A mallocing
> encode signature (and I had this at one stage, I believe) might look like
> this:
> 
> if (base64_mallocing_encode("fred",4, &encoded_length) < 0) { ... }

This should be `encoded_buffer` right?

> 
> Which is not as "simple" :-)

Oh sorry, I missed the previous discussion. If it turned out to
complicate the API too much then OK!

> >>+size_t base64_decoded_length(size_t srclen);
> >>+
> >>+/**
> >>+ * base64_decode - decode An rfc4648 base64-encoded string
> >>+ * @param dest Buffer to decode into
> >>+ * @param destlen Length of the destination buffer
> >>+ * @param src Buffer to decode
> >>+ * @param srclen Length of the data to decode
> >>+ * @return Number of decoded bytes set in dest. -1 on error (and errno set)
> >>+ * @note dest will be nul-padded to destlen
> >
> >Since base-64 pads with `=` in the last bits this can give the wrong
> >impression!
> 
> Hmmm.  A little bit of rewording:
>  * @note dest will be nul-padded to destlen (past any required padding)
> 
> which is still not great, but I'm trying to avoid epic comment blocks in the
> header.
> 
> >In any case why should `dest` be nul-padded? (I always thought strncpy
> >was bad for doing that unless you write out directory entries in a
> >filesystem or something).
> 
> It was, in fact, to make it look like strncpy.  The belief is that the vast
> majority of the time you will be printing the base64 strings out.
> 
> You can't always-nul-terminate (i.e. have encoded_size return +1 and
> explicitly null things); you might be encoding into a larger buffer (and the
> caller scribbling down the length you return alongside it).
> 
> You could not-nul-terminate - but why would somone pass in a buffer size
> larger than what's strictly required if they didn't want you to play with
> it? :-)
> 
> Callers would look like this:
> src = "fred";
> len = base64_encoded_length(strlen(src));
> dest = xmalloc(len+1);
> base64_encode(dest,len,src);
> dest[len] = '\0';

Is `base64_encoded_length` exact? Otherwise I'd use `base_encode()`
retval here.

> printf("%s\n",dest);
> 
> Or this:
> src = "fred";
> len = base64_encoded_length(strlen(src)) + 1; /* for null-termination */
> dest = xmalloc(len);
> base64_encode(dest,len,src);
> printf("%s\n",dest);
> 
> I really prefer the latter.  I'm guessing you don't :-)

If you prefer it then OK.

I dislike it because the expectation is that runtime ~ (length of input)
not (length of output buffer). And I also remember avoiding `strncpy` a
few times because it has this behavior.

Maybe you can say that the buffer is nul-terminated if possible? (i.e.
writes only one '\0' if there is space). Although TBH even that I'd leave
out.

> 
> >>+size_t base64_decode(char *dest, const size_t destlen,
> >>+		     const char *src, const size_t srclen);
> >>+
> >>+
> >>+/**
> >>+ * base64_encode_triplet - encode 3 bytes into base64
> >>+ * @param dest Buffer containing at least 4 bytes
> >
> >I'd remove the "at least" here, to match the type `char dest[4]`.
> 
> OK, I'd chosen my words carefully there - we don't care if the buffer is
> larger :-)  However, it obviously caused confusion the way it was, and "at
> least" could be vaguely inferred.
> 
> >>+ * @param src Buffer containing 3 bytes
> >>+ */
> >>+void base64_encode_triplet(char dest[4], const char src[3]);
> >
> >These sets of functions with tail, triplet, quartet seem like
> >implementation related and shouldn't be exposed as API (unless they're
> >useful for something I can't think of).
> 
> Oh, yes.  They actually make the library useful IMO.  Simple, dodgy
> streaming base64 (see here the confusion about what an "alphabet" is):
> 
> #include <stdio.h>
> #include <base64.h>
> #include <errno.h>
> #include <stdlib.h>
> 
> int
> main(int argc, char * argv[]) {
>   char buffer[1];
>   char todecode[4];
>   int todecode_offset = 0;
>   ssize_t ret = 0;
> 
>   base64_alphabet_t alphabet;
>   base64_init_alphabet(&alphabet,base64_alphabet_rfc4648);
> 
>   while (ret = read(0,buffer,1)) {
>     if (ret == -1) {
>       perror("read");
>       exit(1);
>     }
>     if (base64_char_in_alphabet(&alphabet,buffer[0])) {
>       todecode[todecode_offset++] = buffer[0];
>     }
>     if (todecode_offset > 3) {
>       char decoded[3];
>       base64_decode_quartet_using_alphabet(&alphabet,decoded,todecode);
>       todecode_offset = 0;
>       printf("%c%c%c",decoded[0],decoded[1],decoded[2]);
>     }
>   }
>   if (todecode_offset) {
>       char decoded[3];
> 
>  base64_decode_tail_using_alphabet(&alphabet,decoded,todecode,todecode_offset);
>       todecode_offset = 0;
>       printf("%c%c%c",decoded[0],decoded[1],decoded[2]);
>   }
> }

Yea that's nice, haven't thought about this usecase.

> I could drop a somewhat less unpretty copy of that in an examples directory?

Why not!

> >>+int base64_char_in_alphabet(const base64_alphabet_t *alphabet,
> >>+			    const char b64char);
> >
> >I'm assuming this is for checking if an encoded string is valid? If so,
> >wouldn't it be more straightforward to have a `validate` function (i.e.
> >just checks if the string is OK without actual decoding).
> 
> You can see that in use above :-)  Useful for skipping e.g. carriage returns
> in the input.
> 
> >Ran
> 
> Many, many thanks for taking the time to look through this.  I'm willing to
> be convinved on things like errno and const-in-parameter-lists, so please
> convince me :-)

Thanks for writing & publishing it!

Ran

> I'll wait another day or two to see if anyone else weighs in.  Otherwise
> I'll repost the patch.
> 
> Thanks,
> -- 
> Peter Barker                          |   Programmer,Sysadmin,Geek.
> pbarker at barker.dropbear.id.au	      |   You need a bigger hammer.
> :: It's a hack! Expect underscores! - Nigel Williams