[ccan] [PATCH] base64: implements rfc4648, the base64 encoding

Tue Feb 24 21:08:16 AEDT 2015

On Mon, 23 Feb 2015, Ran Benita wrote:

... lots of feedback!  Thanks!

> On Thu, Jan 22, 2015 at 11:53:26AM +1100, Peter Barker wrote:
>> Encode buffers into base64 according to rfc4648.
>> Decode base64-encoded buffers according to the same standard.
>
> Some comments below. However I took a look at RFC4648 and it actually
> points to a C99 implementation, which is also GPL2. So maybe a laxer
> license could help to differentiate a bit (of course that's up to you).

I had meant to use LGPLv2+.  I've linked to BSD-MIT instead.

Looking at this actually showed up a problem with ccanlint; base64.c and 
base64.h both referenced LGPLv2+ - but the _info file specified the GPL. 
The check in ccanlint (license_comment.c) looks like this:

if (strstr(lines[i],licenses[m->license].shortname))
 	found_flavor = true;

... so it didn't pick this discrepancy up.  I'll see if I can get a patch 
for that together.

>> + *      	// print the base64-encoded form of the program arguments
>> + *      	for(i=1;i<argc;i++) {
>> + *      		size_t space_required = base64_encoded_length(strlen(argv[i]));
>
> I know it's just an example but maybe save the strlen instead of calling
> it twice?

Done.

>> + *      		base64_encoded_string = malloc(space_required);
>> + *      		base64_encode(base64_encoded_string,sizeof(space_required),argv[i],strlen(argv[i]));
>
> If I got it right, the `sizeof` here should be removed.

Quite right.  I think that must have evolved (badly) from using a static 
buffer.  At least I hope so :-)

>> +	ret = alphabet->decode_map[(unsigned char)b64letter];
>> +	if (ret == '~') {
>
> This assumes that `~` >= 64 right? That's valid but `0xff` would feel
> less arbitrary.

Yes.  Something has gone awry at my end somehow; these have already been 
changed to 0xff.  ~ was chosen originally to make the statically-defined 
decode maps more readable, but David also suggested something less 
arbitrary.

>> +int base64_char_in_alphabet(const base64_alphabet_t *alphabet,
>> +			    const char b64char) {
>> +	return (alphabet->decode_map[(unsigned char)b64char] != '~');
>
> This ought to give a warning about casting out a const (I'd just remove
> it).

The cast prevents a "using signed char as an index" warning.  One cast 
here or many in the callers - here was better, I think.

>> +void base64_encode_tail_using_alphabet(const base64_alphabet_t *alphabet,
>> +				       char dest[4],
>> +				       const char *src, const size_t srclen)
>> +{
>> +	char longsrc[3];
>> +
>> +	memcpy(longsrc, src, srclen);
>> +	memset(longsrc+srclen, '\0', 3-srclen);
>
> `3 - srclen` this looks dangerous; maybe add a warning about `srclen`
> max value in the doc comments. In this case thought you can init longsrc
> to be zerod, i.e. `char longsrc[3] = { 0 };`.

Cool, done.

I've dropped an assert in as well as adjusting the comments in the header 
file.

>> +	while (srclen - src_offset >= 3) {
>
> It's easier to see there's no underflow if it's written as
>    src_offset + 3 <= srclen

Doesn't that introduce an integer overflow?  I know you're going to have 
to work really hard to hit that, but.....

>> +		base64_encode_triplet_using_alphabet(alphabet, &dest[dest_offset], &src[src_offset]);
>> +		src_offset += 3;
>> +		dest_offset += 4;
>> +	}
>> +
>> +	if (srclen - src_offset) {
>
> Here too: `srclen != src_offset` or better `src_offset < srclen`.

Changed.  I was assigning into a local variable there but took it out.

>> +	char longsrc[4];
>> +	int quartet_result;
>> +	size_t insize = srclen;
>> +
>> +	if (insize == 0) {
>> +		return 0;
>> +	}
>> +	while (src[insize-1] == '=') { /* throw away padding symbols */
>
> If I give srclen=1, src="=" there's an underflow here (don't think
> that's valid but might be malicious).

Fixed.

>> +	if (insize == 1) {
.
.
>> +
>> +	return insize - 1;
>
> Here too if I follow correctly `insize` can be 0. Not sure which int
> conversions happen here but it'd be better not to need that...

Fixed.

>> +	for(i=0; srclen - i > 4; i+=4) {
>
> `i + 4 < srclen` would be nicer re. underflow (though I looks fine
> here).

Again, written this way to avoid integer overflow - just in case i == 
size_t - 3, for example.

>> +  "\x22\x23\x24\x25\x26" /* 105 */					\
>> +  "\x27\x28\x29\x2a\x2b" /* 110 */					\
>> +  "\x2c\x2d\x2e\x2f\x30" /* 115 */					\
>> +  "\x31\x32\x33~~" /* 120 */						\
>> +  "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" /* 125 */	\
>> +  "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" /* 175 */	\
>> +  "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" /* 225 */
>> +
>> +static const base64_alphabet_t alphabet_rfc4648_pregen = {
>
> Why not export this one?

I was trying to keep the interface as small as possible.  I couldn't see 
any use case for it outside the library.  Having said that, I can see a 
caller switching between alphabets and wanting to use the rfc4648 one 
(rather than switching between functions to call).

This actually also points out the whole "alphabet" problem.  This library 
is currently confused as to what "alphabet" is.  The next version of this 
patch might be have base64_init_maps(const char *), base64_maps_t, 
base64_decode_tail_using_maps(base64_maps_t x, ...) etc.  A better name 
I've not come up with.

>> +  "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/",
>> +  base64_decode_map_rfc4648
>> +};
>> +
>> +#undef base64_decode_map_rfc4648
>> +
>
> Perhaps the following functions should be `static inline` in the header?
> Otherwise in some cases the function call can't be inlined away. It also
> makes it clear these are just simple wrappers.

Done.  Those functions have jumped between base64.c and base64.h a lot :-)

>> @@ -0,0 +1,200 @@
>> +/* Licensed under LGPLv2+ - see LICENSE file for details */
>> +#ifndef CCAN_BASE64_H>> +#define CCAN_BASE64_H
>> +
>> +#include <stdio.h> /* For size_t */
>
> If you only need `size_t`, you can include stddef.h instead.

Changed.

>> +/**
>> + * base64_alphabet_rfc4648 - the base64 alphabet as defined in rfc4648
>> + */
>> +static const char base64_alphabet_rfc4648[] =
>
> I'd put an explicit length here, though there's no reason to...

Hmmm.  I think other people might object to it being unnecessary verbiage. 
It doesn't fuss me either way - but I think it does add a bit of clarity. 
Added the length.

>> +/**
>> + * base64_encode - Encode a buffer into base64 according to rfc4648
>> + * @param dest Buffer to encode into
>> + * @param destlen Length of the destination buffer
>> + * @param src Buffer to encode
>> + * @param srclen Length of the data to encode
>> + * @return Number of encoded bytes set in dest. -1 on error (and errno set)
>
> Maybe return `-ERRORCODE` on error, instead of -1 (errno is annoying to
> use). What errors can happen here? (Also for other functions).

OK, now this is interesting.  There are a few errors the existing 
interface might return - EDOM for passing bad encoded data in, EOVERFLOW 
for a too-small buffer for encode/decode, EINVAL if the input is short.

There's no reason why the existing entry points couldn't return 
-ERRORCODE.

I'm curious to know *why*, 'though.  If the API is ever extended such that 
something returns a pointer - errno is going to have to be used anyway, 
and consistency would be lost.

Stylistically, I'm not sure how returning the error code is advantageous:

if ((size_t err = base64_encode(...)) < 0) {
 	perror(err)
 	abort();
}

vs

if (base64_encode(...)) < 0) {
 	perror(errno);
 	abort();
}

Obviously using errno irks at least one programmer.  I'd like to know why 
you find it annoying to use :-)

>> +size_t base64_encode(char *dest, const size_t destlen, const char *src, const size_t srclen);
>
> `const size_t foo` parameter is a bit funny-looking.

Hmmm.  David mentioned the same thing.  I defended it then by pointing out 
it (a) ensures you don't accidentally fiddle with something you weren't 
expecting to and (b) might give hints to the compiler.  This is obviously 
not a globally accepted argument :-)

> Have you thought about adding a convenience function which only takes
> src+srclen and returns a malloc'd buffer? This seems to be a pretty
> common use case, e.g. in the example in `_info`. Also for decode.

If you look back through the ccan mail archives I posted a ridiculously 
complicated API which would encode/decode your strings in place, resizing 
buffers appropriately (assuming it could take() them).  David smacked me 
around the head to keep this as simple as possible to start off with.  So 
to answer your question, "yes, absolutely" :-)  Didn't do this for two 
reasons: (a) aforementioned simplicity and (b) base64_encode returns 
size_t and 0 is a valid return value; there's no out-of-band return value. 
A mallocing encode signature (and I had this at one stage, I believe) 
might look like this:

if (base64_mallocing_encode("fred",4, &encoded_length) < 0) { ... }

Which is not as "simple" :-)

>> +size_t base64_decoded_length(size_t srclen);
>> +
>> +/**
>> + * base64_decode - decode An rfc4648 base64-encoded string
>> + * @param dest Buffer to decode into
>> + * @param destlen Length of the destination buffer
>> + * @param src Buffer to decode
>> + * @param srclen Length of the data to decode
>> + * @return Number of decoded bytes set in dest. -1 on error (and errno set)
>> + * @note dest will be nul-padded to destlen
>
> Since base-64 pads with `=` in the last bits this can give the wrong
> impression!

Hmmm.  A little bit of rewording:
  * @note dest will be nul-padded to destlen (past any required padding)

which is still not great, but I'm trying to avoid epic comment blocks in 
the header.

> In any case why should `dest` be nul-padded? (I always thought strncpy
> was bad for doing that unless you write out directory entries in a
> filesystem or something).

It was, in fact, to make it look like strncpy.  The belief is that the 
vast majority of the time you will be printing the base64 strings out.

You can't always-nul-terminate (i.e. have encoded_size return +1 and 
explicitly null things); you might be encoding into a larger buffer (and 
the caller scribbling down the length you return alongside it).

You could not-nul-terminate - but why would somone pass in a buffer size 
larger than what's strictly required if they didn't want you to play with 
it? :-)

Callers would look like this:
src = "fred";
len = base64_encoded_length(strlen(src));
dest = xmalloc(len+1);
base64_encode(dest,len,src);
dest[len] = '\0';
printf("%s\n",dest);

Or this:
src = "fred";
len = base64_encoded_length(strlen(src)) + 1; /* for null-termination */
dest = xmalloc(len);
base64_encode(dest,len,src);
printf("%s\n",dest);

I really prefer the latter.  I'm guessing you don't :-)

>> +size_t base64_decode(char *dest, const size_t destlen,
>> +		     const char *src, const size_t srclen);
>> +
>> +
>> +/**
>> + * base64_encode_triplet - encode 3 bytes into base64
>> + * @param dest Buffer containing at least 4 bytes
>
> I'd remove the "at least" here, to match the type `char dest[4]`.

OK, I'd chosen my words carefully there - we don't care if the buffer is 
larger :-)  However, it obviously caused confusion the way it was, and "at 
least" could be vaguely inferred.

>> + * @param src Buffer containing 3 bytes
>> + */
>> +void base64_encode_triplet(char dest[4], const char src[3]);
>
> These sets of functions with tail, triplet, quartet seem like
> implementation related and shouldn't be exposed as API (unless they're
> useful for something I can't think of).

Oh, yes.  They actually make the library useful IMO.  Simple, dodgy 
streaming base64 (see here the confusion about what an "alphabet" is):

#include <stdio.h>
#include <base64.h>
#include <errno.h>
#include <stdlib.h>

int
main(int argc, char * argv[]) {
   char buffer[1];
   char todecode[4];
   int todecode_offset = 0;
   ssize_t ret = 0;

   base64_alphabet_t alphabet;
   base64_init_alphabet(&alphabet,base64_alphabet_rfc4648);

   while (ret = read(0,buffer,1)) {
     if (ret == -1) {
       perror("read");
       exit(1);
     }
     if (base64_char_in_alphabet(&alphabet,buffer[0])) {
       todecode[todecode_offset++] = buffer[0];
     }
     if (todecode_offset > 3) {
       char decoded[3];
       base64_decode_quartet_using_alphabet(&alphabet,decoded,todecode);
       todecode_offset = 0;
       printf("%c%c%c",decoded[0],decoded[1],decoded[2]);
     }
   }
   if (todecode_offset) {
       char decoded[3];

  base64_decode_tail_using_alphabet(&alphabet,decoded,todecode,todecode_offset);
       todecode_offset = 0;
       printf("%c%c%c",decoded[0],decoded[1],decoded[2]);
   }
}

I could drop a somewhat less unpretty copy of that in an examples 
directory?

>> +int base64_char_in_alphabet(const base64_alphabet_t *alphabet,
>> +			    const char b64char);
>
> I'm assuming this is for checking if an encoded string is valid? If so,
> wouldn't it be more straightforward to have a `validate` function (i.e.
> just checks if the string is OK without actual decoding).

You can see that in use above :-)  Useful for skipping e.g. carriage 
returns in the input.

> Ran

Many, many thanks for taking the time to look through this.  I'm willing 
to be convinved on things like errno and const-in-parameter-lists, so 
please convince me :-)

I'll wait another day or two to see if anyone else weighs in.  Otherwise 
I'll repost the patch.

Thanks,
-- 
Peter Barker                          |   Programmer,Sysadmin,Geek.
pbarker at barker.dropbear.id.au	      |   You need a bigger hammer.
:: It's a hack! Expect underscores! - Nigel Williams