[ccan] [PATCH] base64: implements rfc4648, the base64 encoding
Ran Benita
ran234 at gmail.com
Tue Feb 24 23:16:28 AEDT 2015
On Tue, Feb 24, 2015 at 09:08:16PM +1100, Peter Barker wrote:
> On Mon, 23 Feb 2015, Ran Benita wrote:
>
> ... lots of feedback! Thanks!
>
> >On Thu, Jan 22, 2015 at 11:53:26AM +1100, Peter Barker wrote:
> >>Encode buffers into base64 according to rfc4648.
> >>Decode base64-encoded buffers according to the same standard.
> >
> >Some comments below. However I took a look at RFC4648 and it actually
> >points to a C99 implementation, which is also GPL2. So maybe a laxer
> >license could help to differentiate a bit (of course that's up to you).
>
> I had meant to use LGPLv2+. I've linked to BSD-MIT instead.
Great, now I'll be able to use it :)
> Looking at this actually showed up a problem with ccanlint; base64.c and
> base64.h both referenced LGPLv2+ - but the _info file specified the GPL. The
> check in ccanlint (license_comment.c) looks like this:
>
> if (strstr(lines[i],licenses[m->license].shortname))
> found_flavor = true;
>
> ... so it didn't pick this discrepancy up. I'll see if I can get a patch
> for that together.
>
> >>+ * // print the base64-encoded form of the program arguments
> >>+ * for(i=1;i<argc;i++) {
> >>+ * size_t space_required = base64_encoded_length(strlen(argv[i]));
> >
> >I know it's just an example but maybe save the strlen instead of calling
> >it twice?
>
> Done.
>
> >>+ * base64_encoded_string = malloc(space_required);
> >>+ * base64_encode(base64_encoded_string,sizeof(space_required),argv[i],strlen(argv[i]));
> >
> >If I got it right, the `sizeof` here should be removed.
>
> Quite right. I think that must have evolved (badly) from using a static
> buffer. At least I hope so :-)
>
> >>+ ret = alphabet->decode_map[(unsigned char)b64letter];
> >>+ if (ret == '~') {
> >
> >This assumes that `~` >= 64 right? That's valid but `0xff` would feel
> >less arbitrary.
>
> Yes. Something has gone awry at my end somehow; these have already been
> changed to 0xff. ~ was chosen originally to make the statically-defined
> decode maps more readable, but David also suggested something less
> arbitrary.
>
> >>+int base64_char_in_alphabet(const base64_alphabet_t *alphabet,
> >>+ const char b64char) {
> >>+ return (alphabet->decode_map[(unsigned char)b64char] != '~');
> >
> >This ought to give a warning about casting out a const (I'd just remove
> >it).
>
> The cast prevents a "using signed char as an index" warning. One cast here
> or many in the callers - here was better, I think.
The `char` -> `unsigned char` is OK - I was thinking about the `const`
-> no-`const` cast. I remember gcc ferociously warning about this, but
it doesn't warn here. So I suppose it only warns about pointers (where
such a cast can actually be dangerous), not integral types.
> >>+void base64_encode_tail_using_alphabet(const base64_alphabet_t *alphabet,
> >>+ char dest[4],
> >>+ const char *src, const size_t srclen)
> >>+{
> >>+ char longsrc[3];
> >>+
> >>+ memcpy(longsrc, src, srclen);
> >>+ memset(longsrc+srclen, '\0', 3-srclen);
> >
> >`3 - srclen` this looks dangerous; maybe add a warning about `srclen`
> >max value in the doc comments. In this case thought you can init longsrc
> >to be zerod, i.e. `char longsrc[3] = { 0 };`.
>
> Cool, done.
>
> I've dropped an assert in as well as adjusting the comments in the header
> file.
>
> >>+ while (srclen - src_offset >= 3) {
> >
> >It's easier to see there's no underflow if it's written as
> > src_offset + 3 <= srclen
>
> Doesn't that introduce an integer overflow? I know you're going to have to
> work really hard to hit that, but.....
Yes, I just ignored that :) If you want to protect against overflow, and
you can "prove" that there's no underflow (which is usually more likely
with unsigned's), than OK.
> >>+ base64_encode_triplet_using_alphabet(alphabet, &dest[dest_offset], &src[src_offset]);
> >>+ src_offset += 3;
> >>+ dest_offset += 4;
> >>+ }
> >>+
> >>+ if (srclen - src_offset) {
> >
> >Here too: `srclen != src_offset` or better `src_offset < srclen`.
>
> Changed. I was assigning into a local variable there but took it out.
>
> >>+ char longsrc[4];
> >>+ int quartet_result;
> >>+ size_t insize = srclen;
> >>+
> >>+ if (insize == 0) {
> >>+ return 0;
> >>+ }
> >>+ while (src[insize-1] == '=') { /* throw away padding symbols */
> >
> >If I give srclen=1, src="=" there's an underflow here (don't think
> >that's valid but might be malicious).
>
> Fixed.
>
> >>+ if (insize == 1) {
> .
> .
> >>+
> >>+ return insize - 1;
> >
> >Here too if I follow correctly `insize` can be 0. Not sure which int
> >conversions happen here but it'd be better not to need that...
>
> Fixed.
>
> >>+ for(i=0; srclen - i > 4; i+=4) {
> >
> >`i + 4 < srclen` would be nicer re. underflow (though I looks fine
> >here).
>
> Again, written this way to avoid integer overflow - just in case i == size_t
> - 3, for example.
>
> >>+ "\x22\x23\x24\x25\x26" /* 105 */ \
> >>+ "\x27\x28\x29\x2a\x2b" /* 110 */ \
> >>+ "\x2c\x2d\x2e\x2f\x30" /* 115 */ \
> >>+ "\x31\x32\x33~~" /* 120 */ \
> >>+ "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" /* 125 */ \
> >>+ "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" /* 175 */ \
> >>+ "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" /* 225 */
> >>+
> >>+static const base64_alphabet_t alphabet_rfc4648_pregen = {
> >
> >Why not export this one?
>
> I was trying to keep the interface as small as possible. I couldn't see any
> use case for it outside the library. Having said that, I can see a caller
> switching between alphabets and wanting to use the rfc4648 one (rather than
> switching between functions to call).
>
> This actually also points out the whole "alphabet" problem. This library is
> currently confused as to what "alphabet" is. The next version of this patch
> might be have base64_init_maps(const char *), base64_maps_t,
> base64_decode_tail_using_maps(base64_maps_t x, ...) etc. A better name I've
> not come up with.
"alphabet" made sense to me. I thought you didn't export this pregen'ed
alphabet since you can get the same with the `base64_alphabet_rfc4648`
array and `base64_init_alphabet()`, but then I thought it'd be more
helpful to just export the entire thing.
> >>+ "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/",
> >>+ base64_decode_map_rfc4648
> >>+};
> >>+
> >>+#undef base64_decode_map_rfc4648
> >>+
> >
> >Perhaps the following functions should be `static inline` in the header?
> >Otherwise in some cases the function call can't be inlined away. It also
> >makes it clear these are just simple wrappers.
>
> Done. Those functions have jumped between base64.c and base64.h a lot :-)
>
> >>@@ -0,0 +1,200 @@
> >>+/* Licensed under LGPLv2+ - see LICENSE file for details */
> >>+#ifndef CCAN_BASE64_H>> +#define CCAN_BASE64_H
> >>+
> >>+#include <stdio.h> /* For size_t */
> >
> >If you only need `size_t`, you can include stddef.h instead.
>
> Changed.
>
> >>+/**
> >>+ * base64_alphabet_rfc4648 - the base64 alphabet as defined in rfc4648
> >>+ */
> >>+static const char base64_alphabet_rfc4648[] =
> >
> >I'd put an explicit length here, though there's no reason to...
>
> Hmmm. I think other people might object to it being unnecessary verbiage.
> It doesn't fuss me either way - but I think it does add a bit of clarity.
> Added the length.
Yes, this one is completely personal preference. I does have an
advantage/disadvantage (depending on how you look at it) if you
accidentally miss some characters.
> >>+/**
> >>+ * base64_encode - Encode a buffer into base64 according to rfc4648
> >>+ * @param dest Buffer to encode into
> >>+ * @param destlen Length of the destination buffer
> >>+ * @param src Buffer to encode
> >>+ * @param srclen Length of the data to encode
> >>+ * @return Number of encoded bytes set in dest. -1 on error (and errno set)
> >
> >Maybe return `-ERRORCODE` on error, instead of -1 (errno is annoying to
> >use). What errors can happen here? (Also for other functions).
>
> OK, now this is interesting. There are a few errors the existing interface
> might return - EDOM for passing bad encoded data in, EOVERFLOW for a
> too-small buffer for encode/decode, EINVAL if the input is short.
Please mention them where they can occur. I and probably others usually
skip the documentation blocks when the function signature itself is
clear enough (as in this case), but then look back for the error
handling.
>
> There's no reason why the existing entry points couldn't return -ERRORCODE.
>
> I'm curious to know *why*, 'though. If the API is ever extended such that
> something returns a pointer - errno is going to have to be used anyway, and
> consistency would be lost.
>
> Stylistically, I'm not sure how returning the error code is advantageous:
>
> if ((size_t err = base64_encode(...)) < 0) {
> perror(err)
Here it would be `-err` :)
> abort();
> }
>
> vs
>
> if (base64_encode(...)) < 0) {
Ah, I missed this one - this branch is never taken since the type of the
return value is `size_t` (unsigned). It would probably work with some
implicit casting if you write an explicit `== -1` but what `read(2)`,
`write(2)` do instead is return `ssize_t`. So I'd use it here, if you're
willing to pay the extra bit in the range of possible buffer lengths.
> perror(errno);
> abort();
> }
>
> Obviously using errno irks at least one programmer. I'd like to know why
> you find it annoying to use :-)
>
I personally don't like errno because you always have to remember to
zero it before using checking it, and if you want to propagate it you
have to use `saved_errno` variables, etc. Whereas with the error-code
return everything happens in the program flow.
Of course errno is well-established so if you prefer it you should use
it!
> >>+size_t base64_encode(char *dest, const size_t destlen, const char *src, const size_t srclen);
> >
> >`const size_t foo` parameter is a bit funny-looking.
>
> Hmmm. David mentioned the same thing. I defended it then by pointing out
> it (a) ensures you don't accidentally fiddle with something you weren't
> expecting to and (b) might give hints to the compiler. This is obviously
> not a globally accepted argument :-)
That's cool, as long as it's done consciously :)
> >Have you thought about adding a convenience function which only takes
> >src+srclen and returns a malloc'd buffer? This seems to be a pretty
> >common use case, e.g. in the example in `_info`. Also for decode.
>
> If you look back through the ccan mail archives I posted a ridiculously
> complicated API which would encode/decode your strings in place, resizing
> buffers appropriately (assuming it could take() them). David smacked me
> around the head to keep this as simple as possible to start off with. So to
> answer your question, "yes, absolutely" :-) Didn't do this for two reasons:
> (a) aforementioned simplicity and (b) base64_encode returns size_t and 0 is
> a valid return value; there's no out-of-band return value. A mallocing
> encode signature (and I had this at one stage, I believe) might look like
> this:
>
> if (base64_mallocing_encode("fred",4, &encoded_length) < 0) { ... }
This should be `encoded_buffer` right?
>
> Which is not as "simple" :-)
Oh sorry, I missed the previous discussion. If it turned out to
complicate the API too much then OK!
> >>+size_t base64_decoded_length(size_t srclen);
> >>+
> >>+/**
> >>+ * base64_decode - decode An rfc4648 base64-encoded string
> >>+ * @param dest Buffer to decode into
> >>+ * @param destlen Length of the destination buffer
> >>+ * @param src Buffer to decode
> >>+ * @param srclen Length of the data to decode
> >>+ * @return Number of decoded bytes set in dest. -1 on error (and errno set)
> >>+ * @note dest will be nul-padded to destlen
> >
> >Since base-64 pads with `=` in the last bits this can give the wrong
> >impression!
>
> Hmmm. A little bit of rewording:
> * @note dest will be nul-padded to destlen (past any required padding)
>
> which is still not great, but I'm trying to avoid epic comment blocks in the
> header.
>
> >In any case why should `dest` be nul-padded? (I always thought strncpy
> >was bad for doing that unless you write out directory entries in a
> >filesystem or something).
>
> It was, in fact, to make it look like strncpy. The belief is that the vast
> majority of the time you will be printing the base64 strings out.
>
> You can't always-nul-terminate (i.e. have encoded_size return +1 and
> explicitly null things); you might be encoding into a larger buffer (and the
> caller scribbling down the length you return alongside it).
>
> You could not-nul-terminate - but why would somone pass in a buffer size
> larger than what's strictly required if they didn't want you to play with
> it? :-)
>
> Callers would look like this:
> src = "fred";
> len = base64_encoded_length(strlen(src));
> dest = xmalloc(len+1);
> base64_encode(dest,len,src);
> dest[len] = '\0';
Is `base64_encoded_length` exact? Otherwise I'd use `base_encode()`
retval here.
> printf("%s\n",dest);
>
> Or this:
> src = "fred";
> len = base64_encoded_length(strlen(src)) + 1; /* for null-termination */
> dest = xmalloc(len);
> base64_encode(dest,len,src);
> printf("%s\n",dest);
>
> I really prefer the latter. I'm guessing you don't :-)
If you prefer it then OK.
I dislike it because the expectation is that runtime ~ (length of input)
not (length of output buffer). And I also remember avoiding `strncpy` a
few times because it has this behavior.
Maybe you can say that the buffer is nul-terminated if possible? (i.e.
writes only one '\0' if there is space). Although TBH even that I'd leave
out.
>
> >>+size_t base64_decode(char *dest, const size_t destlen,
> >>+ const char *src, const size_t srclen);
> >>+
> >>+
> >>+/**
> >>+ * base64_encode_triplet - encode 3 bytes into base64
> >>+ * @param dest Buffer containing at least 4 bytes
> >
> >I'd remove the "at least" here, to match the type `char dest[4]`.
>
> OK, I'd chosen my words carefully there - we don't care if the buffer is
> larger :-) However, it obviously caused confusion the way it was, and "at
> least" could be vaguely inferred.
>
> >>+ * @param src Buffer containing 3 bytes
> >>+ */
> >>+void base64_encode_triplet(char dest[4], const char src[3]);
> >
> >These sets of functions with tail, triplet, quartet seem like
> >implementation related and shouldn't be exposed as API (unless they're
> >useful for something I can't think of).
>
> Oh, yes. They actually make the library useful IMO. Simple, dodgy
> streaming base64 (see here the confusion about what an "alphabet" is):
>
> #include <stdio.h>
> #include <base64.h>
> #include <errno.h>
> #include <stdlib.h>
>
> int
> main(int argc, char * argv[]) {
> char buffer[1];
> char todecode[4];
> int todecode_offset = 0;
> ssize_t ret = 0;
>
> base64_alphabet_t alphabet;
> base64_init_alphabet(&alphabet,base64_alphabet_rfc4648);
>
> while (ret = read(0,buffer,1)) {
> if (ret == -1) {
> perror("read");
> exit(1);
> }
> if (base64_char_in_alphabet(&alphabet,buffer[0])) {
> todecode[todecode_offset++] = buffer[0];
> }
> if (todecode_offset > 3) {
> char decoded[3];
> base64_decode_quartet_using_alphabet(&alphabet,decoded,todecode);
> todecode_offset = 0;
> printf("%c%c%c",decoded[0],decoded[1],decoded[2]);
> }
> }
> if (todecode_offset) {
> char decoded[3];
>
> base64_decode_tail_using_alphabet(&alphabet,decoded,todecode,todecode_offset);
> todecode_offset = 0;
> printf("%c%c%c",decoded[0],decoded[1],decoded[2]);
> }
> }
Yea that's nice, haven't thought about this usecase.
> I could drop a somewhat less unpretty copy of that in an examples directory?
Why not!
> >>+int base64_char_in_alphabet(const base64_alphabet_t *alphabet,
> >>+ const char b64char);
> >
> >I'm assuming this is for checking if an encoded string is valid? If so,
> >wouldn't it be more straightforward to have a `validate` function (i.e.
> >just checks if the string is OK without actual decoding).
>
> You can see that in use above :-) Useful for skipping e.g. carriage returns
> in the input.
>
> >Ran
>
> Many, many thanks for taking the time to look through this. I'm willing to
> be convinved on things like errno and const-in-parameter-lists, so please
> convince me :-)
Thanks for writing & publishing it!
Ran
> I'll wait another day or two to see if anyone else weighs in. Otherwise
> I'll repost the patch.
>
> Thanks,
> --
> Peter Barker | Programmer,Sysadmin,Geek.
> pbarker at barker.dropbear.id.au | You need a bigger hammer.
> :: It's a hack! Expect underscores! - Nigel Williams
More information about the ccan
mailing list