[PATCH 3/6] parser: parse headers containing invalid characters
Daniel Axtens
dja at axtens.net
Mon Sep 26 08:20:16 AEST 2016
Hi Stephen,
> Excellent work: this works well and the tests are much appreciated :)
> I spent some time reviewing this and I have one question: should we
> extend this to other header parsing, such as 'Subject'? I ask because I
> noticed that we have a 'clean_header' function which is already used to
> handle some unicode headers and it should probably handle things like
> invalid characters too. I did some hacking to see if it could slot in
> place of the above changes:
>
> def find_headers(mail):
> + return '\n'.join(['%s: %s' % (key, clean_header(value)) for key, value
> + in mail.items()])
> # We have some Py2/Py3 issues here.
Oooh, good catch. We might need to try to integreate make_header because
it deals with wrapping lines correctly, but that should be pretty easy.
> On running this, this failed with a LookupError on Python 3 and the
> UnicodeDecodeError on Python 2, so I tried to handle these:
>
> def clean_header(header):
> """Decode (possibly non-ascii) headers."""
> def decode(fragment):
> - (frag_str, frag_encoding) = fragment
> + frag_str, frag_encoding = fragment
> if frag_encoding:
> - return frag_str.decode(frag_encoding)
> + if frag_encoding != 'unknown-8bit':
> + return frag_str.decode(frag_encoding)
> + else:
> + return frag_str.decode('ascii', errors='replace')
> elif isinstance(frag_str, six.binary_type): # python 2
> - return frag_str.decode()
> + try:
> + return frag_str.decode()
> + except UnicodeDecodeError:
> + return frag_str.decode('ascii', errors='replace')
> +
> return frag_str
>
> fragments = [decode(x) for x in decode_header(header)]
>
> When I make this change, all tests pass. Perhaps we should go about
> integrating your changes in this function and adding tests for things
> like invalid subjects lines to make sure this does what it says on the
> tin?
Yep. So conformant email clients should produce email headers that are
7-bit ascii, with UTF-8/other charsets encoded using quoted
printables. But there will certainly be those that don't, as we've
discovered. I'll add some tests for a variety of headers with invalid
characters and integrate your revision of this for v2.
Regards,
Daniel
> Stephen
More information about the Patchwork
mailing list