[PATCH 3/6] parser: parse headers containing invalid characters
Daniel Axtens
dja at axtens.net
Wed Sep 28 13:27:20 AEST 2016
Hi Stephen,
>
> Excellent work: this works well and the tests are much appreciated :)
> I spent some time reviewing this and I have one question: should we
> extend this to other header parsing, such as 'Subject'? I ask because I
> noticed that we have a 'clean_header' function which is already used to
> handle some unicode headers and it should probably handle things like
> invalid characters too. I did some hacking to see if it could slot in
> place of the above changes:
>
> def find_headers(mail):
> + return '\n'.join(['%s: %s' % (key, clean_header(value)) for key, value
> + in mail.items()])
> # We have some Py2/Py3 issues here.
>
> On running this, this failed with a LookupError on Python 3 and the
> UnicodeDecodeError on Python 2, so I tried to handle these:
>
> def clean_header(header):
> """Decode (possibly non-ascii) headers."""
> def decode(fragment):
> - (frag_str, frag_encoding) = fragment
> + frag_str, frag_encoding = fragment
> if frag_encoding:
> - return frag_str.decode(frag_encoding)
> + if frag_encoding != 'unknown-8bit':
> + return frag_str.decode(frag_encoding)
> + else:
> + return frag_str.decode('ascii', errors='replace')
> elif isinstance(frag_str, six.binary_type): # python 2
> - return frag_str.decode()
> + try:
> + return frag_str.decode()
> + except UnicodeDecodeError:
> + return frag_str.decode('ascii', errors='replace')
> +
> return frag_str
>
> fragments = [decode(x) for x in decode_header(header)]
>
> When I make this change, all tests pass. Perhaps we should go about
> integrating your changes in this function and adding tests for things
> like invalid subjects lines to make sure this does what it says on the
> tin?
So it turns out that:
- we need the change to clean_header() you identified to properly handle
invalid characters in fields that are parsed by clean_header()
- clean_header() returns a decoded (8-bit) string, without
wrapping. find_headers() on the other hand, needs to return encoded
(7-bit) wrapped header lines.
I'll probably add a helper function that deals with the decoding
properly, then I'll pipe that into both clean_header() and
find_header().
Regards,
Daniel
More information about the Patchwork
mailing list