[PATCH 3/6] parser: parse headers containing invalid characters

Wed Sep 28 13:27:20 AEST 2016

Hi Stephen,

>
> Excellent work: this works well and the tests are much appreciated :)
> I spent some time reviewing this and I have one question: should we
> extend this to other header parsing, such as 'Subject'? I ask because I
> noticed that we have a 'clean_header' function which is already used to
> handle some unicode headers and it should probably handle things like
> invalid characters too. I did some hacking to see if it could slot in
> place of the above changes:
>
>      def find_headers(mail):
>     +    return '\n'.join(['%s: %s' % (key, clean_header(value)) for key, value
>     +                      in mail.items()])
>          # We have some Py2/Py3 issues here.
>
> On running this, this failed with a LookupError on Python 3 and the
> UnicodeDecodeError on Python 2, so I tried to handle these:
>
>      def clean_header(header):
>          """Decode (possibly non-ascii) headers."""
>          def decode(fragment):
>     -        (frag_str, frag_encoding) = fragment
>     +        frag_str, frag_encoding = fragment
>              if frag_encoding:
>     -            return frag_str.decode(frag_encoding)
>     +            if frag_encoding != 'unknown-8bit':
>     +                return frag_str.decode(frag_encoding)
>     +            else:
>     +                return frag_str.decode('ascii', errors='replace')
>              elif isinstance(frag_str, six.binary_type):  # python 2
>     -            return frag_str.decode()
>     +            try:
>     +                return frag_str.decode()
>     +            except UnicodeDecodeError:
>     +                return frag_str.decode('ascii', errors='replace')
>     +
>              return frag_str
>
>          fragments = [decode(x) for x in decode_header(header)]
>
> When I make this change, all tests pass. Perhaps we should go about
> integrating your changes in this function and adding tests for things
> like invalid subjects lines to make sure this does what it says on the
> tin?

So it turns out that:

- we need the change to clean_header() you identified to properly handle
  invalid characters in fields that are parsed by clean_header()

- clean_header() returns a decoded (8-bit) string, without
  wrapping. find_headers() on the other hand, needs to return encoded
  (7-bit) wrapped header lines.

I'll probably add a helper function that deals with the decoding
properly, then I'll pipe that into both clean_header() and
find_header().

Regards,
Daniel