[PATCH 3/6] parser: parse headers containing invalid characters

Daniel Axtens dja at axtens.net
Mon Sep 26 08:20:16 AEST 2016


Hi Stephen,

> Excellent work: this works well and the tests are much appreciated :)
> I spent some time reviewing this and I have one question: should we
> extend this to other header parsing, such as 'Subject'? I ask because I
> noticed that we have a 'clean_header' function which is already used to
> handle some unicode headers and it should probably handle things like
> invalid characters too. I did some hacking to see if it could slot in
> place of the above changes:
>
>      def find_headers(mail):
>     +    return '\n'.join(['%s: %s' % (key, clean_header(value)) for key, value
>     +                      in mail.items()])
>          # We have some Py2/Py3 issues here.

Oooh, good catch. We might need to try to integreate make_header because
it deals with wrapping lines correctly, but that should be pretty easy.

> On running this, this failed with a LookupError on Python 3 and the
> UnicodeDecodeError on Python 2, so I tried to handle these:
>
>      def clean_header(header):
>          """Decode (possibly non-ascii) headers."""
>          def decode(fragment):
>     -        (frag_str, frag_encoding) = fragment
>     +        frag_str, frag_encoding = fragment
>              if frag_encoding:
>     -            return frag_str.decode(frag_encoding)
>     +            if frag_encoding != 'unknown-8bit':
>     +                return frag_str.decode(frag_encoding)
>     +            else:
>     +                return frag_str.decode('ascii', errors='replace')
>              elif isinstance(frag_str, six.binary_type):  # python 2
>     -            return frag_str.decode()
>     +            try:
>     +                return frag_str.decode()
>     +            except UnicodeDecodeError:
>     +                return frag_str.decode('ascii', errors='replace')
>     +
>              return frag_str
>
>          fragments = [decode(x) for x in decode_header(header)]
>
> When I make this change, all tests pass. Perhaps we should go about
> integrating your changes in this function and adding tests for things
> like invalid subjects lines to make sure this does what it says on the
> tin?

Yep. So conformant email clients should produce email headers that are
7-bit ascii, with UTF-8/other charsets encoded using quoted
printables. But there will certainly be those that don't, as we've
discovered. I'll add some tests for a variety of headers with invalid
characters and integrate your revision of this for v2.

Regards,
Daniel

> Stephen


More information about the Patchwork mailing list