[PATCH 3/6] parser: parse headers containing invalid characters

Mon Sep 26 01:09:27 AEST 2016

On 23 Sep 10:06, Daniel Axtens wrote:
> If there is a non-ascii character in a header, parsing fails,
> even on Py27.
> 
> This has huge Py2/Py3 complexities. The Py3 email package has tools
> to handle this - we just need to use them. Py2, on the other hand,
> needs a lot of hand-holding, as explained in the comments.
> 
> This is handy for mails with malformed headers containing weird
> bytes.
> 
> Reported-by: Thomas Monjalon <thomas.monjalon at 6wind.com>
> Suggested-by: Stephen Finucane <stephenfinucane at hotmail.com>
> Signed-off-by: Daniel Axtens <dja at axtens.net>
> 
> ---
> 
> Many thanks to Thomas for his help debugging this, and to Stephen
> for a much better patch.
> 
> This should probably go to a stable branch too. We'll need to start
> some discussion about how to handle bug fixes for people not running
> git mainline (like ozlabs.org and kernel.org).
> ---
>  patchwork/parser.py | 65 ++++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 60 insertions(+), 5 deletions(-)
> 
> diff --git a/patchwork/parser.py b/patchwork/parser.py
> index 3389e96c4f3e..1b4cab1eb1a8 100644
> --- a/patchwork/parser.py
> +++ b/patchwork/parser.py
> @@ -21,7 +21,7 @@
>  
>  import codecs
>  import datetime
> -from email.header import Header, decode_header
> +from email.header import decode_header, make_header
>  from email.utils import parsedate_tz, mktime_tz
>  from fnmatch import fnmatch
>  from functools import reduce
> @@ -155,10 +155,65 @@ def find_date(mail):
>  
>  
>  def find_headers(mail):
> -    return reduce(operator.__concat__,
> -                  ['%s: %s\n' % (k, Header(v, header_name=k,
> -                                           continuation_ws='\t').encode())
> -                   for (k, v) in list(mail.items())])
> +    # We have some Py2/Py3 issues here.
> +    #
> +    # Firstly, the email parser (before we get here)
> +    # Python 3: headers with weird chars are email.header.Header
> +    #           class, others as str
> +    # Python 2: every header is an str
> +    #
> +    # Secondly, the behaviour of decode_header:
> +    # Python 3: weird headers are labelled as unknown-8bit
> +    # Python 2: weird headers are not labelled differently, causing an
> +    #           encoding issue when Py2 goes to then encode it in
> +    #           make_header.
> +    #
> +    # Lastly, making matters worse, in Python2, unknown-8bit doesn't
> +    # seem to be supported as an input to make_header, so not only do
> +    # we have to detect dodgy headers, we have to fix them ourselves.
> +    #
> +    # We solve this by catching any Unicode errors, and then manually
> +    # handling any interesting headers. This will only happen under
> +    # Py2.
> +
> +    headers = {key: decode_header(value) for key, value in
> +               mail.items()}
> +
> +    strings = []
> +    for key, value in headers.items():
> +        try:
> +            header = make_header(value,
> +                                 header_name=key,
> +                                 continuation_ws='\t')
> +        except UnicodeDecodeError:
> +            # We should only get here under Python 2.
> +            # At least one of the parts cannot be encoded as ascii.
> +            # Find out which one and fix it somehow.
> +            new_value=[]
> +            for (part, coding) in value:
> +                if (coding is not None or
> +                    all([ord(x) in range(128) for x in part])):
> +
> +                    # we either have a coding hint or it's all ascii
> +                    # let it through unchanged.
> +                    # TODO: handle invalid data with a coding hint
> +                    new_value += [(part, coding)]
> +                else:
> +                    # We have random bytes that aren't properly coded.
> +                    # We should do the unknown-8bit coding ourselves.
> +                    # For now, we're just going to replace any dubious
> +                    # chars with ?. TODO: replace it with a proper QP
> +                    # unknown-8bit codec.
> +                    new_value += [(part.decode('ascii', errors='replace')
> +                                   .encode('ascii', errors='replace'),
> +                                   None)]
> +            header = make_header(new_value,
> +                                 header_name=key,
> +                                 continuation_ws='\t')
> +        finally:
> +            strings += ['%s: %s' % (key, header.encode())]
> +
> +    return '\n'.join(strings)

Excellent work: this works well and the tests are much appreciated :)
I spent some time reviewing this and I have one question: should we
extend this to other header parsing, such as 'Subject'? I ask because I
noticed that we have a 'clean_header' function which is already used to
handle some unicode headers and it should probably handle things like
invalid characters too. I did some hacking to see if it could slot in
place of the above changes:

     def find_headers(mail):
    +    return '\n'.join(['%s: %s' % (key, clean_header(value)) for key, value
    +                      in mail.items()])
         # We have some Py2/Py3 issues here.

On running this, this failed with a LookupError on Python 3 and the
UnicodeDecodeError on Python 2, so I tried to handle these:

     def clean_header(header):
         """Decode (possibly non-ascii) headers."""
         def decode(fragment):
    -        (frag_str, frag_encoding) = fragment
    +        frag_str, frag_encoding = fragment
             if frag_encoding:
    -            return frag_str.decode(frag_encoding)
    +            if frag_encoding != 'unknown-8bit':
    +                return frag_str.decode(frag_encoding)
    +            else:
    +                return frag_str.decode('ascii', errors='replace')
             elif isinstance(frag_str, six.binary_type):  # python 2
    -            return frag_str.decode()
    +            try:
    +                return frag_str.decode()
    +            except UnicodeDecodeError:
    +                return frag_str.decode('ascii', errors='replace')
    +
             return frag_str

         fragments = [decode(x) for x in decode_header(header)]

When I make this change, all tests pass. Perhaps we should go about
integrating your changes in this function and adding tests for things
like invalid subjects lines to make sure this does what it says on the
tin?

Stephen