[PATCH v2 4/7] parser: parse headers containing invalid characters or codings

Sat Oct 8 23:15:46 AEDT 2016

On 29 Sep 09:01, Daniel Axtens wrote:
> >> +def sanitise_header(header_contents, header_name=None):
> >> +    """Given a header with header_contents, optionally labelled
> >> +    header_name, decode it with decode_header, sanitise it to make
> >> +    sure it decodes correctly and contains no invalid characters.
> >> +    Then encode the result with make_header()
> >> +    """
> >> +
> >> +    # We have some Py2/Py3 issues here.
> >> +    #
> >> +    # Firstly, the email parser (before we get here)
> >> +    # Python 3: headers with weird chars are email.header.Header
> >> +    #           class, others as str
> >> +    # Python 2: every header is an str
> >> +    #
> >> +    # Secondly, the behaviour of decode_header:
> >> +    # Python 3: weird headers are labelled as unknown-8bit
> >> +    # Python 2: weird headers are not labelled differently
> >> +    #
> >> +    # Lastly, aking matters worse, in Python2, unknown-8bit doesn't
> >> +    # seem to be supported as an input to make_header, so not only do
> >> +    # we have to detect dodgy headers, we have to fix them ourselves.
> >> +    #
> >> +    # We solve this by catching any Unicode errors, and then manually
> >> +    # handling any interesting headers.
> >
> > I'm going to move all the above into the docstring, if that's OK by
> > you.
> 
> I don't mind, but my understanding was that the docstring described the
> function, rather than the implementation details, hence why I had it in
> a comment originally.

You're right. Let's keep this as is.

> >> +                # python2 - no support in make_header for unknown-8bit
> >> +                # We should do unknown-8bit coding ourselves.
> >> +                # For now, we're just going to replace any dubious
> >> +                # chars with ?.
> >> +                #
> >> +                # TODO: replace it with a proper QP unknown-8bit codec.
> >
> > How could we resolve this TODO in the future?
> >
> 
> I think we use email.charset. Possibly we can copy some of the Py3 code
> that implements unknown-8bit. It all looks quite finnicky. Also, any
> time we get a message with an invalid header and we're relying on the
> patchwork header display to figure it out is an edge case on top of
> another edge case, so I didn't think it worth much more time.

Yeah, we can solve this if it comes up. Just good to see that there is
a potential resolution if needed.

> >> +    # on Py2, we want to do unicode(), on Py3, str().
> >> +    # That gets us the decoded, un-wrapped header.
> >> +    if six.PY2:
> >> +        header_str = unicode(sane_header)
> >> +    else:
> >> +        header_str = str(sane_header)
> >
> > nit: this looks like something we could use the 'six.u()' function for?
> >
> > https://pythonhosted.org/six/#six.u
> >
> I thought this too. Then I looked at the code for django.utils.six:
> https://github.com/django/django/blob/master/django/utils/six.py#L646
> 
> 
>     def u(s):
>         return unicode(s.replace(r'\\', r'\\\\'), "unicode_escape")
> 
> That won't work because we're passing in a header class to u().  The
> header class, as far as I know, doesn't provide .replace(), and even if
> it did it's not clear we want to call it.

Good spot. Thanks for checking this out.

> >> +    strings = [('%s: %s' % (key, header.encode()))
> >
> > How come '.encode' is suitable here, yet we need to call 'unicode'/'str'
> > in the 'clean_header' function?
> 
> Because the header class overrides .encode(), __unicode__ and __str__,
> and they do different things.
> 
> For headers:
> 
>  - str (py3) and unicode (py2) provides a non-line-wrapped version of
>    the header, in unicode - that is, not in quoted-printable or base64
>    7-bit form.
> 
>  - encode provides a line-wrapped, 7-bit encoded form of the header.
> 
> For find_headers, we're creating the headers that are displayed when you
> click on the 'show' link on the patch display, under Details, and just
> beneath the message id and state. These are supposed to be fully encoded
> and line wrapped: they're supposed to look like mail headers.
> 
> For clean_header, we're looking for the human readable form: decoded
> From quoted printable or base64, converted from whatever charset they
> were in, and not wrapped.
> 
> Hence the code. Admittedly this is not super clear, but I blame the API
> of the email.Header module.

Ah, I understand you now. I'll stick some of those comments in there.

This looks good now so, hence:

Reviewed-by: Stephen Finucane <stephen at that.guru>

and applied.