[PATCH v2 4/7] parser: parse headers containing invalid characters or codings
Stephen Finucane
stephen at that.guru
Sat Oct 8 23:15:46 AEDT 2016
On 29 Sep 09:01, Daniel Axtens wrote:
> >> +def sanitise_header(header_contents, header_name=None):
> >> + """Given a header with header_contents, optionally labelled
> >> + header_name, decode it with decode_header, sanitise it to make
> >> + sure it decodes correctly and contains no invalid characters.
> >> + Then encode the result with make_header()
> >> + """
> >> +
> >> + # We have some Py2/Py3 issues here.
> >> + #
> >> + # Firstly, the email parser (before we get here)
> >> + # Python 3: headers with weird chars are email.header.Header
> >> + # class, others as str
> >> + # Python 2: every header is an str
> >> + #
> >> + # Secondly, the behaviour of decode_header:
> >> + # Python 3: weird headers are labelled as unknown-8bit
> >> + # Python 2: weird headers are not labelled differently
> >> + #
> >> + # Lastly, aking matters worse, in Python2, unknown-8bit doesn't
> >> + # seem to be supported as an input to make_header, so not only do
> >> + # we have to detect dodgy headers, we have to fix them ourselves.
> >> + #
> >> + # We solve this by catching any Unicode errors, and then manually
> >> + # handling any interesting headers.
> >
> > I'm going to move all the above into the docstring, if that's OK by
> > you.
>
> I don't mind, but my understanding was that the docstring described the
> function, rather than the implementation details, hence why I had it in
> a comment originally.
You're right. Let's keep this as is.
> >> + # python2 - no support in make_header for unknown-8bit
> >> + # We should do unknown-8bit coding ourselves.
> >> + # For now, we're just going to replace any dubious
> >> + # chars with ?.
> >> + #
> >> + # TODO: replace it with a proper QP unknown-8bit codec.
> >
> > How could we resolve this TODO in the future?
> >
>
> I think we use email.charset. Possibly we can copy some of the Py3 code
> that implements unknown-8bit. It all looks quite finnicky. Also, any
> time we get a message with an invalid header and we're relying on the
> patchwork header display to figure it out is an edge case on top of
> another edge case, so I didn't think it worth much more time.
Yeah, we can solve this if it comes up. Just good to see that there is
a potential resolution if needed.
> >> + # on Py2, we want to do unicode(), on Py3, str().
> >> + # That gets us the decoded, un-wrapped header.
> >> + if six.PY2:
> >> + header_str = unicode(sane_header)
> >> + else:
> >> + header_str = str(sane_header)
> >
> > nit: this looks like something we could use the 'six.u()' function for?
> >
> > https://pythonhosted.org/six/#six.u
> >
> I thought this too. Then I looked at the code for django.utils.six:
> https://github.com/django/django/blob/master/django/utils/six.py#L646
>
>
> def u(s):
> return unicode(s.replace(r'\\', r'\\\\'), "unicode_escape")
>
> That won't work because we're passing in a header class to u(). The
> header class, as far as I know, doesn't provide .replace(), and even if
> it did it's not clear we want to call it.
Good spot. Thanks for checking this out.
> >> + strings = [('%s: %s' % (key, header.encode()))
> >
> > How come '.encode' is suitable here, yet we need to call 'unicode'/'str'
> > in the 'clean_header' function?
>
> Because the header class overrides .encode(), __unicode__ and __str__,
> and they do different things.
>
> For headers:
>
> - str (py3) and unicode (py2) provides a non-line-wrapped version of
> the header, in unicode - that is, not in quoted-printable or base64
> 7-bit form.
>
> - encode provides a line-wrapped, 7-bit encoded form of the header.
>
> For find_headers, we're creating the headers that are displayed when you
> click on the 'show' link on the patch display, under Details, and just
> beneath the message id and state. These are supposed to be fully encoded
> and line wrapped: they're supposed to look like mail headers.
>
> For clean_header, we're looking for the human readable form: decoded
> From quoted printable or base64, converted from whatever charset they
> were in, and not wrapped.
>
> Hence the code. Admittedly this is not super clear, but I blame the API
> of the email.Header module.
Ah, I understand you now. I'll stick some of those comments in there.
This looks good now so, hence:
Reviewed-by: Stephen Finucane <stephen at that.guru>
and applied.
More information about the Patchwork
mailing list