[PATCH v2 4/7] parser: parse headers containing invalid characters or codings
dja at axtens.net
Thu Sep 29 09:01:24 AEST 2016
>> +def sanitise_header(header_contents, header_name=None):
>> + """Given a header with header_contents, optionally labelled
>> + header_name, decode it with decode_header, sanitise it to make
>> + sure it decodes correctly and contains no invalid characters.
>> + Then encode the result with make_header()
>> + """
>> + # We have some Py2/Py3 issues here.
>> + #
>> + # Firstly, the email parser (before we get here)
>> + # Python 3: headers with weird chars are email.header.Header
>> + # class, others as str
>> + # Python 2: every header is an str
>> + #
>> + # Secondly, the behaviour of decode_header:
>> + # Python 3: weird headers are labelled as unknown-8bit
>> + # Python 2: weird headers are not labelled differently
>> + #
>> + # Lastly, aking matters worse, in Python2, unknown-8bit doesn't
>> + # seem to be supported as an input to make_header, so not only do
>> + # we have to detect dodgy headers, we have to fix them ourselves.
>> + #
>> + # We solve this by catching any Unicode errors, and then manually
>> + # handling any interesting headers.
> I'm going to move all the above into the docstring, if that's OK by
I don't mind, but my understanding was that the docstring described the
function, rather than the implementation details, hence why I had it in
a comment originally.
>> + # python2 - no support in make_header for unknown-8bit
>> + # We should do unknown-8bit coding ourselves.
>> + # For now, we're just going to replace any dubious
>> + # chars with ?.
>> + #
>> + # TODO: replace it with a proper QP unknown-8bit codec.
> How could we resolve this TODO in the future?
I think we use email.charset. Possibly we can copy some of the Py3 code
that implements unknown-8bit. It all looks quite finnicky. Also, any
time we get a message with an invalid header and we're relying on the
patchwork header display to figure it out is an edge case on top of
another edge case, so I didn't think it worth much more time.
>> + # on Py2, we want to do unicode(), on Py3, str().
>> + # That gets us the decoded, un-wrapped header.
>> + if six.PY2:
>> + header_str = unicode(sane_header)
>> + else:
>> + header_str = str(sane_header)
> nit: this looks like something we could use the 'six.u()' function for?
I thought this too. Then I looked at the code for django.utils.six:
return unicode(s.replace(r'\\', r'\\\\'), "unicode_escape")
That won't work because we're passing in a header class to u(). The
header class, as far as I know, doesn't provide .replace(), and even if
it did it's not clear we want to call it.
>> + strings = [('%s: %s' % (key, header.encode()))
> How come '.encode' is suitable here, yet we need to call 'unicode'/'str'
> in the 'clean_header' function?
Because the header class overrides .encode(), __unicode__ and __str__,
and they do different things.
- str (py3) and unicode (py2) provides a non-line-wrapped version of
the header, in unicode - that is, not in quoted-printable or base64
- encode provides a line-wrapped, 7-bit encoded form of the header.
For find_headers, we're creating the headers that are displayed when you
click on the 'show' link on the patch display, under Details, and just
beneath the message id and state. These are supposed to be fully encoded
and line wrapped: they're supposed to look like mail headers.
For clean_header, we're looking for the human readable form: decoded
From quoted printable or base64, converted from whatever charset they
were in, and not wrapped.
Hence the code. Admittedly this is not super clear, but I blame the API
of the email.Header module.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 859 bytes
Desc: not available
More information about the Patchwork