[PATCH] parser: leniently parse headers as UTF-8

Wed Sep 21 22:06:45 AEST 2016

Hi,

Stephen's suggested patch is a bit better here, so drop this for
now. v2, tests, etc. to come.

Regards,
Daniel

> If there is a non-ascii character in a header, parsing fails,
> even on Py27.
>
> Try to decode headers as UTF-8, but if that fails, replace the
> offending bytes with a character marking that decoding failed.
> See:
> https://docs.python.org/3/howto/unicode.html#python-s-unicode-support
>
> This is handy for mails with malformed headers containing weird
> bytes.
>
> Reported-by: Thomas Monjalon <thomas.monjalon at 6wind.com>
> Signed-off-by: Daniel Axtens <dja at axtens.net>
>
> ---
>
> Many thanks to Thomas for his help debugging this.
>
> Happy to bikeshed whether we want 'replace' or perhaps
> 'backslashreplace'. Not keen on 'ignore'; it has an interesting
> security history - but willing to entertain convincing arguments.
>
> This should probably go to a stable branch too. We'll need to start
> some discussion about how to handle bug fixes for people not running
> git mainline (like ozlabs.org and kernel.org).
>
> Tests to prevent this recurring to come. Python 3 patches to come
> also.
> ---
>  patchwork/parser.py | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/patchwork/parser.py b/patchwork/parser.py
> index 1805df8cda7f..d3f55634f530 100644
> --- a/patchwork/parser.py
> +++ b/patchwork/parser.py
> @@ -157,6 +157,7 @@ def find_date(mail):
>  def find_headers(mail):
>      return reduce(operator.__concat__,
>                    ['%s: %s\n' % (k, Header(v, header_name=k,
> +                                           charset='utf-8', errors='replace',
>                                             continuation_ws='\t').encode())
>                     for (k, v) in list(mail.items())])
>  
> -- 
> 2.7.4