[PATCH] parser: leniently parse headers as UTF-8

Daniel Axtens dja at axtens.net
Tue Sep 20 01:08:32 AEST 2016


If there is a non-ascii character in a header, parsing fails,
even on Py27.

Try to decode headers as UTF-8, but if that fails, replace the
offending bytes with a character marking that decoding failed.
See:
https://docs.python.org/3/howto/unicode.html#python-s-unicode-support

This is handy for mails with malformed headers containing weird
bytes.

Reported-by: Thomas Monjalon <thomas.monjalon at 6wind.com>
Signed-off-by: Daniel Axtens <dja at axtens.net>

---

Many thanks to Thomas for his help debugging this.

Happy to bikeshed whether we want 'replace' or perhaps
'backslashreplace'. Not keen on 'ignore'; it has an interesting
security history - but willing to entertain convincing arguments.

This should probably go to a stable branch too. We'll need to start
some discussion about how to handle bug fixes for people not running
git mainline (like ozlabs.org and kernel.org).

Tests to prevent this recurring to come. Python 3 patches to come
also.
---
 patchwork/parser.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/patchwork/parser.py b/patchwork/parser.py
index 1805df8cda7f..d3f55634f530 100644
--- a/patchwork/parser.py
+++ b/patchwork/parser.py
@@ -157,6 +157,7 @@ def find_date(mail):
 def find_headers(mail):
     return reduce(operator.__concat__,
                   ['%s: %s\n' % (k, Header(v, header_name=k,
+                                           charset='utf-8', errors='replace',
                                            continuation_ws='\t').encode())
                    for (k, v) in list(mail.items())])
 
-- 
2.7.4



More information about the Patchwork mailing list