[PATCH] Fallback to common charsets when charset is None or x-unknown

Siddhesh Poyarekar siddhesh at redhat.com
Thu Jun 12 05:34:46 EST 2014


Trying again after signing up to the mailing list (patch is slightly
modified from my first submission, which may either be in moderation
or may have gotten lost somehow):

On Wed, Jun 11, 2014 at 04:09:16PM +0530, Siddhesh Poyarekar wrote:
> Hi,
> 
> We recently encountered a case in our glibc patchwork instance on
> sourceware, where a patch was dropped because it had x-unknown
> charset.  I used the following patch to fix this in our instance.  The
> fix I used was to fall back on a set of encodings (instead of just
> utf-8) when the charset is not mentioned or if it is set as x-unknown.
> 
> I hope this is useful.  I'd love to know if you all think there is a
> better way to fix this so that I can implement that in our instance
> instead of my hack.
> 
> Cheers,
> Siddhesh

--- a/apps/patchwork/bin/parsemail.py	2014-06-11 15:53:12.685666812 +0530
+++ b/apps/patchwork/bin/parsemail.py	2014-06-11 15:53:03.991667186 +0530
@@ -147,6 +147,13 @@
         return match.group(1)
     return None
 
+def try_decode(payload, charset):
+    try:
+        payload = unicode(payload, charset)
+    except UnicodeDecodeError:
+        return None
+    return payload
+
 def find_content(project, mail):
     patchbuf = None
     commentbuf = ''
@@ -157,15 +164,27 @@
             continue
 
         payload = part.get_payload(decode=True)
-        charset = part.get_content_charset()
         subtype = part.get_content_subtype()
 
-        # if we don't have a charset, assume utf-8
-        if charset is None:
-            charset = 'utf-8'
-
         if not isinstance(payload, unicode):
-            payload = unicode(payload, charset)
+            charset = part.get_content_charset()
+
+            # If there is no charset or if it is unknown, then try some common
+            # charsets before we fail.
+            if charset is None or charset == 'x-unknown':
+                try_charsets = ['utf-8', 'windows-1252', 'ascii', 'iso-8859-1']
+            else:
+                try_charsets = [charset]
+
+            for cset in try_charsets:
+                decoded_payload = try_decode(payload, cset)
+                if decoded_payload is not None:
+                    break
+            payload = decoded_payload
+
+            # Could not find a valid decoded payload.  Fail.
+            if payload is None:
+                return (None, None)
 
         if subtype in ['x-patch', 'x-diff']:
             patchbuf = payload
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/patchwork/attachments/20140612/25717a14/attachment.sig>


More information about the Patchwork mailing list