[PATCH] pwclient: fix handling of UTF-8 char in submitter name

Mon Dec 13 23:24:47 EST 2010

On Mon, Dec 13, 2010 at 03:17:24PM +0800, Jeremy Kerr wrote:
> > > The reason that I don't do this currently is that patchwork would now be
> > > altering your patches to something that the author didn't write. If you
> > > were to apply the resulting patch, you would be introducing the U+FFFD
> > > character to your source tree.
> > > 
> > > However, dropping patches isn't a great solution either, so other
> > > alternatives welcome :)
> > 
> > Would it be possible to handle the error at decode with "try"? If so, maybe
> > you could add some logic there to try to decode first with the email
> > charset. Then, try utf-8. If both fails, try to decode with some other
> > protocols, like iso8859-11. This will likely catch 99% of the issues. If
> > everything fails, it is preferred to use the replacement character than to
> > loose the patch.
> > 
> > I would also add a meta-tag to inticate the cases where patchwork is
> > guessing a type (or using a replacement character). This way, the
> > maintainer may manually take care of the fixes.
> 
> That sounds pretty reasonable. For cases like these, I'd like to add 
> 'warnings' to the patch; either a 'had to guess the charset' or 'invalid 
> encoding', depending on what we had to do to get a sucessful parse. The 
> warnings would then appear in the web UI, or on stderr when running pwclient.
> 
> *adds to the TODO list*

Why not just handle and store the patch as an array of bytes (Python
'str' type) instead of a unicode string?

The restriction that every patch should be valid unicode makes it
impossible to patch existing source files that already have non-utf8
data inside them (I suppose this includes source trees where files are
encoded as iso-8859-1, as the unicode diff won't be encoded back to the
original encoding when exporting the patches from Patchwork, will it?).

This would require changing the database model and xmlrpc API to use
binary data (I hope Django support it) instead of a unicode string, but
it sounds better than piling up unicode encoding/decoding hacks.

-- 
Eduardo