[PATCH] pwclient: fix handling of UTF-8 char in submitter name

Tue Dec 14 01:34:26 EST 2010

On Mon, Dec 13, 2010 at 09:55:18PM +0800, Jeremy Kerr wrote:
> Hi Eduardo,
> 
> > Why not just handle and store the patch as an array of bytes (Python
> > 'str' type) instead of a unicode string?
> 
> Basically, because we need to process the patch itself; either extracting it 
> from the email message, or for finding the hash. Both of these require looking 
> into the content of the patch, which means we need to be able to decode it.

I don't get it. You can process and look into a byte array as easily as
you can process a unicode string.

patch(1) operates at byte level, it doesn't care about unicode and
character encoding. It just get a description of byte-level changes to
source files. So we don't need to pretend that every diff is going to be
valid unicode.

I understand it is hard to change this on Patchwork today, though. It
would affect the database models and the xmlrpc interface.

> 
> > The restriction that every patch should be valid unicode makes it
> > impossible to patch existing source files that already have non-utf8
> > data inside them (I suppose this includes source trees where files are
> > encoded as iso-8859-1, as the unicode diff won't be encoded back to the
> > original encoding when exporting the patches from Patchwork, will it?).
> > 
> > This would require changing the database model and xmlrpc API to use
> > binary data (I hope Django support it)
> 
> no, django doesn't support it out of the box, I believe this is a django 
> design decision.

Ouch. I understand that discouraging storing binary data is a good
thing, but I didn't expect Django to simply not allow it.

-- 
Eduardo