[PATCH v2 09/10] parsemail: Convert to a management command
dja at axtens.net
Sun Aug 28 17:06:05 AEST 2016
> + def handle(self, *args, **options):
> + # Attempt to parse the path if provided, and fallback to stdin if not
> + if args:
> + logger.info('Parsing mail loaded by filename')
> + with open(args) as file_:
> + mail = message_from_file(file_)
> + else:
> + logger.info('Parsing mail loaded from stdin')
> + mail = message_from_file(sys.stdin)
So, I have found an interesting case here, not strictly related to this
patch but related to parsing messages from files.
I have been testing with some messages from this list from earlier this
month. One  includes the following sequence:
000018f0 69 65 73 20 76 69 65 77 29 20 3f c2 a0 20 48 6f |ies view) ?.. Ho|
Note the sequence "c2 a0". Both these are > 128 and therefore not part
of 7-bit ASCII.
Apparently this is a UTF-8 for a non-breaking space:
email.message_from_file does not handle this well: it boils down to
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 6395: ordinal not in range(128)
I imagine this hasn't hit us in production because most (all?)
production users use Python2, which doesn't have the bytes/string
distinction that Python3 has.
Anyway, the only way I've found to work around this is to do something
with open(args, 'rb') as file_:
decoded_mail = file_.read().decode('utf-8')
mail = email.message_from_string(decoded_mail)
This is super ugly, but works in Py3. Ironically it doesn't work in Py2,
but it's a start. Could you include something like this in this patch
set? I think the parsearchive will require something similar too.
I'm going to start collecting these "interesting" emails to make a test suite.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 859 bytes
Desc: not available
More information about the Patchwork