Reliably parsing mail on Py2 and Py3

Daniel Axtens dja at axtens.net
Thu Sep 22 12:32:49 AEST 2016


Hi all,

Parsing is hard.
So, let's see what happens if we try parsing mail.
All the code discussed is on GitHub [0], and links are provided
through-out.

Firstly, let's look at how we currently parse individual mails.

parsemail.py does the hard yards here. However, it doesn't start with
modern Django and python 3, so let's fix that first [1]. parsemail.py
can take input in either a file name or in stdin.

parsemail.sh then invokes parsemail.py. However, it's designed for
python2 and production settings. I'd like to:

 - be able to test with either python version using $PW_PYTHON.
 - Only set the settings module if it isn't already set the
   environment.  (In docker this is preset to the dev settings.)
 - pass arguments to parsemail.py, like --list-id

This is all done in [2].

Finally, parsemail-batch.sh can be used to invoke parsemail.sh in
batch mode. Importantly, it feeds input to parsemail.sh through stdin,
not filenames. It needs a small fixup to pass arguments through to
parsemail.sh [3].

OK, so now we're in the position to start testing stuff.

My test corpus is:
 - The email that Thomas found caused a bug.
 - 310 messages from the patchwork list
 
I've stuck this on github too, in a corpus directory [4].

So, let's see how this parses.

Firstly, let's try parsemail-batch.sh. This feeds mail in through
stdin. I reset the database each time.  So:

PW_PYTHON=python2 patchwork/bin/parsemail-batch.sh corpus/ --list-id=patchwork.ozlabs.org
PW_PYTHON=python3 patchwork/bin/parsemail-batch.sh corpus/ --list-id=patchwork.ozlabs.org

With both python2 and python2, we get only one failure:
mail-not-parsed.mbox
Traceback (most recent call last):
  File "/home/patchwork/patchwork/patchwork/bin/parsemail.py", line 115, in <module>
    sys.exit(main(sys.argv))
  File "/home/patchwork/patchwork/patchwork/bin/parsemail.py", line 103, in main
    result = parse_mail(mail, args['list_id'])
  File "/home/patchwork/patchwork/patchwork/parser.py", line 627, in parse_mail
    headers = find_headers(mail)
  File "/home/patchwork/patchwork/patchwork/parser.py", line 161, in find_headers
    for (k, v) in list(mail.items())])
  File "/home/patchwork/patchwork/patchwork/parser.py", line 161, in <listcomp>
    for (k, v) in list(mail.items())])
  File "/usr/lib/python3.5/email/header.py", line 217, in __init__
    self.append(s, charset, errors)
  File "/usr/lib/python3.5/email/header.py", line 295, in append
    s = s.decode(input_charset, errors)
AttributeError: 'Header' object has no attribute 'decode'

Let's fix that with [5], which is based on what Stephen and I posted
to the list, and many, many hours of [wat]. This stuff is _hard_.

So, now lets add a test to stop that happening again. That's [6]. (We
also add a related test for when a valid, but unescaped, UTF-8 char is
in a header.)  Much to our dismay, that test errors out on
Python3. Wat?

Traceback (most recent call last):
  File "/home/patchwork/patchwork/patchwork/tests/test_parser.py", line 541, in test_invalid_header_char
    '0012-invalid-header-char.mbox')
  File "/home/patchwork/patchwork/patchwork/tests/test_parser.py", line 88, in _find_content
    mail = read_mail(mbox_filename, project=self.project)
  File "/home/patchwork/patchwork/patchwork/tests/test_parser.py", line 54, in read_mail
    mail = message_from_file(open(file_path))
  File "/usr/lib/python3.5/email/__init__.py", line 54, in message_from_file
    return Parser(*args, **kws).parse(fp)
  File "/usr/lib/python3.5/email/parser.py", line 54, in parse
    data = fp.read(8192)
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9c in position 1019: ordinal not in range(128)

Here we're getting skewered by the way we're reading the file. So,
lets do things the way they're supposed to be done in Python3 and use
message_from_binary_file. This now works: [7].

Tox verifies this works with py27 as well.

So, now we can parse mail with parsemail-batch. Remember though, this
is not the only way we can ingest mail. parsemail.sh can take input
*either* on stdin, *or* as a file name argument. parsemail-batch
provides it through stdin. What happens if we try to parse our corpus
mail by mail using the file name argument?

for x in corpus/*; do
  echo $x
  PW_PYTHON=python2 patchwork/bin/parsemail.sh --list-id=patchwork.ozlabs.org $x
done

Now, we succeed in Python 2, but in Python 3, we get errors like this:

corpus/1470109248.8244_5.possimpible,U=6567:2,RS
Traceback (most recent call last):
  File "/home/patchwork/patchwork/patchwork/bin/parsemail.py", line 115, in <module>
    sys.exit(main(sys.argv))
  File "/home/patchwork/patchwork/patchwork/bin/parsemail.py", line 101, in main
    mail = message_from_file(args['infile'])
  File "/usr/lib/python3.5/email/__init__.py", line 54, in message_from_file
    return Parser(*args, **kws).parse(fp)
  File "/usr/lib/python3.5/email/parser.py", line 54, in parse
    data = fp.read(8192)
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2036: ordinal not in range(128)

So, why does it work on stdin, but not in a file open? The python
documentation suggests that stdin is automatically decoded: [8], and
this can be controlled by PYTHONIOENCODING environment variable:
[9]. The default encoding is based on the locale.

In the docker container, the IO encoding defaults to US-ASCII (8-bit
ascii rather than 7-bit). Sure enough, if we force it to 7-bit ASCII
with PYTHONIOENCODING=ascii:strict, things break again:

for x in corpus/*; do
    echo $x;
    PW_PYTHON=python3 PYTHONIOENCODING=ascii:strict patchwork/bin/parsemail.sh --list-id=patchwork.ozlabs.org < $x;
done

corpus/1470109248.8244_5.possimpible,U=6567:2,RS
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2036: ordinal not in range(128)

So, there are now 2 issues on Python 3:
 - file open doesn't work
 - stdin behaviour is dependent on your locale.

So, let's use the binary file mode of stdin, and binary files on open
in Py3. This is [10].

Yay, so now parsing of mail works, for the old-style parse setup, for
the samples we know about. And it only took us 8 patches!

Now we have Stephen's series to contend with. I've only been looking
at parsemail, so I've only pulled that in. After fixing up some merge
conflicts, it's [11].  This has of course undone [10] - the use of
message_from_binary_file, and obsoleted [1] - which just fixed an
error in the old parsemail.py.  This means we get some failures on
py3. They're fixed in [12] in much the same way.

So now we can use Stephen's new managment command to import mail in
Py2 and Py3 with both stdin and filenames, and it passes the corpus.

However, in so doing we've broken the tests in 2 inter-related
ways. Firstly, the stdin tests pass files in instead of file names. I
just broke that by changing the behaviour of the argument, so that's
not surprising. However, that revealed that the tests marked stdin
don't reopen stdin, so they don't actually test stdin behaviour. Fix
that up [13].

Finally, lets add tests for of a UTF-8 issue when invoked through the
management command: [14]. We reuse the UTF-8 email we created for
[6]. (Ideally we'd like a patch with UTF-8 in the body of the mail, but
I can't find one yet.) This tests that the managment command deals
correctly with files and stdin containing non-ASCII content. This also
gives us a great scaffold to test other interesting email that we come
across.

So now things work in the 2x2 matrix of [py2, py3] x [stdin,
filename]. It only took 12 patches! Yay!

I've cleaned up the patches into a new series: [15], which weighs in
at a much more manageable 6 patches. I'll submit that soon. Given that
it contains mbox files, bits of it will get lost on the list, so I'll
probably do a git pull request with a signed tag as well.

Regards,
Daniel

[0]: https://github.com/daxtens/patchwork/commits/parse-demo
[1]: https://github.com/daxtens/patchwork/commit/daa09b806427d7d207e3da503cc59ad8655041eb
[2]: https://github.com/daxtens/patchwork/commit/018d3f92c03ed54023a90622c1be3a1f52e39e2f
[3]: https://github.com/daxtens/patchwork/commit/8863a2d7ad9aa771df45650dc523aa5087c8fe56
[4]: https://github.com/daxtens/patchwork/commit/5e9271231d7b01cf0f9a97028a2630a3a148303e
[5]: https://github.com/daxtens/patchwork/commit/88af79565e9bc85f587c076ed6eddb692f48b02e
[6]: https://github.com/daxtens/patchwork/commit/3a3886f083398f904ae7bf318fa0c910c900bde5
[wat]: https://www.destroyallsoftware.com/talks/wat
[7]: https://github.com/daxtens/patchwork/commit/09d9c91e3acfb1151b89f6f380f30248e2adcb62
[8]: https://docs.python.org/3/library/sys.html#sys.stdin
[9]: https://docs.python.org/3/using/cmdline.html#envvar-PYTHONIOENCODING
[10]: https://github.com/daxtens/patchwork/commit/a791cf66ccad4017a8c4ba5cb10bea01faf40a5f
[11]: https://github.com/daxtens/patchwork/commit/b9e7dc75bb3008ed4eb8ef14271d5cd175f1b575
[12]: https://github.com/daxtens/patchwork/commit/3614100bd1b03f287262f386fe0d193b22f83381
[13]: https://github.com/daxtens/patchwork/commit/01a21af2ed500f6efc753ef347f0196a13882041
[14]: https://github.com/daxtens/patchwork/commit/7c465c49112a4627f19fd5a45615516520997686
[15]: https://github.com/daxtens/patchwork/commits/parse-clean
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 859 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/patchwork/attachments/20160922/6d87d06b/attachment.sig>


More information about the Patchwork mailing list