[PATCH 1/3] parse(mail|archive): handle early fail within email module

Mon Jul 3 15:31:21 AEST 2017

On 01/07/17 14:28, Daniel Axtens wrote:
> Certain really messed up email messages can cause a failure within
> the email module (at least on py3). Catch this.
>
> Signed-off-by: Daniel Axtens <dja at axtens.net>
> ---
>  patchwork/management/commands/parsearchive.py | 15 +++++++++++++
>  patchwork/management/commands/parsemail.py    | 31 ++++++++++++++++-----------
>  2 files changed, 33 insertions(+), 13 deletions(-)
>
> diff --git a/patchwork/management/commands/parsearchive.py b/patchwork/management/commands/parsearchive.py
> index 4e102a988e76..a9abbbff0d0d 100644
> --- a/patchwork/management/commands/parsearchive.py
> +++ b/patchwork/management/commands/parsearchive.py
> @@ -77,6 +77,21 @@ class Command(BaseCommand):
>
>          count = len(mbox)
>
> +        # Iterate through the mbox. This will pick up exceptions that
> +        # are only thrown when a broken email is found part way
> +        # through. Without this block, we'd get the exception thrown
> +        # in enumerate(mbox) below, which is harder to catch.
> +        #
> +        # The alternative is converting the mbox to a list of
> +        # messages, but that requires holding the entire thing in
> +        # memory, which is wateful.
> +        try:
> +            for m in mbox:
> +                pass
> +        except AttributeError:
> +            logger.warning('Broken mbox/Maildir, aborting')
> +            return
> +

So after further discussion with Daniel and taking a look at the test 
case, we've figured out what this is, and why merely iterating through 
this triggers an AttributeError of all things.

This is a bug in the Python email module, triggered when a multipart 
message has a Content-Transfer-Encoding header with a value which can't 
be validly decoded.

A minimal case:

	From xyz
	Content-Type: multipart/mixed; boundary="1"
	Content-Transfer-Encoding: \200 [where \200 can be any byte that's not 
valid ASCII]

Save that to test.mbox:

	import mailbox
	mbox = mailbox.mbox('test.mbox')
	mbox[0]

and you'll hit this.

mailbox's __getitem__() eventually calls down into email.parser.parse() 
-> email.feedparser.close() -> email.feedparser._call_parse() -> 
email.feedparser._parsegen().

In there, we see:

         # Make sure a valid content type was specified per RFC 2045:6.4.
         if (self._cur.get('content-transfer-encoding', '8bit').lower()
                 not in ('7bit', '8bit', 'binary')):
             defect = errors.InvalidMultipartContentTransferEncodingDefect()
             self.policy.handle_defect(self._cur, defect)

So it looks like self._cur.get() is returning a Header rather than a str 
because it can't be validly decoded, but here we expect it to be a str 
and to have .lower(). It's not, so therefore AttributeError!

Intuitively the correct behaviour would be to handle this case the same 
as any other invalid encoding value and handle the 
InvalidMultipartContentTransferEncodingDefect.

So we should report this bug upstream, but we should also merge this fix 
to work around it.

dja - I'm happy to write this up and file it if you haven't already?

Andrew

-- 
Andrew Donnellan              OzLabs, ADL Canberra
andrew.donnellan at au1.ibm.com  IBM Australia Limited