[PATCH] Revert "parser: Ensure whitespace is stripped for long headers"
Daniel Axtens
dja at axtens.net
Wed May 15 02:23:51 AEST 2019
Stephen Finucane <stephen at that.guru> writes:
> On Tue, 2019-05-14 at 16:11 +1000, Daniel Axtens wrote:
>> This reverts commit 841f966b8d54b2f51ab1c498eed6e5391f2546a9.
>>
>> In July 2018, we received a report of OzLabs patchwork mangling
>> emails that have subjects containing words with internal commas,
>> like "Insert DT binding for foo,bar" (#197).
>>
>> Stephen took a look and came up with the comment this reverts. Quoting
>> the commit message:
>>
>> RFC2822 states that long headers can be wrapped using CRLF followed by
>> WSP [1]. For example:
>>
>> Subject: Foo bar,
>> baz
>>
>> Should be parsed as:
>>
>> Foo bar,baz
>>
>> As it turns out, this is not the case. Journey with me to
>> section 2.2.3 of RFC 2822:
>>
>> 2.2.3. Long Header Fields
>>
>> Each header field is logically a single line of characters comprising
>> the field name, the colon, and the field body. For convenience
>> however, and to deal with the 998/78 character limitations per line,
>> the field body portion of a header field can be split into a multiple
>> line representation; this is called "folding". The general rule is
>> that wherever this standard allows for folding white space (not
>> simply WSP characters), a CRLF may be inserted before any WSP. For
>> example, the header field:
>>
>> Subject: This is a test
>>
>> can be represented as:
>>
>> Subject: This
>> is a test
>>
>> So the issue with the example in the reverted commit is that
>> there is no folding white space in "bar,baz", so it's not valid
>> to split it.
>>
>> These are valid:
>> Subject: Foo bar,baz
>> Subject: Foo
>> bar,baz
>>
>> but splitting "bar,baz" into "bar,\n baz" is not valid.
>>
>> What then is correct unfolding behaviour? Quoting the RFC again:
>>
>> The process of moving from this folded multiple-line representation
>> of a header field to its single line representation is called
>> "unfolding". Unfolding is accomplished by simply removing any CRLF
>> that is immediately followed by WSP. Each header field should be
>> treated in its unfolded form for further syntactic and semantic
>> evaluation.
>>
>> In other words, the unfolding rule requires you to strip the CRLF,
>> but it does not permit you to strip the WSP. Indeed, if "bar,\n baz"
>> is received, the correct unfolding is "bar, baz".
>>
>> If you do strip the WSP, you end up mashing words together, such
>> as in https://patchwork.ozlabs.org/patch/1097852/
>>
>> So revert the commit, restoring original behaviour, but keep a
>> corrected version of the test.
>>
>> This presents a big question though: how did Rob's email up with
>> a mangled subject line?
>>
>> To answer this question, you end up having to learn about OzLabs
>> Patchwork and how it differs from Patchwork the project.
>>
>> OzLabs Patchwork (patchwork.ozlabs.org) is an installation of
>> Patchwork. Part of what makes it so useful for so many projects is
>> a little intervening layer that can massage some mail to make it
>> end up in the right project. Email that lands in the device tree
>> project is an example of email that goes through this process.
>> I only learned about this today and I haven't looked in any detail
>> at precisely what is done to the mail. The script is not part of the
>> Patchwork project.
>>
>> This intervening filter is a Python script that runs - and this
>> is an important detail - in Python 2.7.
>>
>> Ignoring all the details, the filter basically operates in a pipe
>> between the mail program and patchwork's parsemail, like
>>
>> (mail from system) | filter.py | parsemail
>>
>> At it's very simplest, filter.py acts as follows:
>>
>> import email
>> import sys
>> mail = email.parse_from_file(sys.stdin)
>> sys.stdout.write(mail.as_string())
>>
>> Fascinatingly, if you take Rob's email from #197 and put it through
>> this process, you can see that it is getting mangled:
>>
>> Before:
>> Subject: [PATCH v2 3/4] dt-bindings: sound: wm8994: document wlf,csnaddr-pd property
>>
>> After:
>> Subject: [PATCH v2 3/4] dt-bindings: sound: wm8994: document wlf,
>> csnaddr-pd property
>>
>> You can see that python27 has incorrectly wrapped the header, breaking
>> where there is not a foldable space. Python3 does not have this issue.
>>
>> To summarise:
>> - part of the magic of OzLabs PW is a filter to make sure mail gets
>> to the right place. This isn't part of the Patchwork project and
>> so is usually invisible to patchwork developers.
>>
>> - the filter is written in python27. The email module in py27 has
>> a bug that incorrectly breaks subjects around commas within words.
>>
>> - patchwork correctly unfolds those broken subjects with a space
>> after the comma.
>>
>> - thr extra space was interpreted as a bug in patchwork, leading to
>> a misinterpretation of the spec to strip out the whitespace that
>> was believed to be in error.
>>
>> - that broke other wrapped subjects.
>>
>> To solve this, revert the commit and I'll work with jk to get the
>> filter script into py3 compatibility. (Given that py27 sunsets in
>> ~7mo, trying to fix it is not worth it.)
>>
>> Closes: #197
>> CC: stable
>
> Thanks for the detailed overview. I dropped the changes to the release
> note and simply added a new one because this revert won't appear in
> 2.1.2, for example, so the release note for that shouldn't be changed.
> Otherwise all good though so I've applied this to both master and
> stable/2.1.
>
Thanks - release notes are still something of a mystery to me so thanks
for fixing that up.
I'll talk to the OzLabs crew about whether they want a 2.1.3 with this
patch or if they're happy to just carry it for a while.
Regards,
Daniel
> Stephen
>
>> ---
>> patchwork/parser.py | 1 -
>> patchwork/tests/test_parser.py | 2 +-
>> releasenotes/notes/issue-197-4f7594db1e4c9887.yaml | 9 +++++----
>> 3 files changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/patchwork/parser.py b/patchwork/parser.py
>> index 712780a498c4..91e9920c8782 100644
>> --- a/patchwork/parser.py
>> +++ b/patchwork/parser.py
>> @@ -47,7 +47,6 @@ class DuplicateMailError(Exception):
>>
>>
>> def normalise_space(value):
>> - value = ''.join(re.split(r'\n\s+', value))
>> whitespace_re = re.compile(r'\s+')
>> return whitespace_re.sub(' ', value).strip()
>>
>> diff --git a/patchwork/tests/test_parser.py b/patchwork/tests/test_parser.py
>> index ddbcf5b15a19..f18220298078 100644
>> --- a/patchwork/tests/test_parser.py
>> +++ b/patchwork/tests/test_parser.py
>> @@ -838,7 +838,7 @@ class SubjectTest(TestCase):
>> self.assertEqual(clean_subject("[PATCH] meep \n meep"),
>> ('meep meep', []))
>> self.assertEqual(clean_subject("[PATCH] meep,\n meep"),
>> - ('meep,meep', []))
>> + ('meep, meep', []))
>> self.assertEqual(clean_subject('[PATCH RFC] meep'),
>> ('[RFC] meep', ['RFC']))
>> self.assertEqual(clean_subject('[PATCH,RFC] meep'),
>> diff --git a/releasenotes/notes/issue-197-4f7594db1e4c9887.yaml b/releasenotes/notes/issue-197-4f7594db1e4c9887.yaml
>> index 2777fbc2f85b..41b86c064b8a 100644
>> --- a/releasenotes/notes/issue-197-4f7594db1e4c9887.yaml
>> +++ b/releasenotes/notes/issue-197-4f7594db1e4c9887.yaml
>> @@ -1,7 +1,8 @@
>> ---
>> fixes:
>> - |
>> - Long headers can be wrapped using CRLF followed by WSP (whitespace). This
>> - whitespace was not being stripped, resulting in errant whitespace being
>> - saved for the patch subject. This is resolved though existing patches and
>> - cover letters will need to be updated manually.
>> + Long headers can be wrapped using CRLF followed by WSP (whitespace). There
>> + was an incorrect fix that would lead to whitespace being stripped where it
>> + shouldn't be, resulting in words being stuck together (likethis). This is
>> + resolved, though existing patches and cover letters will need to be
>> + updated manually.
More information about the Patchwork
mailing list