[PATCH] Revert "parser: Ensure whitespace is stripped for long headers"

Wed May 15 02:23:51 AEST 2019

Stephen Finucane <stephen at that.guru> writes:

> On Tue, 2019-05-14 at 16:11 +1000, Daniel Axtens wrote:
>> This reverts commit 841f966b8d54b2f51ab1c498eed6e5391f2546a9.
>> 
>> In July 2018, we received a report of OzLabs patchwork mangling
>> emails that have subjects containing words with internal commas,
>> like "Insert DT binding for foo,bar" (#197).
>> 
>> Stephen took a look and came up with the comment this reverts. Quoting
>> the commit message:
>> 
>>     RFC2822 states that long headers can be wrapped using CRLF followed by
>>     WSP [1]. For example:
>> 
>>         Subject: Foo bar,
>>          baz
>> 
>>     Should be parsed as:
>> 
>>         Foo bar,baz
>> 
>> As it turns out, this is not the case. Journey with me to
>> section 2.2.3 of RFC 2822:
>> 
>> 2.2.3. Long Header Fields
>> 
>>    Each header field is logically a single line of characters comprising
>>    the field name, the colon, and the field body.  For convenience
>>    however, and to deal with the 998/78 character limitations per line,
>>    the field body portion of a header field can be split into a multiple
>>    line representation; this is called "folding".  The general rule is
>>    that wherever this standard allows for folding white space (not
>>    simply WSP characters), a CRLF may be inserted before any WSP.  For
>>    example, the header field:
>> 
>>            Subject: This is a test
>> 
>>    can be represented as:
>> 
>>            Subject: This
>>             is a test
>> 
>> So the issue with the example in the reverted commit is that
>> there is no folding white space in "bar,baz", so it's not valid
>> to split it.
>> 
>> These are valid:
>>         Subject: Foo bar,baz
>>         Subject: Foo
>> 	 bar,baz
>> 
>> but splitting "bar,baz" into "bar,\n baz" is not valid.
>> 
>> What then is correct unfolding behaviour? Quoting the RFC again:
>> 
>>    The process of moving from this folded multiple-line representation
>>    of a header field to its single line representation is called
>>    "unfolding". Unfolding is accomplished by simply removing any CRLF
>>    that is immediately followed by WSP.  Each header field should be
>>    treated in its unfolded form for further syntactic and semantic
>>    evaluation.
>> 
>> In other words, the unfolding rule requires you to strip the CRLF,
>> but it does not permit you to strip the WSP. Indeed, if "bar,\n baz"
>> is received, the correct unfolding is "bar, baz".
>> 
>> If you do strip the WSP, you end up mashing words together, such
>> as in https://patchwork.ozlabs.org/patch/1097852/
>> 
>> So revert the commit, restoring original behaviour, but keep a
>> corrected version of the test.
>> 
>> This presents a big question though: how did Rob's email up with
>> a mangled subject line?
>> 
>> To answer this question, you end up having to learn about OzLabs
>> Patchwork and how it differs from Patchwork the project.
>> 
>> OzLabs Patchwork (patchwork.ozlabs.org) is an installation of
>> Patchwork. Part of what makes it so useful for so many projects is
>> a little intervening layer that can massage some mail to make it
>> end up in the right project. Email that lands in the device tree
>> project is an example of email that goes through this process.
>> I only learned about this today and I haven't looked in any detail
>> at precisely what is done to the mail. The script is not part of the
>> Patchwork project.
>> 
>> This intervening filter is a Python script that runs - and this
>> is an important detail - in Python 2.7.
>> 
>> Ignoring all the details, the filter basically operates in a pipe
>> between the mail program and patchwork's parsemail, like
>> 
>>   (mail from system) | filter.py | parsemail
>> 
>> At it's very simplest, filter.py acts as follows:
>> 
>>  import email
>>  import sys
>>  mail = email.parse_from_file(sys.stdin)
>>  sys.stdout.write(mail.as_string())
>> 
>> Fascinatingly, if you take Rob's email from #197 and put it through
>> this process, you can see that it is getting mangled:
>> 
>> Before:
>> Subject: [PATCH v2 3/4] dt-bindings: sound: wm8994: document wlf,csnaddr-pd property
>> 
>> After:
>> Subject: [PATCH v2 3/4] dt-bindings: sound: wm8994: document wlf,
>>  csnaddr-pd property
>> 
>> You can see that python27 has incorrectly wrapped the header, breaking
>> where there is not a foldable space. Python3 does not have this issue.
>> 
>> To summarise:
>>  - part of the magic of OzLabs PW is a filter to make sure mail gets
>>    to the right place. This isn't part of the Patchwork project and
>>    so is usually invisible to patchwork developers.
>> 
>>  - the filter is written in python27. The email module in py27 has
>>    a bug that incorrectly breaks subjects around commas within words.
>> 
>>  - patchwork correctly unfolds those broken subjects with a space
>>    after the comma.
>> 
>>  - thr extra space was interpreted as a bug in patchwork, leading to
>>    a misinterpretation of the spec to strip out the whitespace that
>>    was believed to be in error.
>> 
>>  - that broke other wrapped subjects.
>> 
>> To solve this, revert the commit and I'll work with jk to get the
>> filter script into py3 compatibility. (Given that py27 sunsets in
>> ~7mo, trying to fix it is not worth it.)
>> 
>> Closes: #197
>> CC: stable
>
> Thanks for the detailed overview. I dropped the changes to the release
> note and simply added a new one because this revert won't appear in
> 2.1.2, for example, so the release note for that shouldn't be changed.
> Otherwise all good though so I've applied this to both master and
> stable/2.1.
>

Thanks - release notes are still something of a mystery to me so thanks
for fixing that up.

I'll talk to the OzLabs crew about whether they want a 2.1.3 with this
patch or if they're happy to just carry it for a while.

Regards,
Daniel

> Stephen
>
>> ---
>>  patchwork/parser.py                                | 1 -
>>  patchwork/tests/test_parser.py                     | 2 +-
>>  releasenotes/notes/issue-197-4f7594db1e4c9887.yaml | 9 +++++----
>>  3 files changed, 6 insertions(+), 6 deletions(-)
>> 
>> diff --git a/patchwork/parser.py b/patchwork/parser.py
>> index 712780a498c4..91e9920c8782 100644
>> --- a/patchwork/parser.py
>> +++ b/patchwork/parser.py
>> @@ -47,7 +47,6 @@ class DuplicateMailError(Exception):
>>  
>>  
>>  def normalise_space(value):
>> -    value = ''.join(re.split(r'\n\s+', value))
>>      whitespace_re = re.compile(r'\s+')
>>      return whitespace_re.sub(' ', value).strip()
>>  
>> diff --git a/patchwork/tests/test_parser.py b/patchwork/tests/test_parser.py
>> index ddbcf5b15a19..f18220298078 100644
>> --- a/patchwork/tests/test_parser.py
>> +++ b/patchwork/tests/test_parser.py
>> @@ -838,7 +838,7 @@ class SubjectTest(TestCase):
>>          self.assertEqual(clean_subject("[PATCH] meep \n meep"),
>>                           ('meep meep', []))
>>          self.assertEqual(clean_subject("[PATCH] meep,\n meep"),
>> -                         ('meep,meep', []))
>> +                         ('meep, meep', []))
>>          self.assertEqual(clean_subject('[PATCH RFC] meep'),
>>                           ('[RFC] meep', ['RFC']))
>>          self.assertEqual(clean_subject('[PATCH,RFC] meep'),
>> diff --git a/releasenotes/notes/issue-197-4f7594db1e4c9887.yaml b/releasenotes/notes/issue-197-4f7594db1e4c9887.yaml
>> index 2777fbc2f85b..41b86c064b8a 100644
>> --- a/releasenotes/notes/issue-197-4f7594db1e4c9887.yaml
>> +++ b/releasenotes/notes/issue-197-4f7594db1e4c9887.yaml
>> @@ -1,7 +1,8 @@
>>  ---
>>  fixes:
>>    - |
>> -    Long headers can be wrapped using CRLF followed by WSP (whitespace). This
>> -    whitespace was not being stripped, resulting in errant whitespace being
>> -    saved for the patch subject. This is resolved though existing patches and
>> -    cover letters will need to be updated manually.
>> +    Long headers can be wrapped using CRLF followed by WSP (whitespace). There
>> +    was an incorrect fix that would lead to whitespace being stripped where it
>> +    shouldn't be, resulting in words being stuck together (likethis). This is
>> +    resolved, though existing patches and cover letters will need to be
>> +    updated manually.