[PATCH] parser: remove in-reply-to/references comments

Emily Shaffer emilyshaffer at google.com
Fri Jun 25 04:52:45 AEST 2021


On Wed, Jun 23, 2021 at 12:51:46PM -0400, Raxel Gutierrez wrote:
> 
> When contributors using Gnus reply to a patch, the In-Reply-To &
> References header fields include comments in the form described in
> remove_rfc2822_comments spec. These comments can lead to threading
> issues where users' replies don't get associated to the patch they
> reply to because the added comment will be part of the message-id when
> parsed. Following RFC2822, a comment is not supposed to affect threading
> in any way, so we remove it. To use regex, we assume that comments are
> not nested; further changes to more thoroughly match the RFC2822 grammar
> remains for anyone interested.

Hm, does the nesting level of the comments really matter? Or is the
issue that they may be multiline?

That is - it's pretty trivial to write a regex to match
(foo(bar)baz((quot)meh)), as long as you don't actually care about the
semantics of the nesting.

> 
> Signed-off-by: Raxel Gutierrez <raxel at google.com>
> Closes: #399
> ---
>  patchwork/parser.py                           | 25 +++++++++++++++++--
>  .../notes/issue-399-584c5be5b71dcf63.yaml     |  7 ++++++
>  2 files changed, 30 insertions(+), 2 deletions(-)
>  create mode 100644 releasenotes/notes/issue-399-584c5be5b71dcf63.yaml
> 
> diff --git a/patchwork/parser.py b/patchwork/parser.py
> index 61a8124..683ff55 100644
> --- a/patchwork/parser.py
> +++ b/patchwork/parser.py
> @@ -70,6 +70,27 @@ def normalise_space(value):
>      return whitespace_re.sub(' ', value).strip()
> 
> 
> +def remove_rfc2822_comments(header_contents):
> +    """Removes RFC2822 comments from header fields.
> +
> +    Gnus create reply emails with commments like In-Reply-To/References:
> +    <msg-id> (User's message of Sun, 01 Jan 2012 12:34:56 +0700) [comment].
> +    Patchwork parses the values of the In-Reply-To & References header fields
> +    with the comment included as part of their value. A side effect of the
> +    comment not being removed is that message-ids are mismatched. These
> +    comments do not provide useful information for processing patches
> +    because they are ignored for threading and not rendered by mail readers.
> +    """
> +
> +    # Captures comments in header fields.

Firstly, I'd like to point out for other reviewers that Raxel commented
the expression this way because I told him to - if you hate it, blame
me, not him ;)

> +    comment_pattern = re.compile(r"""
> +                                \(      # The opening parenthesis of comment
> +                                [^()]*  # The contents of the comment
I *think* this is the bit that's making it not support nesting.
"Match anything besides another open- or close-paren".

https://docs.python.org/3/library/re.html tells me that Python treats
'*' as greedy by default, so wouldn't "\(.*\)" handle nested comments?
Or is there an issue that you can have more that one, e.g.

  In-Reply-To: (danica's mail) abcd1-40-8d at mail.google.com (from gnus)

in which case greedy-matching would also obliterate the actual
message-id?

This actually brings to mind that I'd like to see an example of one such
problematic line in the commit message, if you've got one handy.

> +                                \)      # The closing parenthesis of comment
> +                                """, re.X)
> +    return re.sub(comment_pattern, '', header_contents)
> +
> +
>  def sanitise_header(header_contents, header_name=None):
>      """Clean and individual mail header.
> 
> @@ -483,13 +504,13 @@ def find_references(mail):
> 
>      if 'In-Reply-To' in mail:
>          for in_reply_to in mail.get_all('In-Reply-To'):
> -            r = clean_header(in_reply_to)
> +            r = remove_rfc2822_comments(clean_header(in_reply_to))
>              if r:
>                  refs.append(r)
> 
>      if 'References' in mail:
>          for references_header in mail.get_all('References'):
> -            h = clean_header(references_header)
> +            h = remove_rfc2822_comments(clean_header(references_header))
>              if not h:
>                  continue
>              references = h.split()
> diff --git a/releasenotes/notes/issue-399-584c5be5b71dcf63.yaml
> b/releasenotes/notes/issue-399-584c5be5b71dcf63.yaml
> new file mode 100644
> index 0000000..a3e2ec1
> --- /dev/null
> +++ b/releasenotes/notes/issue-399-584c5be5b71dcf63.yaml
> @@ -0,0 +1,7 @@
> +---
> +fixes:
> +  - |
> +    Gnus users' replies would not be associated to the patch they reply to
> +    because their In-Reply-To & References field generated explanatory
> +    comments that were added to the value of the message-id parsed.
> +    (`#399 <https://github.com/getpatchwork/patchwork/issues/399>`__)

Hum. Is there a test suite we can add a regression test to for this
specific kind of line format?


Sorry not to have left any lines on the Python style - I'm the last
person you should ask. ;) Please don't consider my lack of comment to be
an approval, there.

 - Emily


More information about the Patchwork mailing list