[PATCH] Implement list filtering

Tue Feb 6 06:00:40 AEDT 2018

----- Original Message -----
> From: "Daniel Axtens" <dja at axtens.net>
> To: vkabatov at redhat.com, patchwork at lists.ozlabs.org
> Sent: Monday, February 5, 2018 4:24:21 PM
> Subject: Re: [PATCH] Implement list filtering
> 
> Hi Veronika,
> 
> So, having been convinced of the merits of this approach, here's my code
> review.
> 

Hi, thanks for the comments!

> > Sometimes, multiple projects reside at the same mailing list. So far,
> > Patchwork only allowed a single project per mailing list, which made it
> > impossible for these projects to use Patchwork (unless they did some
> > dirty hacks).
> >
> > Add a new property `subject_match` to projects and implement filtering
> > on (list_id, subject_match) match instead of solely list_id. Instance
> > admin can specify a regex on a per-project basis when the project is
> > created.
> 
> Ok, so let me see if I have the logic of this patch right.
> 
> A message comes in.
> 
> A set of possible projects is found based on the list-id.
> 
> Then, starting from the first one returned by the db (with no intrinsic
> ordering), the subject_match field is examined:
>  - if subject_match is empty, this project is returned as a match
>  - if subject_match is non-empty, the project is returned as a match if
>    the subject matches
>  
> If this understanding is incorrect, bail out here :)
> 
> Consider a list like netdev. It has at least a couple of userspace
> projects (iproute2, ethtool) that could conceivably be spun off into
> their own lists, and which could use subject matches, but most patches
> are for the linux kernel and do not have usable subject matches.
> 
> As I understand it, you would need to have 3 projects:
>  - for iproute2, subject-match \[[^\]]*iproute[^\]]*\]
>  - for ethtool, subject-match \[[^\]]*ethtool[^\]]*\]
> Then how would you match the rest of the list? You could have an empty
> subject-match, but then if that is loaded first out of the database, it
> would match first and patches would never make it to the iproute2 or
> ethtool projects.
> 

My original plan was that if you need both matching projects and a default
one on top of that you can just do a negative lookahead.

> I think you need some sort of default or fall-back or last-resort - the
> name is not important: just that it will match *only* if no other
> project matches.
> 
> Does that sound right?
> 

Using a default project is doable and makes sense. Would it be acceptable
if we first checked for subject matches and in case no match was found,
checked for a project with same list_id and empty subject_match or do you
have something different in mind?

> I have some other minor comments which I have included throughout:
> 
> > diff --git a/docs/api.yaml b/docs/api.yaml
> > index 3e79f0b..3373226 100644
> > --- a/docs/api.yaml
> > +++ b/docs/api.yaml
> > @@ -374,6 +374,9 @@ definitions:
> >        list_id:
> >          type: string
> >          description: Mailing list identifier for project.
> > +      subject_match:
> > +        type: string
> > +        description: Regex used for email filtering.
> This is good.
> 
> > diff --git a/docs/deployment/management.rst
> > b/docs/deployment/management.rst
> > index c50b7b6..870d7ee 100644
> > --- a/docs/deployment/management.rst
> > +++ b/docs/deployment/management.rst
> > @@ -63,7 +63,7 @@ due to, for example, an outage.
> >  .. option:: --list-id <list-id>
> >  
> >     mailing list ID. If not supplied, this will be extracted from the mail
> > -   headers.
> > +   headers. Don't use this option if you require filtering based on
> > subjects.
> >  
> >  .. option:: infile
> 
> I don't understand why this would interfere with subject
> filtering. I notice below that you've added a new method that matches by
> list-id and subject and kept the old list-id matching one - does the
> command-line option use the old method?
> 
> I think I would prefer that there only be one system of mapping mails to
> projects and for this restriction to go away, but if there's a good
> reason for it to stay I am always open to persuasion.
> 

Yes, the command line option uses the old function. The rationale is that
if there are more projects with same list_id which differ in subject match
fields, how do you know which one to choose if you check for valid project
based only on list_id?

If you check the parse_mail() function, it chooses either the
find_project_by_id() or find_project_by_header() functions based on whether
list_id option was provided, so I kept this behavior and only changed the
helper function for find_project_by_header().

It's not hard to change find_project_by_id() to respect subject matches
too if you want. I just felt that it defeats the purpose of having the
option to skip header matching at altogether.

> > diff --git a/patchwork/api/project.py b/patchwork/api/project.py
> > index 446c473..597f605 100644
> > --- a/patchwork/api/project.py
> > +++ b/patchwork/api/project.py
> > @@ -39,8 +39,9 @@ class ProjectSerializer(HyperlinkedModelSerializer):
> >      class Meta:
> >          model = Project
> >          fields = ('id', 'url', 'name', 'link_name', 'list_id',
> >          'list_email',
> > -                  'web_url', 'scm_url', 'webscm_url', 'maintainers')
> > -        read_only_fields = ('name', 'maintainers')
> > +                  'web_url', 'scm_url', 'webscm_url', 'maintainers',
> > +                  'subject_match')
> > +        read_only_fields = ('name', 'maintainers', 'subject_match')
> This looks good - albeit I am not an API expert.
> 
> > diff --git a/patchwork/models.py b/patchwork/models.py
> > index 11886f1..907707f 100644
> > --- a/patchwork/models.py
> > +++ b/patchwork/models.py
> > @@ -71,8 +71,14 @@ class Project(models.Model):
> >  
> >      linkname = models.CharField(max_length=255, unique=True)
> >      name = models.CharField(max_length=255, unique=True)
> > -    listid = models.CharField(max_length=255, unique=True)
> > +    listid = models.CharField(max_length=255)
> >      listemail = models.CharField(max_length=200)
> > +    subject_match = models.CharField(
> > +        max_length=64, blank=True, default='', help_text='Regex to match
> > the '
> > +        'subject against if only part of emails sent to the list belongs
> > to '
> > +        'this project. Will be used with IGNORECASE and MULTILINE flags.
> > If '
> > +        'rules for more projects match the first one returned from DB is '
> > +        'chosen.')
> 
> I don't want to bikeshed this too much, but it definitely needs to say
> that a blank string matches everything.

Sure, will add.

> 
> > diff --git a/patchwork/parser.py b/patchwork/parser.py
> > index ac7dc5f..015f709 100644
> > --- a/patchwork/parser.py
> > +++ b/patchwork/parser.py
> > @@ -157,9 +157,25 @@ def find_project_by_id(list_id):
> >          project = Project.objects.get(listid=list_id)
> >      except Project.DoesNotExist:
> >          logger.debug("'%s' if not a valid project list-id", list_id)
> > +    except Project.MultipleObjectsReturned:
> > +        logger.debug("Multiple projects with list-id '%s' found", list_id)
> >      return project
> >  
> >  
> > +def find_project_by_id_and_subject(list_id, subject):
> > +    """Find a `project` object based on `list_id` and subject prefix
> > match."""
> > +    projects = Project.objects.filter(listid=list_id)
> > +    for project in projects:
> > +        if (not project.subject_match or
> > +                re.search(project.subject_match, subject,
> > +                          re.MULTILINE | re.IGNORECASE)):
> > +            return project
> > +
> > +    logger.debug("No project to match email with list-id '%s', subject
> > '%s' "
> > +                 "found", list_id, subject)
> > +    return None
> > +
> > +
> >  def find_project_by_header(mail):
> >      project = None
> >      listid_res = [re.compile(r'.*<([^>]+)>.*', re.S),
> > @@ -181,7 +197,8 @@ def find_project_by_header(mail):
> >  
> >              listid = match.group(1)
> >  
> > -            project = find_project_by_id(listid)
> > +            project = find_project_by_id_and_subject(listid,
> > +                                                     mail.get('Subject'))
> >              if project:
> >                  break
> Again here there is the 2 ways to find a project. It does look like
> subject matching only happens if you don't specify --list-id and I think
> that is probably not the approach I want to take.
> 
> Patchwork also extracts the bits of the subject between square brackets
> - would matching on that be more appropriate? (See subject_prefixes in
> parser.py.) That would mean the regex will automatically select for
> things in square brackets without having to encode that in the regex,
> which might avoid some user confusion and misdirected mail.
> 

I though about this but didn't find an easy way to make it work for us.
For example, we accept all of

RHEL7.4
RHEL 7.4
RHEL74
RHEL-7.4
RHEL-74

which means finding 'RHEL' and '7.4' as elements in subject_prefixes is
not enough and we'd need to do something more elaborate, taking into
account possible separators between keywords etc.

Besides that, if you only need to match against single keyword in the
whole subject line (like the 'ethtool' project you mentioned previously)
it's enough to use that keyword since re.search() matches it and you can
avoid all of the .* regex stuff. Maybe I should mention it in the option
description as well...

> Lastly, please could you also add a simple test case for this: doesn't need
> to
> be big, just enough to show that the feature works.
> 

Okay, I'll add that.

> Thanks again for your contribution to Patchwork!
> 
> Regards,
> Daniel
> 

Thanks,
Veronika