[PATCH] Implement list filtering

Tue Feb 13 22:21:19 AEDT 2018

Hi Veronika,

Apologies for the delay in getting back to you, and thanks for your responses.

>> Consider a list like netdev. It has at least a couple of userspace
>> projects (iproute2, ethtool) that could conceivably be spun off into
>> their own lists, and which could use subject matches, but most patches
>> are for the linux kernel and do not have usable subject matches.
>> 
>> As I understand it, you would need to have 3 projects:
>>  - for iproute2, subject-match \[[^\]]*iproute[^\]]*\]
>>  - for ethtool, subject-match \[[^\]]*ethtool[^\]]*\]
>> Then how would you match the rest of the list? You could have an empty
>> subject-match, but then if that is loaded first out of the database, it
>> would match first and patches would never make it to the iproute2 or
>> ethtool projects.
>> 
>
> My original plan was that if you need both matching projects and a default
> one on top of that you can just do a negative lookahead.
>
>> I think you need some sort of default or fall-back or last-resort - the
>> name is not important: just that it will match *only* if no other
>> project matches.
>> 
>> Does that sound right?
>> 
>
> Using a default project is doable and makes sense. Would it be acceptable
> if we first checked for subject matches and in case no match was found,
> checked for a project with same list_id and empty subject_match or do you
> have something different in mind?

That sounds like a good strategy. I certainly prefer it to negative
lookahead.

>
>> I have some other minor comments which I have included throughout:
>> 
>> > diff --git a/docs/api.yaml b/docs/api.yaml
>> > index 3e79f0b..3373226 100644
>> > --- a/docs/api.yaml
>> > +++ b/docs/api.yaml
>> > @@ -374,6 +374,9 @@ definitions:
>> >        list_id:
>> >          type: string
>> >          description: Mailing list identifier for project.
>> > +      subject_match:
>> > +        type: string
>> > +        description: Regex used for email filtering.
>> This is good.
>> 
>> > diff --git a/docs/deployment/management.rst
>> > b/docs/deployment/management.rst
>> > index c50b7b6..870d7ee 100644
>> > --- a/docs/deployment/management.rst
>> > +++ b/docs/deployment/management.rst
>> > @@ -63,7 +63,7 @@ due to, for example, an outage.
>> >  .. option:: --list-id <list-id>
>> >  
>> >     mailing list ID. If not supplied, this will be extracted from the mail
>> > -   headers.
>> > +   headers. Don't use this option if you require filtering based on
>> > subjects.
>> >  
>> >  .. option:: infile
>> 
>> I don't understand why this would interfere with subject
>> filtering. I notice below that you've added a new method that matches by
>> list-id and subject and kept the old list-id matching one - does the
>> command-line option use the old method?
>> 
>> I think I would prefer that there only be one system of mapping mails to
>> projects and for this restriction to go away, but if there's a good
>> reason for it to stay I am always open to persuasion.
>> 
>
> Yes, the command line option uses the old function. The rationale is that
> if there are more projects with same list_id which differ in subject match
> fields, how do you know which one to choose if you check for valid project
> based only on list_id?
>
> If you check the parse_mail() function, it chooses either the
> find_project_by_id() or find_project_by_header() functions based on whether
> list_id option was provided, so I kept this behavior and only changed the
> helper function for find_project_by_header().
>
> It's not hard to change find_project_by_id() to respect subject matches
> too if you want. I just felt that it defeats the purpose of having the
> option to skip header matching at altogether.

Oh I see and I think I understand.

>From a conceptual point of view, I see the --list-id option as just
overriding the list id; I don't think it overrides any other headers
(X-Patchwork-* for example).

However, I realised that my motivation is largely that I'd like to take
my existing archives of email, set up a new project with the same
list-id as the default project but a different subject match, and then
use --list-id=patchwork.ozlabs.org to injest my mail archive from a
different list. I realise now that this is just laziness on my part: I
should just change the list ids!

I wonder if anyone else uses the --list-id parameter (especially live)
and how best to serve their needs. I lean towards enabling subject
matching even if it is used, just in case someone feeds a non-list or
multiple lists into the one patchwork project. If anyone has thoughts
here I'd appreciate them.

(If your v2 addresses my other comments then I'll merge it either way,
we can always change it later.)

>
>> > diff --git a/patchwork/api/project.py b/patchwork/api/project.py
>> > index 446c473..597f605 100644
>> > --- a/patchwork/api/project.py
>> > +++ b/patchwork/api/project.py
>> > @@ -39,8 +39,9 @@ class ProjectSerializer(HyperlinkedModelSerializer):
>> >      class Meta:
>> >          model = Project
>> >          fields = ('id', 'url', 'name', 'link_name', 'list_id',
>> >          'list_email',
>> > -                  'web_url', 'scm_url', 'webscm_url', 'maintainers')
>> > -        read_only_fields = ('name', 'maintainers')
>> > +                  'web_url', 'scm_url', 'webscm_url', 'maintainers',
>> > +                  'subject_match')
>> > +        read_only_fields = ('name', 'maintainers', 'subject_match')
>> This looks good - albeit I am not an API expert.
>> 
>> > diff --git a/patchwork/models.py b/patchwork/models.py
>> > index 11886f1..907707f 100644
>> > --- a/patchwork/models.py
>> > +++ b/patchwork/models.py
>> > @@ -71,8 +71,14 @@ class Project(models.Model):
>> >  
>> >      linkname = models.CharField(max_length=255, unique=True)
>> >      name = models.CharField(max_length=255, unique=True)
>> > -    listid = models.CharField(max_length=255, unique=True)
>> > +    listid = models.CharField(max_length=255)
>> >      listemail = models.CharField(max_length=200)
>> > +    subject_match = models.CharField(
>> > +        max_length=64, blank=True, default='', help_text='Regex to match
>> > the '
>> > +        'subject against if only part of emails sent to the list belongs
>> > to '
>> > +        'this project. Will be used with IGNORECASE and MULTILINE flags.
>> > If '
>> > +        'rules for more projects match the first one returned from DB is '
>> > +        'chosen.')
>> 
>> I don't want to bikeshed this too much, but it definitely needs to say
>> that a blank string matches everything.
>
> Sure, will add.
(This is obvious, but please ensure your updated behaviour for default
projects is reflected here and in the docs.)

>> >  def find_project_by_header(mail):
>> >      project = None
>> >      listid_res = [re.compile(r'.*<([^>]+)>.*', re.S),
>> > @@ -181,7 +197,8 @@ def find_project_by_header(mail):
>> >  
>> >              listid = match.group(1)
>> >  
>> > -            project = find_project_by_id(listid)
>> > +            project = find_project_by_id_and_subject(listid,
>> > +                                                     mail.get('Subject'))
>> >              if project:
>> >                  break
>> Again here there is the 2 ways to find a project. It does look like
>> subject matching only happens if you don't specify --list-id and I think
>> that is probably not the approach I want to take.
>> 
>> Patchwork also extracts the bits of the subject between square brackets
>> - would matching on that be more appropriate? (See subject_prefixes in
>> parser.py.) That would mean the regex will automatically select for
>> things in square brackets without having to encode that in the regex,
>> which might avoid some user confusion and misdirected mail.
>> 
>
> I though about this but didn't find an easy way to make it work for us.
> For example, we accept all of
>
> RHEL7.4
> RHEL 7.4
> RHEL74
> RHEL-7.4
> RHEL-74
>
> which means finding 'RHEL' and '7.4' as elements in subject_prefixes is
> not enough and we'd need to do something more elaborate, taking into
> account possible separators between keywords etc.

I see - keep matching by the entire subject then.

> Besides that, if you only need to match against single keyword in the
> whole subject line (like the 'ethtool' project you mentioned previously)
> it's enough to use that keyword since re.search() matches it and you can
> avoid all of the .* regex stuff. Maybe I should mention it in the option
> description as well...

See, I want to match (and these are real examples from netdev)
[PATCH ethtool v2] ethtool: Support for FEC encoding control
but not
[PATCH RFC 07/11] dpaa_eth: add ethtool functionality
So I do want to keep the \[ and \] in there somehow.

I think we should assume people know a little about re.match() and not
put too much of that in the description lest it become super long.

Thanks again.

Regards,
Daniel

>> Lastly, please could you also add a simple test case for this: doesn't need
>> to
>> be big, just enough to show that the feature works.
>> 
>
> Okay, I'll add that.
>
>> Thanks again for your contribution to Patchwork!
>> 
>> Regards,
>> Daniel
>> 
>
> Thanks,
> Veronika