[PATCH] Implement list filtering

Daniel Axtens dja at axtens.net
Tue Feb 6 02:24:21 AEDT 2018


Hi Veronika,

So, having been convinced of the merits of this approach, here's my code
review.

> Sometimes, multiple projects reside at the same mailing list. So far,
> Patchwork only allowed a single project per mailing list, which made it
> impossible for these projects to use Patchwork (unless they did some
> dirty hacks).
>
> Add a new property `subject_match` to projects and implement filtering
> on (list_id, subject_match) match instead of solely list_id. Instance
> admin can specify a regex on a per-project basis when the project is
> created.

Ok, so let me see if I have the logic of this patch right.

A message comes in.

A set of possible projects is found based on the list-id.

Then, starting from the first one returned by the db (with no intrinsic
ordering), the subject_match field is examined:
 - if subject_match is empty, this project is returned as a match
 - if subject_match is non-empty, the project is returned as a match if
   the subject matches
 
If this understanding is incorrect, bail out here :)

Consider a list like netdev. It has at least a couple of userspace
projects (iproute2, ethtool) that could conceivably be spun off into
their own lists, and which could use subject matches, but most patches
are for the linux kernel and do not have usable subject matches.

As I understand it, you would need to have 3 projects:
 - for iproute2, subject-match \[[^\]]*iproute[^\]]*\]
 - for ethtool, subject-match \[[^\]]*ethtool[^\]]*\]
Then how would you match the rest of the list? You could have an empty
subject-match, but then if that is loaded first out of the database, it
would match first and patches would never make it to the iproute2 or
ethtool projects.

I think you need some sort of default or fall-back or last-resort - the
name is not important: just that it will match *only* if no other
project matches.

Does that sound right?

I have some other minor comments which I have included throughout:

> diff --git a/docs/api.yaml b/docs/api.yaml
> index 3e79f0b..3373226 100644
> --- a/docs/api.yaml
> +++ b/docs/api.yaml
> @@ -374,6 +374,9 @@ definitions:
>        list_id:
>          type: string
>          description: Mailing list identifier for project.
> +      subject_match:
> +        type: string
> +        description: Regex used for email filtering.
This is good.

> diff --git a/docs/deployment/management.rst b/docs/deployment/management.rst
> index c50b7b6..870d7ee 100644
> --- a/docs/deployment/management.rst
> +++ b/docs/deployment/management.rst
> @@ -63,7 +63,7 @@ due to, for example, an outage.
>  .. option:: --list-id <list-id>
>  
>     mailing list ID. If not supplied, this will be extracted from the mail
> -   headers.
> +   headers. Don't use this option if you require filtering based on subjects.
>  
>  .. option:: infile

I don't understand why this would interfere with subject
filtering. I notice below that you've added a new method that matches by
list-id and subject and kept the old list-id matching one - does the
command-line option use the old method?

I think I would prefer that there only be one system of mapping mails to
projects and for this restriction to go away, but if there's a good
reason for it to stay I am always open to persuasion.

> diff --git a/patchwork/api/project.py b/patchwork/api/project.py
> index 446c473..597f605 100644
> --- a/patchwork/api/project.py
> +++ b/patchwork/api/project.py
> @@ -39,8 +39,9 @@ class ProjectSerializer(HyperlinkedModelSerializer):
>      class Meta:
>          model = Project
>          fields = ('id', 'url', 'name', 'link_name', 'list_id', 'list_email',
> -                  'web_url', 'scm_url', 'webscm_url', 'maintainers')
> -        read_only_fields = ('name', 'maintainers')
> +                  'web_url', 'scm_url', 'webscm_url', 'maintainers',
> +                  'subject_match')
> +        read_only_fields = ('name', 'maintainers', 'subject_match')
This looks good - albeit I am not an API expert.

> diff --git a/patchwork/models.py b/patchwork/models.py
> index 11886f1..907707f 100644
> --- a/patchwork/models.py
> +++ b/patchwork/models.py
> @@ -71,8 +71,14 @@ class Project(models.Model):
>  
>      linkname = models.CharField(max_length=255, unique=True)
>      name = models.CharField(max_length=255, unique=True)
> -    listid = models.CharField(max_length=255, unique=True)
> +    listid = models.CharField(max_length=255)
>      listemail = models.CharField(max_length=200)
> +    subject_match = models.CharField(
> +        max_length=64, blank=True, default='', help_text='Regex to match the '
> +        'subject against if only part of emails sent to the list belongs to '
> +        'this project. Will be used with IGNORECASE and MULTILINE flags. If '
> +        'rules for more projects match the first one returned from DB is '
> +        'chosen.')

I don't want to bikeshed this too much, but it definitely needs to say
that a blank string matches everything.

> diff --git a/patchwork/parser.py b/patchwork/parser.py
> index ac7dc5f..015f709 100644
> --- a/patchwork/parser.py
> +++ b/patchwork/parser.py
> @@ -157,9 +157,25 @@ def find_project_by_id(list_id):
>          project = Project.objects.get(listid=list_id)
>      except Project.DoesNotExist:
>          logger.debug("'%s' if not a valid project list-id", list_id)
> +    except Project.MultipleObjectsReturned:
> +        logger.debug("Multiple projects with list-id '%s' found", list_id)
>      return project
>  
>  
> +def find_project_by_id_and_subject(list_id, subject):
> +    """Find a `project` object based on `list_id` and subject prefix match."""
> +    projects = Project.objects.filter(listid=list_id)
> +    for project in projects:
> +        if (not project.subject_match or
> +                re.search(project.subject_match, subject,
> +                          re.MULTILINE | re.IGNORECASE)):
> +            return project
> +
> +    logger.debug("No project to match email with list-id '%s', subject '%s' "
> +                 "found", list_id, subject)
> +    return None
> +
> +
>  def find_project_by_header(mail):
>      project = None
>      listid_res = [re.compile(r'.*<([^>]+)>.*', re.S),
> @@ -181,7 +197,8 @@ def find_project_by_header(mail):
>  
>              listid = match.group(1)
>  
> -            project = find_project_by_id(listid)
> +            project = find_project_by_id_and_subject(listid,
> +                                                     mail.get('Subject'))
>              if project:
>                  break
Again here there is the 2 ways to find a project. It does look like
subject matching only happens if you don't specify --list-id and I think
that is probably not the approach I want to take.

Patchwork also extracts the bits of the subject between square brackets
- would matching on that be more appropriate? (See subject_prefixes in
parser.py.) That would mean the regex will automatically select for
things in square brackets without having to encode that in the regex,
which might avoid some user confusion and misdirected mail.

Lastly, please could you also add a simple test case for this: doesn't need to
be big, just enough to show that the feature works.

Thanks again for your contribution to Patchwork!

Regards,
Daniel


More information about the Patchwork mailing list