Structured feeds

Sat Nov 9 01:18:14 AEDT 2019

> I'm actually very interested in seeing patchwork switch from being fed 
> mail directly from postfix to using public-inbox repositories as its 
> source of patches. I know it's easy enough to accomplish as-is, by 
> piping things from public-inbox to parsemail.sh, but it would be even 
> more awesome if patchwork learned to work with these repos natively.
>
> The way I see it:
>
> - site administrator configures upstream public-inbox feeds
> - a backend process clones these repositories
>    - if it doesn't find a refs/heads/json, then it does its own parsing 
>      to generate a structured feed with patches/series/trailers/pull 
>      requests, cross-referencing them by series as necessary. Something 
>      like a subset of this, excluding patchwork-specific data:
>      https://patchwork.kernel.org/api/1.1/patches/11177661/
>    - if it does find an existing structured feed, it simply uses it (e.g.  
>      it was made available by another patchwork instance)
> - the same backend process updates the repositories from upstream using 
>    proper manifest files (e.g. see 
>    https://lore.kernel.org/workflows/manifest.js.gz)
>
> - patchwork projects then consume one (or more) of these structured 
>    feeds to generate the actionable list of patches that maintainers can 
>    use, perhaps with optional filtering by specific headers (list-id, 
>    from, cc), patch paths, keywords, etc.
>
> Basically, parsemail.sh is split into two, where one part does feed 
> cloning, pulling, and parsing into structured data (if not already 
> done), and another populates actual patchwork project with patches 
> matching requested parameters.

This is very confusing to me. Let me see if I have it correct.

You want to split out a chunk of parsemail that takes email messages,
either from regular email or from public-inbox, and spits out a
structured feed.

You then want patchwork to consume that structured feed.

I don't know how that would work architecturally - converting emails
into a structured feed requires a lot of the patchwork core.

It would be a lot simpler from the patchwork side to teach parsemail to
be able to consume a public-inbox git feed, and write an API consumer
that takes the structured data that Patchwork produces, strip out the
bits you don't care about, and feed it into other projects.

>
> I see the following upsides to this:
>
> - we consume public-inbox feeds directly, no longer losing patches due 
>    to MTA problems, postfix burps, parse failures, etc

This much I am OK with as an additional option for sites. FWIW,
consuming a public-inbox feed doesn't protect you against most parse
failures - they are due to things like duplicate message-ids and bad
mail from the sender end. It should prevent against issues due to
postfix invoking multiple parsemails in parallel, but that shouldn't be
losing patches, just getting series metadata wrong.

> - a project can have multiple sources for patches instead of being tied 
>    to a single mailing list

You can get around this pretty easily now with the --list-id=parameter,
and I think the netdev patchwork might do this to grab bpf patches? I
think there's a little shim at OzLabs that does this.

I also don't see how a public-inbox feed helps. Currently pw determines
the list based on a header in the email, unless overridden. public-inbox
emails will also have that header, so either patchwork looks at those
headers or you tell patchwork explicitly that a particular public-inbox
feed corresponds to a particular list. Either way I think this leaves
you in the same situation you were in before, unless I have
misunderstood...

> - downstream patchwork instances (the "local patchwork" tool I mentioned 
>    earlier) can benefit from structured feeds provided by 
>    patchwork.kernel.org

Do I understand correctly that this is basically a stripped-down version
of what the API provides, but in git form?

>>Patchwork does expose much of this as an API, for example for patches:
>>https://patchwork.ozlabs.org/api/patches/?order=-id so if you want to
>>build on that feel free. We can possibly add data to the API if that
>>would be helpful. (Patches are always welcome too, if you don't want to
>>wait an indeterminate amount of time.)
>
> As I said previously, I may be able to fund development of various 
> features, but I want to make sure that I properly work with upstream.  
> That requires getting consensus on features to make sure that we don't 
> spend funds and efforts on a feature that gets rejected. :)
>
> Would the above feature (using one or more public-inbox repositories as 
> sources for a patchwork project) be a welcome addition to upstream?

I think a lot about patchwork development in terms of good incremental
changes. This is largely because maintainers get quite cross with us if
we break things, and I don't like that.

What I would be happy with as a first step (not necessarily saying this
is _all_ I would accept, just that this is what I'd want to see _first_)
is:

 - code that efficiently reads a public-inbox git repository/folder of
   git repositories and feeds it into the existing parser. I have very
   inefficient code that converts public-inbox to an mbox and then
   parses that, but I'm sure you can do better with a git library.

 - careful thought about how to do this incrementally. It's obvious how
   to do email incrementally, but I think you need to keep an extra bit
   of state around to incrementally parse the git archive. I think.

 - careful thought about how to do this in a way that doesn't require
   sites that don't want to load public-inbox feeds to install lots of
   random git-parsing code.

Once you can do that, I'm happy to think more about your more ambitious
plans.

Regards,
Daniel

>
> -K