[rfc] Extending Patchwork as a GSoC project

Fri May 8 17:43:36 AEST 2020

Hi Ralf,

>> Just in terms of data model for patchwork, a uniqueness constraint for
>> patches is the pair (message id, project). So you can have one email
>> received and tracked by two separate projects - and with two different
>> sets of status in each - but you cannot have 2 mails with the same
>> message id in the same project. Patches also have a unique ID which is
>> used in the API.
>
> while working on PaStA, I had to learn that Message-IDs are not unique.
> A mail that is sent to multiple lists will have the same Message-IDs,
> but it also will have different bodies (e.g., different footers added by
> list servers). And users sometimes re-use Message-IDs. I observed all
> fancy corner cases.

Right, I wasn't sufficiently clear.

As far as patchwork is concerned, there can be 1 instance of a given
message-id per project. That can have different contents (e.g. list
footers) and status (accepted by 1 project, Not Applicable in another,
etc). Hopefully the diff and therefore the hash will be the same.

Once patchwork has ingested an email with a given message id for a
project, it will drop all subsequent emails with the same message-id for
that project. This doesn't tend to be a problem in practice.

> However, we use Message-IDs as unique identifier in PaStA, but it will
> map to a list of emails.

Likewise,
http://patchwork.ozlabs.org/api/patches/?msgid=20111116225839.GE26985@bloggs.ozlabs.ibm.com
returns a list of 2 patches, 1 for when the message was received on the
linuxppc project, and 1 for when the message was received on the kvm-ppc project.

>> Message IDs and patch IDs should also be stable/immutable. Message IDs,
>> being a property of _mails_, will be the same across different patchwork
>> instances that consume the same mail. Patch IDs, being a property of the
>> specific database that ingested the patch, will vary from patchwork
>> instance to patchwork instance.
>
> I didn't fully understand. Let's say you have a patchwork instance with
> two configured projects. Each project is assigned to one list. You send
> a patch to both lists. Both mails will have the same Message-ID, while
> list servers treat the mail a bit different and modify it.
>
> While those mails be assigned to two IDs in patchwork, or will the ID be
> the same in both projects?

You will get 2 patch IDs, one for each project. The message-id will be
the same in each case. To retrieve a unique email you need to either
specify the patch ID, or search by the (msgid, project) pair.

What I was trying to say was actually something different again. Say you
create a local patchwork instance and subscribe it to linuxppc-dev (or
feed it linuxppc-dev patches through mboxes or public-inbox or
whatever).

Then, a query to 

http://patchwork.ozlabs.org/api/patches/?msgid=20200501031128.19584-2-srikar@linux.vnet.ibm.com&project=linuxppc-dev

and

http://localhost:8000/api/patches/?msgid=20200501031128.19584-2-srikar@linux.vnet.ibm.com&project=linuxppc-dev

will both* return a list containing just the patch
'[v3,1/3] powerpc/numa: Set numa_node for all possible cpus'

The diff/content will be the same. However, only the ozlabs one will
have the current metadata for the patch (state, delegate, checks, etc),
and they will both have different internal patch IDs for the patch.

I had a long spiel about how this might be useful but the more I think
about it the more I see edge cases. Happy to go on about it some more if
you want to explore it further though.

* there's a narrow edge case if you have duplicate message IDs very
  close in time and parse them in a different order.

> BTW, this is why we can't simply use mailing list data from public
> inboxes as our data source and provide results to a patchwork instance
> running on a different data source.

I think someone quipped on Twitter once that email is a massive
distributed fuzzer and I think they are right!

>>> Daniel, I have in mind that there is already some kind of infrastructure
>>> in patchwork for receiving raw patches... AFAIR, Mete implemented an
>>> export routine that eases the first initial import. Is there a
>>> possibility to reliably "receive all new patches since my last pull"?

>>  - In terms of 'catching up': I think you're asking if Patchwork will
>>    let you _export_ all patches since your last pull, rather than asking
>>    if patchwork will let you import patches? I think that makes the most
>>    sense in context. If that's the case, then the way I would do that
>>    is:
>> 
>>    a) observe the highest patch ID in the project you are tracking, as
>>       patch IDs are always increasing. Note that the same cannot be said
>>       about dates - patchwork instances, due to the quirks of email,
>>       often get mail out-of-order. You probably want something like:
>> 
>>       http://patchwork.ozlabs.org/api/patches/?order=-id&project=linuxppc-dev
>
> Excellent! So... If I got everything from [0..100] and JSON reports that
> the latest ID is 130, then [101..130] will _definitely_ exist and form
> the exact set of patches that I miss?

It's not _quite_ that simple! Both the set [0..100] and the set
[101..130] will likely contain patches that do not belong to your
project. I suspect you do not want to gather patches for every project!

But if you are following linuxppc, and you have gathered
{linuxppc patches with id <= 100},
and the latest ID for linuxppc is 130, then I believe
{linuxppc patches where 100 < id <= 130}
is the exact set of patches you've missed. We don't support sharding of
PK space for multi-master writes or anything else that might mess with
this.

Sadly, we also don't currently support a filter predicate that would
allow you to neatly express 'patches with IDs between 100 and 130' in a
query, but I'd be happy to consider such a patch.

(In the mean time, you can store what page of
http://patchwork.ozlabs.org/api/patches/?project=linuxppc-dev contained
patch 100 and read all subsequent pages until you hit patch 130. As I
alluded to but didn't state clearly, pagination when sorted by
increasing ID is stable*.)

* unless an admin goes in and deletes a patch or moves it to another
  project, or changes the hard-coded page length. But I think this is
  largely a theoretical concern only!

Regards,
Daniel

>> 
>>    b) Retrieve all email from your last pull to that patch ID. Bear in
>>       mind that it is likely that more email will arrive while you are
>>       doing this - hence why I suggest fetching the patch ID first! Be
>
> Ack.
>
>>       careful also of pagination as that can also change if new patches
>>       come in. One day we will fix this by adding cursor-based
>>       pagination as well but we haven't done it yet. As such you
>>       probably want to do this with a different query with the opposite
>>       ordering, something like:
>> 
>>       http://patchwork.ozlabs.org/api/patches/?since=2020-05-01T00%3A00%3A00&project=linuxppc-dev
>> 
>>       (order=id is implied but wouldn't hurt to specify it, and an API
>>       version, in your final code)
>> 
>> You can map message ids to patch IDs using the API, which might also
>> help, e.g.:
>> 
>>  https://patchwork.ozlabs.org/api/patches/?msgid=20200414062102.6798-3-dja@axtens.net&project=patchwork
>> 
>> will return a json array with an object containing the patch ID for that
>> msgid/project pair if it exists.
>> 
>> The inverse is even simpler:
>> 
>>   https://patchwork.ozlabs.org/api/patches/1270129/
>
> I see, thanks.
>
> Thanks
>   Ralf