[rfc] Extending Patchwork as a GSoC project

Thu May 7 16:54:35 AEST 2020

>> > Do I see it correctly, that the official Linux patchwork instances
>> > receive the ML data on their own? So they do not rely on, for example,
>> > public inboxes, right?
>> 
>> There are two patchwork instances that might reasonably be called
>> "official" - patchwork.kernel.org and patchwork.ozlabs.org. They cover
>> different subsystems and some other projects also. I don't run either of
>> them, so I am limited a bit in how much I know about them, but it is my
>> understanding that both of them ingest mail directly - see
>> https://patchwork.readthedocs.io/en/master/deployment/management/#parsemail
>> 
>> I think Konstantin, who manages the kernel.org one, would love to see an
>> evolution to move towards more use of things like public-inbox, but
>> there's no direct import support for it at this point.
>
> I guess at this point in time, going the public inbox way to ingest
> patches is a no go then. There will also be additional issues in this
> architecture that we might have to tackle. One is mapping a patch in
> PaStA to it's patchwork id will involve an API call. This might not be
> scalable for a large number of patches. Whereas in the other case (pulling
> patches from patchwork) we can obtain the Patchwork Id easily from the
> header. Further if I understand correctly, in the case of public inboxes, 
> we need to manually pull patches to sync to the current state of the public 
> inbox. It will be difficult for PaStA to know what state Patchwork is in.

If you need new or improved API calls to do things efficently, I'm very
open to considering patches for that. Feel free to also ask specific
questions about 'what is the most efficient way to achieve X with the
API' on the list when you come to them.

>> > PaStA supports both: mboxes and public inboxes. PaStA also understands
>> > the X-Patchwork-ID header to uniquely identify mails. Public Inboxes are
>> > a great exchange format. We know exactly what was added since our last
>> > pull. But we need some alternative strategy in case you don't support
>> > it, and this might be tricky.
>> 
>> Just in terms of data model for patchwork, a uniqueness constraint for
>> patches is the pair (message id, project). So you can have one email
>> received and tracked by two separate projects - and with two different
>> sets of status in each - but you cannot have 2 mails with the same
>> message id in the same project. Patches also have a unique ID which is
>> used in the API.
> In this case, running an instance of PaStA per Patchwork project will make sense.
> Although we will have to think about how to identify dependencies across
> projects.

I don't know enough about PaStA to be able to give you guidance
there, but in terms of patchwork:

 - patchwork itself does not have any concept of dependencies between
   projects. There is also nothing in patchwork that knows or cares that
   multiple patchwork projects might operate over the same
   repository. Indeed, apart some informational fields (commit_ref,
   commit_url_format, scm_url and friends) patchwork doesn't store
   anything at all about underlying VCSes.

 - In terms of practical patchwork usage, I can tell you a bit about the
   linuxppc-dev list where I do a lot of my day job. Patches that are
   sent to the powerpc list for discussion but which get merged through
   another tree are marked as "Not Applicable" by our subsystem
   maintainer, so the only patches that remain 'live' in the
   linuxppc-dev patchwork project are the ones that may eventually get
   merged into the powerpc kernel tree.

I don't know how PaStA will handle:

 - commits in the tree that are not traceable to a patch in patchwork

 - patches sent to multiple patchwork projects, where they are only
   merged via a single patchwork project, or potentially via a
   tree/list/process not tracked by patchwork (I'm thinking e.g. some
   KASAN work I did which I sent to a bunch of different lists including
   linuxppc but which was merged via linux-mm.)

I suspect your other GSoC mentors would be much more helpful on these
points!

>> Message IDs and patch IDs should also be stable/immutable. Message IDs,
>> being a property of _mails_, will be the same across different patchwork
>> instances that consume the same mail. Patch IDs, being a property of the
>> specific database that ingested the patch, will vary from patchwork
>> instance to patchwork instance.
>> 
>> > Daniel, I have in mind that there is already some kind of infrastructure
>> > in patchwork for receiving raw patches... AFAIR, Mete implemented an
>> > export routine that eases the first initial import. Is there a
>> > possibility to reliably "receive all new patches since my last pull"?
>> 
>> I struggle a little bit to follow the who's importing and exporting from
>> whom, but:
>> 
>>  - There is now code to extract patches in one go from a patchwork
>>    instance. I'd caution you that there are gigabytes of patches in the
>>    databases of production instances going back over a decade, so you
>>    might find that a challenging data set to acquire and work with.
>> 
>>  - In terms of 'catching up': I think you're asking if Patchwork will
>>    let you _export_ all patches since your last pull, rather than asking
>>    if patchwork will let you import patches? I think that makes the most
>>    sense in context. If that's the case, then the way I would do that
>>    is:
>> 
>>    a) observe the highest patch ID in the project you are tracking, as
>>       patch IDs are always increasing. Note that the same cannot be said
>>       about dates - patchwork instances, due to the quirks of email,
>>       often get mail out-of-order. You probably want something like:
>> 
>>       http://patchwork.ozlabs.org/api/patches/?order=-id&project=linuxppc-dev
>> 
>>    b) Retrieve all email from your last pull to that patch ID. Bear in
>>       mind that it is likely that more email will arrive while you are
>>       doing this - hence why I suggest fetching the patch ID first! Be
>>       careful also of pagination as that can also change if new patches
>>       come in. One day we will fix this by adding cursor-based
>>       pagination as well but we haven't done it yet. As such you
>>       probably want to do this with a different query with the opposite
>>       ordering, something like:
>> 
>>       http://patchwork.ozlabs.org/api/patches/?since=2020-05-01T00%3A00%3A00&project=linuxppc-dev
>> 
>>       (order=id is implied but wouldn't hurt to specify it, and an API
>>       version, in your final code)
> I might be missing something, but why does it matter if more patches
> arrive while pulling? PaStA can pull all patches since it's last pull as
> you mentioned. 

I may be overcomplicating things! Have a go and see, I guess.

>> You can map message ids to patch IDs using the API, which might also
>> help, e.g.:
>> 
>>  https://patchwork.ozlabs.org/api/patches/?msgid=20200414062102.6798-3-dja@axtens.net&project=patchwork
>> 
>> will return a json array with an object containing the patch ID for that
>> msgid/project pair if it exists.
>> 
>> The inverse is even simpler:
>> 
>>   https://patchwork.ozlabs.org/api/patches/1270129/
>> 
>> 
>> > Rohit, I guess the best thing you can do is to play with a local
>> > patchwork instance. Convert an existing public inbox back to a mbox and
>> > split it in the middle. Then, feed the first half to patchwork, and try
>> > to receive all patches via the API. Then, feed the second half and try
>> > to receive the rest of the patches. Compare the result of the API (e.g.,
>> > all Patchwork-IDs) with the database entries of Patchwork to ensure that
>> > we didn't miss a single mail.
>> 
>> The good news is that local patchwork instances are fairly easy to set
>> up with docker-compose, and the docs should be reasonable. (And if
>> they're not we're keen to fix them.)
> Agreed. It was very easy for me to setup a local instance. The docs are
> amazing!

I'm so pleased to hear :)

Regards,
Daniel