[rfc] Extending Patchwork as a GSoC project

Thu May 7 13:30:17 AEST 2020

On Thu, May 07, 2020 at 03:27:13AM +1000, Daniel Axtens wrote:
> Ralf Ramsauer <ralf.ramsauer at oth-regensburg.de> writes:
> 
> > Hi Rohit, hi all,
> >
> > On 06/05/2020 05:13, Rohit Sarkar wrote:
> >> It would be great to hear about the views of the Patchwork community
> >> regarding this project. This would help us in better defining the work
> >> items and making informed architectural decisions regarding the
> >> interaction between PaStA and Patchwork.
> >
> > Thanks for picking this up, and thanks for starting the discussion.
> 
> Welcome Rohit!
> 
> > Daniel, just to keep you in sync: We (Lukas, Rohit and I) already had a
> > video call yesterday, and we were already able to identify three
> > milestones of the project:
> >
> > 1. Get PaStA and Patchwork in sync. Both need to work on the same data
> >    sources.
> > 2. (Differentially) analyse new incoming data, such as new patches on
> >    lists, or new commits in the repo(s).
> > 3. Update Patchwork relations by using the existent API.
> >
> > But beforehand, we need to sort out some technical/architectural details
> > before Rohit can start coding.
> >
> > So let me start the discussion for 1.:
> >
> > Do I see it correctly, that the official Linux patchwork instances
> > receive the ML data on their own? So they do not rely on, for example,
> > public inboxes, right?
> 
> There are two patchwork instances that might reasonably be called
> "official" - patchwork.kernel.org and patchwork.ozlabs.org. They cover
> different subsystems and some other projects also. I don't run either of
> them, so I am limited a bit in how much I know about them, but it is my
> understanding that both of them ingest mail directly - see
> https://patchwork.readthedocs.io/en/master/deployment/management/#parsemail
> 
> I think Konstantin, who manages the kernel.org one, would love to see an
> evolution to move towards more use of things like public-inbox, but
> there's no direct import support for it at this point.

I guess at this point in time, going the public inbox way to ingest
patches is a no go then. There will also be additional issues in this
architecture that we might have to tackle. One is mapping a patch in
PaStA to it's patchwork id will involve an API call. This might not be
scalable for a large number of patches. Whereas in the other case (pulling
patches from patchwork) we can obtain the Patchwork Id easily from the
header. Further if I understand correctly, in the case of public inboxes, 
we need to manually pull patches to sync to the current state of the public 
inbox. It will be difficult for PaStA to know what state Patchwork is in.

> > PaStA supports both: mboxes and public inboxes. PaStA also understands
> > the X-Patchwork-ID header to uniquely identify mails. Public Inboxes are
> > a great exchange format. We know exactly what was added since our last
> > pull. But we need some alternative strategy in case you don't support
> > it, and this might be tricky.
> 
> Just in terms of data model for patchwork, a uniqueness constraint for
> patches is the pair (message id, project). So you can have one email
> received and tracked by two separate projects - and with two different
> sets of status in each - but you cannot have 2 mails with the same
> message id in the same project. Patches also have a unique ID which is
> used in the API.
In this case, running an instance of PaStA per Patchwork project will make sense.
Although we will have to think about how to identify dependencies across
projects.

> Message IDs and patch IDs should also be stable/immutable. Message IDs,
> being a property of _mails_, will be the same across different patchwork
> instances that consume the same mail. Patch IDs, being a property of the
> specific database that ingested the patch, will vary from patchwork
> instance to patchwork instance.
> 
> > Daniel, I have in mind that there is already some kind of infrastructure
> > in patchwork for receiving raw patches... AFAIR, Mete implemented an
> > export routine that eases the first initial import. Is there a
> > possibility to reliably "receive all new patches since my last pull"?
> 
> I struggle a little bit to follow the who's importing and exporting from
> whom, but:
> 
>  - There is now code to extract patches in one go from a patchwork
>    instance. I'd caution you that there are gigabytes of patches in the
>    databases of production instances going back over a decade, so you
>    might find that a challenging data set to acquire and work with.
> 
>  - In terms of 'catching up': I think you're asking if Patchwork will
>    let you _export_ all patches since your last pull, rather than asking
>    if patchwork will let you import patches? I think that makes the most
>    sense in context. If that's the case, then the way I would do that
>    is:
> 
>    a) observe the highest patch ID in the project you are tracking, as
>       patch IDs are always increasing. Note that the same cannot be said
>       about dates - patchwork instances, due to the quirks of email,
>       often get mail out-of-order. You probably want something like:
> 
>       http://patchwork.ozlabs.org/api/patches/?order=-id&project=linuxppc-dev
> 
>    b) Retrieve all email from your last pull to that patch ID. Bear in
>       mind that it is likely that more email will arrive while you are
>       doing this - hence why I suggest fetching the patch ID first! Be
>       careful also of pagination as that can also change if new patches
>       come in. One day we will fix this by adding cursor-based
>       pagination as well but we haven't done it yet. As such you
>       probably want to do this with a different query with the opposite
>       ordering, something like:
> 
>       http://patchwork.ozlabs.org/api/patches/?since=2020-05-01T00%3A00%3A00&project=linuxppc-dev
> 
>       (order=id is implied but wouldn't hurt to specify it, and an API
>       version, in your final code)
I might be missing something, but why does it matter if more patches
arrive while pulling? PaStA can pull all patches since it's last pull as
you mentioned. 

> You can map message ids to patch IDs using the API, which might also
> help, e.g.:
> 
>  https://patchwork.ozlabs.org/api/patches/?msgid=20200414062102.6798-3-dja@axtens.net&project=patchwork
> 
> will return a json array with an object containing the patch ID for that
> msgid/project pair if it exists.
> 
> The inverse is even simpler:
> 
>   https://patchwork.ozlabs.org/api/patches/1270129/
> 
> 
> > Rohit, I guess the best thing you can do is to play with a local
> > patchwork instance. Convert an existing public inbox back to a mbox and
> > split it in the middle. Then, feed the first half to patchwork, and try
> > to receive all patches via the API. Then, feed the second half and try
> > to receive the rest of the patches. Compare the result of the API (e.g.,
> > all Patchwork-IDs) with the database entries of Patchwork to ensure that
> > we didn't miss a single mail.
> 
> The good news is that local patchwork instances are fairly easy to set
> up with docker-compose, and the docs should be reasonable. (And if
> they're not we're keen to fix them.)
Agreed. It was very easy for me to setup a local instance. The docs are
amazing!

> BTW - I'm not sure if this will be helpful or just a distraction, but
> just in case: if you want a reasonably big data set already in mbox
> format, I recommend the Canonical kernel-team mailing list archive. It's
> several hundred megabytes now, but be warned that it does contain a lot
> of broken mails (missing message ids and other assorted brokenness) in
> the early years of coverage.
> 
> One interesting feature of this dataset for your purposes is that it
> takes (from memory) hours to import with parsearchive on my SSD-equipped
> laptop. That means you could easily test how your scripts perform when
> messages are rapidly being added to the database, which will help shake
> out any bugs you might have.
> 
> You can find it here
> https://lists.ubuntu.com/archives/kernel-team/
> direct link - 817 MB:
> https://lists.ubuntu.com/archives/kernel-team.mbox/kernel-team.mbox
This helps for sure. Thanks a ton!

> FWIW, I would also be open to a patch that adds a managment command or
> option to parsearchive for parsing public-inbox format, depending on the
> amount of dependencies it would need to pull in. (But please don't stall
> your project working on this - it's just a nice-to-have from my point of
> view as a patchwork maintainer!) Somewhere back in the mail archives I
> think I also pointed Mete to a dodgy shell script I wrote that converts
> public-inbox to mbox... but someone has probably written a better one by
> now.
> 
> 
> Hope this helps, feel free to ask more questions.
This was very informative, thanks!
> Regards,
> Daniel
> 
> >
> > Thanks
> >   Ralf
Thanks,
Rohit