[rfc] Extending Patchwork as a GSoC project

Thu May 7 03:27:13 AEST 2020

Ralf Ramsauer <ralf.ramsauer at oth-regensburg.de> writes:

> Hi Rohit, hi all,
>
> On 06/05/2020 05:13, Rohit Sarkar wrote:
>> It would be great to hear about the views of the Patchwork community
>> regarding this project. This would help us in better defining the work
>> items and making informed architectural decisions regarding the
>> interaction between PaStA and Patchwork.
>
> Thanks for picking this up, and thanks for starting the discussion.

Welcome Rohit!

> Daniel, just to keep you in sync: We (Lukas, Rohit and I) already had a
> video call yesterday, and we were already able to identify three
> milestones of the project:
>
> 1. Get PaStA and Patchwork in sync. Both need to work on the same data
>    sources.
> 2. (Differentially) analyse new incoming data, such as new patches on
>    lists, or new commits in the repo(s).
> 3. Update Patchwork relations by using the existent API.
>
> But beforehand, we need to sort out some technical/architectural details
> before Rohit can start coding.
>
> So let me start the discussion for 1.:
>
> Do I see it correctly, that the official Linux patchwork instances
> receive the ML data on their own? So they do not rely on, for example,
> public inboxes, right?

There are two patchwork instances that might reasonably be called
"official" - patchwork.kernel.org and patchwork.ozlabs.org. They cover
different subsystems and some other projects also. I don't run either of
them, so I am limited a bit in how much I know about them, but it is my
understanding that both of them ingest mail directly - see
https://patchwork.readthedocs.io/en/master/deployment/management/#parsemail

I think Konstantin, who manages the kernel.org one, would love to see an
evolution to move towards more use of things like public-inbox, but
there's no direct import support for it at this point.

> PaStA supports both: mboxes and public inboxes. PaStA also understands
> the X-Patchwork-ID header to uniquely identify mails. Public Inboxes are
> a great exchange format. We know exactly what was added since our last
> pull. But we need some alternative strategy in case you don't support
> it, and this might be tricky.

Just in terms of data model for patchwork, a uniqueness constraint for
patches is the pair (message id, project). So you can have one email
received and tracked by two separate projects - and with two different
sets of status in each - but you cannot have 2 mails with the same
message id in the same project. Patches also have a unique ID which is
used in the API.

Message IDs and patch IDs should also be stable/immutable. Message IDs,
being a property of _mails_, will be the same across different patchwork
instances that consume the same mail. Patch IDs, being a property of the
specific database that ingested the patch, will vary from patchwork
instance to patchwork instance.

> Daniel, I have in mind that there is already some kind of infrastructure
> in patchwork for receiving raw patches... AFAIR, Mete implemented an
> export routine that eases the first initial import. Is there a
> possibility to reliably "receive all new patches since my last pull"?

I struggle a little bit to follow the who's importing and exporting from
whom, but:

 - There is now code to extract patches in one go from a patchwork
   instance. I'd caution you that there are gigabytes of patches in the
   databases of production instances going back over a decade, so you
   might find that a challenging data set to acquire and work with.

 - In terms of 'catching up': I think you're asking if Patchwork will
   let you _export_ all patches since your last pull, rather than asking
   if patchwork will let you import patches? I think that makes the most
   sense in context. If that's the case, then the way I would do that
   is:

   a) observe the highest patch ID in the project you are tracking, as
      patch IDs are always increasing. Note that the same cannot be said
      about dates - patchwork instances, due to the quirks of email,
      often get mail out-of-order. You probably want something like:

      http://patchwork.ozlabs.org/api/patches/?order=-id&project=linuxppc-dev

   b) Retrieve all email from your last pull to that patch ID. Bear in
      mind that it is likely that more email will arrive while you are
      doing this - hence why I suggest fetching the patch ID first! Be
      careful also of pagination as that can also change if new patches
      come in. One day we will fix this by adding cursor-based
      pagination as well but we haven't done it yet. As such you
      probably want to do this with a different query with the opposite
      ordering, something like:

      http://patchwork.ozlabs.org/api/patches/?since=2020-05-01T00%3A00%3A00&project=linuxppc-dev

      (order=id is implied but wouldn't hurt to specify it, and an API
      version, in your final code)

You can map message ids to patch IDs using the API, which might also
help, e.g.:

 https://patchwork.ozlabs.org/api/patches/?msgid=20200414062102.6798-3-dja@axtens.net&project=patchwork

will return a json array with an object containing the patch ID for that
msgid/project pair if it exists.

The inverse is even simpler:

  https://patchwork.ozlabs.org/api/patches/1270129/

> Rohit, I guess the best thing you can do is to play with a local
> patchwork instance. Convert an existing public inbox back to a mbox and
> split it in the middle. Then, feed the first half to patchwork, and try
> to receive all patches via the API. Then, feed the second half and try
> to receive the rest of the patches. Compare the result of the API (e.g.,
> all Patchwork-IDs) with the database entries of Patchwork to ensure that
> we didn't miss a single mail.

The good news is that local patchwork instances are fairly easy to set
up with docker-compose, and the docs should be reasonable. (And if
they're not we're keen to fix them.)

BTW - I'm not sure if this will be helpful or just a distraction, but
just in case: if you want a reasonably big data set already in mbox
format, I recommend the Canonical kernel-team mailing list archive. It's
several hundred megabytes now, but be warned that it does contain a lot
of broken mails (missing message ids and other assorted brokenness) in
the early years of coverage.

One interesting feature of this dataset for your purposes is that it
takes (from memory) hours to import with parsearchive on my SSD-equipped
laptop. That means you could easily test how your scripts perform when
messages are rapidly being added to the database, which will help shake
out any bugs you might have.

You can find it here
https://lists.ubuntu.com/archives/kernel-team/
direct link - 817 MB:
https://lists.ubuntu.com/archives/kernel-team.mbox/kernel-team.mbox

FWIW, I would also be open to a patch that adds a managment command or
option to parsearchive for parsing public-inbox format, depending on the
amount of dependencies it would need to pull in. (But please don't stall
your project working on this - it's just a nice-to-have from my point of
view as a patchwork maintainer!) Somewhere back in the mail archives I
think I also pointed Mete to a dodgy shell script I wrote that converts
public-inbox to mbox... but someone has probably written a better one by
now.

Hope this helps, feel free to ask more questions.

Regards,
Daniel

>
> Thanks
>   Ralf