[rfc] Extending Patchwork as a GSoC project

Ralf Ramsauer ralf.ramsauer at oth-regensburg.de
Fri May 8 00:04:56 AEST 2020


Hi Daniel,

[snip]

> Just in terms of data model for patchwork, a uniqueness constraint for
> patches is the pair (message id, project). So you can have one email
> received and tracked by two separate projects - and with two different
> sets of status in each - but you cannot have 2 mails with the same
> message id in the same project. Patches also have a unique ID which is
> used in the API.

while working on PaStA, I had to learn that Message-IDs are not unique.
A mail that is sent to multiple lists will have the same Message-IDs,
but it also will have different bodies (e.g., different footers added by
list servers). And users sometimes re-use Message-IDs. I observed all
fancy corner cases.

However, we use Message-IDs as unique identifier in PaStA, but it will
map to a list of emails.

In PaStA, we can load several mailing lists at once, but we can also do
per-list analyses.

> 
> Message IDs and patch IDs should also be stable/immutable. Message IDs,
> being a property of _mails_, will be the same across different patchwork
> instances that consume the same mail. Patch IDs, being a property of the
> specific database that ingested the patch, will vary from patchwork
> instance to patchwork instance.

I didn't fully understand. Let's say you have a patchwork instance with
two configured projects. Each project is assigned to one list. You send
a patch to both lists. Both mails will have the same Message-ID, while
list servers treat the mail a bit different and modify it.

While those mails be assigned to two IDs in patchwork, or will the ID be
the same in both projects?

BTW, this is why we can't simply use mailing list data from public
inboxes as our data source and provide results to a patchwork instance
running on a different data source.

> 
>> Daniel, I have in mind that there is already some kind of infrastructure
>> in patchwork for receiving raw patches... AFAIR, Mete implemented an
>> export routine that eases the first initial import. Is there a
>> possibility to reliably "receive all new patches since my last pull"?
> 
> I struggle a little bit to follow the who's importing and exporting from
> whom, but:
> 
>  - There is now code to extract patches in one go from a patchwork
>    instance. I'd caution you that there are gigabytes of patches in the
>    databases of production instances going back over a decade, so you
>    might find that a challenging data set to acquire and work with.

Sounds like this could be very useful for an initial import.

> 
>  - In terms of 'catching up': I think you're asking if Patchwork will
>    let you _export_ all patches since your last pull, rather than asking
>    if patchwork will let you import patches? I think that makes the most
>    sense in context. If that's the case, then the way I would do that
>    is:
> 
>    a) observe the highest patch ID in the project you are tracking, as
>       patch IDs are always increasing. Note that the same cannot be said
>       about dates - patchwork instances, due to the quirks of email,
>       often get mail out-of-order. You probably want something like:
> 
>       http://patchwork.ozlabs.org/api/patches/?order=-id&project=linuxppc-dev

Excellent! So... If I got everything from [0..100] and JSON reports that
the latest ID is 130, then [101..130] will _definitely_ exist and form
the exact set of patches that I miss?

> 
>    b) Retrieve all email from your last pull to that patch ID. Bear in
>       mind that it is likely that more email will arrive while you are
>       doing this - hence why I suggest fetching the patch ID first! Be

Ack.

>       careful also of pagination as that can also change if new patches
>       come in. One day we will fix this by adding cursor-based
>       pagination as well but we haven't done it yet. As such you
>       probably want to do this with a different query with the opposite
>       ordering, something like:
> 
>       http://patchwork.ozlabs.org/api/patches/?since=2020-05-01T00%3A00%3A00&project=linuxppc-dev
> 
>       (order=id is implied but wouldn't hurt to specify it, and an API
>       version, in your final code)
> 
> You can map message ids to patch IDs using the API, which might also
> help, e.g.:
> 
>  https://patchwork.ozlabs.org/api/patches/?msgid=20200414062102.6798-3-dja@axtens.net&project=patchwork
> 
> will return a json array with an object containing the patch ID for that
> msgid/project pair if it exists.
> 
> The inverse is even simpler:
> 
>   https://patchwork.ozlabs.org/api/patches/1270129/

I see, thanks.

Thanks
  Ralf


More information about the Patchwork mailing list