[rfc] Extending Patchwork as a GSoC project

Tue May 12 11:07:47 AEST 2020

Rohit Sarkar <rohitsarkar5398 at gmail.com> writes:

> Hi Daniel,
>
> [snip]
>
>> 
>> >>  - In terms of 'catching up': I think you're asking if Patchwork will
>> >>    let you _export_ all patches since your last pull, rather than asking
>> >>    if patchwork will let you import patches? I think that makes the most
>> >>    sense in context. If that's the case, then the way I would do that
>> >>    is:
>> >> 
>> >>    a) observe the highest patch ID in the project you are tracking, as
>> >>       patch IDs are always increasing. Note that the same cannot be said
>> >>       about dates - patchwork instances, due to the quirks of email,
>> >>       often get mail out-of-order. You probably want something like:
>> >> 
>> >>       http://patchwork.ozlabs.org/api/patches/?order=-id&project=linuxppc-dev
>> >
>> > Excellent! So... If I got everything from [0..100] and JSON reports that
>> > the latest ID is 130, then [101..130] will _definitely_ exist and form
>> > the exact set of patches that I miss?
>> 
>> It's not _quite_ that simple! Both the set [0..100] and the set
>> [101..130] will likely contain patches that do not belong to your
>> project. I suspect you do not want to gather patches for every project!
>> 
>> But if you are following linuxppc, and you have gathered
>> {linuxppc patches with id <= 100},
>> and the latest ID for linuxppc is 130, then I believe
>> {linuxppc patches where 100 < id <= 130}
>> is the exact set of patches you've missed. We don't support sharding of
>> PK space for multi-master writes or anything else that might mess with
>> this.
> This is exactly how I am thinking of going about this, although with a
> slightly different approach mentioned below
>
>> Sadly, we also don't currently support a filter predicate that would
>> allow you to neatly express 'patches with IDs between 100 and 130' in a
>> query, but I'd be happy to consider such a patch.
> This would certainly be the most elegant solution.
>
>> (In the mean time, you can store what page of
>> http://patchwork.ozlabs.org/api/patches/?project=linuxppc-dev contained
>> patch 100 and read all subsequent pages until you hit patch 130. As I
>> alluded to but didn't state clearly, pagination when sorted by
>> increasing ID is stable*.)
> I am storing the highest patch id amongst all the patches that PaStA has
> received. Then I fetch all patches from Patchwork or a particular
> project reverse ordered by id. I read the patches until I reach the
> patch that has patch id same as the highest patch id in PaStA.
> In the worst case I see that I will be fetching an extra page of
> patches. (When the first patch in a page is one that PaStA already has)
>
> Is this an efficient way to go about things? Particularly is fetching
> all patches for a project efficient considering the response is paged?

In general, the best way to explore what is being asked of the database
is to look at the SQL queries revealed by the django-debug-toolbar when
you query the API.

I think the queries as described should use SQL LIMIT etc, so at least
they shouldn't be shipping too much data over the wire, but I suspect a
query generated by an ID filter predicate would be much better. If
you're doing these queries a lot and you want to do them against public
instances one day, you should probably test this out.

>> >>    b) Retrieve all email from your last pull to that patch ID. Bear in
>> >>       mind that it is likely that more email will arrive while you are
>> >>       doing this - hence why I suggest fetching the patch ID first! Be
>> >
>> > Ack.
>> >
>> >>       careful also of pagination as that can also change if new patches
>> >>       come in. One day we will fix this by adding cursor-based
>> >>       pagination as well but we haven't done it yet. As such you
>> >>       probably want to do this with a different query with the opposite
>> >>       ordering, something like:
>> >> 
>> >>       http://patchwork.ozlabs.org/api/patches/?since=2020-05-01T00%3A00%3A00&project=linuxppc-dev
>> >> 
>> >>       (order=id is implied but wouldn't hurt to specify it, and an API
>> >>       version, in your final code)
>> >> 
> In part b above: doesn't it suffer from the same issue of there being no
> guarantee that patches will arrive in the same order as given by the
> patch dates. Eg. 
>
> Last pull was at date(timestamp) x. A patch with date y, y<x, arrives after my last
> pull. On my next pull from Patchwork, when I fetch patches arriving
> since date x, I will lose the patch with date y.

Yes, you're quite right, you shouldn't use the date filter, I only
realised this a couple of days after sending it and didn't get to
fixing my mistake before you picked up on it!

Regards,
Daniel

>
> Thanks,
> Rohit