[RFC PATCH v2 0/1] add parserelations management command

Wed Jul 1 17:33:03 AEST 2020

Rohit Sarkar <rohitsarkar5398 at gmail.com> writes:

> On Tue, Jun 30, 2020 at 11:59:52AM +1000, Daniel Axtens wrote:
>> Rohit Sarkar <rohitsarkar5398 at gmail.com> writes:
>> 
>> > Hey,
>> > On Wed, Jun 17, 2020 at 07:55:37AM +1000, Daniel Axtens wrote:
>> >> Rohit Sarkar <rohitsarkar5398 at gmail.com> writes:
>> >> 
>> >> > Hey,
>> >> > On Fri, Jun 12, 2020 at 12:04:54AM +0530, Rohit Sarkar wrote:
>> >> >> The parserelations command will be used to bulk import patch relations into
>> >> >> Patchwork using a "patch groups" file.
>> >> >> The patch groups file is structured such that each line contains a
>> >> >> relation.
>> >> >> Eg patch groups file contents.
>> >> >> 1 3 4
>> >> >> 2
>> >> >> 5 9 10
>> >> >> 
>> >> >> In this case 2 relations will be ingested into Patchwork, (1,3,4) and
>> >> >> (5,9,10). Further group (5,9,10) also points to two upstream commit
>> >> >> references.
>> >> >> Note that before ingesting the relations the existing relations in the
>> >> >> DB are removed.
>> >> >> 
>> >> >> v1..v2
>> >> >> - remove commit references from the patch groups file.
>> >> >> - Update documentation
>> >> >> - rename some variables to remove the overloading.
>> >> >> - use filter and update instead of get and save to reduce the db calls.
>> >> >>   (Visible performance enhancement)
>> >> >> 
>> >> >> Leaving the copyright untouched till Ralf and Lukas comment on how to
>> >> >> proceed.
>> >> >> 
>> >> >> Rohit Sarkar (1):
>> >> >>   management: introduce parserelations command
>> >> >> 
>> >> >>  docs/deployment/management.rst                | 26 +++++++
>> >> >>  .../management/commands/parserelations.py     | 71 +++++++++++++++++++
>> >> >>  patchwork/tests/test_management.py            |  7 ++
>> >> >>  3 files changed, 104 insertions(+)
>> >> >>  create mode 100644 patchwork/management/commands/parserelations.py
>> >> >> 
>> >> >> -- 
>> >> >> 2.23.0.385.gbc12974a89
>> >> >> 
>> >> >
>> >> > Just wanted to follow up on this. Does this look good?
>> >> 
>> >> I was thinking about this as I washed the dishes last night.
>> >> 
>> >> Purging all relations in the database means that if any API-based user
>> >> of relations emerges before the big public servers adopt a release with
>> >> this, then they'll never be able to use it.
>> >> 
>> >> What's the speed impact of doing two passes through the data, with pass
>> >> 1 collecting all the projects that this touches and pass 2 actually
>> >> installing the relations?
>> >> 
>> >> Regards,
>> >> Daniel
>> > I wanted to reopen the discussion on this with the aim being to get
>> > everyone on the same page, and either go with this approach or decide on
>> > an alternate approach.
>> >
>> > The primary goal is to have a "parserelations" command that can parse a
>> > "patch-groups" file that contains a relation on each line. There are 2
>> > approaches:
>> >
>> > a. Refresh the relations table. The parserelations command will remove
>> > the existing relations before inserting the newly parsed relations. This
>> > avoids any conflicts between existing and new relations. However other
>> > users of the API might be affected.
>> >
>> > b. Insert relations in a non destructive way. We need a way to handle
>> > conflicts between existing and incoming relations. So the solution is
>> > not really clear in this case. What should have precedence in the case
>> > of conflicts? The parserelations script or the existing relations.
>> >
>> 
>> 
>> Right, sorry for dropping this, work got a bit crazy.
>> 
>> It occurs to me that I have an embedded assumption that may not be true
>> here. Are you trying to build a set of relations that includes
>> cross-project relations, or are you just trying to build a set of
>> relations entirely or mostly within a single patchwork project?
>> 
>> I've been assuming that your relations are entirely or predominantly
>> intra-project, but I'm starting to suspect that maybe my understanding
>> is not right.
> The relations can be inter project too. PaStA is configured with a list
> of Patchwork projects for which it will pull all the corresponding
> patches and perform it's analyses. PaStA can relate patches across
> projects too.
>
>> If you are expecting a non-trivial quantity of cross-project relations,
>> then my suggested approach earlier (caller of the script specifies the
>> projects that patches may belong to) doesn't make sense. It probably
>> makes most sense to go with approach (a) in that case - it's really
>> difficult to figure out how to do (b) in a way that preserves the
>> meaning that any other API user, or the PaStA tool, is trying to create.
>> 
>> If we do go down route (a), I want to make it really hard for users to
>> accidentally shoot themselves in the foot here and wipe out data. So
>> probably I would say that parserelations should only insert relations if
>> the table is empty when the script begins, and there should be a
>> separate command that with a suitably scary name that purges all
>> existing relations.
>
> I agree that we should make parserelations actions very explicit to
> prevent any loss of data. However, if we use parserelations to
> differentially import relations we will need to be able to insert
> relations with a non empty table. Let me explain the complete flow
> between PaStA and Patchwork so that things are a bit clear.
> We plan to run a script, maybe as a cron job at regular time intervals
> which does the following:
> 1. Exports patches from Patchwork into PaStA
> 2. Runs PaStA's analyses
> 3. Runs the parserelations script to parse the patch-groups file
> generated by PaStA and inserts the relations into Patchwork. In this
> step we refresh the db and insert all the relations afresh. A non
> destructive approach would involve thinking of a way to resolve
> conflicts between existing and new relations.

OK.

I am happy for you to stick with a destructive approach for the
managment command, so long as it's clear to the user that data is
potentially about to be destroyed. I'm happy for that to be:

 - two commands: a command that purges relations, and then a parse
   command that checks if the relations table is empty and refuses to
   run if it is not empty.

 - a single command that requires you to pass a command line option like
   '--really-delete-old-relations'

 - a single command with a name that captures the idea that data will be
   destroyed. I don't know what that would be, maybe 'replace...'.

I'm not fussy which option you go with, I just don't want a command
named 'parse...' to delete data without the user being properly warned.

(You will at some point need to consider how to do incremental updates
over the API for a publicly usable version, but that can definitely wait
and might be out of scope for your GSoC project.)

Regards,
Daniel

>
>> 
>> Have I correctly understood things here?
>> 
>> Regards,
>> Daniel
>> 
> Thanks,
> Rohit