[RFC 0/2 REBASE] Rework tagging infrastructure

Wed Apr 18 01:35:15 AEST 2018

----- Original Message -----
> From: "Stephen Finucane" <stephen at that.guru>
> To: "Veronika Kabatova" <vkabatov at redhat.com>
> Cc: patchwork at lists.ozlabs.org
> Sent: Tuesday, April 17, 2018 5:17:19 PM
> Subject: Re: [RFC 0/2 REBASE] Rework tagging infrastructure
> 
> On Tue, 2018-04-17 at 09:50 -0400, Veronika Kabatova wrote:
> > ----- Original Message -----
> > > From: "Veronika Kabatova" <vkabatov at redhat.com>
> > > To: "Stephen Finucane" <stephen at that.guru>
> > > Cc: patchwork at lists.ozlabs.org
> > > Sent: Monday, April 16, 2018 3:06:34 PM
> > > Subject: Re: [RFC 0/2 REBASE] Rework tagging infrastructure
> > > 
> > > ----- Original Message -----
> > > > From: "Stephen Finucane" <stephen at that.guru>
> > > > To: vkabatov at redhat.com, patchwork at lists.ozlabs.org
> > > > Sent: Monday, April 16, 2018 12:49:03 PM
> > > > Subject: Re: [RFC 0/2 REBASE] Rework tagging infrastructure
> > > > 
> > > > On Thu, 2018-04-12 at 17:24 +0200, vkabatov at redhat.com wrote:
> > > > > From: Veronika Kabatova <vkabatov at redhat.com>
> > > > > 
> > > > > TL;DR:
> > > > > Our goals:
> > > > > - Avoid tag reextraction with each added comment
> > > > > - Fix issues #57 and #113
> > > > > - Prepare tags addition to comments in the API
> > > > > - Add tags to patch (currently returns {}) and cover letter APIs
> > > > > 
> > > > > If you agree with the general idea, I'd like to get some help with DB
> > > > > query
> > > > > optimizations where determined needed. Tags are more distributed, so
> > > > > counting
> > > > > them in PatchQuerySet is harder. I wanted to use annotation with
> > > > > Django's
> > > > > conditional selection, but it doesn't seem to work very well and
> > > > > generated
> > > > > invalid SQL in my attempts (hence my solution with counting the tags
> > > > > on
> > > > > the
> > > > > fly
> > > > > only for the patches that are being displayed). That said, I'm not a
> > > > > DB
> > > > > expert
> > > > > so any more optimized solutions are more than welcome.
> > > > > 
> > > > > 
> > > > > Thoughts? Ideas?
> > > > 
> > > > Took at look at this. As things stand, I'm not sure how we could
> > > > improve the performance of this substantially. The problem is that
> > > > we're trying to treat these as both a generic thing (cover letters,
> > > > patches _and_ comments can all have associated tags) but also somewhat
> > > > interrelated (a patch need to know about the tags attached to its
> > > > series' cover letter and its comments). This leads to the kind of
> > > > convoluted queries which impact your performance.
> > > > 
> > > > Let's take a look at the things we need to solve the goals you
> > > > highlighted above.
> > > > 
> > > >  * Keep track of where we found of the tag (patch, follow-up comment,
> > > >    series/cover letter)
> > > >     * This would resolve our tag re-extraction issue and potentially
> > > >       offer a path to resolving #113 (A/R/T on cover letter should
> > > >       increment the counters of every patch)
> > > > 
> > > >  * Keep track of the value-side of the equation (i.e. an email for
> > > >    something like 'Reviewed-by:')
> > > >     * This would solve #57 (Tags should be stored on a per-Person
> > > >       basis)
> > > > 
> > > > As noted above, we have a clash between specificness and genericness of
> > > > this field. I think we should double down on the former, even if it
> > > > means some level of duplication. To this end, what about linking 'Tag'
> > > > to 'Patch' via a ManyToMany field and use a "through" table? If we do
> > > > this, we can track optional 'cover' and 'comment' attributes for
> > > > identifying the source of the tag, if it wasn't included in the patch
> > > > itself. We could also store a value here. This would result in some
> > > > duplication (e.g. for a 10 series patch, a reply to the cover letter
> > > > would generate 10 PatchTag entries with series set for each) but it
> > > > would be simple to query. Does this make sense to you?
> > > > 
> > > 
> > > Hi,
> > > 
> > > I need to take a proper look later but for the first look, this solution
> > > might work out. I'll reimplement it and see if it works well for us and
> > > either send v2 for the RFC or let you know if I find any caveats.
> > > 
> > 
> > So I finally started digging into this alternative solution and run into
> > an issue with tags extracted from cover letters (we - and maybe others do
> > that too - often put eg. links to issue trackers in the cover letter
> > instead of putting them into each patch).
> > 
> > Usually, the cover letter is received before the patches from series. This
> > means if we only had the link between Tag and Patch (with the additional
> > information in the intermediate model), there will be no place to save tags
> > from cover letter at the time of parsing.
> 
> Oh, very good point. I'd missed that.
> 
> > Unless I'm overlooking something, we'd need to have the link from Tag to
> > both Patch and CoverLetter. This should still have much better performance
> > than my original solution (and will get rid of the duplication of yours).
> > 
> > Does this proposal make sense, or am I missing something?
> 
> That mostly makes sense. My main concern is what happens when you want
> to show tags for a patch when those tags were created again the cover
> letter. If that's the case, are we going to have to query on
> 'patch.series.cover_letter.tags'? I imagine that's going to be slow
> (lots of JOINs). We could store it on the series instead, but I'm not
> sure how much that would improve things. Any ideas how to work around
> this?
> 

I was thinking about filtering on the SubmissionTag (or whatever the
intermediate model will be named) based on submission IDs of the patch
and cover (or comment IDs in case of comments API), instead of going
through the relations. That said, my database knowledge is very...
abstract... so I have no idea how much it helps with the underlying
queries.

If you (or whoever else) can offer any insight that would be great!
Veronika

> Stephen
>