[RFC 0/2 REBASE] Rework tagging infrastructure

Mon Apr 16 23:06:34 AEST 2018

----- Original Message -----
> From: "Stephen Finucane" <stephen at that.guru>
> To: vkabatov at redhat.com, patchwork at lists.ozlabs.org
> Sent: Monday, April 16, 2018 12:49:03 PM
> Subject: Re: [RFC 0/2 REBASE] Rework tagging infrastructure
> 
> On Thu, 2018-04-12 at 17:24 +0200, vkabatov at redhat.com wrote:
> > From: Veronika Kabatova <vkabatov at redhat.com>
> > 
> > (TL;DR at the end)
> > 
> > This RFC describes an approach to rework tagging. It attempts to solve
> > GitHub
> > issues #57 [1] and #113 [2] as well as some other things we encountered.
> > I'm
> > sending the incomplete version (eg I haven't fixed the tests) to discuss
> > the
> > approach first.
> > 
> > Right now, tags are extracted from all the comments on the patch and the
> > patch
> > itself, and they are reextracted from all the sources every time a comment
> > is
> > added or removed. It makes saving slower, and might contribute to races
> > with
> > writes to database when we are parsing multiple emails at the same time.
> > This
> > gets even more prevalent if we want to solve the issue #113 (tags on cover
> > letter should increment counters on every patch in series) -- for each
> > added
> > comment on the cover letter, we would reextract tags from all the other
> > sources,
> > for each patch in series; and for a change on comments related to the patch
> > directly, we would need to take the tags on the cover letter and it's
> > comments
> > into account as well (I implemented this solution in my fork but I really
> > don't like it).
> > 
> > The current approach has several other issues as well, some of which are
> > mentioned in the issue #57, eg duplicate tags are counted more times.
> > Taking
> > into account the tags on cover letters would also be easier if we could
> > store
> > tags against them and just query them on demand, instead of reextracting
> > everything all over again.
> > 
> > Solutions for some other things we found missing solve the issues mentioned
> > above too. If we want to determine if the tag is duplicate, we need to save
> > the
> > associated value. Having the value would help us to use arbitrary strings
> > as
> > tags (for example links to issue trackers, like `Bugzilla: <link>` if the
> > patch
> > solves a known bug). The key-values approach to storing tags is mentioned
> > [3],
> > this email additionally mentions a comments REST API (currently worked on).
> > For
> > the comments API we would also find it very useful to have the tags
> > extracted
> > from the comments available directly so we can query for them, which means
> > we
> > would either need to reextract the tags on every API call, or we could
> > store
> > the tags against the comments as they are extracted and only query them as
> > needed. Keeping tags related to comments where they originated from avoids
> > the
> > need to reextract tags with addition of new comments, as well as gets rid
> > of
> > the `patch_responses` property used when converting comments to mbox (we
> > finally get all the custom tags there instead of only the few hardcoded
> > ones).
> > 
> > TL;DR:
> > Our goals:
> > - Avoid tag reextraction with each added comment
> > - Fix issues #57 and #113
> > - Prepare tags addition to comments in the API
> > - Add tags to patch (currently returns {}) and cover letter APIs
> > 
> > If you agree with the general idea, I'd like to get some help with DB query
> > optimizations where determined needed. Tags are more distributed, so
> > counting
> > them in PatchQuerySet is harder. I wanted to use annotation with Django's
> > conditional selection, but it doesn't seem to work very well and generated
> > invalid SQL in my attempts (hence my solution with counting the tags on the
> > fly
> > only for the patches that are being displayed). That said, I'm not a DB
> > expert
> > so any more optimized solutions are more than welcome.
> > 
> > 
> > Thoughts? Ideas?
> 
> Took at look at this. As things stand, I'm not sure how we could
> improve the performance of this substantially. The problem is that
> we're trying to treat these as both a generic thing (cover letters,
> patches _and_ comments can all have associated tags) but also somewhat
> interrelated (a patch need to know about the tags attached to its
> series' cover letter and its comments). This leads to the kind of
> convoluted queries which impact your performance.
> 
> Let's take a look at the things we need to solve the goals you
> highlighted above.
> 
>  * Keep track of where we found of the tag (patch, follow-up comment,
>    series/cover letter)
>     * This would resolve our tag re-extraction issue and potentially
>       offer a path to resolving #113 (A/R/T on cover letter should
>       increment the counters of every patch)
> 
>  * Keep track of the value-side of the equation (i.e. an email for
>    something like 'Reviewed-by:')
>     * This would solve #57 (Tags should be stored on a per-Person
>       basis)
> 
> As noted above, we have a clash between specificness and genericness of
> this field. I think we should double down on the former, even if it
> means some level of duplication. To this end, what about linking 'Tag'
> to 'Patch' via a ManyToMany field and use a "through" table? If we do
> this, we can track optional 'cover' and 'comment' attributes for
> identifying the source of the tag, if it wasn't included in the patch
> itself. We could also store a value here. This would result in some
> duplication (e.g. for a 10 series patch, a reply to the cover letter
> would generate 10 PatchTag entries with series set for each) but it
> would be simple to query. Does this make sense to you?
> 

Hi,

I need to take a proper look later but for the first look, this solution
might work out. I'll reimplement it and see if it works well for us and
either send v2 for the RFC or let you know if I find any caveats.

Thanks,
Veronika

> Stephen
> 
> > 
> > [1] https://github.com/getpatchwork/patchwork/issues/57
> > [2] https://github.com/getpatchwork/patchwork/issues/113
> > [3] https://lists.ozlabs.org/pipermail/patchwork/2018-January/004741.html
> > 
> > Veronika Kabatova (2):
> >   Rework tagging infrastructure
> >   Add migration for tagging changes
>