PaReD: a patch relations detector for patchwork

Mon Sep 6 13:21:41 AEST 2021

Hi Lukas,

>> It currently detects mails with identical subjects (after prefixes are
>> removed) within a 180 day window. This is not a very sophisticated
>> matching system, but given that it's an API client and not in the core,
>> I'm much happier to experiment and build up sophistication as and when
>> it's needed.
>>
>
> Simplicity is certainly valuable.
>
> Ralf and I envisioned the much more sophisticated algorithm for
> similar patch detection (from pasta, https://github.com/lfd/PaStA)
> integrated into a workflow with patchwork.

Yes, PaStA was definitely something I had in mind when I wrote
this. The capitalisation in PaReD is a small homage to PaStA.

I have also been pointed at this abandoned attempt to get similar
functionality into Gerrit: https://gerrit-review.googlesource.com/c/gerrit/+/91253

I am very open to moving in that direction if it turns out that more
detection at that level of sophistication is required to get acceptable
accuracy.

> Daniel, you have seen the small steps we have taken:
>
> - Mete (an intern at BMW, my employer at the time) implemented the
> "related patches" feature for patchwork in 2019.
> - Rohit (a Google Summer of Code student in 2020, mentored by Ralf and
> me) implemented an "export, compute, import" toolchain between
> patchwork and pasta, some more details are described in
> https://github.com/lfd/PaStA/blob/master/documentation/pasta-patchwork.md.
>
> Unfortunately, IMHO, we hit two challenging implementation tasks with this work:
> 1. Performance issue computing relations with pasta
> 2. The lack of being able to limit the computation to new incoming
> patches: pasta was designed as an run-once off-line analysis tool, not
> as an continuously running online analysis; changing that is possible,
> but touches on various internal aspects throughout the whole tool.
>
> At that point, we have not continued the work yet and I personally
> believe that exploring simpler solutions than the complex pasta
> heuristics is worth a try (even if just to save power consumption of
> servers in the long run...).
>
> For completeness, I need to mention that Konstantin's b4 tool also
> detects the "latest patch series" when you ask it to pick a patch
> series from a kernel mailing list. I do not know how it determines
> that (and I hope that Konstantin can comment here), but it is probably
> also a simple heuristics searching for similar/same subject lines of
> the patch series cover letter. It would be nice if that functionality
> could be invoked as some kind of library function/separate client tool
> for patchwork as well.
>
> I hope that others can also come up with simple PaReD variants, such
> as parsing lore.kernel.org Links in the 'patch comment section' (so
> below the "---"), as once named the best way for developers to refer
> to previous versions in a ksummit-discuss email thread. I always hope
> that once a tool provides a significant benefit for tracking and
> managing previous versions, more developers pick up the needed
> conventions that patches would need to follow to benefit from such a
> tool.

I hope so too! I have been pleased by the proliferation of checks across
kernel.org's patchwork; I hope this will be the next thing to spread!

>> You can get the code at https://github.com/daxtens/pw-pared . I'm using
>> the same license as Patchwork, for a number of reasons, but in part
>> because we may one day want to migrate the functionality into the
>> patchwork core. Patches are welcome.
>>
>> You can see some examples of where PaReD has set up meaningful relations
>> at:
>>
>>  - https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210802073929.907431-2-kjain@linux.ibm.com/
>>  - https://patchwork.ozlabs.org/project/patchwork/patch/20210823182833.3976100-6-raxel@google.com/
>>
>> Some very obvious things that doing this has exposed:
>>
>>  - the relations display should show the status of each related patch
>>    (e.g. New, Superseded, Accepted)
>>
>>  - Series relations would make a lot of sense - probably even more sense
>>    from a human point of view - and we should probably build those at
>>    some point.
>>
>
> Agree. This is something Ralf, Mete, Rohit and I discussed as well.
>
> Extending a patch relation to a patch series relation is conceptually simple:
>
> If two patch series S1 and S2 with patches p1, ..., pn in series S1
> and patches r1, ..., rm in series S2 share a critical amount of
> related patches, i.e., for a large set of pairs of indices (i, j) in
> I: pi and rj are related to each other, then the series S1 and S2 are
> related to each other. Further, one could come up with a separate
> similarity relation among cover letters, and weigh that into the
> measure for related patch series. Fine-tune the weights and
> thresholds, evaluate it on a representative dataset and you are
> done...Conceptually clear, but this involves quite some work.

As with patch relations, I think we'd want to start with the
infrastructure and API --- although having learned from the experience
with patch relations I think we'd also want to release a tool that
performs basic detection of series relations at the same time!

>>  - PaReD requires an API token for a maintainer account (much like for
>>    pushing checks) which is annoying and one day we should sort out
>>    fine-grained permissions.
>>
>> Ask your patchwork instance admin if a maintainer account for PaReD is
>> right for you!
>>
>
> I am looking forward to more implementations and more instances
> running and trying out this feature.
>
> Daniel, thanks for moving this feature yet a step further.
>

Thanks for the long-running effort you have coordinated to land patch
relations as a feature in Patchwork and extend their capabilities!

Kind regards,
Daniel