Patch stack analysis

Tue Jun 4 14:30:46 AEST 2019

Hi Daniel,

> 
> > we (Ralf Ramsauer, Lukas Bulwahn and me) are currently working on
> > extending the capabilities of Patchwork by combining it with a tool
> > called PaStA [1] (Patch Stack Analysis). PaStA is the outcome of a
> > research project [2] by the Technical University of Applied Sciences
> > Regensburg. It analyses and compares all mails in a mailing list to
> > find related ones (e.g former versions of the patch, see [3]). Ralf
> > compared PaStA's results for the Linux kernel mailing list with a
> > manually created ground truth and achieved an accuracy of 91%. This
> > motivated us to integrate PaStA into Patchwork.
> 
> Cool, always interesting to see what people build on top of Patchwork!
> 

I hope we can nicely integrate that into what is already there.

> 
> One bit of relevant Patchwork history: that there's a long-running fork run by
> the freedesktop.org people: patchwork.freedesktop.org,
> https://gitlab.freedesktop.org/patchwork-fdo/patchwork-fdo/ . They took a
> different approach to series than we did: we focused on patches as the key
> 'unit' of patchworking, they focused on series as the key unit. They already
> have some support for multiple revisions of a series. I don't know how
> they've implemented their feature for detecting multiple revisions, but I'm
> guessing it's not based on analysis of (commit message, diff) tuples. There's
> an example here:
> https://patchwork.freedesktop.org/series/49692/
> 

Yes, we will certainly have a look at what they implemented and consider incorporating the good ideas they had.

> > Showing related patches (beside ones in the current series) allows
> > developers to understand the patch's evolution better. We have
> > adjusted the patch details view and renamed the series patch links
> > from "related" to "series". Our new related row shows the patches
> > related to each other by PaStA [3][4]. The relations between the
> > patches in the screenshot were made manually and the next steps will
> > be to automate this procedure with PaStA.
> 
> I'm really wary about incorporating something with so many dependencies
> (and with presumably higher resource usage) into the core of patchwork.
> 

Agree. That is also our main concern: we would like to set this up so that the use of pasta is optional and has little impact, e.g., other than exporting some REST API, on the main application. We also want that patchwork and pasta can be running on two different machines and that there is a clear low coupling interface between them. How to achieve this step-by-step is our current discussion. 

> I'd want to know a few things:
> 
>  - what is the accuracy of the FDO Patchwork approach (which I assume is
>    100% metadata based)? Does it require that patch sumbitters do
>    particular things (e.g. use the same cover letter title)? Sometimes
>    we can train users to be helpful in how they submit things to the
>    lists in order to have them work properly in more simple systems.
> 
>  - one key use case is the Linux kernel, where we have stable trees, and
>    patches getting picked up for those trees. Sometimes those patches
>    are identical and sometimes they need backporting. Some care would
>    need to be taken around this.
> 
>    An example would be:
>     - I send this patch to the mailing list:
> http://patchwork.ozlabs.org/patch/1099934/
>     - It is merged into mainline
>     - It is proposed for stable trees. This involves multiple threads of
>       over 100 emails each, including:
>       * https://lkml.org/lkml/2019/5/29/1655
>       * https://lkml.org/lkml/2019/5/30/361
>       * (plus 3 others)
> 
>    In this case, the original patch is related to the stable patches,
>    (despite being sent by someone different), and it is interesting and
>    useful to know what stable series a patch landed in. However, the
>    patch is not really related to the entire stable patch _series_, and
>    if you include all the hundreds of patches in your 'related' view in
>    [3], you will drown out all the potentially useful signal in a bunch
>    of noise.
> 
>    It does get more complicated than this too, for example when there is
>    a need to packport a patch for stable. (See
>    e.g. http://patchwork.ozlabs.org/patch/1109024/ and friends)
> 

I agree. We will need to identify stable patches.
We already have multiple good indicators that we will investigate:

- date of upstream inclusion, when was the patch finally included in the main repository, i.e., date of the Linus' merge commit for a patch
- date of the stable patch email
- sender of the stable patch email, i.e., it is usually Greg KH or Sascha Levin in the linux kernel  development
- is some email CC-ed?
- does it contain some specific string in the commit message.

All these points are mostly project specific, so we will need to check how to make that configurable so that it fits to all the projects and reaches a good precision and recall.

>  - what's the resource usage, and how long does matching take?
>    kernel.org has a patchwork instance that is hooked up to LKML, so
>    this is a deeply practical concern for them!
>

As part of the research we do analysis over all >200 Linux kernel mailing lists, including LKML, and we are well aware of the memory and cpu usage that is required. Our current analyses are always infrequent one-time analyses over months and years of patch submissions, and that takes quite some computing power and memory usage. We hope that if the analysis is set up to run continuously, e.g., always triggered by patchwork, we can analyse single messages within seconds. This will still require quite some work on our side to figure out the best solution here.

> I think a really good place to start would be to hook PaStA up as an API
> consumer like Snowpatch. It wouldn't be able to report the results back to
> patchwork just yet, but you'd be able to try it with live data and demonstrate
> its value.
> 

We will look into that interface and we will keep you in the loop on our design decisions and where we see the need to have some dedicated interfaces to provide this patch matching feature to the patchwork users.

I hope we can show a first prototype soon where we hooked into patchwork and have some first way to report data back to patchwork when we extend it on our local github fork. Then, we can discuss how to integrate this nicely back into the patchwork main development.

Best regards,

Lukas