[PATCH v2 0/9] Tools and fixes for parallel parsing
Stephen Finucane
stephen at that.guru
Mon Feb 26 00:05:34 AEDT 2018
On Sun, 2018-02-25 at 01:50 +1100, Daniel Axtens wrote:
> Thomas Petazzoni reported that Patchwork would occasionally lose
> Buildroot email. Andrew - having talked to jk and sfr - suggested that
> this may be race-condition related.
>
> I investigated and found some bugs. I first had to develop some tools.
> Along the way I found other unrelated bugs too.
>
> Patches 1-4 are tooling - ways to do parallel parsing of messages and
> get and compare the output. (Patch 1 fixes an issue I found when
> running the tool from patch 2)
>
> Patch 5 is an unrelated fix that came up along the way and
> demonstrates that humans remain the best fuzzers, and that Python's
> email module is still adorably* quirky.
>
> Patch 6 is a bug that came up very quickly in testing but is unlikely
> to be the actual bug Buildroot is hitting, as it can only occur the
> first time an email address is seen.
>
> Patch 7 is a related tidy-up/optimisation.
>
> Patch 8 fixes up a MySQL-only bug, but also adds some robustness.
>
> I think patch 9 closes the most likely issue for Buildroot patches.
>
> Pending review, patches 5, 6, 8 and 9 should go to stable.
>
> V2: Address review comments from Andrew, see patches.
> Added the new python script to the automatic pep8 testing.
> Minor improvement to parallel parse script to allow different
> pythons to be used.
Well, this is all rather unfortunate :( When I was developing this, I
was testing using 'parsearchive' with a single archive, which naturally
meant everything was running in series. It seems the use of an MTA
means the 'parsemail' script, which 'parsearchive' harnesses, gets
called multiple times here and highlights all these races.
I've reviewed most of the patches (bar one, which I'm still working on)
at this point, and think most of these should be merged. However, I've
noted that some of the patches leave FIXMEs in place. I'm pretty sure
all of these occur where we apply a "lookup now and save later _if_
these preconditions are met" pattern, which we do for things like
series references, authors, etc. This isn't going to be easy to fix,
IMO.
That brings me to the real question here: what do we need to do to
ultimately fix this? Far as I can see, there are two options:
* Save stuff unconditionally. If we're able to find an email, save
that even if it's not a patch/cover letter/comment. If we find an
author that we haven't seen before, save that too even if they
haven't touched a patch, cover letter or comment. This greatly
simplifies things, but would result in a more congested database and
API, and would require some background work (saving non-patch/cover
letter emails).
* Use locks. I recall a patch from Damien Lespiau some years ago
around making this a parallel process [1]. Couldn't we just do this
on the critical sections of code (anything that involves DB
accesses, I guess)? This would require minimal changes but could
have scalability issues.
Thoughts?
[1] http://patchwork.ozlabs.org/patch/532930/
More information about the Patchwork
mailing list