[PATCH v4 1/3] models: Use database constraints to prevent split Series
Daniel Axtens
dja at axtens.net
Sun Dec 8 22:14:36 AEDT 2019
Daniel Axtens <dja at axtens.net> writes:
>> Currently, the 'SeriesReference' object has a unique constraint on the
>> two fields it has, 'series', which is a foreign key to 'Series', and
>> 'msgid'. This is the wrong constraint. What we actually want to enforce
>> is that a patch, cover letter or comment is referenced by a single
>> series, or rather a single series per project the submission appears on.
>> As such, we should be enforcing uniqueness on the msgid and the project
>> that the patch, cover letter or comment belongs to.
>
>
> OK, so I've been working on this and making good progress.
>
> My fundamental issue with the current design, and this extension of it,
> is that we're ignoring or working against the tools that databases
> provide for concurrency. This patch is an improvement, no doubt, but I'd
> really like to solve it properly.
>
> Because we have multiple processes operating on the database - processes
> without external syncronisation - we need to rely on the database to
> provide the synchronisation primitive. Currently we're doing things like
> asking the database to check if something exists and then creating it if
> it does not. This is embeds TOCTOU problems, similar to the ones I fixed
> for Person creation a while back. What we should be doing is
> get_or_create style - basically more aggressively attempting to create
> the thing we need and if that fails, then do the lookup for the existing
> object.
>
> This design pervades the series parser. It doesn't generally bite us
> because we generally, more or less, parse in series.
>
> I'm basically refactoring the parser to be
> 'try-creating-and-catch-failures' rather than
> 'lookup-and-create-if-missing'. It is a bit tricky but I think I'm at
> least 80% of the way there.
The more I think about this patch, the more I see it does actually move
us in the right direction. Let's run with it for now.
As for my rewrite, well, series parsing is really hard. I'm also working
from a data set that has a whole set of bizzare things including series
of patches in-reply-to a cover letter, a series of patches that's half
threaded and half unthreaded and a patch series that contains - by
design - multiple patches per "number" (i.e. 3 patches each numbered
1/N, 3 patches numbered 2/N...), so that means I'm forever fighting
weird edge-cases of how people used git a decade ago. So I'm going to
pass on trying to rewrite it for now.
I'll go over it with a finer-toothed comb and do a proper review, but I
withdraw my overall objection.
Regards,
Daniel
>
> I would be happy to leave my series parsing rework and Mete's work to a
> 2.3, or I'd be happy to try and grind through, get these done, and make
> 2.2 the final release in the 2.x series. What do you think?
>
> Regards,
> Daniel
>
> Stephen Finucane <stephen at that.guru> writes:
>
>>
>> This requires adding a new field to the object, 'project', since it's
>> not possible to do something like the following:
>>
>> unique_together = [('msgid', 'series__project')]
>>
>> This is detailed here [1]. In addition, the migration needs a precursor
>> migration step to merge any broken series.
>>
>> [1] https://stackoverflow.com/a/4440189/613428
>>
>> Signed-off-by: Stephen Finucane <stephen at that.guru>
>> Closes: #241
>> Cc: Daniel Axtens <dja at axtens.net>
>> Cc: Petr Vorel <petr.vorel at gmail.com>
>> ---
>> v4:
>> - Further tweaks to logging
>> v3:
>> - Improve logging
>> ---
>> .../0038_unique_series_references.py | 121 +++++++++++++++++
>> patchwork/models.py | 6 +-
>> patchwork/parser.py | 123 +++++++++---------
>> patchwork/tests/test_parser.py | 9 +-
>> patchwork/tests/utils.py | 6 +-
>> 5 files changed, 199 insertions(+), 66 deletions(-)
>> create mode 100644 patchwork/migrations/0038_unique_series_references.py
>>
>> diff --git a/patchwork/migrations/0038_unique_series_references.py b/patchwork/migrations/0038_unique_series_references.py
>> new file mode 100644
>> index 00000000..91799020
>> --- /dev/null
>> +++ b/patchwork/migrations/0038_unique_series_references.py
>> @@ -0,0 +1,121 @@
>> +from django.db import connection, migrations, models
>> +from django.db.models import Count
>> +import django.db.models.deletion
>> +
>> +
>> +def merge_duplicate_series(apps, schema_editor):
>> + SeriesReference = apps.get_model('patchwork', 'SeriesReference')
>> + Patch = apps.get_model('patchwork', 'Patch')
>> +
>> + msgid_seriesrefs = {}
>> +
>> + # find all SeriesReference that share a msgid but point to different series
>> + # and decide which of the series is going to be the authoritative one
>> + msgid_counts = (
>> + SeriesReference.objects.values('msgid')
>> + .annotate(count=Count('msgid'))
>> + .filter(count__gt=1)
>> + )
>> + for msgid_count in msgid_counts:
>> + msgid = msgid_count['msgid']
>> + chosen_ref = None
>> + for series_ref in SeriesReference.objects.filter(msgid=msgid):
>> + if series_ref.series.cover_letter:
>> + if chosen_ref:
>> + # I don't think this can happen, but explode if it does
>> + raise Exception(
>> + "Looks like you've got two or more series that share "
>> + "some patches but do not share a cover letter. Unable "
>> + "to auto-resolve."
>> + )
>> +
>> + # if a series has a cover letter, that's the one we'll group
>> + # everything under
>> + chosen_ref = series_ref
>> +
>> + if not chosen_ref:
>> + # if none of the series have cover letters, simply use the last
>> + # one (hint: this relies on Python's weird scoping for for loops
>> + # where 'series_ref' is still accessible outside the loop)
>> + chosen_ref = series_ref
>> +
>> + msgid_seriesrefs[msgid] = chosen_ref
>> +
>> + # reassign any patches referring to non-authoritative series to point to
>> + # the authoritative one, and delete the other series; we do this separately
>> + # to allow us a chance to raise the exception above if necessary
>> + for msgid, chosen_ref in msgid_seriesrefs.items():
>> + for series_ref in SeriesReference.objects.filter(msgid=msgid):
>> + if series_ref == chosen_ref:
>> + continue
>> +
>> + # update the patches to point to our chosen series instead, on the
>> + # assumption that all other metadata is correct
>> + for patch in Patch.objects.filter(series=series_ref.series):
>> + patch.series = chosen_ref.series
>> + patch.save()
>> +
>> + # delete the other series (which will delete the series ref)
>> + series_ref.series.delete()
>> +
>> +
>> +def copy_project_field(apps, schema_editor):
>> + if connection.vendor == 'postgresql':
>> + schema_editor.execute(
>> + """
>> + UPDATE patchwork_seriesreference
>> + SET project_id = patchwork_series.project_id
>> + FROM patchwork_series
>> + WHERE patchwork_seriesreference.series_id = patchwork_series.id
>> + """
>> + )
>> + elif connection.vendor == 'mysql':
>> + schema_editor.execute(
>> + """
>> + UPDATE patchwork_seriesreference, patchwork_series
>> + SET patchwork_seriesreference.project_id = patchwork_series.project_id
>> + WHERE patchwork_seriesreference.series_id = patchwork_series.id
>> + """
>> + ) # noqa
>> + else:
>> + SeriesReference = apps.get_model('patchwork', 'SeriesReference')
>> +
>> + for series_ref in SeriesReference.objects.all().select_related(
>> + 'series'
>> + ):
>> + series_ref.project = series_ref.series.project
>> + series_ref.save()
>> +
>> +
>> +class Migration(migrations.Migration):
>> +
>> + dependencies = [('patchwork', '0037_event_actor')]
>> +
>> + operations = [
>> + migrations.RunPython(
>> + merge_duplicate_series, migrations.RunPython.noop, atomic=False
>> + ),
>> + migrations.AddField(
>> + model_name='seriesreference',
>> + name='project',
>> + field=models.ForeignKey(
>> + null=True,
>> + on_delete=django.db.models.deletion.CASCADE,
>> + to='patchwork.Project',
>> + ),
>> + ),
>> + migrations.AlterUniqueTogether(
>> + name='seriesreference', unique_together={('project', 'msgid')}
>> + ),
>> + migrations.RunPython(
>> + copy_project_field, migrations.RunPython.noop, atomic=False
>> + ),
>> + migrations.AlterField(
>> + model_name='seriesreference',
>> + name='project',
>> + field=models.ForeignKey(
>> + on_delete=django.db.models.deletion.CASCADE,
>> + to='patchwork.Project',
>> + ),
>> + ),
>> + ]
>> diff --git a/patchwork/models.py b/patchwork/models.py
>> index 7f0efd48..dbaff024 100644
>> --- a/patchwork/models.py
>> +++ b/patchwork/models.py
>> @@ -79,7 +79,8 @@ class Project(models.Model):
>> webscm_url = models.CharField(max_length=2000, blank=True)
>> list_archive_url = models.CharField(max_length=2000, blank=True)
>> list_archive_url_format = models.CharField(
>> - max_length=2000, blank=True,
>> + max_length=2000,
>> + blank=True,
>> help_text="URL format for the list archive's Message-ID redirector. "
>> "{} will be replaced by the Message-ID.")
>> commit_url_format = models.CharField(
>> @@ -787,6 +788,7 @@ class SeriesReference(models.Model):
>> required to handle the case whereby one or more patches are
>> received before the cover letter.
>> """
>> + project = models.ForeignKey(Project, on_delete=models.CASCADE)
>> series = models.ForeignKey(Series, related_name='references',
>> related_query_name='reference',
>> on_delete=models.CASCADE)
>> @@ -796,7 +798,7 @@ class SeriesReference(models.Model):
>> return self.msgid
>>
>> class Meta:
>> - unique_together = [('series', 'msgid')]
>> + unique_together = [('project', 'msgid')]
>>
>>
>> class Bundle(models.Model):
>> diff --git a/patchwork/parser.py b/patchwork/parser.py
>> index c424729b..c0084cc0 100644
>> --- a/patchwork/parser.py
>> +++ b/patchwork/parser.py
>> @@ -16,6 +16,7 @@ import re
>>
>> from django.contrib.auth.models import User
>> from django.db.utils import IntegrityError
>> +from django.db import transaction
>> from django.utils import six
>>
>> from patchwork.models import Comment
>> @@ -256,16 +257,9 @@ def _find_series_by_references(project, mail):
>> for ref in refs:
>> try:
>> return SeriesReference.objects.get(
>> - msgid=ref[:255], series__project=project).series
>> + msgid=ref[:255], project=project).series
>> except SeriesReference.DoesNotExist:
>> continue
>> - except SeriesReference.MultipleObjectsReturned:
>> - # FIXME: Open bug: this can happen when we're processing
>> - # messages in parallel. Pick the first and log.
>> - logger.error("Multiple SeriesReferences for %s in project %s!" %
>> - (ref[:255], project.name))
>> - return SeriesReference.objects.filter(
>> - msgid=ref[:255], series__project=project).first().series
>>
>>
>> def _find_series_by_markers(project, mail, author):
>> @@ -1092,47 +1086,65 @@ def parse_mail(mail, list_id=None):
>> except IntegrityError:
>> raise DuplicateMailError(msgid=msgid)
>>
>> - # if we don't have a series marker, we will never have an existing
>> - # series to match against.
>> - series = None
>> - if n:
>> - series = find_series(project, mail, author)
>> + for attempt in range(1, 11): # arbitrary retry count
>> + try:
>> + with transaction.atomic():
>> + # if we don't have a series marker, we will never have an
>> + # existing series to match against.
>> + series = None
>> + if n:
>> + series = find_series(project, mail, author)
>> + else:
>> + x = n = 1
>> +
>> + # We will create a new series if:
>> + # - there is no existing series to assign this patch to, or
>> + # - there is an existing series, but it already has a patch
>> + # with this number in it
>> + if not series or Patch.objects.filter(
>> + series=series, number=x).count():
>> + series = Series.objects.create(
>> + project=project,
>> + date=date,
>> + submitter=author,
>> + version=version,
>> + total=n)
>> +
>> + # NOTE(stephenfin) We must save references for series.
>> + # We do this to handle the case where a later patch is
>> + # received first. Without storing references, it would
>> + # not be possible to identify the relationship between
>> + # patches as the earlier patch does not reference the
>> + # later one.
>> + for ref in refs + [msgid]:
>> + ref = ref[:255]
>> + # we don't want duplicates
>> + try:
>> + # we could have a ref to a previous series.
>> + # (For example, a series sent in reply to
>> + # another series.) That should not create a
>> + # series ref for this series, so check for the
>> + # msg-id only, not the msg-id/series pair.
>> + SeriesReference.objects.get(
>> + msgid=ref, project=project)
>> + except SeriesReference.DoesNotExist:
>> + SeriesReference.objects.create(
>> + msgid=ref, project=project, series=series)
>> + break
>> + except IntegrityError:
>> + # we lost the race so go again
>> + logger.warning('Conflict while saving series. This is '
>> + 'probably because multiple patches belonging '
>> + 'to the same series have been received at '
>> + 'once. Trying again (attempt %02d/10)',
>> + attempt)
>> else:
>> - x = n = 1
>> -
>> - # We will create a new series if:
>> - # - there is no existing series to assign this patch to, or
>> - # - there is an existing series, but it already has a patch with this
>> - # number in it
>> - if not series or Patch.objects.filter(series=series, number=x).count():
>> - series = Series.objects.create(
>> - project=project,
>> - date=date,
>> - submitter=author,
>> - version=version,
>> - total=n)
>> -
>> - # NOTE(stephenfin) We must save references for series. We
>> - # do this to handle the case where a later patch is
>> - # received first. Without storing references, it would not
>> - # be possible to identify the relationship between patches
>> - # as the earlier patch does not reference the later one.
>> - for ref in refs + [msgid]:
>> - ref = ref[:255]
>> - # we don't want duplicates
>> - try:
>> - # we could have a ref to a previous series. (For
>> - # example, a series sent in reply to another
>> - # series.) That should not create a series ref
>> - # for this series, so check for the msg-id only,
>> - # not the msg-id/series pair.
>> - SeriesReference.objects.get(msgid=ref,
>> - series__project=project)
>> - except SeriesReference.DoesNotExist:
>> - SeriesReference.objects.create(series=series, msgid=ref)
>> - except SeriesReference.MultipleObjectsReturned:
>> - logger.error("Multiple SeriesReferences for %s"
>> - " in project %s!" % (ref, project.name))
>> + # we failed to save the series so return the series-less patch
>> + logger.warning('Failed to save series. Your patch with message ID '
>> + '%s has been saved but this should not happen. '
>> + 'Please report this as a bug and include logs',
>> + msgid)
>> + return patch
>>
>> # add to a series if we have found one, and we have a numbered
>> # patch. Don't add unnumbered patches (for example diffs sent
>> @@ -1170,14 +1182,9 @@ def parse_mail(mail, list_id=None):
>> # message
>> try:
>> series = SeriesReference.objects.get(
>> - msgid=msgid, series__project=project).series
>> + msgid=msgid, project=project).series
>> except SeriesReference.DoesNotExist:
>> series = None
>> - except SeriesReference.MultipleObjectsReturned:
>> - logger.error("Multiple SeriesReferences for %s"
>> - " in project %s!" % (msgid, project.name))
>> - series = SeriesReference.objects.filter(
>> - msgid=msgid, series__project=project).first().series
>>
>> if not series:
>> series = Series.objects.create(
>> @@ -1190,12 +1197,8 @@ def parse_mail(mail, list_id=None):
>> # we don't save the in-reply-to or references fields
>> # for a cover letter, as they can't refer to the same
>> # series
>> - try:
>> - SeriesReference.objects.get_or_create(series=series,
>> - msgid=msgid)
>> - except SeriesReference.MultipleObjectsReturned:
>> - logger.error("Multiple SeriesReferences for %s"
>> - " in project %s!" % (msgid, project.name))
>> + SeriesReference.objects.create(
>> + msgid=msgid, project=project, series=series)
>>
>> try:
>> cover_letter = CoverLetter.objects.create(
>> diff --git a/patchwork/tests/test_parser.py b/patchwork/tests/test_parser.py
>> index 0bf71585..0edbd87a 100644
>> --- a/patchwork/tests/test_parser.py
>> +++ b/patchwork/tests/test_parser.py
>> @@ -421,17 +421,20 @@ class SeriesCorrelationTest(TestCase):
>> msgids = [make_msgid()]
>> project = create_project()
>> series_v1 = create_series(project=project)
>> - create_series_reference(msgid=msgids[0], series=series_v1)
>> + create_series_reference(msgid=msgids[0], series=series_v1,
>> + project=project)
>>
>> # ...and three patches
>> for i in range(3):
>> msgids.append(make_msgid())
>> - create_series_reference(msgid=msgids[-1], series=series_v1)
>> + create_series_reference(msgid=msgids[-1], series=series_v1,
>> + project=project)
>>
>> # now create a new series with "cover letter"
>> msgids.append(make_msgid())
>> series_v2 = create_series(project=project)
>> - ref_v2 = create_series_reference(msgid=msgids[-1], series=series_v2)
>> + ref_v2 = create_series_reference(msgid=msgids[-1], series=series_v2,
>> + project=project)
>>
>> # ...and the "first patch" of this new series
>> msgid = make_msgid()
>> diff --git a/patchwork/tests/utils.py b/patchwork/tests/utils.py
>> index 577183d0..4ead1781 100644
>> --- a/patchwork/tests/utils.py
>> +++ b/patchwork/tests/utils.py
>> @@ -282,8 +282,12 @@ def create_series(**kwargs):
>>
>> def create_series_reference(**kwargs):
>> """Create 'SeriesReference' object."""
>> + project = kwargs.pop('project', create_project())
>> + series = kwargs.pop('series', create_series(project=project))
>> +
>> values = {
>> - 'series': create_series() if 'series' not in kwargs else None,
>> + 'series': series,
>> + 'project': project,
>> 'msgid': make_msgid(),
>> }
>> values.update(**kwargs)
>> --
>> 2.23.0
More information about the Patchwork
mailing list