Postmortem upgrading patchwork.kernel.org 1.0->2.1
Konstantin Ryabitsev
konstantin at linuxfoundation.org
Wed Aug 1 00:34:53 AEST 2018
Hello, patchwork list!
About a week ago I performed the upgrade of patchwork.kernel.org from
version 1.0 to version 2.1. This is my war story, since it didn't go
nearly as smooth as I was hoping.
First, general notes to describe the infra:
- patchwork.kernel.org consists of 74 projects and, prior to the
migration, 1,755,019 patches (according to count(*) in the
patchwork_patch table).
- of these, 750,081 patches are from one single project (LKML).
- the database is MySQL, in a master-slave failover configuration,
accessed using haproxy
The LKML project was mostly dead weight, since nobody was actually using
it for tracking patches. We ended up creating a separate patchwork
instance just for LKML, located here: https://lore.kernel.org/patchwork.
The migration, therefore, included two overall steps:
1. delete the LKML project from patchwork.kernel.org, dumping nearly
half of db entries.
2. migrate the remainder to patchwork 2.1
# Problems during the first stage
Attempt to delete the LKML project via the admin interface failed.
Clicking the "Delete" button on the project settings page basically
consumed all the RAM on the patchwork system and OOM-ed in most horrible
ways, requiring a system reboot. In an attempt to solve it, I manually
deleted all patches from patchwork_patch that belonged to the LKML
project. This allowed me to access the "Delete" page and delete the
project, though this also resulted in a corrupted session, because my
admin profile ended up corrupted. The uwsgi log showed this error:
Traceback (most recent call last):
File "/opt/patchwork/venv/lib/python2.7/site-packages/django/core/handlers/base.py", line 132, in get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "./patchwork/views/patch.py", line 106, in list
view_args = {'project_id': project.linkname})
File "./patchwork/views/__init__.py", line 59, in generic_list
if project.is_editable(user):
File "./patchwork/models.py", line 69, in is_editable
return self in user.profile.maintainer_projects.all()
File "/opt/patchwork/venv/lib/python2.7/site-packages/django/utils/functional.py", line 226, in inner
return func(self._wrapped, *args)
File "/opt/patchwork/venv/lib/python2.7/site-packages/django/db/models/fields/related.py", line 483, in __get__
self.related.get_accessor_name()
RelatedObjectDoesNotExist: User has no profile.
I was able to create another admin user and continue.
# Problems during the second stage
At this stage, I started the migration process using manage.py migrate.
Immediately, this resulted in a problem due to haproxy inactivity
timeouts. As I mentioned, our setup uses a master-slave setup, and tcp
connections are configured to time out after 5 minutes of inactivity.
The migration script was doing some serious database modifications
operating on tables with about a million or more rows, which took MUCH
longer than our 5-minute timeout setting.
After setting things up to connect directly to the master, bypassing
haproxy, I was able to proceed to the next step. Unfortunately, I didn't
get very far, since at this point migration routines were failing
because they were trying to lock millions of rows and running out of
database resources. Unfortunately, I could not easily fix this because
raising maximum locks would have required restarting the database server
(an operation that affects multiple projects using it). I had to look in
the django migration scripts and run mysql queries manually, adding
WHERE clauses so that they would operate on subsets of rows (limiting by
id ranges). This took a few hours -- to some degree because all
operations had to be replicated to the slave server. Some of the tables
it operated on were tens of gigabytes in size, so shipping all these
replication logs to the slave server also took a lot of resources and
resulted in lots of network and disk IO.
In the end, it mostly worked out, despite the somewhat gruelling
process. I do have a somewhat mysterious side-effect of deleting the
LKML project in that some people lost maintainer status in other
projects. I'm not sure how this came to be, but at least it's an easy
fix -- probably the same reason my admin profile got purged requiring me
to create a new one.
Needless to say, I hope future upgrades are a lot more smooth. Did I
test the migration before starting it? Yes, but on a very small subset
of test data, which is why I didn't hit any of the above conditions,
which stemmed from a) deleting a project and b) copying, merging and
deleting huge datasets as the backend db format got rejigged between
version 1.0 and 2.1.
My suggestion for anyone performing similar migrations from 1.0 to 2.1
-- if you have more than a few thousand patches -- is to perform this
migration on a standalone database and then dump and re-import the
result into production instead of performing this live on multi-master
or master-slave setups. You will save a LOT of IO and avoid many
problems related to table modifications and database resource starving.
If you have any questions or if you'd like me to help troubleshoot
something, I'll be quite happy to do that.
Best,
-K
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/patchwork/attachments/20180731/386f389f/attachment.sig>
More information about the Patchwork
mailing list