Postmortem upgrading patchwork.kernel.org 1.0->2.1

Wed Aug 1 00:34:53 AEST 2018

Hello, patchwork list!

About a week ago I performed the upgrade of patchwork.kernel.org from 
version 1.0 to version 2.1. This is my war story, since it didn't go 
nearly as smooth as I was hoping.

First, general notes to describe the infra:

- patchwork.kernel.org consists of 74 projects and, prior to the 
  migration, 1,755,019 patches (according to count(*) in the 
  patchwork_patch table).
- of these, 750,081 patches are from one single project (LKML).
- the database is MySQL, in a master-slave failover configuration, 
  accessed using haproxy

The LKML project was mostly dead weight, since nobody was actually using 
it for tracking patches. We ended up creating a separate patchwork 
instance just for LKML, located here: https://lore.kernel.org/patchwork.  
The migration, therefore, included two overall steps:

1. delete the LKML project from patchwork.kernel.org, dumping nearly 
half of db entries.
2. migrate the remainder to patchwork 2.1

# Problems during the first stage

Attempt to delete the LKML project via the admin interface failed.  
Clicking the "Delete" button on the project settings page basically 
consumed all the RAM on the patchwork system and OOM-ed in most horrible 
ways, requiring a system reboot. In an attempt to solve it, I manually 
deleted all patches from patchwork_patch that belonged to the LKML 
project. This allowed me to access the "Delete" page and delete the 
project, though this also resulted in a corrupted session, because my 
admin profile ended up corrupted. The uwsgi log showed this error:

Traceback (most recent call last):
  File "/opt/patchwork/venv/lib/python2.7/site-packages/django/core/handlers/base.py", line 132, in get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "./patchwork/views/patch.py", line 106, in list
    view_args = {'project_id': project.linkname})
  File "./patchwork/views/__init__.py", line 59, in generic_list
    if project.is_editable(user):
  File "./patchwork/models.py", line 69, in is_editable
    return self in user.profile.maintainer_projects.all()
  File "/opt/patchwork/venv/lib/python2.7/site-packages/django/utils/functional.py", line 226, in inner
    return func(self._wrapped, *args)
  File "/opt/patchwork/venv/lib/python2.7/site-packages/django/db/models/fields/related.py", line 483, in __get__
    self.related.get_accessor_name()
RelatedObjectDoesNotExist: User has no profile.

I was able to create another admin user and continue.

# Problems during the second stage

At this stage, I started the migration process using manage.py migrate.  
Immediately, this resulted in a problem due to haproxy inactivity 
timeouts. As I mentioned, our setup uses a master-slave setup, and tcp 
connections are configured to time out after 5 minutes of inactivity.  
The migration script was doing some serious database modifications 
operating on tables with about a million or more rows, which took MUCH 
longer than our 5-minute timeout setting.

After setting things up to connect directly to the master, bypassing 
haproxy, I was able to proceed to the next step. Unfortunately, I didn't 
get very far, since at this point migration routines were failing 
because they were trying to lock millions of rows and running out of 
database resources. Unfortunately, I could not easily fix this because 
raising maximum locks would have required restarting the database server 
(an operation that affects multiple projects using it). I had to look in 
the django migration scripts and run mysql queries manually, adding 
WHERE clauses so that they would operate on subsets of rows (limiting by 
id ranges). This took a few hours -- to some degree because all 
operations had to be replicated to the slave server. Some of the tables 
it operated on were tens of gigabytes in size, so shipping all these 
replication logs to the slave server also took a lot of resources and 
resulted in lots of network and disk IO.

In the end, it mostly worked out, despite the somewhat gruelling 
process. I do have a somewhat mysterious side-effect of deleting the 
LKML project in that some people lost maintainer status in other 
projects. I'm not sure how this came to be, but at least it's an easy 
fix -- probably the same reason my admin profile got purged requiring me 
to create a new one.

Needless to say, I hope future upgrades are a lot more smooth. Did I 
test the migration before starting it? Yes, but on a very small subset 
of test data, which is why I didn't hit any of the above conditions, 
which stemmed from a) deleting a project and b) copying, merging and 
deleting huge datasets as the backend db format got rejigged between 
version 1.0 and 2.1.

My suggestion for anyone performing similar migrations from 1.0 to 2.1 
-- if you have more than a few thousand patches -- is to perform this 
migration on a standalone database and then dump and re-import the 
result into production instead of performing this live on multi-master 
or master-slave setups. You will save a LOT of IO and avoid many 
problems related to table modifications and database resource starving.

If you have any questions or if you'd like me to help troubleshoot 
something, I'll be quite happy to do that.

Best,
-K
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/patchwork/attachments/20180731/386f389f/attachment.sig>