Postmortem upgrading patchwork.kernel.org 1.0->2.1

Thu Aug 2 03:03:36 AEST 2018

Hi Konstantin,

> Hello, patchwork list!
>
> About a week ago I performed the upgrade of patchwork.kernel.org from 
> version 1.0 to version 2.1. This is my war story, since it didn't go 
> nearly as smooth as I was hoping.

Apologies that your experience was sub-optimal. Thanks so much for
sending this - this info will be really helpful for us in making things
better in the future.

> First, general notes to describe the infra:
>
> - patchwork.kernel.org consists of 74 projects and, prior to the 
>   migration, 1,755,019 patches (according to count(*) in the 
>   patchwork_patch table).
> - of these, 750,081 patches are from one single project (LKML).
> - the database is MySQL, in a master-slave failover configuration, 
>   accessed using haproxy

That is large! I will increase the size of my test database, although
I'm never going to be at the scale to test master/slave failover.

> The LKML project was mostly dead weight, since nobody was actually using 
> it for tracking patches. We ended up creating a separate patchwork 
> instance just for LKML, located here: https://lore.kernel.org/patchwork.  
> The migration, therefore, included two overall steps:
>
> 1. delete the LKML project from patchwork.kernel.org, dumping nearly 
> half of db entries.
> 2. migrate the remainder to patchwork 2.1
>
> # Problems during the first stage
>
> Attempt to delete the LKML project via the admin interface failed.  
> Clicking the "Delete" button on the project settings page basically 
> consumed all the RAM on the patchwork system and OOM-ed in most horrible 
> ways, requiring a system reboot. In an attempt to solve it, I manually 
> deleted all patches from patchwork_patch that belonged to the LKML 
> project. This allowed me to access the "Delete" page and delete the 
> project, though this also resulted in a corrupted session, because my 
> admin profile ended up corrupted. The uwsgi log showed this error:
>
> Traceback (most recent call last):
>   File "/opt/patchwork/venv/lib/python2.7/site-packages/django/core/handlers/base.py", line 132, in get_response
>     response = wrapped_callback(request, *callback_args, **callback_kwargs)
>   File "./patchwork/views/patch.py", line 106, in list
>     view_args = {'project_id': project.linkname})
>   File "./patchwork/views/__init__.py", line 59, in generic_list
>     if project.is_editable(user):
>   File "./patchwork/models.py", line 69, in is_editable
>     return self in user.profile.maintainer_projects.all()
>   File "/opt/patchwork/venv/lib/python2.7/site-packages/django/utils/functional.py", line 226, in inner
>     return func(self._wrapped, *args)
>   File "/opt/patchwork/venv/lib/python2.7/site-packages/django/db/models/fields/related.py", line 483, in __get__
>     self.related.get_accessor_name()
> RelatedObjectDoesNotExist: User has no profile.
>
> I was able to create another admin user and continue.

Ouch. :(

I wonder what broke there. I'll see if I can hunt it down. You should be
able to delete projects without incident, although it's something we
probably test very rarely. I imagine there's some list comprehension or
dumb code of some sort that loads the whole thing into memory for some
reason.

> # Problems during the second stage
>
> At this stage, I started the migration process using manage.py migrate.  
> Immediately, this resulted in a problem due to haproxy inactivity 
> timeouts. As I mentioned, our setup uses a master-slave setup, and tcp 
> connections are configured to time out after 5 minutes of inactivity.  
> The migration script was doing some serious database modifications 
> operating on tables with about a million or more rows, which took MUCH 
> longer than our 5-minute timeout setting.

Yeah, I'm sorry about the state of the 1.0 -> 2.0 migration
especially. We tried a different sort of schema, the migration to the
new schema is super expensive, and after all that it didn't end up
scaling well, so 2.0 -> 2.1 partially reverts it.

I would have loved to make the migrations 'smart' and able to detect
that you're skipping 2.0 and avoid doing useless work, but I don't think
Django can be sanely wrangled to do this. (If there are any Django
wizards on the list that do know how to do this, please let me know!)

> After setting things up to connect directly to the master, bypassing 
> haproxy, I was able to proceed to the next step. Unfortunately, I didn't 
> get very far, since at this point migration routines were failing 
> because they were trying to lock millions of rows and running out of 
> database resources. Unfortunately, I could not easily fix this because 
> raising maximum locks would have required restarting the database server 
> (an operation that affects multiple projects using it). I had to look in 
> the django migration scripts and run mysql queries manually, adding 
> WHERE clauses so that they would operate on subsets of rows (limiting by 
> id ranges). This took a few hours -- to some degree because all 
> operations had to be replicated to the slave server. Some of the tables 
> it operated on were tens of gigabytes in size, so shipping all these 
> replication logs to the slave server also took a lot of resources and 
> resulted in lots of network and disk IO.

OK, that's an at-scale problem that just didn't occur to me. I will
(human memory permitting) make sure that future migrations are a bit
more clever here and don't try to lock everything all at once.

> In the end, it mostly worked out, despite the somewhat gruelling 
> process. I do have a somewhat mysterious side-effect of deleting the 
> LKML project in that some people lost maintainer status in other 
> projects. I'm not sure how this came to be, but at least it's an easy 
> fix -- probably the same reason my admin profile got purged requiring me 
> to create a new one.

Huh, I'm sorry to hear that. I wonder if there's a cascade rule that's
broken somewhere. I guess I'll write some more tests in this area.

> Needless to say, I hope future upgrades are a lot more smooth. Did I 
> test the migration before starting it? Yes, but on a very small subset 
> of test data, which is why I didn't hit any of the above conditions, 
> which stemmed from a) deleting a project and b) copying, merging and 
> deleting huge datasets as the backend db format got rejigged between 
> version 1.0 and 2.1.

I'm really sorry it was such an ordeal for you.

I think OzLabs also suffered from the 1.0 -> 2.0 migration taking
multiple hours - it was a large, data-intensive migration. The good news
(at least for OzLabs, which is still on 2.0) is that the 2.0 -> 2.1
migration is much simpler than 1.0 -> 2.0.

I will try to keep future migrations to a managable level also. There's
more work to do to unpick some of the knots we tied ourselves in, but it
will probably be spread out over a few minor versions.

> My suggestion for anyone performing similar migrations from 1.0 to 2.1 
> -- if you have more than a few thousand patches -- is to perform this 
> migration on a standalone database and then dump and re-import the 
> result into production instead of performing this live on multi-master 
> or master-slave setups. You will save a LOT of IO and avoid many 
> problems related to table modifications and database resource starving.
>
> If you have any questions or if you'd like me to help troubleshoot 
> something, I'll be quite happy to do that.

If you have any ongoing issues please do let us know - you and OzLabs
are our biggest public users and I'd really like to make sure you have a
good experience.

I'll take a look at the bugs you mentioned as soon as I get a minute
(which at the moment is not particularly compatible with the rest of my
life), but it sounded like none of them are ongoing?

Thanks again for letting us know how it went and I'll keep your feedback
in mind for future releases.

Regards,
Daniel

PS. I don't personally have much experience adminstering large systems,
so if there are other things you want us to add/remove/change/consider
so it can run well at kernel.org scale, please let us know!

>
> Best,
> -K
> _______________________________________________
> Patchwork mailing list
> Patchwork at lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/patchwork