Postmortem upgrading patchwork.kernel.org 1.0->2.1

Wed Aug 29 23:08:14 AEST 2018

On Thu, 2018-08-02 at 03:03 +1000, Daniel Axtens wrote:
> Hi Konstantin,
> 
> > Hello, patchwork list!
> > 
> > About a week ago I performed the upgrade of patchwork.kernel.org from 
> > version 1.0 to version 2.1. This is my war story, since it didn't go 
> > nearly as smooth as I was hoping.
> 
> Apologies that your experience was sub-optimal. Thanks so much for
> sending this - this info will be really helpful for us in making things
> better in the future.
> 
> > First, general notes to describe the infra:
> > 
> > - patchwork.kernel.org consists of 74 projects and, prior to the 
> >   migration, 1,755,019 patches (according to count(*) in the 
> >   patchwork_patch table).
> > - of these, 750,081 patches are from one single project (LKML).
> > - the database is MySQL, in a master-slave failover configuration, 
> >   accessed using haproxy
> 
> That is large! I will increase the size of my test database, although
> I'm never going to be at the scale to test master/slave failover.

Likewise. I don't think I've ever gone about 20,000 patches, which is
clearly nowhere near enough.

> > The LKML project was mostly dead weight, since nobody was actually using 
> > it for tracking patches. We ended up creating a separate patchwork 
> > instance just for LKML, located here: https://lore.kernel.org/patchwork.  
> > The migration, therefore, included two overall steps:
> > 
> > 1. delete the LKML project from patchwork.kernel.org, dumping nearly 
> > half of db entries.
> > 2. migrate the remainder to patchwork 2.1
> > 
> > # Problems during the first stage
> > 
> > Attempt to delete the LKML project via the admin interface failed.  
> > Clicking the "Delete" button on the project settings page basically 
> > consumed all the RAM on the patchwork system and OOM-ed in most horrible 
> > ways, requiring a system reboot. In an attempt to solve it, I manually 
> > deleted all patches from patchwork_patch that belonged to the LKML 
> > project. This allowed me to access the "Delete" page and delete the 
> > project, though this also resulted in a corrupted session, because my 
> > admin profile ended up corrupted. The uwsgi log showed this error:
> > 
> > Traceback (most recent call last):
> >   File "/opt/patchwork/venv/lib/python2.7/site-packages/django/core/handlers/base.py", line 132, in get_response
> >     response = wrapped_callback(request, *callback_args, **callback_kwargs)
> >   File "./patchwork/views/patch.py", line 106, in list
> >     view_args = {'project_id': project.linkname})
> >   File "./patchwork/views/__init__.py", line 59, in generic_list
> >     if project.is_editable(user):
> >   File "./patchwork/models.py", line 69, in is_editable
> >     return self in user.profile.maintainer_projects.all()
> >   File "/opt/patchwork/venv/lib/python2.7/site-packages/django/utils/functional.py", line 226, in inner
> >     return func(self._wrapped, *args)
> >   File "/opt/patchwork/venv/lib/python2.7/site-packages/django/db/models/fields/related.py", line 483, in __get__
> >     self.related.get_accessor_name()
> > RelatedObjectDoesNotExist: User has no profile.
> > 
> > I was able to create another admin user and continue.
> 
> Ouch. :(
> 
> I wonder what broke there. I'll see if I can hunt it down. You should be
> able to delete projects without incident, although it's something we
> probably test very rarely. I imagine there's some list comprehension or
> dumb code of some sort that loads the whole thing into memory for some
> reason.

I suspect these might be Django bugs. There's no reason the delete
operation should cause issues like this. I would suggest reporting this
upstream against Django but, iirc, patchwork.kernel.org had been using
Django 1.6 which is waaay past EOL.

> > # Problems during the second stage
> > 
> > At this stage, I started the migration process using manage.py migrate.  
> > Immediately, this resulted in a problem due to haproxy inactivity 
> > timeouts. As I mentioned, our setup uses a master-slave setup, and tcp 
> > connections are configured to time out after 5 minutes of inactivity.  
> > The migration script was doing some serious database modifications 
> > operating on tables with about a million or more rows, which took MUCH 
> > longer than our 5-minute timeout setting.
> 
> Yeah, I'm sorry about the state of the 1.0 -> 2.0 migration
> especially. We tried a different sort of schema, the migration to the
> new schema is super expensive, and after all that it didn't end up
> scaling well, so 2.0 -> 2.1 partially reverts it.
> 
> I would have loved to make the migrations 'smart' and able to detect
> that you're skipping 2.0 and avoid doing useless work, but I don't think
> Django can be sanely wrangled to do this. (If there are any Django
> wizards on the list that do know how to do this, please let me know!)

This might help:

https://docs.djangoproject.com/en/2.1/topics/migrations/#migration-squashing
I haven't really looked into it though. I knew that migration was going
to be painful but, again, never thought about people operating at this
scale. If it's any consolation, this should be the last time you'll see
a migration quite so awful.

> > After setting things up to connect directly to the master, bypassing 
> > haproxy, I was able to proceed to the next step. Unfortunately, I didn't 
> > get very far, since at this point migration routines were failing 
> > because they were trying to lock millions of rows and running out of 
> > database resources. Unfortunately, I could not easily fix this because 
> > raising maximum locks would have required restarting the database server 
> > (an operation that affects multiple projects using it). I had to look in 
> > the django migration scripts and run mysql queries manually, adding 
> > WHERE clauses so that they would operate on subsets of rows (limiting by 
> > id ranges). This took a few hours -- to some degree because all 
> > operations had to be replicated to the slave server. Some of the tables 
> > it operated on were tens of gigabytes in size, so shipping all these 
> > replication logs to the slave server also took a lot of resources and 
> > resulted in lots of network and disk IO.
> 
> OK, that's an at-scale problem that just didn't occur to me. I will
> (human memory permitting) make sure that future migrations are a bit
> more clever here and don't try to lock everything all at once.
> 
> > In the end, it mostly worked out, despite the somewhat gruelling 
> > process. I do have a somewhat mysterious side-effect of deleting the 
> > LKML project in that some people lost maintainer status in other 
> > projects. I'm not sure how this came to be, but at least it's an easy 
> > fix -- probably the same reason my admin profile got purged requiring me 
> > to create a new one.
> 
> Huh, I'm sorry to hear that. I wonder if there's a cascade rule that's
> broken somewhere. I guess I'll write some more tests in this area.
> 
> > Needless to say, I hope future upgrades are a lot more smooth. Did I 
> > test the migration before starting it? Yes, but on a very small subset 
> > of test data, which is why I didn't hit any of the above conditions, 
> > which stemmed from a) deleting a project and b) copying, merging and 
> > deleting huge datasets as the backend db format got rejigged between 
> > version 1.0 and 2.1.
> 
> I'm really sorry it was such an ordeal for you.
> 
> I think OzLabs also suffered from the 1.0 -> 2.0 migration taking
> multiple hours - it was a large, data-intensive migration. The good news
> (at least for OzLabs, which is still on 2.0) is that the 2.0 -> 2.1
> migration is much simpler than 1.0 -> 2.0.
> 
> I will try to keep future migrations to a managable level also. There's
> more work to do to unpick some of the knots we tied ourselves in, but it
> will probably be spread out over a few minor versions.
> 
> > My suggestion for anyone performing similar migrations from 1.0 to 2.1 
> > -- if you have more than a few thousand patches -- is to perform this 
> > migration on a standalone database and then dump and re-import the 
> > result into production instead of performing this live on multi-master 
> > or master-slave setups. You will save a LOT of IO and avoid many 
> > problems related to table modifications and database resource starving.
> > 
> > If you have any questions or if you'd like me to help troubleshoot 
> > something, I'll be quite happy to do that.
> 
> If you have any ongoing issues please do let us know - you and OzLabs
> are our biggest public users and I'd really like to make sure you have a
> good experience.
> 
> I'll take a look at the bugs you mentioned as soon as I get a minute
> (which at the moment is not particularly compatible with the rest of my
> life), but it sounded like none of them are ongoing?
> 
> Thanks again for letting us know how it went and I'll keep your feedback
> in mind for future releases.

Everything he said. As far as actionable items go, it seems there is
some weirdness going on as a result of deleting a table manually, which
I think Daniel has mostly covered. I'm not sure how much I can
personally do here, however, a greater focus on the impact of
migrations is definitely something I'll be considering going forward.
In any case though, I very much appreciate this write up. It's good to
see the kind of issues realworld users see at scale, if only to ensure
we do our best to minimize them where possible. I'm certain this will
be a big asset to anyone else looking at taking on this migration.

Cheers,
Stephen

> Regards,
> Daniel
> 
> PS. I don't personally have much experience adminstering large systems,
> so if there are other things you want us to add/remove/change/consider
> so it can run well at kernel.org scale, please let us know!
> 
> > 
> > Best,
> > -K