[PATCH 01/11] Improve patch listing performance (~3x)

Mon Aug 13 12:06:43 AEST 2018

Daniel Axtens <dja at axtens.net> writes:
> Stewart Smith <stewart at linux.ibm.com> writes:
>
>> There's two main bits that are really expensive when composing the list
>> of patches for a project: the query getting the list, and the query
>> finding the series for each patch.
>>
>> If we look at the query getting the list, it gets a lot of unnesseccary
>> fields such as 'headers' and 'content', even though we tell Django not
>> to. It turns out that Django seems to ignore the Submission relationship
>> and I have no idea how to force it to ignore that thing (defer doesn't
>> work) but if we go only, then it works okay.
>>
>> From my import of ~8000 messages for a few projects, my laptop query
>> time (MySQL, as setup by whatever the docker-compose things do) goes
>> from:
>>
>> http://localhost:8000/project/linux-kernel/list/
>> FROM:
>> 342ms SQL queries cold cache, 268ms warm cache
>> TO:
>> 118ms SQL queries cold cache, 88ms warm cache
>>
>> Which is... non trivial to say the least.
>>
>> The big jump is the patches.only change, and the removal of ordering
>> on the patchseries takes a further 10ms off. For some strange reason, it
>> seems rather hard to tell Django that you don't care what order the
>> results come back in for that query (if we do, then the db server has to
>> do a sort rather than just return each row)
>
> Thanks Stewart! It's great to get some real DB experience - feel free to
> hang around! :)

I'll try to :)

Or at least pop up to make things faster/nicer to the database.

> So, further to our conversation with Konstantin, I tested this against
> Django 2.0. It still saves us some time - it means we no longer load the
> following fields:
>
> `patchwork_submission`.`id`, `patchwork_submission`.`msgid`,
> `patchwork_patch`.`commit_ref`, `patchwork_patch`.`pull_url`,
> `patchwork_patch`.`archived`, `patchwork_patch`.`hash`,
> `patchwork_patch`.`patch_project_id`, 
>
> This obviously saves the db some work and communication overhead.

I found that 'headers' and 'content' from the patch were also being
fetched, and it saved fetching those (which are really quite expensive
to do, as blobs are typically mostly stored on a separate page than the
row itself).

It's the blobs that were the big expense, as they'd likely generate 2
disk seeks and reads per row.

> I'm a little nervous that this will slightly complicate some of the
> further denormalisation but I think that's probably a price worth paying
> unless Stephen objects.

I don't think it should really be too much of a problem, as any
denormalisation occurs, folk are going to have to look at the queries
being produced and executed anyway otherwise there's always a chance for
regressions.

> I do still want to test the 'ordering' change a bit more before
> committing it though.

Yeah, that struck me as a weird one, but it certainly did have an effect
on some things.

-- 
Stewart Smith
OPAL Architect, IBM.