[RFC PATCH] REST: Add new setting for maximum API page size
stewart at linux.vnet.ibm.com
Tue Aug 7 19:09:08 AEST 2018
Daniel Axtens <dja at axtens.net> writes:
>>>>> FWIW we now have this applied on patchwork.ozlabs.org and it appears to
>>>>> be working. Would like some more input as to what an appropriate default
>>>>> limit is.
>>>> My completely fact-free feeling/opinion is that if it takes more than
>>>> ~1sec to run on Ozlabs, it's probably too high. Apart from that, I'm not
>>> OzLabs.org is not a fast machine. 1 second round trip time == 100 per
>>> page. 500 per page == ~2.3 seconds round trip.
>>> A single 500 item load is still under half the time of doing 5 * 100
>>> item loads...
> I wonder if we could let authenticated users do slower queries. Maybe
> something for later. Let's split the difference at 250 then, I'd be
> happy to merge that.
>> I bet MySQL would execute the query quicker :)
> In general (and I have tested this) mysql 5.7 executes patchwork queries
> slower than postgres 9.6 - this is on my laptop on an SSD with a couple
> 100ks of patches across 2 or 3 projects.
Huh, somewhat surprising. I wonder what the query plan and cache layout
>> Anyway, that sounds really quite slow and we should look at why on earth
>> it's that slow - is there some kind of horrific hidden join or poor
>> indexing or query plan or something?
> So the 1.0 -> 2.0 transition was not good for performance because it
> turns out databases do not map well to object-oriented structures.* I
> didn't think to check for performance at the time, so we're
> progressively improving that by denormalising stuff. (In fact that was a
> major driver of the 2.1 release!) There is still further to go.
So, looking at the schema, if we were going for the state of a bunch of
patches in a project (which I imagine is fairly common), the current
patch table is:
CREATE TABLE `patchwork_patch` (
`submission_ptr_id` int(11) NOT NULL,
`commit_ref` varchar(255) DEFAULT NULL,
`pull_url` varchar(255) DEFAULT NULL,
`delegate_id` int(11) DEFAULT NULL,
`state_id` int(11) DEFAULT NULL,
`archived` tinyint(1) NOT NULL,
`hash` char(40) DEFAULT NULL,
`patch_project_id` int(11) NOT NULL,
PRIMARY KEY (`submission_ptr_id`),
KEY `patchwork_patch_delegate_id_8f7ce805_fk_auth_user_id` (`delegate_id`),
KEY `patchwork_patch_state_id_3d9fb500_fk_patchwork_state_id` (`state_id`),
KEY `patchwork_patch_patch_project_id_aff1ad51_fk_patchwork` (`patch_project_id`),
CONSTRAINT `patchwork_patch_delegate_id_8f7ce805_fk_auth_user_id` FOREIGN KEY (`delegate_id`) REFERENCES `auth_user` (`id`),
CONSTRAINT `patchwork_patch_patch_project_id_aff1ad51_fk_patchwork` FOREIGN KEY (`patch_project_id`) REFERENCES `patchwork_project` (`id`),
CONSTRAINT `patchwork_patch_state_id_3d9fb500_fk_patchwork_state_id` FOREIGN KEY (`state_id`) REFERENCES `patchwork_state` (`id`),
CONSTRAINT `patchwork_patch_submission_ptr_id_03216801_fk_patchwork` FOREIGN KEY (`submission_ptr_id`) REFERENCES `patchwork_submission` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
which puts the submission_ptr_id as the key for the clustered index
(i.e. on disk layout will group by this ID).
What I suspect is occuring is that we end up having teh on disk layout
of patchwork_patch (and patchwork_submission) be ordered by time the
patch comes in across *all* projects of the patchwork instance, which is
typically (never?) what is actually queried.
So that when I do queries over, say 'skiboot', I'm pulling into cache
all the submissions/patches that occured to any project around that
time, rather than only things related to skiboot.
If instead, the primary key was ('patch_project_id','submission_ptr_id')
and there was a UNIQUE KEY ('submission_ptr_id') then this would mean
that we'd be pulling in a database page full of the patches for the
project we're searching on, which I think would be more likely to be
queried in the near future.
The patch_project_id index colud be dropped, as all queries on it would
be satisfied by the primary key.
The same goes for patchwork_submission:
CREATE TABLE `patchwork_submission` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`msgid` varchar(255) NOT NULL,
`name` varchar(255) NOT NULL,
`date` datetime(6) NOT NULL,
`headers` longtext NOT NULL,
`project_id` int(11) NOT NULL,
`submitter_id` int(11) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `patchwork_patch_msgid_project_id_2e1fe709_uniq` (`msgid`,`project_id`),
KEY `patchwork_patch_project_id_162a7a7f_fk_patchwork_project_id` (`project_id`),
KEY `patchwork_patch_submitter_id_57c66142_fk_patchwork_person_id` (`submitter_id`),
CONSTRAINT `patchwork_patch_project_id_162a7a7f_fk_patchwork_project_id` FOREIGN KEY (`project_id`) REFERENCES `patchwork_project` (`id`),
CONSTRAINT `patchwork_patch_submitter_id_57c66142_fk_patchwork_person_id` FOREIGN KEY (`submitter_id`) REFERENCES `patchwork_person` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
simply getting a list of patches for a project here is going to be a
very expensive query, pulling in a lot of database pages containing data
for patches around that time rather than ones relevant to the query.
Again, I'd say go:
PRIMARY KEY ('project_id', 'id')
UNIQUE KEY ('id')
(the project_id index can then be dropped, as the primary key could be
Of course, things get better if you query everything looking up
project_id=X and id=Y rather than just id=Y.
For PostgreSQL it gets a bit interesting as I'm not as familiar with how
it does data layout. There *appears* to be a CLUSTER *command* that will
re-organise an existing table but *not* make any future inserts/updates
obey it (which is annoying).
This seems to be a bit of a theme throughout the schema too...
Actually... do we have a good test dataset I could load up locally and
> * Yes, I probably should have realised this at the time. something
> something unpaid volunteers something something patch review something
pretty much neglecting some piece of software that ends up being vital
open source infrastructure is a grand tradition, welcome!
OPAL Architect, IBM.
More information about the Patchwork