[RFC PATCH] REST: Add new setting for maximum API page size

Tue Aug 7 19:09:08 AEST 2018

Daniel Axtens <dja at axtens.net> writes:
>>>>> FWIW we now have this applied on patchwork.ozlabs.org and it appears to
>>>>> be working. Would like some more input as to what an appropriate default
>>>>> limit is.
>>>> 
>>>> My completely fact-free feeling/opinion is that if it takes more than
>>>> ~1sec to run on Ozlabs, it's probably too high. Apart from that, I'm not
>>>> fussed.
>>>
>>> OzLabs.org is not a fast machine. 1 second round trip time == 100 per 
>>> page. 500 per page == ~2.3 seconds round trip.
>>>
>>> A single 500 item load is still under half the time of doing 5 * 100 
>>> item loads...
>
> I wonder if we could let authenticated users do slower queries. Maybe
> something for later. Let's split the difference at 250 then, I'd be
> happy to merge that.
>
>> I bet MySQL would execute the query quicker :)
>
> In general (and I have tested this) mysql 5.7 executes patchwork queries
> slower than postgres 9.6 - this is on my laptop on an SSD with a couple
> 100ks of patches across 2 or 3 projects.

Huh, somewhat surprising. I wonder what the query plan and cache layout
looks like....

>> Anyway, that sounds really quite slow and we should look at why on earth
>> it's that slow - is there some kind of horrific hidden join or poor
>> indexing or query plan or something?
>
> Yes.
>
> So the 1.0 -> 2.0 transition was not good for performance because it
> turns out databases do not map well to object-oriented structures.* I
> didn't think to check for performance at the time, so we're
> progressively improving that by denormalising stuff. (In fact that was a
> major driver of the 2.1 release!) There is still further to go.

So, looking at the schema, if we were going for the state of a bunch of
patches in a project (which I imagine is  fairly common), the current
patch table is:

CREATE TABLE `patchwork_patch` (
  `submission_ptr_id` int(11) NOT NULL,
  `diff` longtext,
  `commit_ref` varchar(255) DEFAULT NULL,
  `pull_url` varchar(255) DEFAULT NULL,
  `delegate_id` int(11) DEFAULT NULL,
  `state_id` int(11) DEFAULT NULL,
  `archived` tinyint(1) NOT NULL,
  `hash` char(40) DEFAULT NULL,
  `patch_project_id` int(11) NOT NULL,
  PRIMARY KEY (`submission_ptr_id`),
  KEY `patchwork_patch_delegate_id_8f7ce805_fk_auth_user_id` (`delegate_id`),
  KEY `patchwork_patch_state_id_3d9fb500_fk_patchwork_state_id` (`state_id`),
  KEY `patchwork_patch_patch_project_id_aff1ad51_fk_patchwork` (`patch_project_id`),
  CONSTRAINT `patchwork_patch_delegate_id_8f7ce805_fk_auth_user_id` FOREIGN KEY (`delegate_id`) REFERENCES `auth_user` (`id`),
  CONSTRAINT `patchwork_patch_patch_project_id_aff1ad51_fk_patchwork` FOREIGN KEY (`patch_project_id`) REFERENCES `patchwork_project` (`id`),
  CONSTRAINT `patchwork_patch_state_id_3d9fb500_fk_patchwork_state_id` FOREIGN KEY (`state_id`) REFERENCES `patchwork_state` (`id`),
  CONSTRAINT `patchwork_patch_submission_ptr_id_03216801_fk_patchwork` FOREIGN KEY (`submission_ptr_id`) REFERENCES `patchwork_submission` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

which puts the submission_ptr_id as the key for the clustered index
(i.e. on disk layout will group by this ID).

What I suspect is occuring is that we end up having teh on disk layout
of patchwork_patch (and patchwork_submission) be ordered by time the
patch comes in across *all* projects of the patchwork instance, which is
typically (never?) what is actually queried.

So that when I do queries over, say 'skiboot', I'm pulling into cache
all the submissions/patches that occured to any project around that
time, rather than only things related to skiboot.

If instead, the primary key was ('patch_project_id','submission_ptr_id')
and there was a UNIQUE KEY ('submission_ptr_id') then this would mean
that we'd be pulling in a database page full of the patches for the
project we're searching on, which I think would be more likely to be
queried in the near future.

The patch_project_id index colud be dropped, as all queries on it would
be satisfied by the primary key.

The same goes for patchwork_submission:

CREATE TABLE `patchwork_submission` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `msgid` varchar(255) NOT NULL,
  `name` varchar(255) NOT NULL,
  `date` datetime(6) NOT NULL,
  `headers` longtext NOT NULL,
  `project_id` int(11) NOT NULL,
  `submitter_id` int(11) NOT NULL,
  `content` longtext,
  PRIMARY KEY (`id`),
  UNIQUE KEY `patchwork_patch_msgid_project_id_2e1fe709_uniq` (`msgid`,`project_id`),
  KEY `patchwork_patch_project_id_162a7a7f_fk_patchwork_project_id` (`project_id`),
  KEY `patchwork_patch_submitter_id_57c66142_fk_patchwork_person_id` (`submitter_id`),
  CONSTRAINT `patchwork_patch_project_id_162a7a7f_fk_patchwork_project_id` FOREIGN KEY (`project_id`) REFERENCES `patchwork_project` (`id`),
  CONSTRAINT `patchwork_patch_submitter_id_57c66142_fk_patchwork_person_id` FOREIGN KEY (`submitter_id`) REFERENCES `patchwork_person` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

simply getting a list of patches for a project here is going to be a
very expensive query, pulling in a lot of database pages containing data
for patches around that time rather than ones relevant to the query.

Again, I'd say go:
PRIMARY KEY ('project_id', 'id')
UNIQUE KEY ('id')

(the project_id index can then be dropped, as the primary key could be
used there).

Of course, things get better if you query everything looking up
project_id=X and id=Y rather than just id=Y.

For PostgreSQL it gets a bit interesting as I'm not as familiar with how
it does data layout. There *appears* to be a CLUSTER *command* that will
re-organise an existing table but *not* make any future inserts/updates
obey it (which is annoying).

This seems to be a bit of a theme throughout the schema too...

Actually... do we have a good test dataset I could load up locally and
experiment with?

> * Yes, I probably should have realised this at the time. something
>   something unpaid volunteers something something patch review something
>   something.

:)

pretty much neglecting some piece of software that ends up being vital
open source infrastructure is a grand tradition, welcome!

-- 
Stewart Smith
OPAL Architect, IBM.