Memory coherency issue with IO thread offloading?

Mon Mar 27 23:39:33 AEDT 2023

On 3/26/23 10:22?PM, Nicholas Piggin wrote:
> On Sat Mar 25, 2023 at 11:20 AM AEST, Jens Axboe wrote:
>> On 3/24/23 7:15?PM, Jens Axboe wrote:
>>>> Are there any CONFIG options I'd need to trip this?
>>>
>>> I don't think you need any special CONFIG options. I'll attach my config
>>> here, and I know the default distro one hits it too. But perhaps the
>>> mariadb version is not new enough? I think you need 10.6 or above, as
>>> will use io_uring by default. What version are you running?
>>
>> And here's the .config and the patch for using queue_work().
> 
> So if you *don't* apply this patch, the work gets queued up with an IO
> thread? In io-wq.c? Does that worker end up just doing an io_write()
> same as this one?

Right, without this patch, it gets added to the io-wq work pool. If a
thread is available to run it, it will. If one is not, then one is
created. Eg either event can happen.

That thread does the exact same io_write() again.

> Can the  queueing cause the creation of an IO thread (if one does not
> exist, or all blocked?)

Yep

Since writing this email, I've gone through a lot of different tests.
Here's a rough listing of what I found:

- Like using the hack patch, if I just limit the number of IO thread
  workers to 1, it seems to pass. At least longer than before, does 1000
  iterations.

- If I pin each IO worker to a single CPU, it also passes.

- If I liberally sprinkle smp_mb() for the io-wq side, test still fails.
  I've added one before queueing the work item, and after. One before
  the io-wq worker grabs a work item and one after. Eg full hammer
  approach. This still fails.

Puzzling... For the "pin each IO worker to a single CPU" I added some
basic code around trying to ensure that a work item queued on CPU X
would be processed by a worker on CPU X, and too a large degree, this
does happen. But since the work list is a normal list, it's quite
possible that some other worker finishes its work on CPU Y just in time
to grab the one from cpu X. I checked and this does happen in the test
case, yet it still passes. This may be because I got a bit lucky, but
seems suspect with thousands of passes of the test case.

Another theory there is that it's perhaps related to an io-wq worker
being rescheduled on a different CPU. Though again puzzled as to why the
smp_mb sprinkling didn't fix that then. I'm going to try and run the
test case with JUST the io-wq worker pinning and not caring about where
the work is processed to see if that does anything.

> I'm wondering what the practical differences are between this patch and
> upstream.
> 
> kthread_use_mm() should be basically the same as context switching to an
> IO thread. There is maybe a difference in that kthread_switch_mm() has
> a 'sync' instruction *after* the MMU is switched to the new thread from
> the membarrier code, but a regular context switch might not. The MMU
> switch does have an isync() after it though, so loads *should* be
> prohibited from moving ahead of that.
> 
> Something like this adds a sync roughly where kthread_use_mm() has one.
> It's a pretty unlikely shot in the dark though. I'm more inclined to
> think the work submission to the IO thread might have a problem.

Didn't seem to change anything, fails pretty quickly:

[...]
encryption.innodb_encryption 'innodb,undo0' [ 38 pass ]   3083
encryption.innodb_encryption 'innodb,undo0' [ 39 pass ]   3135
encryption.innodb_encryption 'innodb,undo0' [ 40 fail ]
        Test ended at 2023-03-27 12:20:46

CURRENT_TEST: encryption.innodb_encryption
mysqltest: At line 11: query 'SET @start_global_value = @@global.innodb_encryption_threads' failed: ER_UNKNOWN_SYSTEM_VARIABLE (1193): Unknown system variable 'innodb_encryption_threads'

The result from queries just before the failure was:
SET @start_global_value = @@global.innodb_encryption_threads;

 - saving '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/' to '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/'
***Warnings generated in error logs during shutdown after running tests: encryption.innodb_encryption

2023-03-27 12:20:45 0 [Warning] Plugin 'example_key_management' is of maturity level experimental while the server is gamma
2023-03-27 12:20:45 0 [ERROR] InnoDB: Database page corruption on disk or a failed read of file './ibdata1' page [page id: space=0, page number=214]. You may have to recover from a backup.
2023-03-27 12:20:45 0 [ERROR] InnoDB: File './ibdata1' is corrupted
2023-03-27 12:20:45 0 [ERROR] InnoDB: Plugin initialization aborted with error Page read from tablespace is corrupted.
2023-03-27 12:20:45 0 [ERROR] Plugin 'InnoDB' init function returned error.
2023-03-27 12:20:45 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.

-- 
Jens Axboe