[PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal

Fri Jul 31 23:16:44 AEST 2020

Nathan Lynch <nathanl at linux.ibm.com> writes:
> Michael Ellerman <mpe at ellerman.id.au> writes:
>> Nathan Lynch <nathanl at linux.ibm.com> writes:
>>> Laurent Dufour <ldufour at linux.ibm.com> writes:
>>>> Le 28/07/2020 à 19:37, Nathan Lynch a écrit :
>>>>> The drmem lmb list can have hundreds of thousands of entries, and
>>>>> unfortunately lookups take the form of linear searches. As long as
>>>>> this is the case, traversals have the potential to monopolize the CPU
>>>>> and provoke lockup reports, workqueue stalls, and the like unless
>>>>> they explicitly yield.
>>>>> 
>>>>> Rather than placing cond_resched() calls within various
>>>>> for_each_drmem_lmb() loop blocks in the code, put it in the iteration
>>>>> expression of the loop macro itself so users can't omit it.
>>>>
>>>> Is that not too much to call cond_resched() on every LMB?
>>>>
>>>> Could that be less frequent, every 10, or 100, I don't really know ?
>>>
>>> Everything done within for_each_drmem_lmb is relatively heavyweight
>>> already. E.g. calling dlpar_remove_lmb()/dlpar_add_lmb() can take dozens
>>> of milliseconds. I don't think cond_resched() is an expensive check in
>>> this context.
>>
>> Hmm, mostly.
>>
>> But there are quite a few cases like drmem_update_dt_v1():
>>
>> 	for_each_drmem_lmb(lmb) {
>> 		dr_cell->base_addr = cpu_to_be64(lmb->base_addr);
>> 		dr_cell->drc_index = cpu_to_be32(lmb->drc_index);
>> 		dr_cell->aa_index = cpu_to_be32(lmb->aa_index);
>> 		dr_cell->flags = cpu_to_be32(drmem_lmb_flags(lmb));
>>
>> 		dr_cell++;
>> 	}
>>
>> Which will compile to a pretty tight loop at the moment.
>>
>> Or drmem_update_dt_v2() which has two loops over all lmbs.
>>
>> And although the actual TIF check is cheap the function call to do it is
>> not free.
>>
>> So I worry this is going to make some of those long loops take even
>> longer.
>
> That's fair, and I was wrong - some of the loop bodies are relatively
> simple, not doing allocations or taking locks, etc.
>
> One way to deal is to keep for_each_drmem_lmb() as-is and add a new
> iterator that can reschedule, e.g. for_each_drmem_lmb_slow().

If we did that, how many call-sites would need converting?
Is it ~2 or ~20 or ~200?

> On the other hand... it's probably not too strong to say that the
> drmem/hotplug code is in crisis with respect to correctness and
> algorithmic complexity, so those are my overriding concerns right
> now. Yes, this change will pessimize loops that are reinitializing the
> entire drmem_lmb array on every DLPAR operation, but:
>
> 1. it doesn't make any user of for_each_drmem_lmb() less correct;
> 2. why is this code doing that in the first place, other than to
>    accommodate a poor data structure choice?
>
> The duration of the system calls where this code runs are measured in
> minutes or hours on large configurations because of all the behaviors
> that are at best O(n) with the amount of memory assigned to the
> partition. For simplicity's sake I'd rather defer lower-level
> performance considerations like this until the drmem data structures'
> awful lookup properties are fixed -- hopefully in the 5.10 timeframe.

Yeah fair enough.

cheers