[PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal

Fri Jul 31 23:52:55 AEST 2020

Michael Ellerman <mpe at ellerman.id.au> writes:

> Nathan Lynch <nathanl at linux.ibm.com> writes:
>> Michael Ellerman <mpe at ellerman.id.au> writes:
>>> Nathan Lynch <nathanl at linux.ibm.com> writes:
>>>> Laurent Dufour <ldufour at linux.ibm.com> writes:
>>>>> Le 28/07/2020 à 19:37, Nathan Lynch a écrit :
>>>>>> The drmem lmb list can have hundreds of thousands of entries, and
>>>>>> unfortunately lookups take the form of linear searches. As long as
>>>>>> this is the case, traversals have the potential to monopolize the CPU
>>>>>> and provoke lockup reports, workqueue stalls, and the like unless
>>>>>> they explicitly yield.
>>>>>> 
>>>>>> Rather than placing cond_resched() calls within various
>>>>>> for_each_drmem_lmb() loop blocks in the code, put it in the iteration
>>>>>> expression of the loop macro itself so users can't omit it.
>>>>>
>>>>> Is that not too much to call cond_resched() on every LMB?
>>>>>
>>>>> Could that be less frequent, every 10, or 100, I don't really know ?
>>>>
>>>> Everything done within for_each_drmem_lmb is relatively heavyweight
>>>> already. E.g. calling dlpar_remove_lmb()/dlpar_add_lmb() can take dozens
>>>> of milliseconds. I don't think cond_resched() is an expensive check in
>>>> this context.
>>>
>>> Hmm, mostly.
>>>
>>> But there are quite a few cases like drmem_update_dt_v1():
>>>
>>> 	for_each_drmem_lmb(lmb) {
>>> 		dr_cell->base_addr = cpu_to_be64(lmb->base_addr);
>>> 		dr_cell->drc_index = cpu_to_be32(lmb->drc_index);
>>> 		dr_cell->aa_index = cpu_to_be32(lmb->aa_index);
>>> 		dr_cell->flags = cpu_to_be32(drmem_lmb_flags(lmb));
>>>
>>> 		dr_cell++;
>>> 	}
>>>
>>> Which will compile to a pretty tight loop at the moment.
>>>
>>> Or drmem_update_dt_v2() which has two loops over all lmbs.
>>>
>>> And although the actual TIF check is cheap the function call to do it is
>>> not free.
>>>
>>> So I worry this is going to make some of those long loops take even
>>> longer.
>>
>> That's fair, and I was wrong - some of the loop bodies are relatively
>> simple, not doing allocations or taking locks, etc.
>>
>> One way to deal is to keep for_each_drmem_lmb() as-is and add a new
>> iterator that can reschedule, e.g. for_each_drmem_lmb_slow().
>
> If we did that, how many call-sites would need converting?
> Is it ~2 or ~20 or ~200?

At a glance I would convert 15-20 out of the 24 users in the tree I'm
looking at. Let me know if I should do a v2 with that approach.