[PATCH 1/2] cpuidle : auto-promotion for cpuidle states

Thu Apr 4 21:21:36 AEDT 2019

Hi Abhishek,

thanks for taking the time to test the different scenario and give us
the numbers.

On 01/04/2019 07:11, Abhishek wrote:
> 
> 
> On 03/22/2019 06:56 PM, Daniel Lezcano wrote:
>> On 22/03/2019 10:45, Rafael J. Wysocki wrote:
>>> On Fri, Mar 22, 2019 at 8:31 AM Abhishek Goel
>>> <huntbag at linux.vnet.ibm.com> wrote:
>>>> Currently, the cpuidle governors (menu /ladder) determine what idle
>>>> state
>>>> an idling CPU should enter into based on heuristics that depend on the
>>>> idle history on that CPU. Given that no predictive heuristic is
>>>> perfect,
>>>> there are cases where the governor predicts a shallow idle state,
>>>> hoping
>>>> that the CPU will be busy soon. However, if no new workload is
>>>> scheduled
>>>> on that CPU in the near future, the CPU will end up in the shallow
>>>> state.
>>>>
>>>> In case of POWER, this is problematic, when the predicted state in the
>>>> aforementioned scenario is a lite stop state, as such lite states will
>>>> inhibit SMT folding, thereby depriving the other threads in the core
>>>> from
>>>> using the core resources.

I can understand an idle state can prevent other threads to use the core
resources. But why a deeper idle state does not prevent this also?

>>>> To address this, such lite states need to be autopromoted. The cpuidle-
>>>> core can queue timer to correspond with the residency value of the next
>>>> available state. Thus leading to auto-promotion to a deeper idle
>>>> state as
>>>> soon as possible.
>>> Isn't the tick stopping avoidance sufficient for that?
>> I was about to ask the same :)
>>
>>
>>
>>
> Thanks for the review.
> I performed experiments for three scenarios to collect some data.
> 
> case 1 :
> Without this patch and without tick retained, i.e. in a upstream kernel,
> It would spend more than even a second to get out of stop0_lite.
> 
> case 2 : With tick retained(as suggested) -
> 
> Generally, we have a sched tick at 4ms(CONF_HZ = 250). Ideally I expected
> it to take 8 sched tick to get out of stop0_lite. Experimentally,
> observation was
> 
> ===================================
> min            max            99percentile
> 4ms            12ms          4ms
> ===================================
> *ms = milliseconds
> 
> It would take atleast one sched tick to get out of stop0_lite.
> 
> case 2 :  With this patch (not stopping tick, but explicitly queuing a
> timer)
> 
> min            max              99.5percentile
> ===============================
> 144us       192us              144us
> ===============================
> *us = microseconds
> 
> In this patch, we queue a timer just before entering into a stop0_lite
> state. The timer fires at (residency of next available state + exit
> latency of next available state * 2).

So for the context, we have a similar issue but from the power
management point of view where a CPU can stay in a shallow state for a
long period, thus consuming a lot of energy.

The window was reduced by preventing stopping the tick when a shallow
state is selected. Unfortunately, if the tick is stopped and we
exit/enter again and we select a shallow state, the situation is the same.

A solution was previously proposed with a timer some years ago, like
this patch does, and merged but there were complains about bad
performance impact, so it has been reverted.

> Let's say if next state(stop0) is available which has residency of 20us, it
> should get out in as low as (20+2*2)*8 [Based on the forumla (residency +
> 2xlatency)*history length] microseconds = 192us. Ideally we would expect 8
> iterations, it was observed to get out in 6-7 iterations.

Can you explain the formula? I don't get the rational. Why using the
exit latency and why multiply it by 2?

Why the timer is not set to the next state's target residency value ?

> Even if let's say stop2 is next available state(stop0 and stop1 both are
> unavailable), it would take (100+2*10)*8 = 960us to get into stop2.
> 
> So, We are able to get out of stop0_lite generally in 150us(with this
> patch) as
> compared to 4ms(with tick retained). As stated earlier, we do not want
> to get
> stuck into stop0_lite as it inhibits SMT folding for other sibling
> threads, depriving
> them of core resources. Current patch is using auto-promotion only for
> stop0_lite,
> as it gives performance benefit(primary reason) along with lowering down
> power
> consumption. We may extend this model for other states in future.
> 
> --Abhishek
> 

-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog