[PATCH v2 1/1] cpuidle-powernv : forced wakeup for stop states

Abhishek huntbag at linux.vnet.ibm.com
Wed Jun 26 19:09:26 AEST 2019


Hi Nick,


On 06/19/2019 03:39 PM, Nicholas Piggin wrote:
> Abhishek's on June 19, 2019 7:08 pm:
>> Hi Nick,
>>
>> Thanks for the review. Some replies below.
>>
>> On 06/19/2019 09:53 AM, Nicholas Piggin wrote:
>>> Abhishek Goel's on June 17, 2019 7:56 pm:
>>>> Currently, the cpuidle governors determine what idle state a idling CPU
>>>> should enter into based on heuristics that depend on the idle history on
>>>> that CPU. Given that no predictive heuristic is perfect, there are cases
>>>> where the governor predicts a shallow idle state, hoping that the CPU will
>>>> be busy soon. However, if no new workload is scheduled on that CPU in the
>>>> near future, the CPU may end up in the shallow state.
>>>>
>>>> This is problematic, when the predicted state in the aforementioned
>>>> scenario is a shallow stop state on a tickless system. As we might get
>>>> stuck into shallow states for hours, in absence of ticks or interrupts.
>>>>
>>>> To address this, We forcefully wakeup the cpu by setting the
>>>> decrementer. The decrementer is set to a value that corresponds with the
>>>> residency of the next available state. Thus firing up a timer that will
>>>> forcefully wakeup the cpu. Few such iterations will essentially train the
>>>> governor to select a deeper state for that cpu, as the timer here
>>>> corresponds to the next available cpuidle state residency. Thus, cpu will
>>>> eventually end up in the deepest possible state.
>>>>
>>>> Signed-off-by: Abhishek Goel <huntbag at linux.vnet.ibm.com>
>>>> ---
>>>>
>>>> Auto-promotion
>>>>    v1 : started as auto promotion logic for cpuidle states in generic
>>>> driver
>>>>    v2 : Removed timeout_needed and rebased the code to upstream kernel
>>>> Forced-wakeup
>>>>    v1 : New patch with name of forced wakeup started
>>>>    v2 : Extending the forced wakeup logic for all states. Setting the
>>>> decrementer instead of queuing up a hrtimer to implement the logic.
>>>>
>>>>    drivers/cpuidle/cpuidle-powernv.c | 38 +++++++++++++++++++++++++++++++
>>>>    1 file changed, 38 insertions(+)
>>>>
>>>> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
>>>> index 84b1ebe212b3..bc9ca18ae7e3 100644
>>>> --- a/drivers/cpuidle/cpuidle-powernv.c
>>>> +++ b/drivers/cpuidle/cpuidle-powernv.c
>>>> @@ -46,6 +46,26 @@ static struct stop_psscr_table stop_psscr_table[CPUIDLE_STATE_MAX] __read_mostly
>>>>    static u64 default_snooze_timeout __read_mostly;
>>>>    static bool snooze_timeout_en __read_mostly;
>>>>    
>>>> +static u64 forced_wakeup_timeout(struct cpuidle_device *dev,
>>>> +				 struct cpuidle_driver *drv,
>>>> +				 int index)
>>>> +{
>>>> +	int i;
>>>> +
>>>> +	for (i = index + 1; i < drv->state_count; i++) {
>>>> +		struct cpuidle_state *s = &drv->states[i];
>>>> +		struct cpuidle_state_usage *su = &dev->states_usage[i];
>>>> +
>>>> +		if (s->disabled || su->disable)
>>>> +			continue;
>>>> +
>>>> +		return (s->target_residency + 2 * s->exit_latency) *
>>>> +			tb_ticks_per_usec;
>>>> +	}
>>>> +
>>>> +	return 0;
>>>> +}
>>> It would be nice to not have this kind of loop iteration in the
>>> idle fast path. Can we add a flag or something to the idle state?
>> Currently, we do not have any callback notification or some feedback that
>> notifies the driver everytime some state is enabled/disabled. So we have
>> to parse everytime to get the next enabled state.
> Ahh, that's why you're doing that.
>
>> Are you suggesting to
>> add something like next_enabled_state in cpuidle state structure itself
>> which will be updated when a state is enabled or disabled?
> Hmm, I guess it normally should not iterate over more than one state
> unless some idle states are disabled.
>
> What would have been nice is each state just have its own timeout
> field with ticks already calculated, if that could be updated when
> a state is enabled or disabled. How hard is that to add to the
> cpuidle core?

I have implemented a prototype which does what you have asked for. Added
a  disable_callback which will update timeout whenever a state is 
enabled or
disabled. But It would mean adding some code to cpuidle.h and 
cpuidle/sysfs.c.
If that is not an issue, should I go ahead and post it?
>>>> +
>>>>    static u64 get_snooze_timeout(struct cpuidle_device *dev,
>>>>    			      struct cpuidle_driver *drv,
>>>>    			      int index)
>>>> @@ -144,8 +164,26 @@ static int stop_loop(struct cpuidle_device *dev,
>>>>    		     struct cpuidle_driver *drv,
>>>>    		     int index)
>>>>    {
>>>> +	u64 dec_expiry_tb, dec, timeout_tb, forced_wakeup;
>>>> +
>>>> +	dec = mfspr(SPRN_DEC);
>>>> +	timeout_tb = forced_wakeup_timeout(dev, drv, index);
>>>> +	forced_wakeup = 0;
>>>> +
>>>> +	if (timeout_tb && timeout_tb < dec) {
>>>> +		forced_wakeup = 1;
>>>> +		dec_expiry_tb = mftb() + dec;
>>>> +	}
>>> The compiler probably can't optimise away the SPR manipulations so try
>>> to avoid them if possible.
>> Are you suggesting something like set_dec_before_idle?(in line with
>> what you have suggested to do after idle, reset_dec_after_idle)
> I should have been clear, I meant don't mfspr(SPRN_DEC) until you
> have tested timeout_tb.
>
>>>> +
>>>> +	if (forced_wakeup)
>>>> +		mtspr(SPRN_DEC, timeout_tb);
>>> This should just be put in the above 'if'.
>> Fair point.
>>>> +
>>>>    	power9_idle_type(stop_psscr_table[index].val,
>>>>    			 stop_psscr_table[index].mask);
>>>> +
>>>> +	if (forced_wakeup)
>>>> +		mtspr(SPRN_DEC, dec_expiry_tb - mftb());
>>> This will sometimes go negative and result in another timer interrupt.
>>>
>>> It also breaks irq work (which can be set here by machine check I
>>> believe.
>>>
>>> May need to implement some timer code to do this for you.
>>>
>>> static void reset_dec_after_idle(void)
>>> {
>>> 	u64 now;
>>>           u64 *next_tb;
>>>
>>> 	if (test_irq_work_pending())
>>> 		return;
>>> 	now = mftb;
>>> 	next_tb = this_cpu_ptr(&decrementers_next_tb);
>>>
>>> 	if (now >= *next_tb)
>>> 		return;
>>> 	set_dec(*next_tb - now);
>>> 	if (test_irq_work_pending())
>>> 		set_dec(1);
>>> }
>>>
>>> Something vaguely like that. See timer_interrupt().
>> Ah, Okay. Will go through timer_interrupt().
> Thanks,
> Nick

Thanks,
Abhishek



More information about the Linuxppc-dev mailing list