[PATCH 1/2] tickless idle cpus: core patch - v2

Tue Apr 25 01:39:53 EST 2006

Paul,
	At the outset I am slightly confused on whether you are
proposing to allow boot cpus to skip ticks or not. A clarification on
this will help!

On Fri, Apr 21, 2006 at 08:49:48PM +1000, Paul Mackerras wrote:
> Is the nohz_cpu_mask accessed by other cpus?  I wonder if using a

Yes, nohz_cpu_mask can be accessed by other CPUs, as part of RCU batch
processing.

> 1-byte per-cpu variable for this, or even a bit in the thread_info
> struct, might be better because it would reduce the number of atomic
> bit set/clear operations we have to do.

Agreed that updating a global bitmask is not a scalable thing to do. But
unfortunately this is tied to RCU implementation ATM. I know that Dipankar has 
plans to get rid of nohz_cpu_mask in the RCU implementation. When that happens, 
maybe we can revist this.

> > +#define MAX_DEC_COUNT	UINT_MAX	/* Decrementer is 32-bit */
> 
> The decrementer value should be restricted to INT_MAX.  I think some
> implementations will take a decrementer interrupt if you set the
> decrementer to a negative value.

Ok. Thanks for pointing it out.

> > +static void account_ticks(struct pt_regs *regs)
> > +{
> > +	int next_dec;
> > +	int cpu = smp_processor_id();
> > +	unsigned long ticks;
> > +
> >  	while ((ticks = tb_ticks_since(per_cpu(last_jiffy, cpu)))
> >  	       >= tb_ticks_per_jiffy) {
> >  		/* Update last_jiffy */
> 
> This is just the loop body from timer_interrupt, right?  Why do we

Correct. Note that this loop body is common to both timer_interrupt and 
start_hz_timer currently.

> have to loop around N times after we skipped N ticks?  What we're
> doing inside the loop is:
> 
> - account_process_time, but we know we were in the idle task, so the
>   time is just idle time.  If we have the accurate accounting stuff
>   turned on the first call to account_process_time will account all
>   the idle time anyway.

Do you mean to say that "the first call to account_system_vtime will account 
all the idle time"? AFAICS account_process_time (->account_process_vtime) is
accounting just usertime.

I had a related question: if we put the idle CPU to some power-saving
state in the period that it is tickless, would its PURR value keep
getting incremented? If not, would that affect the idle time we
calculate after skipping N ticks (in case accurate acct is turned ON
that is)?

> - we're not skipping ticks on the boot cpu (yet), so we won't do the
>   do_timer and timer_recalc_offset calls.

Don't follow you here. I thought we wanted to let the boot cpu to skip
ticks too (as you seem to indicate down below - while talking of
updating xtime/NTP)? If we want to allow that, then we need to be able to
call do_timer with a regs argument?

> I think we could probably rearrange this code to eliminate the need
> for the regs argument - all it's used for is telling whether we were
> in user mode, and we know we weren't since we were in the idle task.
> That would mean that we maybe could call start_hz_timer from the idle
> task body instead of inside do_IRQ etc.  The only thing we'd have to
> watch out for is updating the decrementer if some interrupt handler
> called add_timer/mod_timer etc.

One problem in deferring the call to start_hz_timer like this is RCU. Ideally
we want to clear the tickless idle-CPU from nohz_cpu_mask -as soon- as it
starts processing an interrupt. Otherwise RCU will think that the CPU is in
tickless state (hence not accessing RCU-protected data structures) while the 
idle-CPU is processing interrupts/softirqs and will not include the CPU in its
grace-period processing. This may be a bad thing to allow. 

The other problem in deferring a call to start_hz_timer is we could cause them 
to be running with an incorrect jiffy value (can happen if all cpus were
skipping ticks for a period).

IMO we should let start_hz_timer be called as it is being called now, i.e
immediately when a tickless idle-CPU wakes up. This would also
unfortunately mean that we need to embed calls to start_hz_timer from all
the interrupt paths (do_IRQ, performance_monitor_exception etc) where we
come out of tickless state.

> Assuming we make the changes we have discussed to reduce the skewing
> of the decrementer interrupts quite small, and allow any cpu to call
> do_timer, then how where you thinking that xtime and the NTP stuff
> would get updated, in the situation where all cpus are skipping ticks?
> By doing N calls to do_timer in a row?  Or by adding N-1 to jiffies_64
> and then calling do_timer once?

I assume this implies we want to let boot-cpu to skip ticks?

Ideally it would be good if we add N-1 to jiffies_64 and call do_timer
once. But I wonder if it can be done w/o being racy. In order to calculate
N, we may calculate the difference, delta, between current timebase and
tb_last_stamp as:

	delta = tb_ticks_since(tb_last_stamp);	-> Step 1

We may then calculate N as:

	N = delta / tb_ticks_per_jiffy;		-> Step 2

However this will be racy, if we happen to sample timebase (Step 1) just
before a jiffy boundary. In which case, we would have missed to update
jiffy by 1? To avoid this, we may need to check for the corner case and
introduce a 'goto retry' loop?

BTW, I realized that my patch to let boot cpu to skip ticks also is
slightly buggy here:

                if (cpu != do_timer_cpu)
                       continue;

                write_seqlock(&xtime_lock);
		...
                do_timer(regs);
		...
                write_sequnlock(&xtime_lock);

There needs to a different while loop to call do_timer based on 
tb_ticks_since(tb_last_stamp). Will fix it next time around.

> 
> > +#ifdef CONFIG_NO_IDLE_HZ
> > +	max_skip = __USE_RTC() ? HZ : MAX_DEC_COUNT / tb_ticks_per_jiffy;
> > +#endif
> 
> If we allow the boot cpu to skip ticks, then we will have to make sure
> that at least one cpu wakes up in time to do the updating in
> recalc_timer_offset.

Good point. However is this critical, especially if all cpus had been
idle for quite a while? As I understand, if we don't wakeup in time to
do timer_recalc_offset, the only drawback is the first gettimeofday will
be slow (because it has make a system call)?

-- 
Regards,
vatsa