powerpc: mitigate impact of decrementer reset

Paul Clarke pc at us.ibm.com
Fri Nov 14 06:33:58 AEDT 2014


On 11/12/2014 08:39 PM, Michael Ellerman wrote:
> On Wed, 2014-11-05 at 11:06 -0600, Paul Clarke wrote:
>> On 10/07/2014 09:52 PM, Michael Ellerman wrote:
>>> On Tue, 2014-07-10 at 19:13:24 UTC, Paul Clarke wrote:
>>>> This patch short-circuits the reset of the decrementer, exiting after
>>>> the decrementer reset, but before the housekeeping tasks if the only
>>>> need for the interrupt is simply to reset it.  After this patch,
>>>> the latency spike was measured at about 150 nanoseconds.
>>
>>> Thanks for the excellent changelog. But this patch makes me a bit nervous :)
>>>
>>> Do you know where the latency is coming from? Is it primarily the irq work?
>>
>> Yes, it is all under irq_enter (measured at ~10us) and irq_exit (~12us).
>
> Hmm, OK. I actually meant irq_work_run().
>
> AIUI irq_enter/exit() are just state tracking, they shouldn't be actually
> running work.
>
> How are you measuring it?

ftrace function_graph tracer:
--
   127.425212 |                |  .irq_enter() {
   127.425213 |                |    .rcu_irq_enter() {
   127.425213 |  + 12.206 us   |      .rcu_eqs_exit_common.isra.41();
   127.425226 |  + 12.750 us   |    }
... RCU is a big hitter
   127.425226 |                |    .vtime_common_account_irq_enter() {
   127.425226 |                |      .vtime_account_user() {
   127.425226 |    0.032 us    |        ._raw_spin_lock();
   127.425227 |    0.034 us    |        .get_vtime_delta();
   127.425227 |                |        .account_user_time() {
   127.425228 |    0.030 us    |          .cpuacct_account_field();
   127.425228 |                |          .acct_account_cputime() {
   127.425228 |    0.082 us    |            .__acct_update_integrals();
   127.425229 |    0.562 us    |          }
   127.425229 |    1.500 us    |        }
   127.425229 |    2.954 us    |      }
   127.425230 |    3.434 us    |    }
... but even accounting is not insignificant
   127.425230 |  + 17.218 us   |  }
   127.425230 |                |  /* timer_interrupt_entry: [...] */
... nothing to see here, because there's nothing to do except reset the 
decrementer
   127.425230 |                |  /* timer_interrupt_exit: [...] */
... (less than 1 us spent doing the "required" work)
   127.425231 |                |  .irq_exit() {
   127.425231 |                |    .vtime_gen_account_irq_exit() {
   127.425231 |    0.036 us    |      ._raw_spin_lock();
   127.425232 |                |      .__vtime_account_system() {
   127.425232 |    0.030 us    |        .get_vtime_delta();
   127.425232 |                |        .account_system_time() {
   127.425233 |    0.030 us    |          .cpuacct_account_field();
   127.425233 |                |          .acct_account_cputime() {
   127.425233 |    0.072 us    |            .__acct_update_integrals();
   127.425234 |    0.564 us    |          }
   127.425234 |    1.546 us    |        }
   127.425234 |    2.528 us    |      }
   127.425235 |    3.700 us    |    }
... significant accounting time
   127.425235 |    0.032 us    |    .idle_cpu();
   127.425235 |                |    .tick_nohz_irq_exit() {
   127.425236 |                |      .can_stop_full_tick() {
   127.425236 |    0.022 us    |        .sched_can_stop_tick();
   127.425236 |    0.020 us    |        .posix_cpu_timers_can_stop_tick()
   127.425237 |    0.970 us    |      }
   127.425237 |    0.082 us    |      .ktime_get();
   127.425238 |                |      .tick_nohz_stop_sched_tick() {
   127.425238 |    0.032 us    |        .timekeeping_max_deferment();
   127.425238 |                |        .get_next_timer_interrupt() {
   127.425239 |    0.038 us    |          ._raw_spin_lock();
   127.425239 |                |          .hrtimer_get_next_event() {
   127.425239 |    0.030 us    |            ._raw_spin_lock_irqsave();
   127.425240 |    0.028 us    |            ._raw_spin_unlock_irqrestore
   127.425240 |    0.984 us    |          }
   127.425241 |    1.936 us    |        }
   127.425241 |    0.032 us    |        .scheduler_tick_max_deferment();
   127.425241 |    3.438 us    |      }
   127.425242 |    5.880 us    |    }
   127.425242 |                |    .rcu_irq_exit() {
   127.425242 |    0.102 us    |      .rcu_eqs_enter_common.isra.40();
   127.425243 |    0.576 us    |    }
   127.425243 |  + 12.156 us   |  }

This one was almost 30 us total (17.218 + 12.156 = 29.374 us), just to 
reset the decrementer.

>>> If so I'd prefer if we could move the short circuit into __timer_interrupt()
>>> itself. That way we'd still have the trace points usable, and it would
>>> hopefully result in less duplicated logic.
>>
>> But irq_enter and irq_exit are called in timer_interrupt, before
>> __timer_interrupt is called.  I don't see how that helps.  The time
>> spent in __timer_interrupt is minuscule by comparison.
>
> Right, it won't help if it's irq_enter() that is causing the delay. But I was
> assuming it was irq_work_run().
>
>> Are you suggesting that irq_enter/exit be moved into __timer_interrupt
>> as well?  (I'm not sure how that would impact the existing call to
>> __timer_interrupt from tick_broadcast_ipi_handler?  And if there is no
>> impact, what's the point of separating timer_interrupt and
>> __timer_interrupt?)
>
> The point is __timer_interrupt() is called from tick_broadcast_ipi_handler(),
> which is called from smp_ipi_demux(), from icp_hv_ipi_action(), from
> __do_irq(), which has already done irq_enter() (and will do irq_exit()).

If that's the only impact, maybe an "IRQ entered" flag would suffice to 
either prevent a 2nd call to irq_enter(), or allow irq_enter to be 
"reentrant" (where it just returns if it was called and the last call 
was not paired with an irq_exit yet?  Alternatively, a new parameter to 
__timer_interrupt() which indicates the same.

PC



More information about the Linuxppc-dev mailing list