<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <p>Hi Balbir,</p>

    <p><br>

    </p>

    <br>

    <div class="moz-cite-prefix">On Friday 16 December 2016 08:16 PM,

      Balbir Singh wrote:<br>

    </div>

    <blockquote

      cite="mid:fd268944-5f8e-9815-6fef-e7d9d5191044@gmail.com"

      type="cite">

      <pre wrap="">

On 15/12/16 03:18, Anju T Sudhakar wrote:

</pre>

      <blockquote type="cite">

        <pre wrap="">This is the V2 patchset of the kprobes jump optimization

(a.k.a OPTPROBES)for powerpc. Kprobe being an inevitable tool

for kernel developers, enhancing the performance of kprobe has

got much importance.

Currently kprobes inserts a trap instruction to probe a running kernel.

Jump optimization allows kprobes to replace the trap with a branch,

reducing the probe overhead drastically.

In this series, conditional branch instructions are not considered for

optimization as they have to be assessed carefully in SMP systems.

The kprobe placed on the kretprobe_trampoline during boot time, is also

optimized in this series. Patch 4/4 furnishes this.

The first two patches can go independently of the series. The helper 

functions in these patches are invoked in patch 3/4.

Performance:

============

An optimized kprobe in powerpc is 1.05 to 4.7 times faster than a kprobe.

Example:

Placed a probe at an offset 0x50 in _do_fork().

*Time Diff here is, difference in time before hitting the probe and

after the probed instruction. mftb() is employed in kernel/fork.c for

this purpose.

# echo 0 > /proc/sys/debug/kprobes-optimization

Kprobes globally unoptimized

 [  233.607120] Time Diff = 0x1f0

 [  233.608273] Time Diff = 0x1ee

 [  233.609228] Time Diff = 0x203

 [  233.610400] Time Diff = 0x1ec

 [  233.611335] Time Diff = 0x200

 [  233.612552] Time Diff = 0x1f0

 [  233.613386] Time Diff = 0x1ee

 [  233.614547] Time Diff = 0x212

 [  233.615570] Time Diff = 0x206

 [  233.616819] Time Diff = 0x1f3

 [  233.617773] Time Diff = 0x1ec

 [  233.618944] Time Diff = 0x1fb

 [  233.619879] Time Diff = 0x1f0

 [  233.621066] Time Diff = 0x1f9

 [  233.621999] Time Diff = 0x283

 [  233.623281] Time Diff = 0x24d

 [  233.624172] Time Diff = 0x1ea

 [  233.625381] Time Diff = 0x1f0

 [  233.626358] Time Diff = 0x200

 [  233.627572] Time Diff = 0x1ed

# echo 1 > /proc/sys/debug/kprobes-optimization

Kprobes globally optimized

 [   70.797075] Time Diff = 0x103

 [   70.799102] Time Diff = 0x181

 [   70.801861] Time Diff = 0x15e

 [   70.803466] Time Diff = 0xf0

 [   70.804348] Time Diff = 0xd0

 [   70.805653] Time Diff = 0xad

 [   70.806477] Time Diff = 0xe0

 [   70.807725] Time Diff = 0xbe

 [   70.808541] Time Diff = 0xc3

 [   70.810191] Time Diff = 0xc7

 [   70.811007] Time Diff = 0xc0

 [   70.812629] Time Diff = 0xc0

 [   70.813640] Time Diff = 0xda

 [   70.814915] Time Diff = 0xbb

 [   70.815726] Time Diff = 0xc4

 [   70.816955] Time Diff = 0xc0

 [   70.817778] Time Diff = 0xcd

 [   70.818999] Time Diff = 0xcd

 [   70.820099] Time Diff = 0xcb

 [   70.821333] Time Diff = 0xf0

Implementation:

===================

The trap instruction is replaced by a branch to a detour buffer. To address

the limitation of branch instruction in power architecture, detour buffer

slot is allocated from a reserved area . This will ensure that the branch

is within ± 32 MB range. The current kprobes insn caches allocate memory 

area for insn slots with module_alloc(). This will always be beyond 

± 32MB range.

</pre>

      </blockquote>

      <pre wrap="">

The paragraph is a little confusing. We need the detour buffer to be within

+-32 MB, but then you say we always get memory from module_alloc() beyond

32MB.

</pre>

    </blockquote>

    <br>

    The last two lines in the paragraph talks about the<b> current </b>method<b>

    </b>which the regular kprobe uses<br>

    for allocating instruction slot. So in our case, we can't use

    module_alloc() since there is no guarantee that the slot allocated

    will be within +/- 32MB range.<br>

    <blockquote

      cite="mid:fd268944-5f8e-9815-6fef-e7d9d5191044@gmail.com"

      type="cite">

      <pre wrap="">

</pre>

      <blockquote type="cite">

        <pre wrap="">The detour buffer contains a call to optimized_callback() which in turn

call the pre_handler(). Once the pre-handler is run, the original

instruction is emulated from the detour buffer itself. Also the detour

buffer is equipped with a branch back to the normal work flow after the

probed instruction is emulated.

</pre>

      </blockquote>

      <pre wrap="">

Does the branch itself use registers that need to be saved? I presume

we are going to rely on the +-32MB, what are the guarantees of success

of such a mechanism?</pre>

    </blockquote>

    <br>

    For branching back to the next instruction, after the execution of

    the kprobe's pre-handler,<br>

    we place the branch instruction in the detour buffer itself. Hence

    we don't have to clobber any registers<br>

    after restoring them.<br>

    Before optimizing the kprobe we make sure that , 'branch to detour

    buffer' and 'branch back from detour buffer' is within +/- 32MB

    range. This ensures the working of optimized kprobe.<br>

    <br>

    <br>

    Thanks ,<br>

    Anju <br>

    <br>

    <blockquote

      cite="mid:fd268944-5f8e-9815-6fef-e7d9d5191044@gmail.com"

      type="cite">

      <pre wrap="">

Balbir Singh.

</pre>

    </blockquote>

    <br>

  </body>

</html>