[PATCH 00/16] Remove hash page table slot tracking from linux PTE

Tue Oct 31 00:49:25 AEDT 2017

"Aneesh Kumar K.V" <aneesh.kumar at linux.vnet.ibm.com> writes:

> "Aneesh Kumar K.V" <aneesh.kumar at linux.vnet.ibm.com> writes:
>
>
>> I looked at the perf data and with the test, we are doing larger number
>> of hash faults and then around 10k flush_hash_range. Can the small
>> improvement in number be due to the fact that we are not storing slot
>> number when doing an insert now?. Also in the flush path we are now not
>> using real_pte_t.
>>
>
> With THP disabled I am finding below.
>
> Without patch
>
>     35.62%  a.out    [kernel.vmlinux]            [k] clear_user_page
>      8.54%  a.out    [kernel.vmlinux]            [k] __lock_acquire
>      3.86%  a.out    [kernel.vmlinux]            [k] native_flush_hash_range
>      3.38%  a.out    [kernel.vmlinux]            [k] save_context_stack
>      2.98%  a.out    a.out                       [.] main
>      2.59%  a.out    [kernel.vmlinux]            [k] lock_acquire
>      2.29%  a.out    [kernel.vmlinux]            [k] mark_lock
>      2.23%  a.out    [kernel.vmlinux]            [k] native_hpte_insert
>      1.87%  a.out    [kernel.vmlinux]            [k] get_mem_cgroup_from_mm
>      1.71%  a.out    [kernel.vmlinux]            [k] rcu_lockdep_current_cpu_online
>      1.68%  a.out    [kernel.vmlinux]            [k] lock_release
>      1.47%  a.out    [kernel.vmlinux]            [k] __handle_mm_fault
>      1.41%  a.out    [kernel.vmlinux]            [k] validate_sp
>
>
> With patch
>     35.40%  a.out    [kernel.vmlinux]            [k] clear_user_page
>      8.82%  a.out    [kernel.vmlinux]            [k] __lock_acquire
>      3.66%  a.out    a.out                       [.] main
>      3.49%  a.out    [kernel.vmlinux]            [k] save_context_stack
>      2.77%  a.out    [kernel.vmlinux]            [k] lock_acquire
>      2.45%  a.out    [kernel.vmlinux]            [k] mark_lock
>      1.80%  a.out    [kernel.vmlinux]            [k] get_mem_cgroup_from_mm
>      1.80%  a.out    [kernel.vmlinux]            [k] native_hpte_insert
>      1.79%  a.out    [kernel.vmlinux]            [k] rcu_lockdep_current_cpu_online
>      1.78%  a.out    [kernel.vmlinux]            [k] lock_release
>      1.73%  a.out    [kernel.vmlinux]            [k] native_flush_hash_range
>      1.53%  a.out    [kernel.vmlinux]            [k] __handle_mm_fault
>
> That is we are now spending less time in native_flush_hash_range.
>
> -aneesh

One possible explanation is, with slot tracking we do

	slot += hidx & _PTEIDX_GROUP_IX;
	hptep = htab_address + slot;
	want_v = hpte_encode_avpn(vpn, psize, ssize);
	native_lock_hpte(hptep);
	hpte_v = be64_to_cpu(hptep->v);
	if (cpu_has_feature(CPU_FTR_ARCH_300))
		hpte_v = hpte_new_to_old_v(hpte_v,
				be64_to_cpu(hptep->r));
	if (!HPTE_V_COMPARE(hpte_v, want_v) ||
	    !(hpte_v & HPTE_V_VALID))
		native_unlock_hpte(hptep);

and without slot tracking we do

	for (i = 0; i < HPTES_PER_GROUP; i++, hptep++) {
		/* check locklessly first */
		hpte_v = be64_to_cpu(hptep->v);
		if (cpu_has_feature(CPU_FTR_ARCH_300))
			hpte_v = hpte_new_to_old_v(hpte_v, be64_to_cpu(hptep->r));
		if (!HPTE_V_COMPARE(hpte_v, want_v) || !(hpte_v & HPTE_V_VALID))
			continue;

		native_lock_hpte(hptep);

That is without the patch series, we take the hpte lock always even if the
hpte didn't match. Hence in perf annotate we find the lock to be
highly contended without patch series.

I will change that to compare pte without taking lock and see if that
has any impact.

-aneesh