tlb flushing on Power

Thu Mar 8 08:18:56 EST 2012

On 03/06/2012 11:28 PM, Michael Neuling wrote:
> Seth,
> 
>> Thanks for the help!  I was wondering if you could take a look at something
>> for me.
>>
>> I've been working on this staging driver (zsmalloc memory allocator)
>> that does virtual mapping of two pages.
>>
>> I have a github repo with the driver and the unsubmitted changes.  I'm
>> trying to make to get the pte/tlb stuff working in a portable way:
>>
>> git://github.com/spartacus06/linux.git (portable branch)
>>
>> The experimental commits are the top 5 and the branch is based on
>> Greg's staging-next + frontswap-v11 patches.
>>
>> Could you take a look at the zs_map_object() and zs_unmap_object()
>> in drivers/staging/zsmalloc/zsmalloc-main.c and see if they should
>> work for PPC64?
> 
> I suggest you post the code directly to the list in reviewable chucks.
> People are much more likely to review it if they don't have to download
> some random git tree, checkout some branch, find the changes, etc etc.

It's hard to summarize out of context, but I'll try.

So zsmalloc is designed to store compressed memory pages.  In order
to do that efficiently in memory restricted environments where
memory allocations from the buddy allocator greater than order 0 are
very likely to fail, we have to be able to store compressed memory pages
across non-contiguous page boundaries.

zsmalloc uses virtual memory mapping to do this.  There is a per-cpu
struct vm_struct *vm.  This vm is initialized like this:
===
pte_t *vm_ptes[2];
vm = alloc_vm_area(2 * PAGE_SIZE, vm_ptes);
===

When the allocation needs to be accessed, we map the memory like this
(with preemption disabled):

set_pte_at(&init_mm, 0, vm_ptes[0], mk_pte(page1, PAGE_KERNEL));
set_pte_at(&init_mm, 0, vm_ptes[1], mk_pte(page2, PAGE_KERNEL));

Preemption remains disabled while the user manipulates the allocation.

When the user is done, we unmap by doing:
====
ptep_get_and_clear(&init_mm, vm->addr, vm_ptes[0]);
ptep_get_and_clear(&init_mm, vm->addr + PAGE_SIZE, vm_ptes[1]);

local_flush_tlb_kernel_page(vm->addr);
local_flush_tlb_kernel_page(vm->addr + PAGE_SIZE);
====

In my patch, I've defined local_flush_tlb_kernel_page as:

#define local_flush_tlb_kernel_page(kaddr) local_flush_tlb_page(NULL, kaddr)

in arch/powerpc/include/asm/tlbflush.h

For PPC64 (the platform I'm testing on) local_flush_tlb_page() is a no-op
because of the hashing tlb in that architecture; however, it isn't a no-op
on other archs and the whole point of this change is portability.

My understanding is that the tlb flush on PPC64 happens as a result of
the pte_update() eventually called from ptep_get_and_clear().

The problem is that this doesn't always seem to work.  I get corruption
in the mapped memory from time to time.  

I've isolated the issue to this mapping code.

Does anyone see anything glaringly wrong with this approach?

Should ptep_get_and_clear() also flush the tlb entry (asynchronously I think)?

For comparison, this code is stable on x86 where local_flush_tlb_kernel_page()
is defined to __flush_tlb_one().

> If it's work in progress, mark it as an RFC patch and note what issues
> you think still exist.  
> 
> Mikey