[PATCH V4 1/3] powerpc/mm: Add support for handling > 512TB address in SLB miss

Fri Mar 16 17:35:24 AEDT 2018

Hi Aneesh,

Some comments ...

"Aneesh Kumar K.V" <aneesh.kumar at linux.vnet.ibm.com> writes:
> For address above 512TB we allocate additional mmu context. To make it all
> easy address above 512TB is handled with IR/DR=1 and with stack frame setup.
>
> We do the additional context allocation in SLB miss handler. If the context is
> not allocated, we enable interrupts and allocate the context and retry the
> access which will again result in a SLB miss.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar at linux.vnet.ibm.com>

This doesn't build for g5_defconfig:

  In file included from ../arch/powerpc/include/asm/mmu.h:314:0,
                   from ../arch/powerpc/include/asm/lppaca.h:36,
                   from ../arch/powerpc/include/asm/paca.h:21,
                   from ../arch/powerpc/include/asm/hw_irq.h:63,
                   from ../arch/powerpc/include/asm/irqflags.h:12,
                   from ../include/linux/irqflags.h:16,
                   from ../include/linux/spinlock.h:54,
                   from ../include/linux/mmzone.h:8,
                   from ../arch/powerpc/include/asm/pgtable.h:7,
                   from ../arch/powerpc/mm/slb.c:17:
  ../arch/powerpc/mm/slb.c: In function ‘handle_multi_context_slb_miss’:
  ../arch/powerpc/include/asm/book3s/64/mmu.h:208:25: error: array subscript is above array bounds [-Werror=array-bounds]
    return ctx->extended_id[index];
           ~~~~~~~~~~~~~~~~^~~~~~~
  ../arch/powerpc/mm/slb.c:417:30: error: array subscript is above array bounds [-Werror=array-bounds]
    if (!mm->context.extended_id[index])
         ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
  ../arch/powerpc/mm/slb.c:418:26: error: array subscript is above array bounds [-Werror=array-bounds]
     mm->context.extended_id[index] = context_id;
     ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~

> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> index 67c5475311ee..af2ba9875f18 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> @@ -11,6 +11,12 @@
>  #define H_PUD_INDEX_SIZE  9
>  #define H_PGD_INDEX_SIZE  9
>  
> +/*
> + * No of address bits below which we use the default context

Can we use "Number", "No" means no, to mean number you need "No." but
that's ugly.

> + * for slb allocation. For 4k this is 64TB.
> + */
> +#define H_BITS_FIRST_CONTEXT	46

It's actually the number of address bits *per* context right?

"Context" is also a bit of a overloaded term, eg. context !=
mm_context_t, maybe "context id" would be clearer?

So maybe H_BITS_PER_CONTEXT_ID?

This values is essentially VA_BITS - CONTEXT_BITS right?

Except VA_BITS is actually 65 in this case, but that's not clear at all
from the code :/

It'd be really nice if this was calculated, or if the other values
(VA_BITS/CONTEXT_BITS) were calculated based on it.

> diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> index 3bcf269f8f55..0ee0fc1ad675 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> @@ -6,6 +6,11 @@
>  #define H_PMD_INDEX_SIZE  10
>  #define H_PUD_INDEX_SIZE  7
>  #define H_PGD_INDEX_SIZE  8

Can we get some newlines between code?

> +/*
> + * No of address bits below which we use the default context
> + * for slb allocation. For 64k this is 512TB.
> + */
> +#define H_BITS_FIRST_CONTEXT	49

This really is == VA_BITS - CONTEXT_BITS.

> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> index 50ed64fba4ae..8ee83f6e9c84 100644
> --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> @@ -691,8 +691,8 @@ static inline int user_segment_size(unsigned long addr)
>  	return MMU_SEGSIZE_256M;
>  }
>  
> -static inline unsigned long get_vsid(unsigned long context, unsigned long ea,
> -				     int ssize)
> +static inline unsigned long __get_vsid(unsigned long context, unsigned long ea,
> +				       int ssize)

If you're going to realign it the 'i' in int should line up with the 'u'
in unsigned.

Why did this become __get_vsid()? We don't have a get_vsid() that I can see?

> diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
> index 777778579305..a70adbb7ec56 100644
> --- a/arch/powerpc/include/asm/book3s/64/mmu.h
> +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> @@ -91,7 +91,15 @@ struct slice_mask {
>  };
>  
>  typedef struct {
> -	mm_context_id_t id;
> +	union {
> +		/*
> +		 * One context for each 512TB.
> +		 * First 512TB context is saved in id and is also used
> +		 * as PIDR.
> +		 */

Can you make it clearer how this is used on hash vs radix.

> +		mm_context_id_t id;
> +		mm_context_id_t extended_id[TASK_SIZE_USER64/TASK_CONTEXT_SIZE];

If my math is right this is typically 4PB / 512T == 8 == 64 bytes.

Which is small enough to not worry about, but would be good to mention
in the change log at least.

> +	};
>  	u16 user_psize;		/* page size index */
>  
>  	/* Number of bits in the mm_cpumask */
> @@ -193,5 +201,21 @@ extern void radix_init_pseries(void);
>  static inline void radix_init_pseries(void) { };
>  #endif
>  
> +static inline int get_esid_context(mm_context_t *ctx, unsigned long ea)
> +{
> +	int index = ea >> H_BITS_FIRST_CONTEXT;

Using "esid" in the name is a bit confusing, because the value is an EA,
not an ESID. So change the name or what's passed?

I think change the name, the callers both have an EA so it makes sense
to pass that.

> +

I know we check ea elsewhere, but a:

  VM_BUG_ON(index >= ARRAY_SIZE(extened_id))

Would make me feel better I think, and index being unsigned.

> diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
> index 01299cdc9806..70d65b482504 100644
> --- a/arch/powerpc/include/asm/processor.h
> +++ b/arch/powerpc/include/asm/processor.h
> @@ -119,9 +119,16 @@ void release_thread(struct task_struct *);
>   */
>  #define TASK_SIZE_USER64		TASK_SIZE_512TB
>  #define DEFAULT_MAP_WINDOW_USER64	TASK_SIZE_128TB
> +#define TASK_CONTEXT_SIZE		TASK_SIZE_512TB
>  #else
>  #define TASK_SIZE_USER64		TASK_SIZE_64TB
>  #define DEFAULT_MAP_WINDOW_USER64	TASK_SIZE_64TB
> +/*
> + * We don't need allocate extended context id for 4K
> + * page size. We limit max address on this config to
> + * 64TB.
> + */

/*
 * We don't need to allocate extended context ids for 4K page size, because
 * we limit the max effective address on this config to 64TB.
 */

???

> diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
> index 3ac87e53b3da..166b8c0f1830 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -620,8 +620,12 @@ END_MMU_FTR_SECTION_IFCLR(MMU_FTR_TYPE_RADIX)
>  	ld	r10,PACA_EXSLB+EX_LR(r13)
>  	lwz	r9,PACA_EXSLB+EX_CCR(r13)	/* get saved CR */
>  	mtlr	r10

Blank line would be nice.

> +	/*
> +	 * Large address, check whether we have to allocate new
> +	 * contexts.
> +	 */

That doesn't need to wrap.

	/* Large address, check whether we need to allocate new contexts. */

> +	beq-	8f
>  
> -	beq-	8f		/* if bad address, make full stack frame */

And we ended up with two blank lines here which is probably overkill.

> @@ -710,7 +714,7 @@ EXC_COMMON_BEGIN(bad_addr_slb)
>  	std	r10, _TRAP(r1)
>  2:	bl	save_nvgprs
>  	addi	r3, r1, STACK_FRAME_OVERHEAD
> -	bl	slb_miss_bad_addr

You removed the only call to this but didn't remove the function or its
prototype AFAICS.

> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
> index 80acad52b006..ccc88fa7c35c 100644
> --- a/arch/powerpc/mm/mmu_context_book3s64.c
> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
> @@ -178,6 +178,19 @@ void __destroy_context(int context_id)
>  }
>  EXPORT_SYMBOL_GPL(__destroy_context);
>  
> +static void destroy_contexts(mm_context_t *ctx)
> +{
> +	int index, context_id;
> +
> +	spin_lock(&mmu_context_lock);
> +	for (index = 0; index < (TASK_SIZE_USER64/TASK_CONTEXT_SIZE); index++) {

ARRAY_SIZE(ctx->extended_id) ?

> diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
> index 13cfe413b40d..a93887ee9b22 100644
> --- a/arch/powerpc/mm/slb.c
> +++ b/arch/powerpc/mm/slb.c
> @@ -23,6 +23,7 @@
>  #include <asm/smp.h>
>  #include <linux/compiler.h>
>  #include <linux/mm_types.h>
> +#include <linux/context_tracking.h>

Our includes are generally not well sorted, but can you try to keep them
alphabetical. So after compiler.h.

> @@ -340,3 +341,156 @@ void slb_initialize(void)
>  
>  	asm volatile("isync":::"memory");
>  }
> +
> +/*
> + * Only handle insert of 1TB slb entries.

What if we don't have them (enabled) ?

> + */
> +static void insert_slb_entry(unsigned long vsid, unsigned long ea,
> +			     int bpsize, int ssize)

Parameter alignment again.

> +{
> +	int slb_cache_index;
> +	unsigned long flags;
> +	enum slb_index index;
> +	unsigned long vsid_data, esid_data;

Can you clean those up:

	unsigned long vsid_data, esid_data, flags;
	enum slb_index index;
	int slb_cache_index;

> +
> +	/*
> +	 * We are irq disabled, hence should be safe
> +	 * to access PACA.

Doesn't need to wrap.

> +	 */
> +	index =  get_paca()->stab_rr;
                ^
Unneeded extra space there

Blank line please.

> +	/*
> +	 * simple round roubin replacement of slb.

"Simple round-robin"

Although it's not really, it's round-robin but not starting at zero,
starting at SLB_NUM_BOLTED.

> +	 */
> +	if (index < mmu_slb_size)
> +		index++;
> +	else
> +		index = SLB_NUM_BOLTED;

Blank line.

> +	get_paca()->stab_rr = index;
> +
> +	flags = SLB_VSID_USER | mmu_psize_defs[bpsize].sllp;
> +	vsid_data =  (vsid << SLB_VSID_SHIFT_1T) | flags |
> +		((unsigned long) ssize << SLB_VSID_SSIZE_SHIFT);

	vsid_data = (vsid << SLB_VSID_SHIFT_1T) | flags |
        	    ((unsigned long) ssize << SLB_VSID_SSIZE_SHIFT);

> +	esid_data = mk_esid_data(ea, mmu_highuser_ssize, index);
> +
> +	asm volatile("slbmte %0, %1" : : "r" (vsid_data), "r" (esid_data)
> +		     : "memory");
Blank line.
> +	/*
> +	 * Now update slb cache entries
> +	 */
> +	slb_cache_index = get_paca()->slb_cache_ptr;
> +	if (slb_cache_index < SLB_CACHE_ENTRIES) {
> +		/*
> +		 * We have space in slb cache for optimized switch_slb().
> +		 * Top 36 bits from esid_data as per ISA
> +		 */
> +		get_paca()->slb_cache[slb_cache_index++] = esid_data >> 28;
> +	}
Blank line.
> +	/*
> +	 * if we are full, just increment and return.
> +	 */
> +	get_paca()->slb_cache_ptr++;
> +}
> +
> +static void alloc_extended_context(struct mm_struct *mm, unsigned long ea)
> +{
> +	int context_id;
> +
> +	int index = ea >> H_BITS_FIRST_CONTEXT;
> +
> +	/*
> +	 * we need to do locking only here. If this value was not set before
> +	 * we will have taken an SLB miss and will reach here. The value will
> +	 * be either 0 or a valid extended context. We need to make sure two
> +	 * parallel SLB miss don't end up allocating extended_context for the
> +	 * same range. The locking below ensures that. For now we take the
> +	 * heavy mmap_sem. But can be changed to per mm_context_t custom lock
> +	 * if needed.
> +	 */
> +	down_read(&mm->mmap_sem);

> +	context_id = hash__alloc_context_id();
> +	if (context_id < 0) {
> +		up_read(&mm->mmap_sem);
> +		pagefault_out_of_memory();
> +		return;
> +	}

> +	/* Check for parallel allocation after holding lock */
> +	if (!mm->context.extended_id[index])
> +		mm->context.extended_id[index] = context_id;
> +	else
> +		__destroy_context(context_id);

> +	up_read(&mm->mmap_sem);
> +}
> +
> +static void __handle_multi_context_slb_miss(struct pt_regs *regs,
> +					    unsigned long ea)
> +{
> +	int context, bpsize;
> +	unsigned long vsid;
> +	struct mm_struct *mm = current->mm;
> +
> +	context = get_esid_context(&mm->context, ea);
> +	if (!context) {
> +		/*
> +		 * haven't allocated context yet for this range.
> +		 * Enable irq and allo context and return. We will
> +		 * take an slb miss on this again and come here with
> +		 * allocated context.
> +		 */
> +		/* We restore the interrupt state now */
> +		if (!arch_irq_disabled_regs(regs))
> +			local_irq_enable();
> +		return alloc_extended_context(mm, ea);

I guess I thought we'd do this somewhere in the slice code, when an
address above the limit is first used?

Doing it this way means a stray access to >= 512TB will get a lot
further, ie. we'll come in here allocate a context etc, install an SLB
entry and only then realise there's nothing mapped when we look in the
hash. I'd prefer we caught it earlier if possible.

> +	}
> +	/*
> +	 * We are always above 1TB, hence use high user segment size.
> +	 */
> +	vsid = __get_vsid(context, ea, mmu_highuser_ssize);
> +	bpsize = get_slice_psize(mm, ea);
> +
> +	insert_slb_entry(vsid, ea, bpsize, mmu_highuser_ssize);
> +}
> +
> +/*
> + * exception_enter() handling? FIXME!!

???

> + */
> +void handle_multi_context_slb_miss(struct pt_regs *regs)

I don't love this name, maybe slb_miss_large_addr()?

That would also free-up the non-underscore name for the actual core of
the handler above.

> +{
> +	enum ctx_state prev_state = exception_enter();
> +	unsigned long ea = regs->dar;
> +
> +	/*
> +	 * Kernel always runs with single context. Hence
> +	 * anything that request for multi context is
> +	 * considered bad slb request.
> +	 */
> +	if (!user_mode(regs))
> +		return bad_page_fault(regs, ea, SIGSEGV);

Can't copy/to_from_user() hit this case?

> +	if (REGION_ID(ea) != USER_REGION_ID)
> +		goto slb_bad_addr;

Blank line please :)

> +	/*
> +	 * Are we beyound what the page table layout support ?

"supports" ?

> +	 */
> +	if ((ea & ~REGION_MASK) >= H_PGTABLE_RANGE)
> +		goto slb_bad_addr;
> +
> +#ifdef CONFIG_PPC_MM_SLICES

That can only ever be true in this file, right?

> +	/*
> +	 * consider this as bad slb if we take an slb miss
> +	 * on an address above addr limit.

It's a bad "access" not a "bad slb".

We should probably spell slb SLB in the comments too.

> +	 */
> +	if (ea >= current->mm->context.slb_addr_limit)
> +		goto slb_bad_addr;
> +#endif

Blank ..

> +	/* Lower address should be handled by asm code */

"should have been handled"

> +	if (ea < (1UL << H_BITS_FIRST_CONTEXT))
> +		goto slb_bad_addr;
> +
> +	__handle_multi_context_slb_miss(regs, ea);
> +	exception_exit(prev_state);
> +	return;
> +
> +slb_bad_addr:
> +	_exception(SIGSEGV, regs, SEGV_BNDERR, ea);
> +	exception_exit(prev_state);
> +}

I think this would work better if we had:

void slb_bad_addr(regs, ea)
{
	_exception(SIGSEGV, regs, SEGV_BNDERR, ea);
}

...
void handle_multi_context_slb_miss(struct pt_regs *regs)
{
...
	if (ea < (1UL << H_BITS_FIRST_CONTEXT)) {
		slb_bad_addr(regs, ea);
                goto out;
        }
...
	__handle_multi_context_slb_miss(regs, ea);
out:
	exception_exit(prev_state);
}

> diff --git a/arch/powerpc/mm/slb_low.S b/arch/powerpc/mm/slb_low.S
> index 2c7c717fd2ea..c66cb06e73a1 100644
> --- a/arch/powerpc/mm/slb_low.S
> +++ b/arch/powerpc/mm/slb_low.S
> @@ -75,10 +75,12 @@ ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_68_BIT_VA)
>   */
>  _GLOBAL(slb_allocate)
>  	/*
> -	 * check for bad kernel/user address
> +	 * Check for address range for which we need to handle multi context. For
> +         * the default context we allocate the slb via the fast path. For large
> +         * address we branch out to C-code and look at additional context allocated.

Leading white space is fubar.

cheers