[PATCH] powerpc: mm: Limit rma_size to 1TB when running without HV mode

Satheesh Rajendran sathnaga at linux.vnet.ibm.com
Wed Jul 10 17:32:07 AEST 2019


On Wed, Jul 10, 2019 at 03:20:18PM +1000, Suraj Jitindar Singh wrote:
> The virtual real mode addressing (VRMA) mechanism is used when a
> partition is using HPT (Hash Page Table) translation and performs
> real mode accesses (MSR[IR|DR] = 0) in non-hypervisor mode. In this
> mode effective address bits 0:23 are treated as zero (i.e. the access
> is aliased to 0) and the access is performed using an implicit 1TB SLB
> entry.
> 
> The size of the RMA (Real Memory Area) is communicated to the guest as
> the size of the first memory region in the device tree. And because of
> the mechanism described above can be expected to not exceed 1TB. In the
> event that the host erroneously represents the RMA as being larger than
> 1TB, guest accesses in real mode to memory addresses above 1TB will be
> aliased down to below 1TB. This means that a memory access performed in
> real mode may differ to one performed in virtual mode for the same memory
> address, which would likely have unintended consequences.
> 
> To avoid this outcome have the guest explicitly limit the size of the
> RMA to the current maximum, which is 1TB. This means that even if the
> first memory block is larger than 1TB, only the first 1TB should be
> accessed in real mode.
> 
> Signed-off-by: Suraj Jitindar Singh <sjitindarsingh at gmail.com>
> ---
>  arch/powerpc/mm/book3s64/hash_utils.c | 8 ++++++++
>  1 file changed, 8 insertions(+)

Hi,

Tested this patch and now Power8 compat guest boots fine with mem >1024G on 
Power9 host.

Tested-by: Satheesh Rajendran <sathnaga at linux.vnet.ibm.com>

Host: P9; kernel: 5.2.0-00915-g5ad18b2e60b7

Before this patch:
Guest crashes..
[0.000000] BUG: Kernel NULL pointer dereference at 0x00000028
[0.000000] Faulting instruction address: 0xc00000000102caa0
[0.000000] Oops: Kernel access of bad area, sig: 11 [#1]
[0.000000] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
[0.000000] Modules linked in:
[0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 5.2.0-03135-ge9a83bd23220 #24
[0.000000] NIP:  c00000000102caa0 LR: c00000000102ca84 CTR: 0000000000000000
[0.000000] REGS: c000000001603ba0 TRAP: 0380   Not tainted  (5.2.0-03135-ge9a83bd23220)
[0.000000] MSR:  8000000000001033 <SF,ME,IR,DR,RI,LE>  CR: 24000428  XER: 20000000
[0.000000] CFAR: c00000000102c1d8 IRQMASK: 1 
[0.000000] GPR00: c00000000102ca84 c000000001603e30 c000000001605300 0000010000000000 
[0.000000] GPR04: 0000000000000000 0000000000000000 c00000ffffff8000 c000000001863dc8 
[0.000000] GPR08: 0000000000002028 0000000000000000 c00000ffffff8000 0000000000000009 
[0.000000] GPR12: 0000000000000000 c0000000018f0000 000000007dc5fef0 00000000012e1220 
[0.000000] GPR16: 00000000012e10a0 fffffffffffffffd 000000007dc5fef0 000000000130fcc0 
[0.000000] GPR20: 0000000000000014 0000000001a80000 000000002fff0000 fffffffffffffffd 
[0.000000] GPR24: 0000000001d0000c c000000000000000 c000000001641ed8 c000000001641b78 
[0.000000] GPR28: 0000000000000000 0000000000000000 0000010000000000 0000000000000000 
[0.000000] NIP [c00000000102caa0] emergency_stack_init+0xb8/0x118
[0.000000] LR [c00000000102ca84] emergency_stack_init+0x9c/0x118
[0.000000] Call Trace:
[0.000000] [c000000001603e30] [c00000000102ca84] emergency_stack_init+0x9c/0x118 (unreliable)
[0.000000] [c000000001603e80] [c00000000102bd54] setup_arch+0x2fc/0x388
[0.000000] [c000000001603ef0] [c000000001023ccc] start_kernel+0xa4/0x660
[0.000000] [c000000001603f90] [c00000000000b774] start_here_common+0x1c/0x528
[0.000000] Instruction dump:
[0.000000] 7ffc07b4 7fc3f378 7bfd1f24 7f84e378 4bfff6e9 3f620004 3b7bc878 7f84e378 
[0.000000] 39434000 7fc3f378 e93b0000 7d29e82a <f9490028> 4bfff6c5 e93b0000 7f84e378 
[0.000000] random: get_random_bytes called from print_oops_end_marker+0x6c/0xa0 with crng_init=0
[0.000000] ---[ end trace 0000000000000000 ]---
[0.000000] 
[0.000000] Kernel panic - not syncing: Attempted to kill the idle task!

-------------------------
With this patch:
# virsh start --console p8
Domain p8 started
Connected to domain p8
..
..
Fedora 27 (Twenty Seven)
Kernel 5.2.0-03136-gf709b0494ad9 on an ppc64le (hvc0)

atest-guest login: 
# free -g
              total        used        free      shared  buff/cache   available
Mem:           1028       0        1027           0           0        1025
Swap:         0           0     

Regards,
-Satheesh.

> 
> diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
> index 28ced26f2a00..4d0e2cce9cd5 100644
> --- a/arch/powerpc/mm/book3s64/hash_utils.c
> +++ b/arch/powerpc/mm/book3s64/hash_utils.c
> @@ -1901,11 +1901,19 @@ void hash__setup_initial_memory_limit(phys_addr_t first_memblock_base,
>  	 *
>  	 * For guests on platforms before POWER9, we clamp the it limit to 1G
>  	 * to avoid some funky things such as RTAS bugs etc...
> +	 * On POWER9 we limit to 1TB in case the host erroneously told us that
> +	 * the RMA was >1TB. Effective address bits 0:23 are treated as zero
> +	 * (meaning the access is aliased to zero i.e. addr = addr % 1TB)
> +	 * for virtual real mode addressing and so it doesn't make sense to
> +	 * have an area larger than 1TB as it can't be addressed.
>  	 */
>  	if (!early_cpu_has_feature(CPU_FTR_HVMODE)) {
>  		ppc64_rma_size = first_memblock_size;
>  		if (!early_cpu_has_feature(CPU_FTR_ARCH_300))
>  			ppc64_rma_size = min_t(u64, ppc64_rma_size, 0x40000000);
> +		else
> +			ppc64_rma_size = min_t(u64, ppc64_rma_size,
> +					       1UL << SID_SHIFT_1T);
> 
>  		/* Finally limit subsequent allocations */
>  		memblock_set_current_limit(ppc64_rma_size);
> -- 
> 2.13.6
> 



More information about the Linuxppc-dev mailing list