[PATCH v15 7/9] powerpc: Set ARCH_HAS_STRICT_MODULE_RWX

Sat Aug 14 08:58:59 AEST 2021

Laurent Vivier <lvivier at redhat.com> writes:

>
> since this patch is merged my VM is experiencing a crash at boot (20% of the time):
>
> [    8.496850] kernel tried to execute exec-protected page (c008000004073278) - exploit
> attempt? (uid: 0)
> [    8.496921] BUG: Unable to handle kernel instruction fetch
> [    8.496954] Faulting instruction address: 0xc008000004073278
> [    8.496994] Oops: Kernel access of bad area, sig: 11 [#1]
> [    8.497028] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> [    8.497071] Modules linked in: drm virtio_console fuse drm_panel_orientation_quirks xfs
> libcrc32c virtio_net net_failover virtio_blk vmx_crypto failover dm_mirror dm_region_hash
> dm_log dm_mod
> [    8.497186] CPU: 3 PID: 44 Comm: kworker/3:1 Not tainted 5.14.0-rc4+ #12
> [    8.497228] Workqueue: events control_work_handler [virtio_console]
> [    8.497272] NIP:  c008000004073278 LR: c008000004073278 CTR: c0000000001e9de0
> [    8.497320] REGS: c00000002e4ef7e0 TRAP: 0400   Not tainted  (5.14.0-rc4+)
> [    8.497361] MSR:  800000004280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 24002822
> XER: 200400cf
> [    8.497426] CFAR: c0000000001e9e44 IRQMASK: 1
> [    8.497426] GPR00: c008000004073278 c00000002e4efa80 c000000002a26b00 c000000042c39520
> [    8.497426] GPR04: 0000000000000001 0000000000000000 0000000000000000 00000000000000ff
> [    8.497426] GPR08: 0000000000000001 c000000042c39520 0000000000000001 c008000004076008
> [    8.497426] GPR12: c0000000001e9de0 c0000001fffccb00 c00000000018ba88 c00000002c91d400
> [    8.497426] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [    8.497426] GPR20: 0000000000000000 0000000000000000 0000000000000000 c008000004080340
> [    8.497426] GPR24: c0080000040a01e8 0000000000000000 0000000000000000 c00000002e0975c0
> [    8.497426] GPR28: c00000002ce72940 c000000042c39520 0000000000000048 0000000000000038
> [    8.497891] NIP [c008000004073278] fill_queue+0xf0/0x210 [virtio_console]
> [    8.497934] LR [c008000004073278] fill_queue+0xf0/0x210 [virtio_console]
> [    8.497976] Call Trace:
> [    8.497993] [c00000002e4efa80] [c00800000407323c] fill_queue+0xb4/0x210
> [virtio_console] (unreliable)
> [    8.498052] [c00000002e4efae0] [c008000004073a90] add_port+0x1a8/0x470 [virtio_console]
> [    8.498102] [c00000002e4efbb0] [c0080000040750f4] control_work_handler+0xbc/0x1e8
> [virtio_console]
> [    8.498160] [c00000002e4efc60] [c00000000017f4f0] process_one_work+0x290/0x590
> [    8.498212] [c00000002e4efd00] [c00000000017f878] worker_thread+0x88/0x620
> [    8.498256] [c00000002e4efda0] [c00000000018bc14] kthread+0x194/0x1a0
> [    8.498299] [c00000002e4efe10] [c00000000000cf54] ret_from_kernel_thread+0x5c/0x64
> [    8.498349] Instruction dump:
> [    8.498374] 7da96b78 a14d0c8a 419c00b0 2f8a0000 419eff88 b32d0c8a 7c0004ac 4bffff7c
> [    8.498430] 60000000 60000000 7fa3eb78 48002d95 <e8410018> 38600000 480025e1 e8410018
> [    8.498485] ---[ end trace 16ee10903290b647 ]---
> [    8.501433]
> [    9.502601] Kernel panic - not syncing: Fatal exception
>
> add_port+0x1a8/0x470 :
>
>   1420	
>   1421		/* We can safely ignore ENOSPC because it means
>   1422		 * the queue already has buffers. Buffers are removed
>   1423		 * only by virtcons_remove(), not by unplug_port()
>   1424		 */
> ->1425		err = fill_queue(port->in_vq, &port->inbuf_lock);
>   1426		if (err < 0 && err != -ENOSPC) {
>   1427			dev_err(port->dev, "Error allocating inbufs\n");
>   1428			goto free_device;
>   1429		}
>
> fill_queue+0x90/0x210 :
>
>   1326	static int fill_queue(struct virtqueue *vq, spinlock_t *lock)
>   1327	{
>   1328		struct port_buffer *buf;
>   1329		int nr_added_bufs;
>   1330		int ret;
>   1331	
>   1332		nr_added_bufs = 0;
>   1333		do {
>   1334			buf = alloc_buf(vq->vdev, PAGE_SIZE, 0);
>   1335			if (!buf)
>   1336				return -ENOMEM;
>   1337	
> ->1338			spin_lock_irq(lock);
>
> I'm using an upstream kernel (5.14-rc4, 251a1524293d) in the VM.
>
> My host is a RHEL 8.5/POWER9: qemu-kvm-6.0.0-21 and kernel-4.18.0-325
>
> My qemu command line is:
>
> /usr/libexec/qemu-kvm \
> -M pseries,accel=kvm \
> -nographic -nodefaults \
> -device virtio-serial-pci \
> -device virtconsole \
> -device virtio-net-pci,mac=9a:2b:2c:2d:2e:2f,netdev=hostnet0  \
> -blockdev
> node-name=disk1,file.driver=file,driver=qcow2,file.driver=file,file.filename=disk.qcow2 \
> -netdev bridge,id=hostnet0,br=virbr0,helper=/usr/libexec/qemu-bridge-helper \
> -device virtio-blk-pci,id=image1,drive=disk1 \
> -m 8192  \
> -smp 4 \
> -serial mon:stdio
>
>
> Do we need something in qemu/kvm to support STRICT_MODULE_RWX ?
>
> Thanks,
> Laurent

This patch survived 300 consecutive runs without the issue:

diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c
index 0876216ceee6..7aeb2b62e5dc 100644
--- a/arch/powerpc/mm/pageattr.c
+++ b/arch/powerpc/mm/pageattr.c
@@ -35,10 +35,7 @@ static int change_page_attr(pte_t *ptep, unsigned long addr, void *data)
        pte_t pte;
 
        spin_lock(&init_mm.page_table_lock);
-
-       /* invalidate the PTE so it's safe to modify */
-       pte = ptep_get_and_clear(&init_mm, addr, ptep);
-       flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+       pte = *ptep;
 
        /* modify the PTE bits as desired, then apply */
        switch (action) {
@@ -60,10 +57,7 @@ static int change_page_attr(pte_t *ptep, unsigned long addr, void *data)
        }
 
        set_pte_at(&init_mm, addr, ptep, pte);
-
-       /* See ptesync comment in radix__set_pte_at() */
-       if (radix_enabled())
-               asm volatile("ptesync": : :"memory");
+       flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
        spin_unlock(&init_mm.page_table_lock);
 
        return 0;
---

What I think is happening is that the virtio_console code is running
at the same time we are doing the `module_enable_ro` at the end of
`do_init_module` due to the async nature of the work handler. There is
a window after the TLB flush when the PTE has its permission bits
cleared, so any translations of the module code page attempted during
that window will fault.

I'm ignorant of strict rwx in general so I don't see why we need to
clear the bits before setting them to their final value, but as I
understand it, the set_pte_at + flush_tlb_kernel_range satisfy the ISA
requirement of [ptesync; tlbie; eieio; tlbsync; ptesync;] so it seems
like the patch should work.

Now, I cannot explain why the crash always happens around the code
that does the module's symbols relocation (the NIP in Laurent's trace
is the TOC reload from module_64.c:restore_r2). Maybe because
instructions are already in icache until the first branch into the
stub?

Anyway, this is what Murilo and I found out over our debugging session
in the past couple of days. I hope it helps. =)