[PATCH v15 7/9] powerpc: Set ARCH_HAS_STRICT_MODULE_RWX
Jordan Niethe
jniethe5 at gmail.com
Sat Aug 14 11:23:47 AEST 2021
On Sat, Aug 14, 2021 at 8:59 AM Fabiano Rosas <farosas at linux.ibm.com> wrote:
>
> Laurent Vivier <lvivier at redhat.com> writes:
>
> >
> > since this patch is merged my VM is experiencing a crash at boot (20% of the time):
> >
> > [ 8.496850] kernel tried to execute exec-protected page (c008000004073278) - exploit
> > attempt? (uid: 0)
> > [ 8.496921] BUG: Unable to handle kernel instruction fetch
> > [ 8.496954] Faulting instruction address: 0xc008000004073278
> > [ 8.496994] Oops: Kernel access of bad area, sig: 11 [#1]
> > [ 8.497028] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> > [ 8.497071] Modules linked in: drm virtio_console fuse drm_panel_orientation_quirks xfs
> > libcrc32c virtio_net net_failover virtio_blk vmx_crypto failover dm_mirror dm_region_hash
> > dm_log dm_mod
> > [ 8.497186] CPU: 3 PID: 44 Comm: kworker/3:1 Not tainted 5.14.0-rc4+ #12
> > [ 8.497228] Workqueue: events control_work_handler [virtio_console]
> > [ 8.497272] NIP: c008000004073278 LR: c008000004073278 CTR: c0000000001e9de0
> > [ 8.497320] REGS: c00000002e4ef7e0 TRAP: 0400 Not tainted (5.14.0-rc4+)
> > [ 8.497361] MSR: 800000004280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24002822
> > XER: 200400cf
> > [ 8.497426] CFAR: c0000000001e9e44 IRQMASK: 1
> > [ 8.497426] GPR00: c008000004073278 c00000002e4efa80 c000000002a26b00 c000000042c39520
> > [ 8.497426] GPR04: 0000000000000001 0000000000000000 0000000000000000 00000000000000ff
> > [ 8.497426] GPR08: 0000000000000001 c000000042c39520 0000000000000001 c008000004076008
> > [ 8.497426] GPR12: c0000000001e9de0 c0000001fffccb00 c00000000018ba88 c00000002c91d400
> > [ 8.497426] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > [ 8.497426] GPR20: 0000000000000000 0000000000000000 0000000000000000 c008000004080340
> > [ 8.497426] GPR24: c0080000040a01e8 0000000000000000 0000000000000000 c00000002e0975c0
> > [ 8.497426] GPR28: c00000002ce72940 c000000042c39520 0000000000000048 0000000000000038
> > [ 8.497891] NIP [c008000004073278] fill_queue+0xf0/0x210 [virtio_console]
> > [ 8.497934] LR [c008000004073278] fill_queue+0xf0/0x210 [virtio_console]
> > [ 8.497976] Call Trace:
> > [ 8.497993] [c00000002e4efa80] [c00800000407323c] fill_queue+0xb4/0x210
> > [virtio_console] (unreliable)
> > [ 8.498052] [c00000002e4efae0] [c008000004073a90] add_port+0x1a8/0x470 [virtio_console]
> > [ 8.498102] [c00000002e4efbb0] [c0080000040750f4] control_work_handler+0xbc/0x1e8
> > [virtio_console]
> > [ 8.498160] [c00000002e4efc60] [c00000000017f4f0] process_one_work+0x290/0x590
> > [ 8.498212] [c00000002e4efd00] [c00000000017f878] worker_thread+0x88/0x620
> > [ 8.498256] [c00000002e4efda0] [c00000000018bc14] kthread+0x194/0x1a0
> > [ 8.498299] [c00000002e4efe10] [c00000000000cf54] ret_from_kernel_thread+0x5c/0x64
> > [ 8.498349] Instruction dump:
> > [ 8.498374] 7da96b78 a14d0c8a 419c00b0 2f8a0000 419eff88 b32d0c8a 7c0004ac 4bffff7c
> > [ 8.498430] 60000000 60000000 7fa3eb78 48002d95 <e8410018> 38600000 480025e1 e8410018
> > [ 8.498485] ---[ end trace 16ee10903290b647 ]---
> > [ 8.501433]
> > [ 9.502601] Kernel panic - not syncing: Fatal exception
> >
> > add_port+0x1a8/0x470 :
> >
> > 1420
> > 1421 /* We can safely ignore ENOSPC because it means
> > 1422 * the queue already has buffers. Buffers are removed
> > 1423 * only by virtcons_remove(), not by unplug_port()
> > 1424 */
> > ->1425 err = fill_queue(port->in_vq, &port->inbuf_lock);
> > 1426 if (err < 0 && err != -ENOSPC) {
> > 1427 dev_err(port->dev, "Error allocating inbufs\n");
> > 1428 goto free_device;
> > 1429 }
> >
> > fill_queue+0x90/0x210 :
> >
> > 1326 static int fill_queue(struct virtqueue *vq, spinlock_t *lock)
> > 1327 {
> > 1328 struct port_buffer *buf;
> > 1329 int nr_added_bufs;
> > 1330 int ret;
> > 1331
> > 1332 nr_added_bufs = 0;
> > 1333 do {
> > 1334 buf = alloc_buf(vq->vdev, PAGE_SIZE, 0);
> > 1335 if (!buf)
> > 1336 return -ENOMEM;
> > 1337
> > ->1338 spin_lock_irq(lock);
> >
> > I'm using an upstream kernel (5.14-rc4, 251a1524293d) in the VM.
> >
> > My host is a RHEL 8.5/POWER9: qemu-kvm-6.0.0-21 and kernel-4.18.0-325
> >
> > My qemu command line is:
> >
> > /usr/libexec/qemu-kvm \
> > -M pseries,accel=kvm \
> > -nographic -nodefaults \
> > -device virtio-serial-pci \
> > -device virtconsole \
> > -device virtio-net-pci,mac=9a:2b:2c:2d:2e:2f,netdev=hostnet0 \
> > -blockdev
> > node-name=disk1,file.driver=file,driver=qcow2,file.driver=file,file.filename=disk.qcow2 \
> > -netdev bridge,id=hostnet0,br=virbr0,helper=/usr/libexec/qemu-bridge-helper \
> > -device virtio-blk-pci,id=image1,drive=disk1 \
> > -m 8192 \
> > -smp 4 \
> > -serial mon:stdio
> >
> >
> > Do we need something in qemu/kvm to support STRICT_MODULE_RWX ?
> >
> > Thanks,
> > Laurent
>
> This patch survived 300 consecutive runs without the issue:
>
> diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c
> index 0876216ceee6..7aeb2b62e5dc 100644
> --- a/arch/powerpc/mm/pageattr.c
> +++ b/arch/powerpc/mm/pageattr.c
> @@ -35,10 +35,7 @@ static int change_page_attr(pte_t *ptep, unsigned long addr, void *data)
> pte_t pte;
>
> spin_lock(&init_mm.page_table_lock);
> -
> - /* invalidate the PTE so it's safe to modify */
> - pte = ptep_get_and_clear(&init_mm, addr, ptep);
> - flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> + pte = *ptep;
>
> /* modify the PTE bits as desired, then apply */
> switch (action) {
> @@ -60,10 +57,7 @@ static int change_page_attr(pte_t *ptep, unsigned long addr, void *data)
> }
>
> set_pte_at(&init_mm, addr, ptep, pte);
> -
> - /* See ptesync comment in radix__set_pte_at() */
> - if (radix_enabled())
> - asm volatile("ptesync": : :"memory");
> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> spin_unlock(&init_mm.page_table_lock);
>
> return 0;
> ---
>
> What I think is happening is that the virtio_console code is running
> at the same time we are doing the `module_enable_ro` at the end of
> `do_init_module` due to the async nature of the work handler. There is
> a window after the TLB flush when the PTE has its permission bits
> cleared, so any translations of the module code page attempted during
> that window will fault.
Thanks for looking at this. I agree that this is the problem.
This avoids the crash (not a proper solution):
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -2011,13 +2011,15 @@ static void module_enable_ro(const struct
module *mod, bool after_init)
if (!rodata_enabled)
return;
- set_vm_flush_reset_perms(mod->core_layout.base);
- set_vm_flush_reset_perms(mod->init_layout.base);
- frob_text(&mod->core_layout, set_memory_ro);
-
- frob_rodata(&mod->core_layout, set_memory_ro);
- frob_text(&mod->init_layout, set_memory_ro);
- frob_rodata(&mod->init_layout, set_memory_ro);
+ if (!after_init) {
+ set_vm_flush_reset_perms(mod->core_layout.base);
+ set_vm_flush_reset_perms(mod->init_layout.base);
+ frob_text(&mod->core_layout, set_memory_ro);
+
+ frob_rodata(&mod->core_layout, set_memory_ro);
+ frob_text(&mod->init_layout, set_memory_ro);
+ frob_rodata(&mod->init_layout, set_memory_ro);
+ }
>
> I'm ignorant of strict rwx in general so I don't see why we need to
> clear the bits before setting them to their final value, but as I
> understand it, the set_pte_at + flush_tlb_kernel_range satisfy the ISA
> requirement of [ptesync; tlbie; eieio; tlbsync; ptesync;] so it seems
> like the patch should work.
I believe it's done like that because ISA 6.10.1.2 Modifying a
Translation Table Entry -
"The sequence is equivalent to deleting the PTE and then adding a new one".
In that sequence, I think it is [set pte to 0; ptesync; tlbie; eieio;
tlbsync; ptesync;]
But it does seem like we need to do something like your patch for
change_page_attr() to work as expected.
>
> Now, I cannot explain why the crash always happens around the code
> that does the module's symbols relocation (the NIP in Laurent's trace
> is the TOC reload from module_64.c:restore_r2). Maybe because
> instructions are already in icache until the first branch into the
> stub?
Yeah, mpe thought it might be some cache effect of a branch instruction.
>From time to time I did see Oopses like this (not "kernel tried to
execute exec-protected page")
[ 124.986964][ T52] Oops: Kernel access of bad area, sig: 11 [#1]
[ 124.987043][ T52] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[ 124.987095][ T52] Modules linked in: virtio_console binfmt_misc
virtiofs fuse virtio_net virtio_blk virtio_pci virtio_pci_modern_dev
virtio_ring virtio crc32c_vpmsum [last unloaded: virtio_console]
[ 124.987209][ T52] CPU: 2 PID: 52 Comm: kworker/2:1 Not tainted
5.14.0-rc4 #4
[ 124.987259][ T52] Workqueue: events control_work_handler [virtio_console]
[ 124.987307][ T52] NIP: c008000001a86044 LR: c008000001a82cf8
CTR: c00000000042a6c0
[ 124.987358][ T52] REGS: c00000000b273770 TRAP: 0300 Not tainted
(5.14.0-rc4)
[ 124.987406][ T52] MSR: 800000000280b033
<SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 82002882 XER: 00000004
[ 124.987475][ T52] CFAR: c008000001a82cf4 DAR: c008000001a86058
DSISR: 40000000 IRQMASK: 0
[ 124.987475][ T52] GPR00: c008000001a82ce8 c00000000b273a10
c008000001ab8000 c000000010731320
[ 124.987475][ T52] GPR04: 0000000000000001 0000000000000000
0000000000000000 0000000000000000
[ 124.987475][ T52] GPR08: 0000000000000000 0000000000000000
0000000000000000 c008000001a86038
[ 124.987475][ T52] GPR12: 0000000022002884 c0000001ffffdf00
c000000000169ae8 c0000000037700c0
[ 124.987475][ T52] GPR16: 0000000000000000 0000000000000000
c000000021af0000 c000000003313900
[ 124.987475][ T52] GPR20: 0000000000000001 0000000000000000
0000000000000000 c000000003316780
[ 124.987475][ T52] GPR24: c008000001a90288 0000000000000000
0000000000000000 c000000010731320
[ 124.987475][ T52] GPR28: c000000003316900 0000000000000018
0000000000000068 c00000000379c5a0
[ 124.988028][ T52] NIP [c008000001a86044] fini+0x824/0xa7e0 [virtio_console]
[ 124.988085][ T52] LR [c008000001a82cf8] fill_queue+0xb0/0x230
[virtio_console]
[ 124.988141][ T52] Call Trace:
[ 124.988171][ T52] [c00000000b273a10] [c008000001a82ce8]
fill_queue+0xa0/0x230 [virtio_console] (unreliable)
[ 124.988246][ T52] [c00000000b273aa0] [c008000001a83208]
add_port.isra.0+0x1a0/0x4b0 [virtio_console]
[ 124.988312][ T52] [c00000000b273b70] [c008000001a85474]
control_work_handler+0x46c/0x654 [virtio_console]
[ 124.988403][ T52] [c00000000b273c70] [c00000000015ce60]
process_one_work+0x2a0/0x570
[ 124.988586][ T52] [c00000000b273d10] [c00000000015d1d8]
worker_thread+0xa8/0x660
[ 124.988684][ T52] [c00000000b273da0] [c000000000169c5c] kthread+0x17c/0x190
[ 124.988762][ T52] [c00000000b273e10] [c00000000000cf54]
ret_from_kernel_thread+0x5c/0x64
[ 124.988916][ T52] Instruction dump:
[ 124.988965][ T52] 396be010 f8410018 e98b0020 7d8903a6 4e800420
00000000 73747562 003a0300
[ 124.989152][ T52] c0000000 3d62fffd 396be038 f8410018 <e98b0020>
7d8903a6 4e800420 00000000
[ 124.989298][ T52] ---[ end trace ab9046c024eb3154 ]---
I guess it depends on the timing of which pte is invalidated.
>
> Anyway, this is what Murilo and I found out over our debugging session
> in the past couple of days. I hope it helps. =)
Yes it does, thanks heaps.
>
More information about the Linuxppc-dev
mailing list