ppc64el kernel bug?

Kip Warner kip at thevertigo.com
Tue Jan 4 13:00:11 AEDT 2022


Dear List,

I was directed to this list by a colleague of mine at IBM (cc'd).

After recently spending all day debugging late into the evening, I
think I've found a problem. I noticed some really weird behaviour with
one of my application's database i/o threads exiting without any
exceptions being thrown and without anything in my code apparently
telling it to exit. This happened only on ppc64el.

I checked the postgresql log and this is what I see after my
application's connection thread bails:

   2021-12-25 06:52:52.836 UTC [144517] LOG:  server process (PID 144937) was terminated by signal 11: Segmentation fault
   2021-12-25 06:52:52.836 UTC [144517] DETAIL:  Failed process was running: SELECT * FROM mytable;
   2021-12-25 06:52:52.836 UTC [144517] LOG:  terminating any other active server processes
   2021-12-25 06:52:52.836 UTC [144941] WARNING:  terminating connection because of crash of another server process
   2021-12-25 06:52:52.836 UTC [144941] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
   2021-12-25 06:52:52.836 UTC [144941] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
   2021-12-25 06:52:52.836 UTC [144940] WARNING:  terminating connection because of crash of another server process
   2021-12-25 06:52:52.836 UTC [144940] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
   2021-12-25 06:52:52.836 UTC [144940] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
   2021-12-25 06:52:52.836 UTC [144938] WARNING:  terminating connection because of crash of another server process
   2021-12-25 06:52:52.836 UTC [144938] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
   2021-12-25 06:52:52.836 UTC [144938] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
   2021-12-25 06:52:52.836 UTC [144939] WARNING:  terminating connection because of crash of another server process
   2021-12-25 06:52:52.836 UTC [144939] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
   2021-12-25 06:52:52.836 UTC [144939] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
   2021-12-25 06:52:52.836 UTC [144833] WARNING:  terminating connection because of crash of another server process
   2021-12-25 06:52:52.836 UTC [144833] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
   2021-12-25 06:52:52.836 UTC [144833] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
   2021-12-25 06:52:52.838 UTC [144517] LOG:  all server processes terminated; reinitializing
   2021-12-25 06:52:52.842 UTC [144942] LOG:  database system was interrupted; last known up at 2021-12-25 06:46:54 UTC
   2021-12-25 06:52:52.842 UTC [144942] LOG:  database system was not properly shut down; automatic recovery in progress
   2021-12-25 06:52:52.843 UTC [144942] LOG:  redo starts at 0/192A690
   2021-12-25 06:52:52.843 UTC [144942] LOG:  invalid record length at 0/19378B8: wanted 24, got 0
   2021-12-25 06:52:52.843 UTC [144942] LOG:  redo done at 0/1937430
   2021-12-25 06:52:52.850 UTC [144517] LOG:  database system is ready to accept connections

That looks like PostgreSQL itself crashed. So I started trying to find
a core dump, but couldn't find it. I checked the kernel log and this is
where it gets interesting:

   Dec 25 06:52:52 romulus-server kernel: [28835.277591] BUG: Unable to handle kernel data access on write at 0x132b47d38499fd58
   Dec 25 06:52:52 romulus-server kernel: [28835.277624] Faulting instruction address: 0xc0000000004d0434
   Dec 25 06:52:52 romulus-server kernel: [28835.277636] Oops: Kernel access of bad area, sig: 11 [#150]
   Dec 25 06:52:52 romulus-server kernel: [28835.277656] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
   Dec 25 06:52:52 romulus-server kernel: [28835.277669] Modules linked in: veth nft_masq zfs(PO) zunicode(PO) zzstd(O) zlua(O) zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) vhost_vsock vmw_vsock_virtio_transport_common vhost vhost_iotlb vsock xt_CHECKSUM nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter xt_tcpudp nft_compat bridge stp llc nf_tables nfnetlink binfmt_misc dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua joydev input_leds ipmi_powernv mac_hid ipmi_devintf ipmi_msghandler ofpart cmdlinepart at24 powernv_flash mtd uio_pdrv_genirq opal_prd uio ibmpowernv vmx_crypto sch_fq_codel jc42 ip_tables x_tables autofs4 xfs btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor hid_generic usbhid hid raid6_pq libcrc32c raid1 raid0 multipath linear nouveau ses enclosure scsi_transport_sas ast drm_vram_helper i2c_algo_bit drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm crct10dif_vpmsum
   Dec 25 06:52:52 romulus-server kernel: [28835.277776]  crc32c_vpmsum xhci_pci tg3 aacraid xhci_pci_renesas drm_panel_orientation_quirks
   Dec 25 06:52:52 romulus-server kernel: [28835.277918] CPU: 26 PID: 144937 Comm: postgres Tainted: P      D    O      5.11.0-41-generic #45-Ubuntu
   Dec 25 06:52:52 romulus-server kernel: [28835.277943] NIP: c0000000004d0434 LR: c0000000004d032c CTR: c0000000010a90e0
   Dec 25 06:52:52 romulus-server kernel: [28835.277975] REGS: c000000056b9f6b0 TRAP: 0380   Tainted: P      D    O       (5.11.0-41-generic)
   Dec 25 06:52:52 romulus-server kernel: [28835.278008] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 88002281  XER: 0000008c
   Dec 25 06:52:52 romulus-server kernel: [28835.278050] CFAR: c0000000004d041c IRQMASK: 0 
   Dec 25 06:52:52 romulus-server kernel: [28835.278050] GPR00: c0000000004d032c c000000056b9f950 c000000002409a00 0000000000000000 
   Dec 25 06:52:52 romulus-server kernel: [28835.278050] GPR04: 0000000000400cc0 0000000000000097 ffffffffffffffff c000000ffda9d0d0 
   Dec 25 06:52:52 romulus-server kernel: [28835.278050] GPR08: 0000000ffbd90000 132b47d38499fce8 0000000000000070 d4ff277338704e25 
   Dec 25 06:52:52 romulus-server kernel: [28835.278050] GPR12: 0000000000002000 c000000ffffd2c00 0000000000000000 c000000116c512d0 
   Dec 25 06:52:52 romulus-server kernel: [28835.278050] GPR16: 0000000000000154 c000000116c51570 c000000056b9fc88 0000000000000154 
   Dec 25 06:52:52 romulus-server kernel: [28835.278050] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
   Dec 25 06:52:52 romulus-server kernel: [28835.278050] GPR24: c000000000ecccc0 0000000000000001 c0000000024588fc c000000000ec9954 
   Dec 25 06:52:52 romulus-server kernel: [28835.278050] GPR28: ffffffffffffffff c00000001d597e40 0000000000400cc0 c000000003018880 
   Dec 25 06:52:52 romulus-server kernel: [28835.278213] NIP [c0000000004d0434] kmem_cache_alloc_node+0x1d4/0x490
   Dec 25 06:52:52 romulus-server kernel: [28835.278237] LR [c0000000004d032c] kmem_cache_alloc_node+0xcc/0x490
   Dec 25 06:52:52 romulus-server kernel: [28835.278268] Call Trace:
   Dec 25 06:52:52 romulus-server kernel: [28835.278283] [c000000056b9f950] [c0000000004d032c] kmem_cache_alloc_node+0xcc/0x490 (unreliable)
   Dec 25 06:52:52 romulus-server kernel: [28835.278328] [c000000056b9f9c0] [c000000000ec9954] __alloc_skb+0x74/0x2d0
   Dec 25 06:52:52 romulus-server kernel: [28835.278369] [c000000056b9fa20] [c000000000ecccc0] alloc_skb_with_frags+0x70/0x2e0
   Dec 25 06:52:52 romulus-server kernel: [28835.278403] [c000000056b9faa0] [c000000000ec0f38] sock_alloc_send_pskb+0x1d8/0x200
   Dec 25 06:52:52 romulus-server kernel: [28835.278436] [c000000056b9fb10] [c0000000010a93a8] unix_stream_sendmsg+0x2c8/0x710
   Dec 25 06:52:52 romulus-server kernel: [28835.278471] [c000000056b9fc10] [c000000000eb64e0] sock_sendmsg+0x80/0xb0
   Dec 25 06:52:52 romulus-server kernel: [28835.278494] [c000000056b9fc40] [c000000000ebab88] __sys_sendto+0xf8/0x1a0
   Dec 25 06:52:52 romulus-server kernel: [28835.278526] [c000000056b9fd90] [c000000000ebaca0] sys_send+0x30/0x40
   Dec 25 06:52:52 romulus-server kernel: [28835.278558] [c000000056b9fdb0] [c000000000036ffc] system_call_exception+0x10c/0x230
   Dec 25 06:52:52 romulus-server kernel: [28835.278601] [c000000056b9fe10] [c00000000000d374] system_call_vectored_common+0xf4/0x26c
   Dec 25 06:52:52 romulus-server kernel: [28835.278634] --- interrupt: 3000 at 0x7ec638a194f4
   Dec 25 06:52:52 romulus-server kernel: [28835.278654] NIP: 00007ec638a194f4 LR: 0000000000000000 CTR: 0000000000000000
   Dec 25 06:52:52 romulus-server kernel: [28835.278685] REGS: c000000056b9fe80 TRAP: 3000   Tainted: P      D    O       (5.11.0-41-generic)
   Dec 25 06:52:52 romulus-server kernel: [28835.278719] MSR: 900000000280f033 <SF,HV,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 48008281 XER: 00000000
   Dec 25 06:52:52 romulus-server kernel: [28835.278766] IRQMASK: 0 
   Dec 25 06:52:52 romulus-server kernel: [28835.278766] GPR00: 000000000000014e 00007fffe99c1800 00007ec638a47f00 0000000000000009 
   Dec 25 06:52:52 romulus-server kernel: [28835.278766] GPR04: 00000043809d1148 0000000000000154 0000000000000000 0000000000001ae8 
   Dec 25 06:52:52 romulus-server kernel: [28835.278766] GPR08: 0000004362347d00 0000000000000000 0000000000000000 0000000000000000 
   Dec 25 06:52:52 romulus-server kernel: [28835.278766] GPR12: 0000000000000000 00007ec6348e0890 0000000000000000 ffffffffffffffff 
   Dec 25 06:52:52 romulus-server kernel: [28835.278766] GPR16: 0000000000000000 000000436233f7a0 0000000000000001 0000000000000000 
   Dec 25 06:52:52 romulus-server kernel: [28835.278766] GPR20: 00007fffe99c18ac 0000004362344f48 0000000000000004 00007fffe99c18b0 
   Dec 25 06:52:52 romulus-server kernel: [28835.278766] GPR24: 0000000006000001 0000000000000000 0000000000000154 00000043809d1148 
   Dec 25 06:52:52 romulus-server kernel: [28835.278766] GPR28: 0000000000000000 00007ec6348d9938 00000043809ceb00 000000000000000b 
   Dec 25 06:52:52 romulus-server kernel: [28835.278992] NIP [00007ec638a194f4] 0x7ec638a194f4
   Dec 25 06:52:52 romulus-server kernel: [28835.279020] LR [0000000000000000] 0x0
   Dec 25 06:52:52 romulus-server kernel: [28835.279038] --- interrupt: 3000
   Dec 25 06:52:52 romulus-server kernel: [28835.279054] Instruction dump:
   Dec 25 06:52:52 romulus-server kernel: [28835.279072] f9210020 41820098 2e1cffff 3b200001 2c2a0000 41820088 41920010 894a0007 
   Dec 25 06:52:52 romulus-server kernel: [28835.279110] 7c1c5000 40820078 815f0028 e97f00b8 <7ce9502a> 7c095214 886d0988 9b2d0988 
   Dec 25 06:52:52 romulus-server kernel: [28835.279141] ---[ end trace fe7ee98d0b7beb6a ]---
   Dec 25 06:52:52 romulus-server kernel: [28835.855026] 

I'm only experiencing this on ppc64el, but not amd64 or risc64. The
closest thing I could find to it was this:

   https://github.com/dynup/kpatch/issues/1228

I'm using kernel 5.11.0-41-generic and g++ 11.1.0.

Any thoughts?

-- 
Kip Warner -- Senior Software Engineer
OpenPGP signed/encrypted mail preferred
https://www.thevertigo.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20220103/66c82394/attachment.sig>


More information about the Linuxppc-dev mailing list