[5.17.0-rc6][DLPAR][SRIOV/mlx5]EEH errors and WARNING: CPU: 7 PID: 30505 at include/rdma/ib_verbs.h:3688 mlx5_ib_dev_res_cleanup

Abdul Haleem abdhalee at linux.vnet.ibm.com
Tue Mar 8 01:55:46 AEDT 2022


Greeting's

HMC DLPAR hotplug of SRIOV logical device backed with Everglade melanox adapter results in EEH error messages followed by WARNINGS on my PowerPC P10 LPAR running latest 5.17-rc6 kernel


from hmc dlpar remove and than add the SRIOV device
$ chhwres -r sriov -m ltcden11 --rsubtype logport -o r --id 9 -a  adapter_id=1,logical_port_id=2700400f
$ chhwres -r sriov -m ltcden11 --rsubtype logport -o a --id 9 -a	phys_port_id=0,adapter_id=1,logical_port_id=2700400f,logical_port_type=eth

the above command completed but the console is filled with EEH errors and warnings

console messages
PC: Registered rdma backchannel transport module.
mlx5_core 400f:01:00.0 eth1: Link up
IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
mlx5_core 8005:01:00.0 eth2: Link up
IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1
sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver
mlx5_core 400f:01:00.0: poll_health:800:(pid 0): Fatal error 1 detected
EEH: Recovering PHB#400f-PE#10000
EEH: PE location: N/A, PHB location: N/A
mlx5_core 400f:01:00.0: print_health_info:424:(pid 0): PCI slot is unavailable
mlx5_core 400f:01:00.0: mlx5_trigger_health_work:756:(pid 0): new health works are not permitted at this stage
EEH: Frozen PHB#400f-PE#10000 detected
EEH: Call Trace:
EEH: [c000000000054d10] __eeh_send_failure_event+0x70/0x150
EEH: [c00000000004df98] eeh_dev_check_failure+0x2e8/0x6c0
EEH: [c00000000004e438] eeh_check_failure+0xc8/0x100
EEH: [c0000000006a04b4] ioread32be+0x114/0x180
EEH: [c008000000d42bc0] mlx5_health_check_fatal_sensors+0x28/0x180 [mlx5_core]
EEH: [c008000000d43448] poll_health+0x50/0x260 [mlx5_core]
EEH: [c00000000021fed0] call_timer_fn+0x50/0x200
EEH: [c000000000220e90] run_timer_softirq+0x340/0x7c0
EEH: [c000000000c9e85c] __do_softirq+0x15c/0x3d0
EEH: [c00000000014f068] irq_exit+0x168/0x1b0
EEH: [c000000000026f84] timer_interrupt+0x1a4/0x3e0
EEH: [c000000000009a08] decrementer_common_virt+0x208/0x210
EEH: [c00000000367bdc0] 0xc00000000367bdc0
EEH: [c0000000009bf764] dedicated_cede_loop+0x94/0x1a0
EEH: [c0000000009bc094] cpuidle_enter_state+0x2d4/0x4e0
EEH: [c0000000009bc338] cpuidle_enter+0x48/0x70
EEH: [c00000000019ded4] call_cpuidle+0x44/0x80
EEH: [c00000000019e4b0] do_idle+0x340/0x390
EEH: [c00000000019e730] cpu_startup_entry+0x30/0x40
EEH: [c0000000000605a0] start_secondary+0x290/0x2b0
EEH: [c00000000000d154] start_secondary_prolog+0x10/0x14
EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
EEH: Notify device drivers to shutdown
EEH: Beginning: 'error_detected(IO frozen)'
mlx5_core 400f:01:00.0: wait_func_handle_exec_timeout:1108:(pid 30505): cmd[0]: DESTROY_RMP(0x90e) No done completion
mlx5_core 400f:01:00.0: wait_func:1136:(pid 30505): DESTROY_RMP(0x90e) timeout. Will cause a leak of a command resource
------------[ cut here ]------------
Destroy of kernel SRQ shouldn't fail
WARNING: CPU: 7 PID: 30505 at include/rdma/ib_verbs.h:3688 mlx5_ib_dev_res_cleanup+0x104/0x1a0 [mlx5_ib]
Modules linked in: sit tunnel4 ip_tunnel rpadlpar_io rpaphp tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag bonding rfkill rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser ib_umad rdma_cm ib_ipoib iw_cm ib_cm libiscsi scsi_transport_iscsi mlx5_ib ib_uverbs ib_core xts pseries_rng vmx_crypto gf128mul sch_fq_codel binfmt_misc ip_tables ext4 mbcache jbd2 dm_service_time mlx5_core sd_mod t10_pi sg ibmvfc scsi_transport_fc ibmveth mlxfw ptp pps_core dm_multipath dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 7 PID: 30505 Comm: drmgr Not tainted 5.17.0-rc6-autotest-g669b258a793d #1
NIP:  c0080000023cf20c LR: c0080000023cf208 CTR: c000000000702790
REGS: c0000000111b7420 TRAP: 0700   Not tainted  (5.17.0-rc6-autotest-g669b258a793d)
MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 48088224  XER: 00000005
CFAR: c000000000143c90 IRQMASK: 0
GPR00: c0080000023cf208 c0000000111b76c0 c008000002438000 0000000000000024
GPR04: 00000000ffff7fff c0000000111b7390 c0000000111b7388 0000000000000027
GPR08: c0000018fd067e00 0000000000000001 0000000000000027 c0000000027a68f0
GPR12: 0000000000008000 c0000018ff984e80 0000000000000000 0000000119d902a0
GPR16: 00007fffd673e838 0000000119d90ed0 0000000119da3070 0000000106ad1e38
GPR20: 0000000106acf330 0000000106acf3d8 0000000106acd838 0000000119da3208
GPR24: 0000000000000007 0000000000000000 c008000000e78320 c000000002818eb8
GPR28: c00000000fd210d0 c0080000024328a8 c000000017808000 c000000017808000
NIP [c0080000023cf20c] mlx5_ib_dev_res_cleanup+0x104/0x1a0 [mlx5_ib]
LR [c0080000023cf208] mlx5_ib_dev_res_cleanup+0x100/0x1a0 [mlx5_ib]
Call Trace:
[c0000000111b76c0] [c0080000023cf208] mlx5_ib_dev_res_cleanup+0x100/0x1a0 [mlx5_ib] (unreliable)
[c0000000111b7730] [c0080000023d4c00] __mlx5_ib_remove+0x78/0xc0 [mlx5_ib]
[c0000000111b7770] [c00000000082479c] auxiliary_bus_remove+0x3c/0x70
[c0000000111b77a0] [c000000000814278] device_release_driver_internal+0x168/0x2d0
[c0000000111b77e0] [c000000000811748] bus_remove_device+0x118/0x210
[c0000000111b7860] [c000000000809a18] device_del+0x1d8/0x4e0
[c0000000111b7920] [c008000000d601b0] mlx5_rescan_drivers_locked.part.9+0xf8/0x250 [mlx5_core]
[c0000000111b79d0] [c008000000d60870] mlx5_unregister_device+0x48/0x80 [mlx5_core]
[c0000000111b7a00] [c008000000d32930] mlx5_uninit_one+0x38/0x100 [mlx5_core]
[c0000000111b7a70] [c008000000d33330] remove_one+0x58/0xa0 [mlx5_core]
[c0000000111b7aa0] [c000000000736d0c] pci_device_remove+0x5c/0x100
[c0000000111b7ae0] [c000000000814278] device_release_driver_internal+0x168/0x2d0
[c0000000111b7b20] [c000000000728a98] pci_stop_bus_device+0xa8/0x100
[c0000000111b7b60] [c000000000728cdc] pci_stop_and_remove_bus_device_locked+0x2c/0x50
[c0000000111b7b90] [c000000000739d20] remove_store+0xc0/0xe0
[c0000000111b7be0] [c000000000806870] dev_attr_store+0x30/0x50
[c0000000111b7c00] [c0000000005767c0] sysfs_kf_write+0x60/0x80
[c0000000111b7c20] [c000000000574e50] kernfs_fop_write_iter+0x1a0/0x2a0
[c0000000111b7c70] [c00000000045e3ec] new_sync_write+0x14c/0x1d0
[c0000000111b7d10] [c000000000461904] vfs_write+0x234/0x340
[c0000000111b7d60] [c000000000461bc4] ksys_write+0x74/0x130
[c0000000111b7db0] [c00000000002f608] system_call_exception+0x178/0x380
[c0000000111b7e10] [c00000000000c64c] system_call_common+0xec/0x250
--- interrupt: c00 at 0x20000026bd74
NIP:  000020000026bd74 LR: 00002000001e34c4 CTR: 0000000000000000
REGS: c0000000111b7e80 TRAP: 0c00   Not tainted  (5.17.0-rc6-autotest-g669b258a793d)
MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 24004222  XER: 00000000
IRQMASK: 0
GPR00: 0000000000000004 00007fffd673e650 0000200000367100 0000000000000007
GPR04: 0000000119da3ea0 0000000000000001 fffffffffbad2c84 0000000119d902a0
GPR08: 0000000000000001 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 000020000005b520 0000000000000000 0000000119d902a0
GPR16: 00007fffd673e838 0000000119d90ed0 0000000119da3070 0000000106ad1e38
GPR20: 0000000106acf330 0000000106acf3d8 0000000106acd838 0000000119da3208
GPR24: 0000000119da3219 00007fffd673e878 0000000000000001 0000000119da3ea0
GPR28: 0000000000000001 0000000119d902a0 0000000119da3ea0 0000000000000001
NIP [000020000026bd74] 0x20000026bd74
LR [00002000001e34c4] 0x2000001e34c4
--- interrupt: c00
Instruction dump:
60000000 3d420000 e94a84c8 892a0000 2f890000 409eff64 3c620000 e86384d0
39200001 992a0000 48032a1d e8410018 <0fe00000> 3d420000 e94a84c8 892a0000
---[ end trace 0000000000000000 ]---
------------[ cut here ]------------
WARNING: CPU: 7 PID: 30505 at drivers/infiniband/core/verbs.c:347 ib_dealloc_pd_user+0x68/0xd0 [ib_core]
Modules linked in: sit tunnel4 ip_tunnel rpadlpar_io rpaphp tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag bonding rfkill rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser ib_umad rdma_cm ib_ipoib iw_cm ib_cm li

-- 
Regard's

Abdul Haleem
IBM Linux Technology Center



More information about the Linuxppc-dev mailing list