[Linuxppc-users] DMA fails to reserved memory

Brian King brking at linux.vnet.ibm.com
Sat Jun 2 05:50:45 AEST 2018


Any code you can share? On an LC922 system running bare metal, as long as your
adapter is capable of 64 bits of DMA address space, and not all adapters are,
then you would not be using the IOMMU. However, that does not mean that the
physical address the host sees equals the address that is used on the PCIe
link by the adapter. You need to make sure you are using the DMA-API
as defined in the kernel documentation to allocate DMA'able memory to be given
to the adapter. This API will give you a virtual address you can use to access
the memory in the kernel as well as a dma_addr_t which is the token you give
to the adapter as the DMA address. In cases where an IOMMU is in use, this would
setup the translation control entry (TCE) in the IOMMU. In your case, where you
are not using an IOMMU, it will do a simple translation to an address that
can be used on the PCIe link.

What you are seeing in the log below is an EEH error, which is an error correction
feature of the Power PCI host bridge, which allows the platform to recover from
various PCIe errors. In this case, its a DMA write to an invalid address. In your
case the invalid address is 81166000, which is not a valid DMA address on an
LC922. 

Thanks,

Brian

On 06/01/2018 11:56 AM, Brian Varney wrote:
> Hello all,
> 
> I have a LC922 system running Fedora 28 (4.16.10-300.fc28.ppc64le) and I am reserving memory by modifying the device tree passed in to the kernel as described by this forum entry: https://lists.ozlabs.org/pipermail/linuxppc-users/2017-September/000112.html <https://lists.ozlabs.org/pipermail/linuxppc-users/2017-September/000112.html>
> 
> I have a PCIE adapter plugged into the system that I am testing.  When the adapter performs a DMA operation to this reserved memory, things start to go south.  All reads with the adapter's BAR space start returning all FF's.  I suspect the reads aren't actually making it to the adapter but I don't have a PCIE analyzer on there to verify.  Then I get the following in dmesg:
> 
> [  340.316599] EEH: Frozen PHB#34-PE#0 detected
> [  340.316645] EEH: PE location: WIO Slot2, PHB location: N/A
> [  340.316675] CPU: 133 PID: 5380 Comm: mr Tainted: P           OE    4.16.10-300.fc28.ppc64le #1
> [  340.316676] Call Trace:
> [  340.316682] [c0002004287b7a40] [c000000000bec5d0] dump_stack+0xb4/0x104 (unreliable)
> [  340.316686] [c0002004287b7a80] [c00000000003f9d0] eeh_dev_check_failure+0x4b0/0x5b0
> [  340.316689] [c0002004287b7b20] [c0000000000b3ae8] pnv_pci_read_config+0x138/0x170
> [  340.316692] [c0002004287b7b70] [c0000000006c7e14] pci_user_read_config_byte+0x84/0x160
> [  340.316693] [c0002004287b7bc0] [c0000000006df1fc] pci_read_config+0x12c/0x2d0
> [  340.316696] [c0002004287b7c50] [c0000000004bceb4] sysfs_kf_bin_read+0x94/0xf0
> [  340.316698] [c0002004287b7c90] [c0000000004bbd30] kernfs_fop_read+0x130/0x2a0
> [  340.316699] [c0002004287b7ce0] [c0000000003e721c] __vfs_read+0x6c/0x1e0
> [  340.316701] [c0002004287b7d80] [c0000000003e744c] vfs_read+0xbc/0x1b0
> [  340.316703] [c0002004287b7dd0] [c0000000003e7ed4] SyS_pread64+0xc4/0x120
> [  340.316705] [c0002004287b7e30] [c00000000000b8e0] system_call+0x58/0x6c
> [  340.316734] EEH: Detected PCI bus error on PHB#34-PE#0
> [  340.316739] EEH: This PCI device has failed 1 times in the last hour
> [  340.316739] EEH: Notify device drivers to shutdown
> [  340.316746] EEH: Collect temporary log
> [  340.316780] EEH: of node=0034:01:00.0
> [  340.316783] EEH: PCI device/vendor: 00d11000
> [  340.316786] EEH: PCI cmd/status register: 00100146
> [  340.316787] EEH: PCI-E capabilities and status follow:
> [  340.316802] EEH: PCI-E 00: 0002b010 112c8023 00002950 00437d03
> [  340.316813] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
> [  340.316815] EEH: PCI-E 20: 00000000
> [  340.316816] EEH: PCI-E AER capability register set follows:
> [  340.316828] EEH: PCI-E AER 00: 14820001 00000000 00400000 00462030
> [  340.316839] EEH: PCI-E AER 10: 00000000 0000e000 000001e0 00000000
> [  340.316849] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
> [  340.316853] EEH: PCI-E AER 30: 00000000 00000000
> [  340.316855] PHB4 PHB#52 Diag-data (Version: 1)
> [  340.316855] brdgCtl:    00000002
> [  340.316857] RootSts:    00000040 00402000 a0830008 00100107 00000000
> [  340.316859] PhbSts:     0000001c00000000 0000001c00000000
> [  340.316860] Lem:        0000000010000000 0000000000000000 0000000010000000
> [  340.316862] PhbErr:     0000080000000000 0000080000000000 2148000098000240 a008400000000000
> [  340.316864] RxeArbErr:  0000000800000000 0000000800000000 7f1a01000000001b 0000000081166000
> [  340.316865] RegbErr:    0040000000000000 0000000000000000 a2000a4018000000 1800000000000000
> [  340.316868] PE[000] A/B: 8000802301000000 8000000081166000
> [  340.316871] PE[100] A/B: 80000000ff275c00 80000000300d088b
> [  340.316872] EEH: Reset with hotplug activity
> [  340.316898] iommu: Removing device 0034:01:00.0 from group 4
> 
> 
> Any idea why this is happening?  My suspicion is that there is iommu hardware that is not allowing my adapter to access this memory, but I am not familiar with the power9 architecture.  Is there a way to disable the iommu completely or kernel functions to call to give my adapter "permision" to DMA to a memory range?
> 
> Thanks,
> Brian V.
> 
> 
> 
> _______________________________________________
> Linuxppc-users mailing list
> Linuxppc-users at lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-users
> 


-- 
Brian King
Power Linux I/O
IBM Linux Technology Center



More information about the Linuxppc-users mailing list