Guarded load and bus error
micha at neli.hopto.org
Fri Oct 23 07:40:26 EST 2009
I'm working on a MPC8548 processor, using its RapidIO bus. I have two
kernel trees ported for a board, a linux 2.6.24-ppc, and a linux-2.6.31
(powerpc) kernel. I don't think this bus behaviour is RapidIO specific
though, as also the PCI bus and local bus must handle malfunctioning
devices. The HID1[RFXE] bit is enabled.
To test bus error behaviour, I'm doing reads from mapped (RapidIO) I/O
memory (mapped cache-inhibited, guarded). 32 bit aligned accesses are
working fine, so the setup is good. A RapidIO error handler is installed
(error/port-write interrupt) which printks some error bits from the
RapidIO error registers and resets them. Now I'm provoking bus errors by:
1) reading from a RapidIO device that does not exist: a timeout is asserted
2) reading from an unaligned address
The MPC8548ERM mentions that interrupt latency is indeterminate for
guarded loads. From this I conclude that the processor stalls until it
receives data from the bus: it is not interruptable (machine check,
interrupts or critical interrupts). However the following behaviour is seen:
Linux 2.6.24 ppc:
For 1) my application gets a SIGBUS, after this, the error interrupt is
run reporting a packet timeout: good.
For 2) the kernel OOPSes while running do_IRQ, getting irq number. The
kernel is not interrupt mode though: my application is killed and I may
Linux 2.6.31 powerpc:
For 1) first some interrupt runs (apparantly), the machine check handler
prints a stack trace showing do_IRQ and retrieving the irq number. The
kernel in this instance detects it's running an interrupt and panic's
and resets immediately.
For 2) things are even worse ;-).
The case 1) may be "solved" by disabling my own RapidIO error interrupt
handling (I think that's the IRQ about to be executed, but the kernel
hasn't gotten far enough to read the proper registers to tell me). If
the error interrupt is disabled, then the application is killed.
Behaviour seems proper; except I can't print my (diagnostic) errors.
With this "fix" though, the case 2) proceeds as follows: the kernel
OOPSes in the machine check handler with the stack trace showing it's
executing instructions in the softirq handler. The softirq process is
killed (I assume). After this my application may continue, and I think
it retries the I/O read because (after timeout) the machine check OOPSes
again, this time showing a timer interrupt in progress (which is trying
to wake the softirq process), thereby panic'ing and resetting the board.
If I "mangle" the machine check handler to print RapidIO error registers
and return immediately always, then the behaviour is that I keep getting
machine checks printing 'packet timeout' and/or 'illegal field in
packet' ... apparantly the I/O operation is retried again and again. Not
particularly nice for a so called "guarded load".
To verify the "guarded load" being really guarded, I set the timeout to
maximum (~5 seconds), and tried to read from a non-existing device.
Under these circumstances, the board is not pingable anymore, and telnet
sessions to it are dead. These come back to life when the timeout has
passed and a SIGBUS has killed the test application.
So, the guarded load does really seem to block external interrupts (at
least timer and ethernet), but on the other hand I'm seeing inconsistent
stack traces during the machine check handling (as the last instruction
was in user space, I shouldn't be seeing stack traces down the kernel,
softirq or where else).
The HID0 and HID1 registers are equal in the two kernels (except the
2.6.31 sets DOZE mode, but disabling that had no effect).
How is it possible that behaviour differs between these two kernels?
How can I get my desired behaviour that my application is killed with a
SIGBUS, and the rest of the kernel keeps running properly?
Thanks in advance for any insight,
More information about the Linuxppc-dev