EDAC stats & PCI error recovery (was Re: [PATCH 2/2] powerpc: MPC85xx EDAC device driver)

Thu Aug 2 06:34:43 EST 2007

--- Linas Vepstas <linas at austin.ibm.com> wrote:

> On Mon, Jul 30, 2007 at 03:47:05PM -0700, Doug Thompson wrote:
> > 
> > --- Linas Vepstas <linas at austin.ibm.com> wrote:
> > > Also: please note that the linux kernel has a pci error recovery
> > > mechanism built in; its used by pseries and PCI-E. I'm not clear
> > > on what any of this has to do with EDAC, which I thought was supposed 
> > > to be for RAM only. (The EDAC project once talked about doing pci error 
> > > recovery, but that was years ago, and there is a separate system for
> > > that, now.)
> > 
> > no, edac can/does harvest PCI bus errors, via polling and other hardware error detectors.
> 
> Ehh! I had no idea. A few years ago, when I was working on the PCI error
> recovery, I sent a number of emails to the various EDAC people and mailing 
> lists that I could find, and never got a response.  I assumed the
> project was dead. I guess its not ... 

No its not, just some company lay offs stirred the pot, at least for me, for awhile.
I did see the ibm patches go by, but didn't have the time to check up at that time. I actually,
didn't know the recovery interface had gotten into the kernel (My failure to watch for them), so I
was pleasantly surprised at this last OLS to attend the presentation.

> 
> > But at the current time, few PCI device drivers initialize those callback functions and
> > thus errors are lost and some IO transactions fail.
> 
> There are patches for 6 drivers in mainline (e100, e1000, ixgb, s2io,
> ipr, lpfc), and two more pending (sym53cxxx, tg3).  So far, I've written 
> all of them. 

Great.
EDAC does nothing for recovery, just logging and stats gathering and presentation.

> 
> > Over time, as drivers get updated (might take some time) then drivers
> > can take some sort of action FOR THEMSELVES
> 
> I think I need to do more to raise awareness and interest.

good point

> 
> > Yet, there is no tracking of errors - except for a log message in the log file.
> > 
> > There is NO meter on frequency of errors, etc. One must grep the log file and that is not a
> very
> > cycle friendly mechanism.
> 
> Yeah, there was low interest in stats. There's a core set of stats in
> /proc/pp64/eeh, but these are clearly arch-specific. I'd ike to move
> away from those.  Some recent patches added stats to the /sys tree,
> under the individual pci bridge and device nodes.  Again, these are
> arch-specific; I'd like to move to some geeral/standardized presentation.

the memory error consumers really like the stats of EDAC. Allows them to track trends.
Cluster types, with thousands of nodes, like the monitoring for both memory and PCI, as well as
some newer hardware detector harvesting.

> 
> > The reason I added PCI parity/error device scanning, was that when I was at Linux Networx, we
> had
> > parity errors on the PCI-X bus, but didn't know the cause.  After we discovered that a simple
> > PCI-X riser card had manufacturing problems (quality) and didn't drive lines properly, it
> caused
> > parity errors. 
> 
> Heh. Not unusual. I've seen/heard of cases with voltages being low,
> and/or ground-bounce in slots near the end. There's a whole zoo of
> hardware/firmware bugs that we've had to painfully crawl through and
> fix. That's why the IBM boxes cost big $$$; here's to hoping that 
> customers understand why.

I understand

> 
> > This feature allowed us to track nodes that were having parity problems, but we had
> > no METER to know it.
> > 
> > Recovery is a good thing, BUT how do you know you having LOTS of errors/recovery events? You
> need
> > a meter. EDAC provides that METER
> 
> I'm lazy. What source code should I be looking at?  I'm concerned about
> duplication of function and proliferation of interfaces. I've got my 
> metering data under (for example)
> /sys/bus/pci/devices/0001:c0:01.0/eeh_*, mostly very arch specific.
> The code for this is in arch/powerpc/platforms/pseries/eeh_sysfs.c

http://bluesmoke.sourceforge.net/

is the SF project zone (bluesmoke was the out-of-tree name, changed to EDAC when it came into
tree, and source forge doesn't allow renaming)

EDAC info is under:

/sys/devices/system/edac/....

mc for memory controllers
pci for pci info.

very basic, just counters and some controls

> 
> > I met with Yanmin Zhang of Intel at OLS after his paper presentation on PCI Express Advanced
> Error
> > Reporting in the Kernel, and we talked about this same thing. I am talking with him on having
> the
> > recovery code present information into EDAC sysfs area. (hopefully, anyway)
> 
> Hmm. OK, where's that?  Back when, I'd talked to Yamin about coming up 
> with a generic, arch-indep way of driving the recovery routines. But
> this wasn't exactly easy, and we were still grappling with just getting
> things working.  Now that things are working, its time to broaden
> horizons.

Not very far, but I see the potential.
When EDAC was received, it was placed where it was in the sysfs from various kernel developers as
a good spot on its own.

> 
> Can you point me to the current edac code?
> find . -print |grep edac is not particuarly revealing at the moment.

drivers/edac

latest is in 2.6.23-rc1.   

-rc2 will have a few vital bug fixes. This release is fairly large since 2.6.26 when it first was
in the tree.

> 
> > The recovery generates log messages BUT having to periodically 'grep' the log file looking for
> > errors is not a good use of CPU cycles. grep once for a count and then grep later for a count
> and
> > then compare the counts for a delta count per unit time. ugly.
> 
> Yep. Maybe send events up to udev?

That is a possibility, yet EDAC consumers like the stats being available as well.

> 
> > The EDAC solution is to be able to have a Listener thread in user space that can be notified
> (via
> > poll()) that an event has occurred.
> 
> Hmm. OK, I'm alarmingly nave about udev, but my initial gut instinct is
> to pipe all such events to udev. Most of user-space has already been
> given the marching orders to use udev and/or hal for this kind of stuff.
> So this makes sense to me.

I need a learning process as well on udev.
It being the unified highway of event notification from kernel to user space. Although, when
memory errors fires, sometimes they can generate MASSIVE Number of events on every memory access.
PCI being different might have different constraints. Each needs investigation and classification
of timing, etc.

> 
> > There are more than one consumer (error recover) of error events:
> > 1) driver recovery after a transaction (which is the recovery consumer above)
> 
> I had to argue loudly for recovery in the kernel. The problem was that
> it was impossible to recover erros on scsi devics from userspace (since
> the block device and filesystems would go bonkers).

I hear you. It took the cluster consumers to demand ROBUST "meters" to for their machines to give
'validation' (so to speak) of their calculations. Bluesmoke/EDAC became that in our sandbox.

> 
> > 2) Management agents for health of a node
> > 3) Maintainance agents for predictive component replacement
> 
> Yes, agreed. Care to ask your management agent friends for where they'd
> like to get these events from (i.e. udev, or somewhere else?)

They are in a learning curve as well.
Error event processing is still so "young" under linux. Getting people aware of it is the current
push.

> 
> > We have MEMORY (edac_mc) devices for chipsets now, but via the new edac_device class, such
> things
> > as ECC error tracking on DMA error checkers, FABRIC switchs, L1 and L2 cache ECC events, core
> CPU
> > data ECC checkers, etc can be done. I have an out of kernel tree MIPS driver do just this.
> Other
> > types of harvesters can be generated as well for other and/or new hardware error detectors.
> 
> Ohh. I've got hardware tha does this, but its not currently usng EDAC.
> There must be some edac mailing list I'm not subscribed to??

the new edac_device class  rolled out in 2.6.23-rc1
No in tree drivers, but I have a test_device_edac module that uses it and it is in the svn repos.

I need to update the edac tarball to contain this stuff.

for mailing list, goto 

http://bluesmoke.sourceforge.net/

find the subscribe to mailing list action item. Fairly low bandwidth, but has notices, and some
discussion on things now and then.

bluesmoke-devel at lists.sourceforge.net is the development email

doug t

> 
> --linas
> 
>