Adding support for custom SEL records
yulei.sh at bytedance.com
Fri Oct 21 00:24:55 AEDT 2022
On Thu, Oct 20, 2022 at 2:05 AM Bills, Jason M
<jason.m.bills at linux.intel.com> wrote:
> On 10/19/2022 11:10 AM, Brad Bishop wrote:
> > Thanks Jason
> > On Wed, Oct 19, 2022 at 09:50:47AM -0600, Bills, Jason M wrote:
> >> Intel had a requirement to support storing at least 4000 log entries.
Bytedance has a requirement of 1000 log entries.
> > Ok. So is it fair to assume anyone using the DBus backend does not have
> > this requirement?
> That is my assumption, yes.
> >> At the time, we were able to get about 400 entries on D-Bus before
> >> D-Bus performance became unusable.
> > To anyone using the DBus backend - have you observed similar performance
> > issues?
We did hit the performance issue, specifically, it is extremely slow
during BMC boot, when log-manager restore the log entries and put them
That's when I start the discussion about
Later we resolved the issue by:
* Applying the patch
* Implement the SEL cache in ipmid that is already upstreamed
* Improve the SEL cache by serialization (not upstreamed)
Eventually we get fair performance on SEL handling (with 1000
entries), it should handle 4000 as well.
> > Jason is there a testcase or scenario I can execute to highlighht the
> > issues you refer to concretely? Maybe something like "create 4000 sels,
> > run ipmitool and see how long it takes?"
> To clarify, my understanding is the D-Bus performance issues were not
> isolated to just IPMI. All of D-Bus for every BMC service was impacted.
> If I remember correctly, Ed Tanous is who did the initial evaluation, so
> he may have more detail. But I think it was similar to what you
> suggest: Create 4000 logs on D-Bus and check the performance. This
> could be done with ipmitool.
> >> I'd also be curious about the reverse question. Is there any benefit
> >> to storing logs on D-Bus that makes it a better solution?
> > Yes, this is exactly the question I've been trying to ask. The answer
> > seems only to be that the code is in meta-intel/intel-ipmi-oem - but
> > that is easily fixed by moving the code to
> > meta-phosphor/phosphor-host-ipmid.
> >> At the risk of complicating things more (https://xkcd.com/927/), D-Bus
> >> was the primary solution when Intel joined. We created the rsyslog
> >> approach because of the limitation imposed by D-Bus. But I know there
> >> are still those who don't like the rsyslog approach. Is there a way
> >> we can now get together and define a new logging solution that is
> >> fully upstream and avoids the drawbacks of both existing solutions?
> > I hope so, because doing that would make things a lot easier for our
> > users adopting OpenBMC.
> My main requirements are to store many logs (at least 4000 was the
> original number, but I can try to get an updated number if needed) and
> have them persist across BMC reboots.
> We currently accomplish this using rsyslog to extract logs from the
> journal and store them in a persistent text file.
> How is best to approach starting a new design discussion? Should we
> continue discussing in this thread? Start a design doc review?
> Something else?
> > Thanks,
> > brad
I would like to add several notes (possibly limitations) about
rsyslog's SEL in intel-ipmi-oem, please correct if I was wrong.
* It handles the SELs from phosphor-sel-logger, mostly it only
contains the threshold events.
* It iterates the sel files, and convert the file content into SEL
data every time on a request, which does not seem optimal
* The "add sel entry" does not really add a sel log, it adds an event
entry to Redfish instead.
* With above behavior, it basically has two separate types of logs,
SEL logs that are from rsyslog, and redfish event logs that are done
by "add sel entry". Thus the implementation seems to only support SELs
for sensor threshold events, but not for discrete sensors.
In bytedance we need a "full" SEL feature that supports both
thresholds and discrete sensors.
The whole solution is based on the DBus logging, but it involves
different repos (ipmid, phosphor-logging, fault-monitor). Part of the
implementation is upstreamed but some are internal for now.
I would like to share the details when I have bandwidth :)
More information about the openbmc