Inconsistent performance of dbus call GetManagedObjects to PSUSensor in dbus-sensors

Fri Aug 14 00:53:50 AEST 2020

On Wed, Aug 12, 2020 at 6:38 PM Sui Chen <suichen at google.com> wrote:
>
> Hello,
>
> Because machine configurations may change during development, we used a microbenchmark to try to isolate the cause and reproduce the long DBus latencies reliably, and another microbenchmark to demonstrate the idea we had tried that appeared to alleviate but not completely eliminate this DBus latency problem.

This works great, so long as you constrain things to the way they look
on the bmc, and don't look at things too closely.  One thread, pinned
to a single core tends to work best for reproducing this kind of
stuff.  If you can also pin something else to that same core that's
doing some work, it tends to be closer to a real bmc.  Also remember,
context switches are much cheaper on x86 multicore than they are on
other, less capable single core processors.  You need to keep that in
mind when running your microbenchmarks on a different platform, as the
profiles can vary significantly, especially in the worst case.

>
>
> The first microbenchmark, dbus-asio-bmk (https://gerrit.openbmc-project.xyz/c/openbmc/openbmc-tools/+/35576) mimics our patched psusensor: an ASIO worker that reads sensors at some fixed timer interval; the ASIO worker is also used by sdbusplus for handling DBus work. We continuously run "busctl tree" against the DBus interface created by the microbenchmark binary.

What patches did you apply?

>
>
> By importing the resultant DBus traffic dump and ASIO handler activity log onto the timeline view form dbus-vis, we can clearly see the the "sensor reading" ASIO work items block the DBus work items, which in turn causes very long DBus latencies to show up:

All I see is that there's a long lag between a single read of a busctl
tree call. I don't see any indication that it's because of sensor
reading specifically, at least from below.  Dumping the state of the
work queue would explain a lot about what's going on during these
periods.  I wonder if this is as easy as smearing the initial start
time of each sensor when the application starts to make sure the
timers don't synchronize like this.

>
> Although this benchmark is run on an x86 workstation instead of the BMC due to a run-time error in its dependencies, according to the results above, we see this "thundering herd" problem appear to occur on a desktop platform as well.

One odd thing I notice in that graph: we're using null_buffers?  I
thought we were using a direct async read call.

The other thing to realize is busctl tree is very inefficient.  It
executes a full introspection of all levels, then executes
GetProperties for each property.  You really need to move your
microbenchmark to GetManagedObjects (which I see you recognize below)
as that's more indicative of what a real caller would do.  I don't
think your profiles are worth looking at until you do this.

>
>
> As we modify various experimental parameters, it turns out that the more time is occupied by non-DBus ASIO work the more likely long DBus latencies are to happen, since there is a higher chance the DBus calls clash with the "fake sensor reading". Thus, we come up with an assumption that if we reduce the time spent (by the only ASIO worker) in non-DBus ASIO work, DBus latencies will be reduced.

Another option is to make it 2 threaded.  Have one thread (and
io_context) set up for handling dbus traffic, and one thread set up
for async reading of IO, with an atomic data structure in between for
passing values.

Another (probably preferred in this case) architecture would be to
implement an ASIO priority queue, and prioritize the dbus traffic
higher than the IO reading traffic.  There are examples of this in
asio, and I assumed the day where we needed this would've come a lot
sooner (we got away with a dumb fifo io queue for a long time), but
here we are.

>
>
> Based on this assumption, we attempted a few methods to reduce the time it takes psusensor to read the sensors. The second benchmark (https://gerrit.openbmc-project.xyz/c/openbmc/openbmc-tools/+/35387) explains the methods we had experimented with. It turns out we were able to reduce sensor reading time as well as the chance of long DBus method calls happening, but the inconsistent DBus call times do not completely go away. This is probably due to psusensors being much more complex than the two benchmarks with much other work contending for the ASIO worker's time.

Can you detail what methods?  It might help someone else pinpoint
another good option.

>
>
> So to summarize the point of this reply is to say:
> 1) We had attempted the ASIO handler dump as suggested and a method for analyzing DBus+ASIO performance has been embodied in dbus-vis.

Can you post the dump somewhere that others can help analyse?

>
> 2) We are interested to know if someone else is looking at similar problems.

Ps, I'd be remiss if I didn't point out that your last email was sent
as HTML.  This list prefers plaintext.

>
> 3) We will examine GetManagedObjects again when we get a chance.