Prioritizing URIs with tight performance requirement in openBmc with bmcweb

Rohit Pai ropai at nvidia.com
Fri Jun 23 22:19:46 AEST 2023


Hello Ed, 

> What two queries are you testing with?
There are MetericReport URIs 
http://${BMC}/redfish/v1/TelemetryService/MetricReports/
There is no crypto involved, I am using HTTP only.  

> a. Backend dbus call turnaround time                                              - 584 ms
> This is quite high.  Have you looked at reducing this?  This would imply that you're doing blocking calls in your backend daemon.
There is no simple way to reduce this because we have around 2000 properties spread across different objects. So, the dbus call is making GetManagedObjects which will trigger several internal get handlers. 
One solution we have tried is to put all the needed info in a single aggregate property and get this via single get property call. Even with that approach the dbus round trip time is 100+ ms for 2000 sensor data. 
The time taken by bmcweb to prepare the response was mainly impact by other services in the system which were taking considerable CPU bandwidth. 

> FD passing technique between backend and bmcweb 
> I'm happy to have those discussions, and the data you have above is interesting, but any sort of change would require much larger project buy-in.
What we have learnt from our experiment is that there is no way make this solution robust without explicit locking mechanism. 
In our POC we had a backend service which writes 500KB data to a file every 500ms and bmcweb would read this file fd in the route handler. When we run this code without any explicit locking or synchronization, we ran into issues related to data consistency. Its hard to ensure that the reader would always get correct wholesome data even if this is done in block read/write manner. 
If we put reader/writer locks, then I am thinking the performance might come close to dbus IPC. 

> Optimization around CPU bound logic in handler code would certainly help the latency of the other requests pending in the queue.
> Is it CPU bound, or Memory bandwidth bound?  Most of the time I've seen the latter.  How did you collect the measurements on cpu versus IO versus memory bound?
All time tracking is done via explicit logging in bmcweb. So we cant really tell whether its CPU OR memory bandwidth but we know if the CPU is waiting on IO job or it's just compute logic which is taking time. 


 > I will try the multi-threaded solution put by you in the coming days and share the results.
When I tried the patch, I did not see any improvement. I also saw that bmcweb was still single threaded. There was only one entry under /proc/<bmcweb-pid>/task
In the patch we only pass thread count parameter to boost IO context. As per my reading this does not create the threads rather only sets the concurrency hint for io_context. 
I added below piece of code to the webserver_main.cpp file to creates needed threads and call io.run() on each one of them. 

.......
void runIOService() {
     boost::asio::io_context& io = crow::connections::getIoContext();
     io.run();
}
.......
        // Create a vector of threads
    std::vector<std::thread> threads;

    // Create and launch the threads
    for (unsigned int i = 0; i < 4; ++i) {
        threads.emplace_back(runIOService);
    }

    // Wait for all threads to finish
    for (auto& thread : threads) {
        thread.join();
    }
.........

This seems to create the needed threads we want and at unit test level it worked but when we ran stress test to loop few URIs in multiple concurrent clients its breaking. 
There is no crash dump from bmcweb but it just appears to be hung and goes to non-responsive state. Need to dig deeper if there is any sort of deadlocks happening. 
Let me know if you have already found any fix from your testing or in general have any immediate thoughts on from where it might be coming. 

Thanks 
Rohit 



-----Original Message-----
From: Ed Tanous <edtanous at google.com> 
Sent: Thursday, June 8, 2023 10:49 PM
To: Rohit Pai <ropai at nvidia.com>
Cc: openbmc at lists.ozlabs.org
Subject: Re: Prioritizing URIs with tight performance requirement in openBmc with bmcweb

External email: Use caution opening links or attachments


On Sat, Jun 3, 2023 at 1:49 AM Rohit Pai <ropai at nvidia.com> wrote:
>
> Hello Ed,

The below is all really great data

>
> Thermal metric URI has around 100 sensors and has tight latency perf requirement of 500ms.
> Stats/counter metric URI has around 2500 properties to fetch from the backend which uses the GetManagedObjects API.
> Time analysis was done on the latency measurement of stats/counter URI as this impacts the latency of thermal metric URI with the current bmcweb single threaded nature.

What two queries are you testing with?

>
>
>
> Method 1 - Object Manger call to the backend service, bmcweb handler code processes the response and prepares the required JSON objects.
> a. Backend dbus call turnaround time                                              - 584 ms

This is quite high.  Have you looked at reducing this?  This would imply that you're doing blocking calls in your backend daemon.

> b. Logic in bmcweb route handler code to prepare response      - 365 ms

This could almost certainly be reduced with some targeted things.  You didn't answer me on whether you're using TLS in this example, so I'm going to assume you're not from your numbers.  I would've expected crypto to be a significant part of your profile.

> c. Total URI latency                                                                               - 1019 ms

a + b != c.  Is the rest the time spent writing to the socket?  What is the extra time?

>
> Method 2 - Backend populates all the needed properties in a single aggregate property.
> a. Backend dbus call turnaround time                                              - 161 ms

This is still higher than I would like to see, but in the realm of what I would expect.

> b. Logic in bmcweb route handler code to prepare response      - 71   ms

I would've expected to see this in single digit ms for a single property.  Can you profile here and see what's taking so long?

> c. Total URI latency                                                                               - 291 ms
>
> Method 3 - Bmcweb reads all the properties from a file fd. Here goal is to eliminate latency and load coming by using dbus as an IPC for large payloads.
> a. fd read call in bmcweb                                                                     - 64 ms

This is roughly equivalent to the dbus call, so if we figure out where the bottleneck is in method 1B from the above, we could probably get this comperable.

> b. JSON objection population from the read file contents             - 96 ms

This seems really high.

> c. Total URI latency                                                                                - 254 ms
> The file contents were in JSON format. If we can replace this with efficient data structure which can be used with fd passing, then I think we can further optimize point b.

In Method 3 you've essentially invented a new internal OpenBMC API.  I would love to foster discussions of how to handle that, but we need to treat it holistically in the system, and understand how:
1. The schemas will be managed
2. Concurrency will be managed
3. Blocking will be managed (presumably you did a blocking filesystem read to get the data)

I'm happy to have those discussions, and the data you have above is interesting, but any sort of change would require much larger project buy-in.

> Optimization around CPU bound logic in handler code would certainly help the latency of the other requests pending in the queue.

Is it CPU bound or Memory bandwidth bound?  Most of the time I've seen the latter.  How did you collect the measurements on cpu versus IO versus memory bound?

>
> I will try the multi-threaded solution put by you in the coming days and share the results.
>

Sounds good.  Thanks for the input.


More information about the openbmc mailing list