Prioritizing URIs with tight performance requirement in openBmc with bmcweb

Sat May 27 01:16:18 AEST 2023

Hello Ed, 

Thanks for your response and feedback. Please find my answers to your questions below. 

> > What other constraints are here?  We're talking about a TCP protocol, running on a network, on a multi-process CPU.  Are these hard realtime requirements?  It's unlikely you're going to get hard realtime guarantees from Redfish.
No these are not hard real time requirements. The latency can go more that 500ms in some instances, but we want to target the outliers (samples which take more than 500ms of TAT) to be less than one percent of the total samples. 
We also want to limit the max latency of the power, thermal metric URI under 1s. The fabric bmc periodically polls this URIs from our platform and implements the PID loop control based on the response of this URI so its critical for us to limit the max latency of this URI. 

>>         A            B                                 C             D
>> TCP─►TLS──►HTTP Connection─►DBus─►Daemon
>> Which location are you seeing queuing problems?  
The time spent in route handler after getting the response from the backend daemon is affecting the next URI which is waiting in the queue. 
From our tests we see that the backend daemon responds on an average within 200ms. Since this is async this has no impact. 
The response received from the dbus backend has around 2500 properties to process. Processing of this in the route handler and preparing the response it's taking avg time of 200ms (max time of 400 ms and several instances going close to 300ms). The Avg CPU consumption during the test is 50% and avg bmcweb consumption is between 12 to 15%. 

>> What is the ballpark for how big "huge amount" of data would be?  What processing is being done?   
The JSON in the HTTP response is around 400KB. 
The route handler logic makes the GetManagedObjects call to the backend service and processes the response which has around 2500 properties to loop through. 
There is no specific business logic here which is time consuming but it's the number of properties which seem to have significant impact. 

>> One thing I've considered before is switching bmcweb over to boost::json, which can do incremental chunked parsing, unlike nlohmann, which would let us unblock the flows as each processes the data.
Do you have any pointers for this ? where exactly we would be unblocking the flow in route handler after processing chunk of properties from the backend OR somewhere else? 

>> Can you share your test?  Is your test using multiple connections to ensure that the thermal metric is being pulled from a different connection than the aggregation URI?
In the test we have two clients connected to our platform. 
One client does power, thermal metric URI polling for every 500ms. We have single aggregate URI for this which combine all power, thermal sensors. We have close to 100 sensors. This is the URI where we have strict performance requirement. 
Other client periodically polls stats and counters aggregate URI for every 5 seconds. This is the URI which has around 400KB of JSON response and has 2500 dbus properties to process. We are not very particular about the latency of the
stats and counters URI. In the test result we see that power, thermal metric URI has max latency time of 1.2 seconds (min is 2ms and avg is 9ms) and around 3.5% outliers (samples took more than 500ms). We can see that whenever bmcweb code is busy preparing the response for the stats and counter URI and there is request for thermal metric URI then latency of the thermal metric URIs is affected. 
If we limit the number of properties in the stats and counters URI, then we can meet the requirement, but we need to create lot of aggregate URIs. It would not be convenient for the customers to use many aggregate URIs and then combine the response.  Is there way to process chunk of properties in the route handler and voluntarily release the context for bmcweb to process the next URIs in the queue ? any other trick which can work here ? 

> Here the idea is to develop a new application server to serve the URIs which have strong latency requirements and route the rest of the URIs to bmcweb.
>> This is the part I don't understand;  If the forwarding calls in this new server are blocking to bmcweb, what's the point of adding it?
The backend for the thermal metrics and stats/counters metrics is different so we would not be blocked on the same service. 
With a reverse proxy our idea is to forward thermal metrics to another backend application server (new bmcweb server or a similar lightweight server instance) and all other URIs to the existing bmcweb. 

I will share any further results from our internal tests as we tryout different things. 

Thanks 
Rohit PAI 

-----Original Message-----
From: Ed Tanous <edtanous at google.com> 
Sent: Wednesday, May 24, 2023 9:57 PM
To: Rohit Pai <ropai at nvidia.com>
Cc: openbmc at lists.ozlabs.org
Subject: Re: Prioritizing URIs with tight performance requirement in openBmc with bmcweb

External email: Use caution opening links or attachments

On Wed, May 24, 2023 at 2:36 AM Rohit Pai <ropai at nvidia.com> wrote:
>
> Hello All,
>
>
>
> We have a requirement in our platform to serve a few specific URI with a tight performance requirement on the turnaround time (latency).
>
> One such example is the telemetry sensor metric URI which has power, thermal data can have a max turnaround time of 500ms.

What other constraints are here?  We're talking about a TCP protocol, running on a network, on a multi-process CPU.  Are these hard realtime requirements?  It's unlikely you're going to get hard realtime guarantees from Redfish.

>
>
>
> The current bmcweb design uses only a single thread to serve all URI requests/responses.
>
> If bmcweb is processing a huge amount of data (which is common for aggregation URIs) then other requests would get blocked and their latency time would get impacted.

The bmcweb queuing flow looks something like:

        A            B                                 C             D
TCP─►TLS──►HTTP Connection─►DBus─►Daemon

Which location are you seeing queuing problems?  Keep in mind, HTTP
1.1 can only process a single request/response at a time per connection, so if your system is trying to process things from a single connection at A, you're right, long requests will block short ones.

>
> Here I am referring to the time bmcweb takes to prepare the JSON response after it has got the data from the backend service.

What is the ballpark for how big "huge amount" of data would be?  What
processing is actually being done?   This would be the first time that
json parsing itself has actually shown up on a performance profile, but with expand + aggregation, you're right, there's potential for that.

One thing I've considered before is switching bmcweb over to boost::json, which can do incremental chunked parsing, unlike nlohmann, which would let us unblock the flows as each processes the data.

>
> In our platform, we see that power thermal metric URI can take more than 500ms when it’s requested in parallel to other aggregation URI which have huge response data.

Can you share your test?  Is your test using multiple connections to ensure that the thermal metric is being pulled from a different connection than the aggregation URI?

>
>
>
> To solve this problem, we thought of a couple of solutions.
>
>
>
> To introduce multi-threading support in bmcweb.

Sure, I have no problem with adding threads, and it really wouldn't be tough to accomplish as a test:
1. Link pthreads in meson.  Make this a meson option so platforms that don't need multiple threads can opt out of it.
2. go to each async_read and async_write call, and ensure that they are using a strand (to keep processing on the same thread for any one call).
3. Locate all of the global and cross connection data structures, and add a mutex to each of them.  One of the global data structures is the Dbus connection itself, so if your performance problem exists on C or D above, it will likely still exist with multiple threads.
4. Update sdbusplus asio connection to support strands, ensuring that the callbacks happen on the same thread they're requested.
Alternatively, just set up a dbus connection per thread.
5. Test heavily to make sure we don't have threading access problems or missing mutexes.
6. Update the DEVELOPING.md doc to account for multiple threads in the way we review code. (reentrancy, etc).  Most of the existing code should be reentrant, but it's worth looking.
There's likely a few other minor things that would need fixed, but the above is the general gist.

>
> Does anyone have any experience/feedback on making this work?
>
> Is there any strong reason not to have multi-threading support in bmcweb other than general guidelines to avoid threads?

It increases the binary size beyond what can fit on a lot of BMCs (about 10-20%) This is fine so long as you keep it as a compile option so people can opt into threading support.  Historically, teaching and reviewing multi-threaded code has been an order of magnitude more difficult than single threaded code, so keeping the single thread significantly improves the review process, so please plan on having folks prepared to review code for multi-threaded correctness.

>
>
>
> To use a reverse proxy like nginx as the front end to redirect a few URIs to a new application server.

Please take a look at the OpenBMC tree around 2018-2019.  There were several platforms that formerly used nginx as the front end to bmcweb, and have since dropped it.  There was also a discussion on discord recently you might look at.  I'm not really sure how nginx would solve your problem though.  The bmcweb reactor design looks similar to nginx (we use asio, they use libuv) already, so it's not clear to me what you would gain here, unless you were running multiple processes of bmcweb?  Keep in mind, there'd need to be some sort of shared state in that case, so you have to do #3 in the above anyway.

>
> Here the idea is to develop a new application server to serve the URIs which have strong latency requirements and route the rest of the URIs to bmcweb.

This is the part I don't understand;  If the forwarding calls in this new server are blocking to bmcweb, what's the point of adding it?
Feel free to just show the code of this working as well.

>
>        Has anyone experienced any limitations with nginx on openBmc platforms (w.r.t performance, memory footprint, etc)?
>
>        We also have the requirement to support SSE, Is there any known limitation to make such a feature work with nginx?

It can be made to work.  AuthX tends to be the harder part, as implementing CSRF for SSE or websockets is a huge pain.

>
>
>
>
>
> Any other suggestion or solution to the problem we are solving to meet our performance requirement with bmcweb?

1. Audit your code for any blocking calls.  If you have any, put them into a loop, process X bytes at a time, while calling boost::asio::dispatch in between, to not starve the other tasks.
2. Move the bmcweb core to a json library that can do incremental serialization/deserialization.  boost::json would be my first choice.
3. I have patches to turn on uring, which lets us use boost::asio::random_access_file to fix #1 for blocking filesystem calls.
4. Set reasonable limits on the max aggregation size that is allowed at a system level.  There were some proposals on gerrit.

I would be worried about separating code into two classes (high priority/low priority) because everyone's opinions will differ about what should be "high" priority, and what should be "low" priority.  If that isn't configurable easily, I worry that we're going to have problems agreeing on priority, and I don't want to be in a situation where every developer is having to make priority calls for every system class.  I'm open to the possibility here, but we need to make sure it keeps code moving.

>
>
>
>
>
> Thanks
>
> Rohit PAI
>
>