Prioritizing URIs with tight performance requirement in openBmc with bmcweb

Sat May 27 02:29:17 AEST 2023

On Fri, May 26, 2023 at 8:16 AM Rohit Pai <ropai at nvidia.com> wrote:
>
> Hello Ed,
>
> Thanks for your response and feedback. Please find my answers to your questions below.

One additional test, can you enable handler tracking and provide the result?
https://www.boost.org/doc/libs/1_82_0/doc/html/boost_asio/overview/core/handler_tracking.html

That will give a better idea of what specific callback is stalling,
and for how long.
One thing I've considered is just putting a hard cap of, say, 100ms on
the individual handlers, where if any handler takes longer than that,
print to the console.  That would allow us to much more easily find
the flows that are taking too long in practice.  I would expect most
individual context handlers to complete in < 10 ms, unless they're
processing too much data.

>
> > > What other constraints are here?  We're talking about a TCP protocol, running on a network, on a multi-process CPU.  Are these hard realtime requirements?  It's unlikely you're going to get hard realtime guarantees from Redfish.
> No these are not hard real time requirements. The latency can go more that 500ms in some instances, but we want to target the outliers (samples which take more than 500ms of TAT) to be less than one percent of the total samples.
> We also want to limit the max latency of the power, thermal metric URI under 1s. The fabric bmc periodically polls this URIs from our platform and implements the PID loop control based on the response of this URI so its critical for us to limit the max latency of this URI.
>
> >>         A            B                                 C             D
> >> TCP─►TLS──►HTTP Connection─►DBus─►Daemon
> >> Which location are you seeing queuing problems?
> The time spent in route handler after getting the response from the backend daemon is affecting the next URI which is waiting in the queue.

To make sure we're clear, the route handler is the part JUST calling
DBus and turning it into json objects.  The json object serialization
happens outside the router call, so it isn't taking significant time
in your tests?

> From our tests we see that the backend daemon responds on an average within 200ms. Since this is async this has no impact.
> The response received from the dbus backend has around 2500 properties to process. Processing of this in the route handler and preparing the response it's taking avg time of 200ms (max time of 400 ms and several instances going close to 300ms). The Avg CPU consumption during the test is 50% and avg bmcweb consumption is between 12 to 15%.

This is not so bad;  I'm assuming you're not running encryption?

>
>
> >> What is the ballpark for how big "huge amount" of data would be?  What processing is being done?
> The JSON in the HTTP response is around 400KB.

First of all, let's do the easy things to reduce the size.  Start by
making the json not pretty print (change the nlohmann::json::dump()
call in http_connection.hpp to do that), or switch your clients to use
CBOR encoding, which was added a few months ago, to reduce the payload
size being sent on the wire.

> The route handler logic makes the GetManagedObjects call to the backend service and processes the response which has around 2500 properties to loop through.

Can you please change this to Properties GetAll and see if that
improves your consistency?  We never really intended messages to get
THAT big, so the fact that you're seeing some blocking isn't that
surprising.  Also keep in mind, the data structures we use for the
GetManagedObjects call haven't really been optimized, and create a
bunch of intermediates.  If you profile this, you could likely get the
DBus message -> nlohmann json processing time down by an order of
magnitude.

> There is no specific business logic here which is time consuming but it's the number of properties which seem to have significant impact.
>
> >> One thing I've considered before is switching bmcweb over to boost::json, which can do incremental chunked parsing, unlike nlohmann, which would let us unblock the flows as each processes the data.
> Do you have any pointers for this ? where exactly we would be unblocking the flow in route handler after processing chunk of properties from the backend OR somewhere else?

boost/json is in the package and has better documentation than I have.
In terms of unblocking, we could actually write out values
incrementally as the dbus responses are returned, so we reduce the
memory required, and in theory reduce the amount of time we block on
any one handler.  Even if we didn't do that fully, we could, say,
process 4kB worth of data, then return control to the io_context
through boost::asio::dispatch so that other traffic could be
processed.

>
> >> Can you share your test?  Is your test using multiple connections to ensure that the thermal metric is being pulled from a different connection than the aggregation URI?
> In the test we have two clients connected to our platform.
> One client does power, thermal metric URI polling for every 500ms. We have single aggregate URI for this which combine all power, thermal sensors. We have close to 100 sensors. This is the URI where we have strict performance requirement.
> Other client periodically polls stats and counters aggregate URI for every 5 seconds. This is the URI which has around 400KB of JSON response and has 2500 dbus properties to process. We are not very particular about the latency of the
> stats and counters URI. In the test result we see that power, thermal metric URI has max latency time of 1.2 seconds (min is 2ms and avg is 9ms) and around 3.5% outliers (samples took more than 500ms). We can see that whenever bmcweb code is busy preparing the response for the stats and counter URI and there is request for thermal metric URI then latency of the thermal metric URIs is affected.
> If we limit the number of properties in the stats and counters URI, then we can meet the requirement, but we need to create lot of aggregate URIs. It would not be convenient for the customers to use many aggregate URIs and then combine the response.  Is there way to process chunk of properties in the route handler and voluntarily release the context for bmcweb to process the next URIs in the queue ? any other trick which can work here ?

Can you share your test code?  Which URIs are you hitting that are
calling GetManagedObjects?

FWIW, as an overall theme, I do wonder if the speed gained by going to
GetManagedObjects is worth it in the long run.  It seems to
significantly effect stability, because even objects added that the
route doesn't care about have to be processed, and as systems are
getting larger, the GetManagedObjects responses are getting larger,
for no change that the user sees.  Said another way, we're throwing
away significantly more data.

>
>
> > Here the idea is to develop a new application server to serve the URIs which have strong latency requirements and route the rest of the URIs to bmcweb.
> >> This is the part I don't understand;  If the forwarding calls in this new server are blocking to bmcweb, what's the point of adding it?
> The backend for the thermal metrics and stats/counters metrics is different so we would not be blocked on the same service.

I find it unlikely that we'll be able to all agree on what is "high"
priority (some people see any http call is low priority), so unless
this was easily configurable what the priority is for a given route,
I'm not sure this kind of architecture would work.  My preference
would be to:
1. Optimize the callbacks that are taking too much time, and make them
call boost::asio::dispatch during long loops.
2. Use threads to ensure there's more resources available that don't
block (this only works up to N connections, whatever N is for thread
count).

Thanks for the discussion;  Let me know what you find.

> With a reverse proxy our idea is to forward thermal metrics to another backend application server (new bmcweb server or a similar lightweight server instance) and all other URIs to the existing bmcweb.
>
> I will share any further results from our internal tests as we tryout different things.
>
> Thanks
> Rohit PAI
>
>
> -----Original Message-----
> From: Ed Tanous <edtanous at google.com>
> Sent: Wednesday, May 24, 2023 9:57 PM
> To: Rohit Pai <ropai at nvidia.com>
> Cc: openbmc at lists.ozlabs.org
> Subject: Re: Prioritizing URIs with tight performance requirement in openBmc with bmcweb
>
> External email: Use caution opening links or attachments
>
>
> On Wed, May 24, 2023 at 2:36 AM Rohit Pai <ropai at nvidia.com> wrote:
> >
> > Hello All,
> >
> >
> >
> > We have a requirement in our platform to serve a few specific URI with a tight performance requirement on the turnaround time (latency).
> >
> > One such example is the telemetry sensor metric URI which has power, thermal data can have a max turnaround time of 500ms.
>
> What other constraints are here?  We're talking about a TCP protocol, running on a network, on a multi-process CPU.  Are these hard realtime requirements?  It's unlikely you're going to get hard realtime guarantees from Redfish.
>
> >
> >
> >
> > The current bmcweb design uses only a single thread to serve all URI requests/responses.
> >
> > If bmcweb is processing a huge amount of data (which is common for aggregation URIs) then other requests would get blocked and their latency time would get impacted.
>
> The bmcweb queuing flow looks something like:
>
>         A            B                                 C             D
> TCP─►TLS──►HTTP Connection─►DBus─►Daemon
>
> Which location are you seeing queuing problems?  Keep in mind, HTTP
> 1.1 can only process a single request/response at a time per connection, so if your system is trying to process things from a single connection at A, you're right, long requests will block short ones.
>
> >
> > Here I am referring to the time bmcweb takes to prepare the JSON response after it has got the data from the backend service.
>
> What is the ballpark for how big "huge amount" of data would be?  What
> processing is actually being done?   This would be the first time that
> json parsing itself has actually shown up on a performance profile, but with expand + aggregation, you're right, there's potential for that.
>
> One thing I've considered before is switching bmcweb over to boost::json, which can do incremental chunked parsing, unlike nlohmann, which would let us unblock the flows as each processes the data.
>
> >
> > In our platform, we see that power thermal metric URI can take more than 500ms when it’s requested in parallel to other aggregation URI which have huge response data.
>
> Can you share your test?  Is your test using multiple connections to ensure that the thermal metric is being pulled from a different connection than the aggregation URI?
>
> >
> >
> >
> > To solve this problem, we thought of a couple of solutions.
> >
> >
> >
> > To introduce multi-threading support in bmcweb.
>
> Sure, I have no problem with adding threads, and it really wouldn't be tough to accomplish as a test:
> 1. Link pthreads in meson.  Make this a meson option so platforms that don't need multiple threads can opt out of it.
> 2. go to each async_read and async_write call, and ensure that they are using a strand (to keep processing on the same thread for any one call).
> 3. Locate all of the global and cross connection data structures, and add a mutex to each of them.  One of the global data structures is the Dbus connection itself, so if your performance problem exists on C or D above, it will likely still exist with multiple threads.
> 4. Update sdbusplus asio connection to support strands, ensuring that the callbacks happen on the same thread they're requested.
> Alternatively, just set up a dbus connection per thread.
> 5. Test heavily to make sure we don't have threading access problems or missing mutexes.
> 6. Update the DEVELOPING.md doc to account for multiple threads in the way we review code. (reentrancy, etc).  Most of the existing code should be reentrant, but it's worth looking.
> There's likely a few other minor things that would need fixed, but the above is the general gist.
>
> >
> > Does anyone have any experience/feedback on making this work?
> >
> > Is there any strong reason not to have multi-threading support in bmcweb other than general guidelines to avoid threads?
>
> It increases the binary size beyond what can fit on a lot of BMCs (about 10-20%) This is fine so long as you keep it as a compile option so people can opt into threading support.  Historically, teaching and reviewing multi-threaded code has been an order of magnitude more difficult than single threaded code, so keeping the single thread significantly improves the review process, so please plan on having folks prepared to review code for multi-threaded correctness.
>
> >
> >
> >
> > To use a reverse proxy like nginx as the front end to redirect a few URIs to a new application server.
>
> Please take a look at the OpenBMC tree around 2018-2019.  There were several platforms that formerly used nginx as the front end to bmcweb, and have since dropped it.  There was also a discussion on discord recently you might look at.  I'm not really sure how nginx would solve your problem though.  The bmcweb reactor design looks similar to nginx (we use asio, they use libuv) already, so it's not clear to me what you would gain here, unless you were running multiple processes of bmcweb?  Keep in mind, there'd need to be some sort of shared state in that case, so you have to do #3 in the above anyway.
>
> >
> > Here the idea is to develop a new application server to serve the URIs which have strong latency requirements and route the rest of the URIs to bmcweb.
>
> This is the part I don't understand;  If the forwarding calls in this new server are blocking to bmcweb, what's the point of adding it?
> Feel free to just show the code of this working as well.
>
> >
> >        Has anyone experienced any limitations with nginx on openBmc platforms (w.r.t performance, memory footprint, etc)?
> >
> >        We also have the requirement to support SSE, Is there any known limitation to make such a feature work with nginx?
>
> It can be made to work.  AuthX tends to be the harder part, as implementing CSRF for SSE or websockets is a huge pain.
>
> >
> >
> >
> >
> >
> > Any other suggestion or solution to the problem we are solving to meet our performance requirement with bmcweb?
>
> 1. Audit your code for any blocking calls.  If you have any, put them into a loop, process X bytes at a time, while calling boost::asio::dispatch in between, to not starve the other tasks.
> 2. Move the bmcweb core to a json library that can do incremental serialization/deserialization.  boost::json would be my first choice.
> 3. I have patches to turn on uring, which lets us use boost::asio::random_access_file to fix #1 for blocking filesystem calls.
> 4. Set reasonable limits on the max aggregation size that is allowed at a system level.  There were some proposals on gerrit.
>
>
> I would be worried about separating code into two classes (high priority/low priority) because everyone's opinions will differ about what should be "high" priority, and what should be "low" priority.  If that isn't configurable easily, I worry that we're going to have problems agreeing on priority, and I don't want to be in a situation where every developer is having to make priority calls for every system class.  I'm open to the possibility here, but we need to make sure it keeps code moving.
>
> >
> >
> >
> >
> >
> > Thanks
> >
> > Rohit PAI
> >
> >