Prioritizing URIs with tight performance requirement in openBmc with bmcweb

Thu May 25 02:26:40 AEST 2023

On Wed, May 24, 2023 at 2:36 AM Rohit Pai <ropai at nvidia.com> wrote:
>
> Hello All,
>
>
>
> We have a requirement in our platform to serve a few specific URI with a tight performance requirement on the turnaround time (latency).
>
> One such example is the telemetry sensor metric URI which has power, thermal data can have a max turnaround time of 500ms.

What other constraints are here?  We're talking about a TCP protocol,
running on a network, on a multi-process CPU.  Are these hard realtime
requirements?  It's unlikely you're going to get hard realtime
guarantees from Redfish.

>
>
>
> The current bmcweb design uses only a single thread to serve all URI requests/responses.
>
> If bmcweb is processing a huge amount of data (which is common for aggregation URIs) then other requests would get blocked and their latency time would get impacted.

The bmcweb queuing flow looks something like:

        A            B                                 C             D
TCP─►TLS──►HTTP Connection─►DBus─►Daemon

Which location are you seeing queuing problems?  Keep in mind, HTTP
1.1 can only process a single request/response at a time per
connection, so if your system is trying to process things from a
single connection at A, you're right, long requests will block short
ones.

>
> Here I am referring to the time bmcweb takes to prepare the JSON response after it has got the data from the backend service.

What is the ballpark for how big "huge amount" of data would be?  What
processing is actually being done?   This would be the first time that
json parsing itself has actually shown up on a performance profile,
but with expand + aggregation, you're right, there's potential for
that.

One thing I've considered before is switching bmcweb over to
boost::json, which can do incremental chunked parsing, unlike
nlohmann, which would let us unblock the flows as each processes the
data.

>
> In our platform, we see that power thermal metric URI can take more than 500ms when it’s requested in parallel to other aggregation URI which have huge response data.

Can you share your test?  Is your test using multiple connections to
ensure that the thermal metric is being pulled from a different
connection than the aggregation URI?

>
>
>
> To solve this problem, we thought of a couple of solutions.
>
>
>
> To introduce multi-threading support in bmcweb.

Sure, I have no problem with adding threads, and it really wouldn't be
tough to accomplish as a test:
1. Link pthreads in meson.  Make this a meson option so platforms that
don't need multiple threads can opt out of it.
2. go to each async_read and async_write call, and ensure that they
are using a strand (to keep processing on the same thread for any one
call).
3. Locate all of the global and cross connection data structures, and
add a mutex to each of them.  One of the global data structures is the
Dbus connection itself, so if your performance problem exists on C or
D above, it will likely still exist with multiple threads.
4. Update sdbusplus asio connection to support strands, ensuring that
the callbacks happen on the same thread they're requested.
Alternatively, just set up a dbus connection per thread.
5. Test heavily to make sure we don't have threading access problems
or missing mutexes.
6. Update the DEVELOPING.md doc to account for multiple threads in the
way we review code. (reentrancy, etc).  Most of the existing code
should be reentrant, but it's worth looking.
There's likely a few other minor things that would need fixed, but the
above is the general gist.

>
> Does anyone have any experience/feedback on making this work?
>
> Is there any strong reason not to have multi-threading support in bmcweb other than general guidelines to avoid threads?

It increases the binary size beyond what can fit on a lot of BMCs
(about 10-20%) This is fine so long as you keep it as a compile option
so people can opt into threading support.  Historically, teaching and
reviewing multi-threaded code has been an order of magnitude more
difficult than single threaded code, so keeping the single thread
significantly improves the review process, so please plan on having
folks prepared to review code for multi-threaded correctness.

>
>
>
> To use a reverse proxy like nginx as the front end to redirect a few URIs to a new application server.

Please take a look at the OpenBMC tree around 2018-2019.  There were
several platforms that formerly used nginx as the front end to bmcweb,
and have since dropped it.  There was also a discussion on discord
recently you might look at.  I'm not really sure how nginx would solve
your problem though.  The bmcweb reactor design looks similar to nginx
(we use asio, they use libuv) already, so it's not clear to me what
you would gain here, unless you were running multiple processes of
bmcweb?  Keep in mind, there'd need to be some sort of shared state in
that case, so you have to do #3 in the above anyway.

>
> Here the idea is to develop a new application server to serve the URIs which have strong latency requirements and route the rest of the URIs to bmcweb.

This is the part I don't understand;  If the forwarding calls in this
new server are blocking to bmcweb, what's the point of adding it?
Feel free to just show the code of this working as well.

>
>        Has anyone experienced any limitations with nginx on openBmc platforms (w.r.t performance, memory footprint, etc)?
>
>        We also have the requirement to support SSE, Is there any known limitation to make such a feature work with nginx?

It can be made to work.  AuthX tends to be the harder part, as
implementing CSRF for SSE or websockets is a huge pain.

>
>
>
>
>
> Any other suggestion or solution to the problem we are solving to meet our performance requirement with bmcweb?

1. Audit your code for any blocking calls.  If you have any, put them
into a loop, process X bytes at a time, while calling
boost::asio::dispatch in between, to not starve the other tasks.
2. Move the bmcweb core to a json library that can do incremental
serialization/deserialization.  boost::json would be my first choice.
3. I have patches to turn on uring, which lets us use
boost::asio::random_access_file to fix #1 for blocking filesystem
calls.
4. Set reasonable limits on the max aggregation size that is allowed
at a system level.  There were some proposals on gerrit.

I would be worried about separating code into two classes (high
priority/low priority) because everyone's opinions will differ about
what should be "high" priority, and what should be "low" priority.  If
that isn't configurable easily, I worry that we're going to have
problems agreeing on priority, and I don't want to be in a situation
where every developer is having to make priority calls for every
system class.  I'm open to the possibility here, but we need to make
sure it keeps code moving.

>
>
>
>
>
> Thanks
>
> Rohit PAI
>
>