MCTP/PLDM BMC-host communication design

Fri Jan 21 15:37:29 AEDT 2022

Hi Tung,

> Appreciated of your comments. We are using theuserspace MCTP and will
> consider moving to kernel space MCTP as the suggestion. 
> Because of the specific requirements, we look forward for simpler
> way. In our case, we have on-chip sensors and events which are
> allocated in both 2 sockets, and the situation is: we must send the
> PLDM command to poll the data.

Yes, that all sounds fine.

> If using 2 interfaces to communicate
> with host, I think it would be complex when sending to multiple
> sockets. 

[We're at risk of overloading the term "socket" here, as it also refers
to the kernel interface to the MCTP stack - the sockets API. So I'll use
the word "CPU" instead, referring to the physical device, being the
MCTP/PLDM endpoint]

If you're using the kernel stack, there's no real additional complexity
with the two-interface model; you would just configure each interface,
and set up routes to each CPU EID. This is a once-off configuration at
BMC boot time. If you're using dynamic addressing, mctpd takes care of
that for you.

The PLDM application only needs to have knowledge of the EIDs of the
CPUs - the kernel handles the routing of which interface to transmit
packets over, based on the packets' destination EIDs.

> The things should be considered as :
> + If a socket is problem during runtime, is the process of MCTPL/PLDM
> still ok

The MCTP stack on the BMC will be fine. I assume the BMC PLDM
application will timeout any pending requests, and should handle that
gracefully too.

> + If one, or more socket problem. Can we reboot the whole system to
> recover ?

You could, but that's pretty heavy-handed. There should be no need to
reboot the BMC at all. And for the CPU's MCTP implementation, I assume
there's a way to perform recovery there, rather than requiring a host
reboot.

The two-interface architecture does give you more fault-tolerance there;
if one CPU's MCTP stack is not reachable, it doesn't prevent
communication with the other.

> When using 1 interface, i think:
> + From the host side, socket 0 (master) should manage its other
> sockets, (might be not via SMBus, but other faster sockets
> communication). Of course, the more work should be implemented in the
> firmware, and you have pointed.
> + BMC just recover the system (via reboot) when socket 0 issue,
> otherwise it does properly

Not sure what you mean by "it does properly" there - but I think avoiding
host reboots would definitely be a good thing. Also, if the fault on
CPU0 isn't recoverable, you won't be able to perform any communication
with CPU1.

Regards,

Jeremy