<div dir="ltr">Hi Oskar,<div><br></div><div>> C) ipmiStorageWriteFruData clears the cache immediately after WriteFru was called. Maybe we should keep that cache around until the timeout expires?</div><div><br></div><div><div>It seems like this is an issue of multiple clients taking to ipmid. In the middle of writing... There is another client that is reading or something</div><div>else and will reset the fruCache as you mentioned. In that case, I think it may be best to refactor it out to use another getFru method... maybe like getFruToWrite</div><div><br></div><div>Maybe just something like this</div><div><br></div><div>```</div><div>std::vector<uint8_t> getFruToWrite(...){</div><div>  if (writeTimer->isRunning()) {</div><div>    return fruCacheForWrite;</div><div>  }</div><div>  auto [_, fruCacheForWrite] = getFru(...);</div><div>  return fruCacheForWrite;</div><div>}</div><div>```</div><div><br></div><div>Also need to change `writeFruCache` and such to use `fruCacheForWrite` and such.</div></div><div><br></div><div>Please let me know if that works for you. Feel free to reach out internally for faster feedback.</div><div><br></div><div>Willy Tu</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, May 2, 2024 at 11:32 AM Oskar Senft <<a href="mailto:osk@google.com">osk@google.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi everyone<br>

<br>

tl;dr - Attempting "ipmitool fru write" with an input file that<br>

contains additional bytes beyond the actual FRU data does not actually<br>

update the FRU on OpenBMC at head w/ entity-manager.<br>

<br>

Details:<br>

<br>

I ran into an issue with updating the "baseboard" FRU (0), which is<br>

stored as /etc/fru/baseboard.fru.bin. However, I don't think this is<br>

specific to FRU 0 and could apply to other updateable FRUs in the same<br>

way.<br>

<br>

<br>

1. Start off with a "minimal" /etc/fru/baseboard.fru.bin which just<br>

contains chassis type (that's required for entity-manager's fru-device<br>

to recognize the file.<br>

<br>

root@akita:~# ls -al /etc/fru/baseboard.fru.bin<br>

-rw-r--r--    1 root     root           512 Sep 21 05:21<br>

/etc/fru/baseboard.fru.bin<br>

<br>

root@akita:~# hexdump -C /etc/fru/baseboard.fru.bin<br>

00000000  01 00 01 00 00 00 00 fe  01 01 17 c0 c0 c1 00 a6  |................|<br>

00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|<br>

*<br>

00000200<br>

<br>

root@akita:~# ipmitool fru print 0<br>

 Chassis Type          : Rack Mount Chassis<br>

 Chassis Area Checksum : OK<br>

<br>

<br>

2. Prepare a FRU file with additional data, e.g. with frugy:<br>

<br>

(frugy) osk@osk:~$ cat demo.yml<br>

ChassisInfo:<br>

  type: 23<br>

  part_number: '4711'<br>

  serial_number: '12345'<br>

<br>

(frugy) osk@osk:~$ frugy demo.yml -o demo.bin<br>

<br>

(frugy) osk@osk:~$ ls -al demo.bin<br>

-rw-r----- 1 osk primarygroup 24 May  2 13:51 demo.bin<br>

<br>

(frugy) osk@osk:~$ hexdump -C demo.bin<br>

00000000  01 00 01 00 00 00 00 fe  01 02 17 c4 34 37 31 31  |............4711|<br>

00000010  c5 31 32 33 34 35 c1 d0                           |.12345..|<br>

00000018<br>

<br>

Note that frugy generates a minimal binary by default. However, other<br>

processes may not and instead produce a fixed size binary blob. This<br>

happens, e.g. when performing "ipmitool fru read" which returns the<br>

whole contents of the underlying storage. A subsequent "ipmitool fru<br>

write" with that blob will fail as explained here.<br>

<br>

To simulate this here, increase the file to 256 bytes:<br>

<br>

(frugy) osk@osk:~$ cp demo.bin demo-256.bin<br>

(frugy) osk@osk:~$ dd if=/dev/zero bs=1 seek=256 count=0 of=demo-256.bin<br>

0+0 records in<br>

0+0 records out<br>

0 bytes copied, 5.1138e-05 s, 0.0 kB/s<br>

<br>

(frugy) osk@osk:~$ ls -al demo-256.bin<br>

-rw-r----- 1 osk primarygroup 256 May  2 13:57 demo-256.bin<br>

<br>

(frugy) osk@osk:~$ hexdump -C demo-256.bin<br>

00000000  01 00 01 00 00 00 00 fe  01 02 17 c4 34 37 31 31  |............4711|<br>

00000010  c5 31 32 33 34 35 c1 d0  00 00 00 00 00 00 00 00  |.12345..........|<br>

00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|<br>

*<br>

00000100<br>

<br>

<br>

3. Write the newly generated FRU:<br>

<br>

root@akita:~# ipmitool fru write 0 demo-256.bin<br>

Fru Size         : 512 bytes<br>

Size to Write    : 256 bytes<br>

<br>

Wait ~10 seconds for the fru-device process to reload the contents.<br>

<br>

<br>

4. Re-read the FRU contents:<br>

<br>

root@akita:~# ipmitool fru print 0<br>

 Chassis Type          : Rack Mount Chassis<br>

 Chassis Area Checksum : OK<br>

<br>

<br>

5. For comparison, write only the minimal FRU:<br>

<br>

root@akita:~# ipmitool fru write 0 demo.bin<br>

Fru Size         : 512 bytes<br>

Size to Write    : 24 bytes<br>

<br>

Wait ~10 seconds.<br>

<br>

root@akita:~# ipmitool fru print 0<br>

 Chassis Type          : Rack Mount Chassis<br>

 Chassis Part Number   : 4711<br>

 Chassis Serial        : 12345<br>

 Chassis Area Checksum : OK<br>

<br>

<br>

I dug into this and found that this is caused by an interaction<br>

between <a href="https://github.com/openbmc/phosphor-host-ipmid/blob/master/dbus-sdr/storagecommands.cpp" rel="noreferrer" target="_blank">https://github.com/openbmc/phosphor-host-ipmid/blob/master/dbus-sdr/storagecommands.cpp</a><br>

and <a href="https://github.com/openbmc/entity-manager/blob/master/src/fru_device.cpp" rel="noreferrer" target="_blank">https://github.com/openbmc/entity-manager/blob/master/src/fru_device.cpp</a>.<br>

<br>

Here's what happens:<br>

- The "ipmitool fru write" request is handled by storagecommands.cpp<br>

ipmiStorageWriteFruData.<br>

<br>

- ipmiStorageWriteFruData receives the whole FRU blob from IPMI across<br>

several calls. On the initial call, it requests the current FRU blob<br>

via D-bus from fru-device and caches it in the process. It then<br>

overwrites sections of that cache with the received data from IPMI.<br>

<br>

- ipmiStorageWriteFruData uses information from the FRU header to<br>

check whether it received all the bytes that make up the new FRU. Note<br>

that this could be fewer bytes than the size of the cached blob. Once<br>

it receives all the bytes for the new FRU, it calls<br>

/xyz/openbmc_project/FruDevice WriteFru via D-Bus with the full FRU<br>

blob (i.e. the full cache with modifications on top). After that the<br>

cache is cleared.<br>

<br>

- The D-Bus call to WriteFru is handled by fru_device.cpp writeFRU. It<br>

first performs a sanity check and then writes the blob to the<br>

underlying device (or the /etc/fru/baseboard.fru.bin for FRU 0). It<br>

subsequently calls rescanBusses() which actually executes after about<br>

1 second.<br>

<br>

- Meanwhile, "ipmitool fru write" continues to happily send additional<br>

bytes to ipmiStorageWriteFruData (as in step #3 above). Since its<br>

cache has just been cleared, it retrieves the current FRU from<br>

fru-device again. However, since that has not yet completed<br>

"rescanBusses", it actually received the original FRU again, not the<br>

modified version. The above cycle repeats, but this time the original<br>

FRU with additional modifications from the additional bytes.<br>

<br>

In the best case (if the new FRU data is larger than the original FRU<br>

data), the result is that the FRU did not get updated at all, since we<br>

effectively just wrote back the original FRU. However, if the new FRU<br>

data is smaller than the original FRU data, this may result in<br>

corrupted FRU data to be persisted.<br>

<br>

<br>

I was trying to figure out how to best fix that, but couldn't come up<br>

with a design that would still work. Some ideas:<br>

<br>

A)  I think what we're missing is feedback from fru-device to ipmid<br>

that the FRU write operation has actually completed and the FRU data<br>

was re-read. The ipmid should probably not accept any additional write<br>

requests until the previous write request has completed and the new<br>

FRU data is available.<br>

<br>

B) If ipmiStorageWriteFruData cannot detect the expected size of the<br>

FRU data, it relies on a timeout to eventually write the blob if no<br>

more data was received from IPMI. I wonder if this mechanism is "too<br>

clever" and we should simply always just wait for the timeout?<br>

<br>

C) ipmiStorageWriteFruData clears the cache immediately after WriteFru<br>

was called. Maybe we should keep that cache around until the timeout<br>

expires?<br>

<br>

Thoughts?<br>

<br>

Thanks<br>

Oskar.<br>

</blockquote></div>