[Skiboot] skiboot v6.1 tagged

Stewart Smith stewart at linux.ibm.com
Wed Jul 11 18:28:43 AEST 2018


skiboot v6.1 was released on Wednesday July 11th 2018. It is the first
release of skiboot 6.1, which is the new stable release of skiboot
following the 6.0 release, first released May 11th 2018.

Skiboot 6.1 is the basis for op-build v2.1 and contains all bug fixes
as of skiboot-6.0.5, and *skiboot-5.4.9* (the currently maintained
stable releases). We expect further stable releases in the 6.0.x and
5.4.x series, while we do not expect to do any stable releases of

This final 6.1 release follows a single release candidate release, as
this cycle we have been rather quiet, with mainly cleanup and bug fix
patches going in.

For how the skiboot stable releases work, see Skiboot stable tree
rules and releases for details.

Over skiboot-6.0, we have the following changes:

General changes and bug fixes

Since skiboot-6.1-rc1:

* slw: Fix trivial typo in debug message

* vpd: Add vendor property to processor node

  Processor FRU vpd doesn’t contain vendor detail. We have to parse
  module VPD to get vendor detail.

* vpd: Sanitize VPD data

  On OpenPower system, VPD keyword size tells us the maximum size of
  the data. But they fill trailing end with space (0x20) instead of
  NULL. Also spec doesn’t stop user to have space (0x20) within actual

  This patch discards trailing spaces before populating device tree.

* core: always flush console before stopping

  This catches a few cases (e.g., fast reboot failure messages) that
  don’t always make it to the console before the machine is rebooted.

* core/cpu: parallelise global CPU register setting jobs

  On a 176 thread system, before:

     [  122.319923233,5] OPAL: Switch to big-endian OS
     [  126.317897467,5] OPAL: Switch to little-endian OS


     [  212.439299889,5] OPAL: Switch to big-endian OS
     [  212.469323643,5] OPAL: Switch to little-endian OS

* init, occ: Initialise OCC earlier on BMC systems

  We need to use the OCC to obtain presence data for the SXM2 slots on
  Witherspoon systems. This is needed to determine device type for
  NVLink GPUs and OpenCAPI devices which can be plugged into the same
  slot. Support for this will be implemented in a future patch.

  Currently, OCC initialisation is done just before handing over to
  Linux, which is well after NPU probe. On FSP systems, OCC boot
  starts very late, so we wait until the last possible moment to
  initialise the skiboot side in order to give it the maximum time to
  boot. On BMC systems, OCC boot starts earlier, so there aren’t any
  issues in moving it earlier in the skiboot init sequence.

  When running on a BMC machine, call occ_pstates_init() as early as
  possible in the init sequence. On FSP machines, continue to call it
  from its current location.

Since skiboot-6.0:

* GCC8 build fixes

* Add prepare_hbrt_update to hbrt interfaces

  Add placeholder support for prepare_hbrt_update call into hostboot
  runtime (opal-prd) code.  This interface is only called as part of a
  concurrent code update on a FSP based system.

* cpu: Clear PCR SPR in opal_reinit_cpus()

  Currently if Linux boots with a non-zero PCR, things can go bad
  where some early userspace programs can take illegal instructions.
  This is being fixed in Linux, but in the mean time, we should
  cleanup in skiboot also.

* pci: Fix PCI_DEVICE_ID()

  The vendor ID is 16 bits not 8. This error leaves the top of the
  vendor ID in the bottom bits of the device ID, which resulted in
  e.g. a failure to run the PCI quirk for the AST VGA device.

* Quieten console output on boot

  We print out a whole bunch of things on boot, most of which aren’t
  interesting, so we should *not* print them instead.

  Printing things like what CPUs we found and what PCI devices we
  found *are* useful, so continue to do that. But we don’t need to
  splat out a bunch of things that are always going to be true.

* core/console: fix deadlock when printing with console lock held

  Some debugging options will print while the console lock is held,
  which is why the console lock is taken as a recursive lock. However
  console_write calls __flush_console, which will drop and re-take the
  lock non-recursively in some cases.

  Just set con_need_flush and return from __flush_console if we are
  holding the console lock already.

  This stack usage message (taken with this patch applied) could lead
  to a deadlock without this:

     CPU 0000 lowest stack mark 11768 bytes left pc=300cb808 token=0
     CPU 0000 Backtrace:
     S: 0000000031c03370 R: 00000000300cb808   .list_check_node+0x1c
     S: 0000000031c03410 R: 00000000300cb910   .list_check+0x38
     S: 0000000031c034b0 R: 00000000300190ac   .try_lock_caller+0xb8
     S: 0000000031c03540 R: 00000000300192e0   .lock_caller+0x80
     S: 0000000031c03600 R: 0000000030012c70   .__flush_console+0x134
     S: 0000000031c036d0 R: 00000000300130cc   .console_write+0x68
     S: 0000000031c03780 R: 00000000300347bc   .vprlog+0xc8
     S: 0000000031c03970 R: 0000000030034844   ._prlog+0x50
     S: 0000000031c03a00 R: 00000000300364a4   .log_simple_error+0x74
     S: 0000000031c03b90 R: 000000003004ab48   .occ_pstates_init+0x184
     S: 0000000031c03d50 R: 000000003001480c   .load_and_boot_kernel+0x38c
     S: 0000000031c03e30 R: 000000003001571c   .main_cpu_entry+0x62c
     S: 0000000031c03f00 R: 0000000030002700   boot_entry+0x1c0

* opal-prd: Do not error out on first failure for soft/hard offline.

  The memory errors (CEs and UEs) that are detected as part of
  background memory scrubbing are reported by PRD asynchronously to
  opal-prd along with affected memory ranges. hservice_memory_error()
  converts these ranges into page granularity before hooking up them
  to soft/hard offline-ing infrastructure.

  But the current implementation of hservice_memory_error() does not
  hookup all the pages to soft/hard offline-ing if any of the page
  offline action fails. e.g hard offline can fail for:

  * Pages that are not part of buddy managed pool.

  * Pages that are reserved by kernel using memblock_reserved()

  * Pages that are in use by kernel.

  But for the pages that are in use by user space application, the
  hard offline marks the page as hwpoison, sends SIGBUS signal to kill
  the affected application as recovery action and returns success.

  Hence, It is possible that some of the pages in that memory range
  are in use by application or free. By stopping on first error we
  loose the opportunity to hwpoison the subsequent pages which may be
  free or in use by application. This patch fixes this issue.

* libflash/blocklevel_write: Fix missing error handling

  Caught by scan-build, we seem to trap the errors in rc, but not take
  any recovery action during blocklevel_write.


* p8-i2c: fix wrong request status when a reset is needed

  If the bus is found in error state when starting a new request, the
  engine is reset and we enter recovery. However, once complete, the
  reset operation shows a status of complete in the status register.
  So any badly-timed called to check_status() will think the current
  top request is complete, even though it hasn’t run yet.

  So don’t update any request status while we are in recovery, as
  nothing useful for the request is supposed to happen in that state.

* p8-i2c: Remove force reset

  Force reset was added as an attempt to work around some issues with
  TPM devices locking up their I2C bus. In that particular case the
  problem was that the device would hold the SCL line down permanently
  due to a device firmware bug. The force reset doesn’t actually do
  anything to alleviate the situation here, it just happens to reset
  the internal master state enough to make the I2C driver appear to
  work until something tries to access the bus again.

  On P9 systems with secure boot enabled there is the added problem of
  the “diagostic mode” not being supported on I2C masters A,B,C and D.
  Diagnostic mode allows the SCL and SDA lines to be driven directly
  by software. Without this force reset is impossible to implement.

  This patch removes the force reset functionality entirely since:

  1. it doesn’t do what it’s supposed to, and

  2. it’s butt ugly code

  Additionally, turn p8_i2c_reset_engine() into p8_i2c_reset_port().
  There’s no need to reset every port on a master in response to an
  error that occurred on a specific port.

* libstb/i2c-driver: Bump max timeout

  We have observed some TPMs clock streching the I2C bus for
  signifigant amounts of time when processing commands. The same TPMs
  also have errata that can result in permernantly locking up a bus in
  response to an I2C transaction they don’t understand. Using an
  excessively long timeout to prevent this in the field.

* hdata: Add TPM timeout workaround

  Set the default timeout for any bus containing a TPM to one second.
  This is needed to work around a bug in the firmware of certain TPMs
  that will clock strech the I2C port the for up to a second.
  Additionally, when the TPM is clock streching it responds to a STOP
  condition on the bus by bricking itself. Clearing this error
  requires a hard power cycle of the system since the TPM is powered
  by standby power.

* p8-i2c: Allow a per-port default timeout

  Add support for setting a default timeout for the I2C port to the
  device-tree. This is consumed by skiboot.

IPMI Watchdog

* ipmi-watchdog: Support handling re-initialization

  Watchdog resets can return an error code from the BMC indicating
  that the BMC watchdog was not initialized. Currently we abort
  skiboot due to a missing error handler. This patch implements
  handling re-initialization for the watchdog, automatically saving
  the last watchdog set values and re-issuing them if needed.

* ipmi-watchdog: The stop action should disable reset

  Otherwise it is possible for the reset timer to elapse and trigger
  the watchdog to wake back up. This doesn’t affect the behavior of
  the system since we are providing a NONE action to the BMC. However
  we would like to avoid the action from taking place if possible.

* ipmi-watchdog: Add a flag to determine if we are still ticking

  This makes it easier for future changes to ensure that the watchdog
  stops ticking and doesn’t requeue itself for execution in the
  background. This way it is safe for resets to be performed after the
  ticks are assumed to be stopped and it won’t start the timer again.

* ipmi-watchdog: (prepare for) not disabling at shutdown

  The op-build linux kernel has been configured to support the ipmi
  watchdog. This driver will always handle the watchdog by either
  leaving it enabled if configured, or by disabling it during module
  load if no configuration is provided. This increases the coverage of
  the watchdog during the boot process. The watchdog should no longer
  be disabled at any point during skiboot execution.

  We’re not enabling this by default yet as people can (and do, at
  least in development) mix and match old BOOTKERNEL with new skiboot
  and we don’t want to break that too obviously.

* ipmi-watchdog: Don’t reset the watchdog twice

  There is no clarification for why this change was needed, but
  presumably this is due to a buggy BMC implementation where the
  Watchdog Set command was processed concurrently or after the initial
  Watchdog Reset. This inversion would cause the watchdog to stop
  since the DONT_STOP bit was not set. Since we are now using the
  DONT_STOP bit during initialization, the watchdog should not be
  stopped even if an inversion occurs.

* ipmi-watchdog: Make it possible to set DONT_STOP

  The IPMI standard supports setting a DONT_STOP bit during an
  Watchdog Set operation. Most of the time we don’t want to stop the
  Watchdog when updating the settings so we should be using this bit.
  This patch makes it possible for callers of set_wdt to prevent the
  watchdog from being stopped. This only changes the behavior of the
  watchdog during the initial settings update when initializing
  skiboot. The watchdog is no longer disabled and then immediately re-


  The IPMI specification denotes that action 0x1 is Host Reset and 0x3
  is Host Power Cycle. Use the correct name for Reset in our watchdog

POWER8 platforms

* astbmc: Enable mbox depending on scratch reg

  P8 boxes can opt in for mbox pnor support if they set the scratch
  register bit to indicate it is supported.

Simulator platforms

Since skiboot-6.1-rc1:

* pmem: volatile bindings for the poorly enabled

  PMEM_DISK bindings were added, but they rely on a rather recent mmap
  feature. This patch steals from those bindings to add volatile
  bindings. I’ve used these bindings with PMEM_VOLATILE to launch an
  instance with the publicly available systemsim-p9. The bindings are
  volatile and one should not expect any data to be saved/retrieved.

Since skiboot-6.0:

* plat/qemu: add PNOR support

  To access the PNOR, OPAL/skiboot drives the BMC SPI controller using
  the iLPC2AHB device of the BMC SuperIO controller and accesses the
  flash contents using the LPC FW address space on which the PNOR is

  The QEMU PowerNV machine now integrates such models (SuperIO
  controller, iLPC2AHB device) and also a pseudo Aspeed SoC AHB memory
  space populated with the SPI controller registers (same model as for
  ARM). The AHB window giving access to the contents of the BMC SPI
  controller flash modules is mapped on the LPC FW address space.

  The change should be compatible for machine without PNOR support.

* external/mambo: Add support for readline if it exists

  Add support for tclreadline package if it is present. This patch
  loads the package and uses it when the simulation stops for any

FSP based platforms

* Disable fast reboot on FSP IPL side change

  If FSP changes next IPL side, then disable fast reboot.

  sample output:

     [  620.196442259,5] FSP: Got sysparam update, param ID 0xf0000007
     [  620.196444501,5] CUPD: FW IPL side changed. Disable fast reboot
     [  620.196445389,5] CUPD: Next IPL side : perm

* fsp/console: Always establish OPAL console API backend

  Currently we only call set_opal_console() to establish the backend
  used by the OPAL console API if we find at least one FSP serial port
  in HDAT.

  On systems where there is none (IPMI only), we fail to set it,
  causing the console code to try to use the dummy console causing an
  assertion failure during boot due to clashing on the device-tree
  node names.

  So always set it if an FSP is present

AST BMC based platforms

* AMI BMC: use 0x3a as OEM command

  The 0x3a OEM command is for IBM commands, while 0x32 was for AMI
  ones. Sometime in the P8 timeframe, AMI BMCs were changed to listen
  for our commands on either 0x32 or 0x3a. Since 0x3a is the direction
  forward, we’ll use that, as P9 machines with AMI BMCs probably also
  want these to work, and let’s not bet that 0x32 will continue to be

* astbmc: Set romulus BMC type to OpenBMC

* platform/astbmc: Do not delete compatible property

  P9 onwards OPAL is building device tree for BMC based system using
  HDAT. We are populating bmc/compatible node with bmc version. Hence
  do not delete this property.


* external/xscom-utils: Add python library for xscom access

  Patch adds a simple python library module for xscom access. It
  directly manipulate the ‘/access’ file for scom read and write from
  debugfs ‘scom’ directory.

  Example on how to generate a getscom using this module:

     from adu_scoms import *
     getscom = GetSCom()

  Sample output for above getscom.py:

     # ./getscom.py -l
     Chip ID  | Rev   | Chip type
     00000008 | DD2.0 | P9 (Nimbus) processor
     00000000 | DD2.0 | P9 (Nimbus) processor

* ffspart: Don’t require user to create blank partitions manually

  Add ‘–allow-empty’ which allows the filename for a given partition
  to be blank. If set ffspart will set that part of the PNOR file
  ‘blank’ and set ECC bits if required. Without this option behaviour
  is unchanged and ffspart will return an error if it can not find the
  partition file.

* pflash: Use correct prefix when installing

  pflash uses lowercase prefix when running make install in it’s
  direcetory, but uppercase PREFIX when running it in shared. Use
  lowercase everywhere.

  With this the OpenBMC bitbake recipie can drop an out of tree patch
  it’s been carrying for years.


Since skiboot-6.1-rc1:

* occ: sensors: Fix the size of the phandle array ‘sensors’ in DT

  Fixes: 99505c03f493 (present in v5.10-rc4)

* phb4: Delay training till after PERST is deasserted

  This helps some cards train on the second PERST (ie fast-reboot).
  The reason is not clear why but it helps, so YOLO!

Since skiboot-6.0:

* occ-sensor: Avoid using uninitialised struct cpu_thread

  When adding the sensors in occ_sensors_init, if the type is not
  OCC_SENSOR_LOC_CORE, then the loop to find ‘c’ will not be executed.
  Then c->pir is used for both of the the add_sensor_node calls below.

  This provides a default value of 0 instead.

* NX: Add NX coprocessor init opal call

  The read offset (4:11) in Receive FIFO control register is
  incremented by FIFO size whenever CRB read by NX. But the index in
  RxFIFO has to match with the corresponding entry in FIFO maintained
  by VAS in kernel. VAS entry is reset to 0 when opening the receive
  window during driver initialization. So when NX842 is reloaded or in
  kexec boot, possibility of mismatch between RxFIFO control register
  and VAS entries in kernel. It could cause CRB failure / timeout from

  This patch adds nx_coproc_init opal call for kernel to initialize
  readOffset (4:11) and Queued (15:23) in RxFIFO control register.

* SLW: Remove stop1_lite and stop2_lite

  stop1_lite has been removed since it adds no additional benefit over
  stop0_lite. stop2_lite has been removed since currently it adds
  minimal benefit over stop2. However, the benefit is eclipsed by the
  time required to ungate the clocks

  Moreover, Lite states don’t give up the SMT resources, can
  potentially have a performance impact on sibling threads.

  Since current OSs (Linux) aren’t smart enough to make good decisions
  with these stop states, we’re (temporarly) removing them from what
  we expose to the OS, the idea being to bring them back in a new DT
  representation so that only an OS that knows what to do will do
  things with them.

* cpu: Use STOP1 on POWER9 for idle/sleep inside OPAL

  The current code requests STOP3, which means it gets STOP2 in

  STOP2 has proven to occasionally be unreliable depending on FW
  version and chip revision, it also requires a functional CME, so
  instead, let’s use STOP1. The difference is rather minimum for
  something that is only used a few seconds during boot.

NPU2 (NVLink2 and OpenCAPI)

Since skiboot-6.1-rc1:

* capi: Select the correct IODA table entry for the mbt cache.

  With the current code, the capi mmio window is not correctly
  configured in the IODA table entry. The first entry (generally the
  non-prefetchable BAR) is overwrriten. This patch sets the capi
  window bar at the right place.

* npu2/hw-procedures: Fence bricks via NTL instead of MISC

  There are a couple of places we can set/unset fence for a brick:


  2. NTL register for the brick: NPU2_NTL_MISC_CFG1(ndev)

  Recent testing of ATS in combination with GPU reset has exposed a
  side effect of using (1); if fence is set for all six bricks, it
  triggers a sticky nmmu latch which prevents the NPU from getting ATR
  responses. This manifests as a hang in the tests.

  We have npu2_dev_fence_brick() which uses (1), and only two calls to
  it. Replace the call which sets fence with a write to (2). Remove
  the corresponding unset call entirely. It’s unneeded because the
  procedures already do a progression from full fence to half to idle
  using (2).

* phb4/capp: Calculate STQ/DMA read engines based on link-width for

  Presently in CAPI mode the number of STQ/DMA-read engines allocated
  on PEC2 for CAPP is fixed to 6 and 0-30 respectively irrespective of
  the PCI link width. These values are only suitable for x8 cards and
  quickly run out if a x16 card is plugged to a PEC2 attached slot.
  This usually manifests as CAPP reporting TLBI timeout due to these
  messages getting stalled due to insufficient STQs.

  To fix this we update enable_capi_mode() to check if PEC2 chiplet is
  in x16 mode and if yes then we allocate 4/0-47 STQ/DMA-read engines
  for the CAPP traffic.

  Fixes: 37ea3cfdc852 (present in v5.7-rc1)

* npu2: Use same compatible string for NVLink and OpenCAPI link
  nodes in device tree

  Currently, we distinguish between NPU links for NVLink devices and
  OpenCAPI devices through the use of two different compatible strings
  - ibm,npu-link and ibm,npu-link-opencapi.

  As we move towards supporting configurations with both NVLink and
  OpenCAPI devices behind a single NPU, we need to detect the device
  type as part of presence detection, which can’t happen until well
  after the point where the HDAT or platform code has created the NPU
  device tree nodes. Changing a node’s compatible string after it’s
  been created is a bit ugly, so instead we should move the device
  type to a new property which we can add to the node later on.

  Get rid of the ibm,npu-link-opencapi compatible string, add a new
  ibm,npu-link-type property, and a helper function to check the link
  type. Add an “unknown” device type in preparation for later patches
  to detect device type dynamically.

  These device tree bindings are entirely internal to skiboot and are
  not consumed directly by Linux, so this shouldn’t break anything
  (other than internal BML lab environments).

* occ: Add support for GPU presence detection

  On the Witherspoon platform, we need to distinguish between NVLink
  GPUs and OpenCAPI accelerators. In order to do this, we first need
  to find out whether the SXM2 socket is populated.

  On Witherspoon, the SXM2 socket’s presence detection pin is only
  visible via I2C from the APSS, and thus can only be exposed to the
  host via the OCC. The OCC, per OCC Firmware Interface Specification
  for POWER9 version 0.22, now exposes this to skiboot through a field
  in the dynamic data shared memory.

  Add the necessary dynamic data changes required to read the version
  and GPU presence fields. Add a function, occ_get_gpu_presence(),
  that can be used to check GPU presence.

  If the OCC isn’t reporting presence (old OCC firmware, or some other
  reason), we default to assuming there is a device present and wait
  until link training to fail.

  This will be used in later patches to fix up the NPU2 probe path for
  OpenCAPI support on Witherspoon.

* hw/npu2, core/hmi: Use NPU instead of NPU2 as log message prefix

  The NPU2{DBG,INF,ERR} macros use “NPU%d” as a prefix to identify
  messages relating to a particular NPU.

  It’s slightly confusing to have per-NPU messages prefixed with
  “NPU0” or “NPU1” and NPU-generic messages prefixed with “NPU2”. On
  some future system we could potentially have a NPU #2 in which case
  it’d be really confusing.

  Use NPU rather than NPU2 for NPU-generic log messages. There’s no
  risk of confusion with the original npu.c code since that’s only for

Since skiboot-6.0:

* npu2: Reset NVLinks on hot reset

  This effectively fences GPU RAM on GPU reset so the host system does
  not have to crash every time we stop a KVM guest with a GPU passed

* npu2-opencapi: reduce number of retries to train the link

  We’ve been reliably training the opencapi link on the first attempt
  for quite a while. Furthermore, if it doesn’t train on the first
  attempt, retries haven’t been that useful. So let’s reduce the
  number of attempts we do to train the link.

  2 retries = 3 attempts to train.

  Each (failed) training sequence costs about 3 seconds.

* opal/hmi: Display correct chip id while printing NPU FIRs.

  HMIs for NPU xstops are broadcasted to all chips. All cores on all
  the chips receive HMI. HMI handler correctly identifies and extracts
  the NPU FIR details from affected chip, but while printing FIR data
  it prints chip id and location code details of this_cpu()->chip_id
  which may not be correct. This patch fixes this issue.

* npu2-opencapi: Fix link state to report link down

  The PHB callback ‘get_link_state’ is always reporting the link
  width, irrespective of the link status and even when the link is
  down. It is causing too much work (and failures) when the PHB is
  probed during pci init. The fix is to look at the link status first
  and report the link as down when appropriate.

* npu2-opencapi: Cleanup traces printed during link training

  Now that links may train in parallel, traces shown during training
  can be all mixed up. So add a prefix to all the traces to clearly
  identify the chip and link the trace refers to:

     OCAPI[<chip id>:<link id>]: this is a very useful message

  The lower-level hardware procedures (npu2-hw-procedures.c) also
  print traces which would need work. But that code is being reworked
  to be better integrated with opencapi and nvidia, so leave it alone
  for now.

* npu2-opencapi: Train links on fundamental reset

  Reorder our link training steps so that they are executed on
  fundamental reset instead of during the initial setup. Skiboot
  always call a fundamental reset on all the PHBs during pci init.

  It is done through a state machine, similarly to what is done for
  ‘real’ PHBs.

  This is the first step for a longer term goal to be able to trigger
  an adapter reset from linux. We’ll need the reset callbacks of the
  PHB to be defined. We have to handle the various delays differently,
  since a linux thread shouldn’t stay stuck waiting in opal for too

* npu2-opencapi: Rework adapter reset

  Rework a bit the code to reset the opencapi adapter:

  * make clearer which i2c pin is resetting which device

  * break the reset operation in smaller chunks. This is really to
    prepare for a future patch.

  No functional changes.

* npu2-opencapi: Use presence detection

  Presence detection is not part of the opencapi specification. So
  each platform may choose to implement it the way it wants.

  All current platforms implement it through an i2c device where we
  can query a pin to know if a device is connected or not. ZZ and
  Zaius have a similar design and even use the same i2c information
  and pin numbers. However, presence detection on older ZZ planar
  (older than v4) doesn’t work, so we don’t activate it for now, until
  our lab systems are upgraded and it’s better tested.

  Presence detection on witherspoon is still being worked on. It’s
  shaping up to be quite different, so we may have to revisit the
  topic in a later patch.

Testing and CI

Since skiboot-6.1-rc1:

* test/qemu: start building qemu again, and use our built qemu for

  We need to use QEMU_BIN rather than QEMU as the makefiles define
  QEMU already.

* opal-ci: qemu: Use the powernv-3.0 branch

  This is based off the current development version of Qemu, and
  importantly it contains the patch that allows skiboot and Linux to
  clear the PCR that we require to boot.

Stewart Smith
OPAL Architect, IBM.

More information about the Skiboot mailing list