[PATCH RFC 00/15] powerpc/eeh: Synchronize access to struct eeh_pe

Wed Oct 2 16:02:38 AEST 2019

Hi Everyone,

It's currently possible to cause a kernel crash by being (un)lucky while adding
or removing PCI devices during EEH recovery.  The cause of the problem is that
there are trees of objects (struct eeh_pe, implemented with pointers and kernel
linked lists) that are manipulated concurrently by the PCI core and the EEH
recovery system, without synchronization.  Unsurprisingly this allows the
kernel to hit list poison values or free'd memory, and crash.

Here's a first draft of a way to address this, and I would welcome feedback at
this stage: Does it make sense? Is it simple enough? Is it maintainable? Are
there other options I should explore?  etc...

The design:

We have to consider that:
- eeh_dev_check_failure() can be called in any context, so the only methods
  that are safe to use there are a spinlock or atomics.
- Some recovery operations are blocking (and can take a long time), so they
  can't be protected by a spinlock. These operations are performed while
  traversing lists of PEs.

The simplest solution might initially seem to be to hold the pci_rescan_remove
lock during EEH recovery, but it seems to have some significant problems:
- It's a mutex, and can't be used in eeh_dev_check_failure().
- If there is a crash during recovery, either in the core kernel or a driver's
  call-back function, the lock would be held forever and that would block a lot
  of the PCI core.
- The EEH code is quite recursive and it's very difficult to work out where to
  take the lock without causing deadlocks. If there are bugs, they would likely
  be deadlocks.

So instead, this proposal uses a spinlock (eeh_pe_lock) to protect access to
the list-membership fields of eeh_pe in combination with reference counting for
blocking operations.  Because the spinlock is only protecting the list fields
(and no state), the lock only needs to be used over small sections that can be
inspected easily.

For blocking operations I'm using the standard kernel refcounting. It only
involves one extra field on each PE but it does require tweaks to maintain the
refcount in most functions that use PEs and sometimes that can be tricky.

The way the refcount is used is fairly standard: it starts at 1 and is
incremented for each pointer pointing to the PE. The initial value of 1
represents the 'existance' of the PE and it's membership in the 'tree' under
it's PHB, regardless of where in that tree it is, or if it has children. A PE
is removed from those lists before it's inital reference is dropped.

The interaction of the spinlock and refcounting provides this property: If the
spinlock is held and a ref is held on a PE then it is safe to start from that
PE and traverse anywhere along the list fields or parent pointers to other PEs
without taking references to them.  (To keep a reference to a PE for use
outside of the spinlock, a ref must be taken before the spinlock is released.)

That property can be used to perform blocking operations while traversing the
lists by 'sliding' the reference along, only holding the spinlock while moving
to the next iteration.  It seems to work well but is not perfect: when the lock
is released, the current PE may be removed and there isn't any simple, safe way
to continue the traversal. The situation can be detected, so there won't be a
crash, but traversal can be left incomplete. Perhaps this could be fixed by
using a temporary list of struct eeh_pe that is collected while holding the
spinlock (I've experimented with this, and it works but seems a bit
complicated), or perhaps we can just work around it where it matters.  It
hasn't happened at all during my testing. I'll look at it if this proposal goes
ahead.

For clarity, I have also kept the scope very tightly on the lists of struct
eeh_pe and the parent pointer. This patchset does not protect the list of
eeh_dev in each eeh_pe or the pdev member of struct eeh_dev, both of which are
also affected by races. I've left a few 'this is unsafe' comments in the code
in those areas.  I'll look more at it if this proposal goes ahead, I don't
think that they will be difficult compared to eeh_pe.

Lastly, I've included an "orphan tracking system" that I used during
development to verify that reference couting was acting as expected, but I have
no idea if it should be included in the final version.  It keeps track of PEs
that have been removed from the PHB tree, but not yet freed and makes that list
available in debugfs.  Any PEs that remain orphans for very long are going to
be the result of bugs.  It's extra risk because it itself could contain bugs,
but it could also be useful during debugging.

Cheers,
Sam.

Sam Bobroff (15):
  powerpc/eeh: Introduce refcounting for struct eeh_pe
  powerpc/eeh: Rename eeh_pe_get() to eeh_pe_find()
  powerpc/eeh: Track orphaned struct eeh_pe
  powerpc/eeh: Sync eeh_pe_next(), eeh_pe_find() and early-out
    traversals
  powerpc/eeh: Sync eeh_pe_get_parent()
  powerpc/eeh: Sync eeh_phb_pe_get()
  powerpc/eeh: Sync eeh_add_to_parent_pe() and eeh_rmv_from_parent_pe()
  powerpc/eeh: Sync eeh_handle_normal_event()
  powerpw/eeh: Sync eeh_handle_special_event(), pnv_eeh_get_pe(),
    pnv_eeh_next_error()
  powerpc/eeh: Sync eeh_phb_check_failure()
  powerpc/eeh: Sync eeh_dev_check_failure()
  powerpc/eeh: Sync eeh_pe_get_state()
  powerpc/eeh: Sync pnv_eeh_ei_write()
  powerpc/eeh: Sync eeh_force_recover_write()
  powerpc/eeh: Sync pcibios_set_pcie_reset_state()

 arch/powerpc/include/asm/eeh.h               |  20 +-
 arch/powerpc/kernel/eeh.c                    |  65 +++++-
 arch/powerpc/kernel/eeh_driver.c             |  23 +-
 arch/powerpc/kernel/eeh_pe.c                 | 230 ++++++++++++++++---
 arch/powerpc/platforms/powernv/eeh-powernv.c |  43 +++-
 5 files changed, 329 insertions(+), 52 deletions(-)

-- 
2.22.0.216.g00a2a96fc9