[Skiboot] [PATCH 06/60] xive: Document exploitation mode

Thu Dec 22 14:16:14 AEDT 2016

(Pretty much work in progress)

Signed-off-by: Benjamin Herrenschmidt <benh at kernel.crashing.org>
---
 doc/xive.txt | 608 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 608 insertions(+)
 create mode 100644 doc/xive.txt

diff --git a/doc/xive.txt b/doc/xive.txt
new file mode 100644
index 0000000..c38dce0
--- /dev/null
+++ b/doc/xive.txt
@@ -0,0 +1,608 @@
+P9 XIVE Exploitation
+====================
+
+
+I - Device-tree updates
+-----------------------
+
+ 1) The existing OPAL "/interrupt-controller at 0" node remains
+
+    This node represents both the emulated XICS source controller and
+    an abstraction of the virtualization engine. This represents the
+    fact thet OPAL set_xive/get_xive functions are still supported
+    though they don't provide access to the full functionality.
+
+    It is still the parent of all interrupts in the device-tree.
+
+    New or modified properties:
+
+    - "compatible" : This is extended with a new value "ibm,opal-xive-vc"
+
+  2) The new /interrupt-controller@<addr> node
+
+     This node represents both the emulated XICS presentation controller
+     and the new XIVE presentation layer.
+
+     Unlike the traditional XICS, there is only one such node for the whole
+     system.
+
+     New or modified properties:
+
+     - "compatible" : This contains at least the following strings:
+       - "ibm,opal-intc" : This represents the emulated XICS presentation
+         facility and might be the only property present if the version of
+         OPAL doesn't support XIVE exploitation.
+       - "ibm,opal-xive-pe" : This represents the XIVE presentation
+         engine.
+
+     - "ibm,xive-eq-sizes" : One cell per size supported, contains log2
+       of size, in ascending order.
+
+     - "ibm,xive-#priorities" : One cell, the number of supported priorities
+       (the priorities will be 0...n)
+
+     - "ibm,xive-provision-page-size" : Page size (in bytes) of the pages to
+       pass to OPAL for provisioning internal structures
+       (see opal_xive_donate_page). If this is absent, OPAL will never require
+       additional provisioning. The page must be naturally aligned.
+
+     - "ibm,xive-provision-chips" : The list of chip IDs for which provisioning
+        is required. Typically, if a VP allocation return OPAL_XIVE_PROVISIONING,
+        opal_xive_donate_page() will need to be called to donate a page to
+        *each* of these chips before trying again.
+
+     - "reg" property contains the addresses & sizes for the register
+       ranges corresponding respectively to the 4 rings:
+          - Ultravisor level
+          - Hypervisor level
+          - Guest OS level
+          - User level
+       For any of these, a size of  0 means this level is not supported.
+
+  3) Interrupt descriptors
+
+     The interrupt descriptors (aka "interrupts" properties and parts
+     of "interrupt-map" properties) remain 2 cells. The first cell is
+     a global interrupt number which represents a unique interrupt
+     source in the system and is an abstraction provided by OPAL.
+
+     The default configuration for all sources in the IVT/EAS is to
+     issue that number (it's internally a combination of the source
+     chip and per-chip interrupt number but the details of that
+     combination are not exposed and subject to change).
+
+     The second cell remains as usual "0" for an edge interrupt and
+     "1" for a level interrupts.
+
+  4) IPIs
+
+     Each "cpu" node now contains an "interrupts" property which has
+     one entry (2 cells per entry) for each thread on that core
+     containing the interrupt number for the IPI targeted at that
+     thread.
+
+  5) Interrupt targets
+
+     Targetting of interrupts uses processor targets and priority
+     numbers. The processor target encoding depends on which API is
+     used:
+
+      - The legacy opal_set/get_xive() APIs only support the old
+      "mangled" (ie. shifted by 2) HW processor numbers.
+
+      - The new opal_xive_set/get_irq_config API (and other
+      exploitation mode APIs) use a "token" VP number which is
+      described in II-2. Unmodified HW processor numbers are valid
+      VP numbers for those APIs.
+
+II - General operations
+-----------------------
+
+Most configuration operations are abstracted via OPAL calls, there is
+no direct access or exposure of such things as real HW interrupt or VP
+numbers.
+
+OPAL sets up all the physical interrupts and assigns them numbers, it
+also allocates enough virtual interrupts to provide an IPI per physical
+thread in the system.
+
+All interrupts are pre-configured masked and must be set to an explicit
+target before first use. The default interrupt number is programmed
+in the EAS and will remain unchanged if the targetting/unmasking is
+done using the legacy set_xive() interface.
+
+An interrupt "target" is a combination of a target processor number
+and a priority.
+
+Processor numbers are in a single domain that represents both the
+physical processors and any virtual processor or group allocated
+using the interfaces defined in this specification. These numbers
+are an OPAL maintained abstraction and are only partially related
+to the real VP numbers:
+
+In order to maintain the grouping ability, when VPs are allocated
+in blocks of naturally aligned powers of 2, the underlying HW
+numbers will respect this alignment.
+
+Note: The block group mode extension makes the numbering scheme
+a bit more tricky than simple powers of two however, see below.
+
+  1) Interrupt numbering and allocation
+
+     As specified in the device-tree definition, interrupt numbers
+     are abstracted by OPAL to be a 30-bit number. All HW interrupts
+     are "allocated" and configured at boot time along with enough
+     IPIs for all processor threads.
+
+     Additionally, in order to be compatible with the XICS emulation,
+     all interrupt numbers present in the device-tree (ie all physical
+     sources or pre-allocated IPIs) will fit within a 24-bit number
+     space.
+
+     Interrupt sources that are only usable in exploitation mode, such
+     as escalation interrupts, can have numbers covering the full 30-bit
+     range. The same is true of interrupts allocated dynamically.
+
+     The hypervisor can allocate additional blocks of interrupts,
+     in which case OPAL will return the resulting abstracted global
+     numbers. They will have to be individually configured to map
+     to a given number at the target and be routed to a given target
+     and priority using opal_xive_set_irq_config(). This call is
+     semantically equivalent to the old opal_set_xive() which is
+     still supported with the addition that opal_xive_set_irq_config()
+     can also specify the logical interrupt number.
+
+  2) VP numbering and allocation
+
+     A VP number is a 64-bit number. The internal make-up of that number
+     is opaque to the OS. However, it is a discrete integer that will
+     be a naturally aligned power of two when allocating a chunk of
+     VPs representing the "base" number of that chunk, the OS will do
+     basic arithmetic to get to all the VPs in the range.
+
+     Groups, when supported, will also be numbers in that space.
+
+     The physical processors numbering uses the same number space.
+
+     The underlying HW VP numbering is hidden from the OS, the APIs
+     uses the system processor numbers as presented in the
+     "ibm,ppc-interrupt-server#s" which corresponds to the PIR register
+     content to represent physical processors within the same number
+     space as dynamically allocated VPs.
+
+     Note about block group mode:
+
+     The block group mode shall as much as possible be handled
+     transparently by OPAL.
+
+     For example, on a 2-chips machine, a request to allocate
+     2^n VPs might result in an allocation of 2^(n-1) VPs per
+     chip allocated accross 2 chips. The resulting VP numbers
+     will encode the order of the allocation allowing OPAL to
+     reconstitute which bits are the block ID bits and which bits
+     are the index bits in a way transparent to the OS. The overall
+     range of numbers passed to Linux will still be contiguous.
+
+     That implies however a limitation: We can only allocate within
+     power-of-two number of blocks. Thus the VP allocator will limit
+     itself to the largest power of two that can fit in the number
+     of available chips in the machine: A machine with 3 good chips
+     will only be able to allocate VPs from 2 of them.
+
+  3) Group numbering and allocation
+
+     The group numbers are in the *same* number space as the VP
+     numbers. OPAL will internally use some bits of the VP number
+     to encode the group geometry.
+
+     [TBD] OPAL may or may not allocate a default group of all physical
+     processors, per-chip groups or per-core groups. This will be
+     represented in the device-tree somewhat...
+
+     [TBD] OPAL will provide interfaces for allocating groups
+
+
+  Note about P/Q bit operation on sources:
+  ----------------------------------------
+
+  opal_xive_get_irq_info() returns a certain number of flags
+  which define the type of operation supported. The following
+  rules apply based on what those flags say:
+
+        - The Q bit isn't functional on an LSI interrupt. There is no
+          garantee that the special combination "01" will work for an
+          LSI (and in fact it will not work on the PHB LSIs). However
+          just setting P to 1 is sufficient to mask an LSI (just don't
+          EOI it while masked).
+
+        - The recommended setting for a masked interrupt that is
+          temporarily masked by a driver is "10". This means a new
+          occurrence while masked will be recorded and a "StoreEOI"
+          will replay it appropriately.
+
+
+III - Event queues
+------------------
+
+Each virtual processor or group has a certain number of event queues
+associated with it. Each correspond to a given priority. The number
+of supported priorities is provided in the device-tree
+("ibm,xive-#priorities" property of the xive node).
+
+By default, OPAL populates at least one queue for every physical thread
+in the system. The number of queues and the size used is implementation
+specific. If the OS wants to re-use these to save memory, it can query
+the VP configuration.
+
+The opal_xive_get_queue_info() and opal_xive_set_queue_info() can be used
+to query a queue configuration (ie, to obtain the current page and size
+for the queue itself, but also to collect some configuration flags for
+that queue such as whether it coalesces notifications etc...) and to
+obtain the MMIO address of the queue EOI page (in the case where
+coalescing is enabled).
+
+IV - OPAL APIs
+--------------
+
+ WARNING: *All* the calls listed below may return OPAL_BUSY unless
+          explicitely documented not to. In that case, the call
+          should be performed again. The OS is allowed to insert a
+          delay though no minimum nor maxmimum delay is specified.
+          This will typically happen when performing cache update
+          operations in the XIVE, if they result in a collision.
+
+ WARNING: Calls that are expected to be called at runtime
+          simultaneously without conflicts such as getting/setting
+          IRQ info or queue info are fine to do so concurrently.
+
+          However, there is no internal locking to prevent races
+          between things such as freeing a VP block and getting/setting
+          queue infos on that block.
+
+          These aren't fully specified (yet) but common sense shall
+          apply.
+
+ int64_t opal_xive_reset(uint64_t version)
+
+    The OS should call this once when starting up to re-initialize the
+    XIVE hardware and the OPAL XIVE related state back to all defaults.
+
+    It can call it a second time before handing over to another (ie.
+    kexec) to re-enable XICS emulation.
+
+    The "version" argument should be set to 1 to enable the XIVE
+    exploitation mode APIs or 0 to switch back to the default XICS
+    emulation mode.
+
+    Future versions of OPAL might allow higher versions than 1 to
+    represent newer versions of this API. OPAL will return an error
+    if it doesn't recognize the requested version.
+
+    Any page of memory that the OS has "donated" to OPAL, either backing
+    store for EQDs or VPDs or actual queue buffers will be removed from
+    the various HW maps and can be re-used by the OS or freed after this
+    call regardless of the version information. The HW will be reset to
+    a (mostly) clean state.
+
+    It is the responsibility of the caller to ensure that no other
+    XIVE or XICS emulation call happens simultaneously to this. This
+    basically should happen on an otherwise quiescent system. In the
+    case of kexec, it is recommended that all processors CPPR is lowered
+    first.
+
+    Note: This call always executes fully synchronously, never returns
+    OPAL_BUSY and will work regardless of whether VPs and EQs are left
+    enabled or disabled. It *will* spend a significant amount of time
+    inside OPAL and as such is not suitable to be performed during normal
+    runtime.
+
+ int64_t opal_xive_get_irq_info(uint32_t girq,
+                                uint64_t *out_flags,
+                                uint64_t *out_eoi_page,
+                                uint64_t *out_trig_page,
+				uint32_t *out_esb_shift,
+                                uint32_t *out_src_chip);
+
+    Returns info about an interrupt source. This call never returns
+    OPAL_BUSY.
+
+    * out_flags returns a set of flags. The following flags
+      are defined in the API (some bits are reserved, so any bit
+      not defined here should be ignored):
+
+     - OPAL_XIVE_IRQ_TRIGGER_PAGE
+
+       Indicate that the trigger page is a separate page. If that
+       bit is clear, there is either no trigger page or the trigger
+       can be done in the same page as the EOI, see below.
+
+     - OPAL_XIVE_IRQ_STORE_EOI
+
+       Indicates that the interrupt supports the "Store EOI" option,
+       ie a store to the EOI page will move Q into P and retrigger
+       if the resulting P bit is 1. If this flag is 0, then a store
+       to the EOI page will do a trigger if OPAL_XIVE_IRQ_TRIGGER_PAGE
+       is also 0.
+
+     - OPAL_XIVE_IRQ_LSI
+
+       Indicates that the source is a level sensitive source and thus
+       doesn't have a functional Q bit. The Q bit may or may not be
+       implemented in HW but SW shouldn't rely on it doing anything.
+
+     - OPAL_XIVE_IRQ_SHIFT_BUG
+
+       Indicates that the source has a HW bug that shifts the bits
+       of the "offset" inside the EOI page left by 4 bits. So when
+       this is set, us 0xc000, 0xd000... instead of 0xc00, 0xd00...
+       as offets in the EOI page.
+
+     - OPAL_XIVE_IRQ_MASK_VIA_FW
+
+       Indicates that a FW call is needed (either opal_set_xive()
+       or opal_xive_set_irq_config()) to succesfully mask and unmask
+       the interrupt. The operations via the ESB page aren't fully
+       functional.
+
+     - OPAL_XIVE_IRQ_EOI_VIA_FW
+
+       Indicates that a FW call to opal_xive_eoi() is needed to
+       successfully EOI the interrupt. The operation via the ESB page
+       isn't fully functional.
+
+    * out_eoi_page and out_trig_page outputs will be set to the
+      EOI page physical address (always) and the trigger page address
+      (if it exists).
+      The trigger page may exist even if OPAL_XIVE_IRQ_TRIGGER_PAGE
+      is not set. In that case out_trig_page is equal to out_eoi_page.
+
+    * out_esb_shift contains the size (as an order, ie 2^n) of the
+      EOI and trigger pages. Current supported values are 12 (4k)
+      and 16 (64k). Those cannot be configured by the OS and are set
+      by firmware but can be different for different interrupt sources.
+
+    * out_src_chip will be set to the chip ID of the HW entity this
+      interrupt is sourced from. It's meant to be informative only
+      and thus isn't guaranteed to be 100% accurate. The idea is for
+      the OS to use that to pick up a default target processor on
+      the same chip.
+
+ int64_t opal_xive_eoi(uint32_t girq);
+
+    Performs an EOI on the interrupt. This should only be called if
+    OPAL_XIVE_IRQ_EOI_VIA_FW is set as otherwise direct ESB access
+    is preferred.
+
+    Note: This is the *same* opal_xive_eoi() call used by OPAL XICS
+    emulation. However the XIRR parameter is re-purposed as "GIRQ".
+
+    The call will perform the appropriate function depending on
+    whether OPAL is in XICS emulation mode  or native XIVE exploitation
+    mode.
+
+ int64_t opal_xive_get_irq_config(uint32_t girq, uint64_t *out_vp,
+                                  uint8_t *out_prio, uint32_t *out_lirq);
+
+    Returns current the configuration of an interrupt source. This is
+    the equivalent of opal_get_xive() with the addition of the logical
+    interrupt number (the number that will be presented in the queue).
+
+    * girq: The interrupt number to get the configuration of as
+      provided by the device-tree.
+
+    * out_vp: Will contain the target virtual processor where the
+      interrupt is currently routed to. This can return 0xffffffff
+      if the interrupt isn't routed to a valid virtual processor.
+
+    * out_prio: Will contain the priority of the interrupt or 0xff
+      if masked
+
+    * out_lirq: Will contain the logical interrupt assigned to the
+      interrupt. By default this will be the same as girq.
+
+ int64_t opal_xive_set_irq_config(uint32_t girq, uint64_t vp, uint8_t prio,
+                                  uint32_t lirq);
+
+    This allows configuration and routing of a hardware interrupt. This is
+    equivalent to opal_set_xive() with the addition of the ability to
+    configure the logical IRQ number (the number that will be presented
+    in the target queue).
+
+    * girq: The interrupt number to configure of as provided by the
+      device-tree.
+
+    * vp: The target virtual processor. The target VP/Prio combination
+      must already exist, be enabled and populated (ie, a queue page must
+      be provisioned for that queue).
+
+    * prio: The priority of the interrupt.
+
+    * lirq: The logical interrupt number assigned to that interrupt
+
+    Note about masking:
+    -------------------
+
+    If the prio is set to 0xff, this call will cause the interrupt to be
+    masked.
+
+    Note: This function might clobber the source P/Q bits. An interrupt
+    masked this way will be in a state where the events will be lost
+    while masked and not replayed while unmasked. Unkasking *will* clear
+    the state of the source P/Q bits unconditionally.
+
+    It is recommended for an OS exploiting the XIVE directly to not use
+    this function for temporary driver-initiated masking of interrupts
+    but to directly mask using the P/Q bits of the source instead.
+
+    Masking using this function is intended for the case where the OS has
+    no handler registered for a given interrupt anymore or when registering
+    a new handler for an interrupt that had none. In these case, losing
+    interrupts happening while no handler was attached is considered fine
+    and the source comes up in a "clean state" when used for the first time.
+
+ int64_t opal_xive_get_queue_info(uint64_t vp, uint32_t prio,
+                                  uint64_t *out_qpage,
+                                  uint64_t *out_qsize,
+                                  uint64_t *out_qeoi_page,
+                                  uint32_t *out_escalate_irq,
+                                  uint64_t *out_qflags);
+
+    This returns informations about a given interrupt queue associated
+    with a virtual processor and a priority.
+
+    * out_qpage: will contain the physical address of the page where the
+      interrupt events will be posted.
+
+    * out_qsize: will contain the log2 of the size of the queue buffer
+      or 0 if the queue hasn't been populated. Example: 12 for a 4k page.
+
+    * out_qeoi_page: will contain the physical address of the MMIO page
+      used to perform EOIs for the queue notifications.
+
+    * out_escalate_irq: will contain a girq number for the escalation
+      interrupt associated with that queue.
+
+      WARNING: The "escalate_irq" is a special interrupt number, depending
+      on the implementation it may or may not correspond to a normal XIVE
+      source.  Masking of escalation IRQs is only supported using the PQ bits,
+      passing a priority of 0xff to opal_set_xive or
+      opal_xive_set_irq_configuration() will in effect only affect the PQ bits.
+      Being MSIs though, they do support the special "01" combination for
+      'interrupt off'.
+
+    * out_qflags: will contain flags defined as follow:
+
+      - OPAL_XIVE_EQ_ENABLED
+
+        This must be set for the queue to be enabled and thus a valid
+        target for interrupts. Newly allocated queues are disabled by
+        default and must be disabled again before being freed (allocating
+        and freeing of queues currently only happens along with their
+        owner VP).
+
+        NOTE: A newly enabled queue will have the generation set to 1
+        and the queue pointer to 0. If the OS wants to "reset" a queue
+        generation and pointer, it thus must disable and re-enable
+        the queue.
+
+      - OPAL_XIVE_EQ_ALWAYS_NOTIFY
+
+        When this is set, the HW will always notify the VP on any new
+        entry in the queue, thus the queue own P/Q bits won't be relevant
+        and using the EOI page will be unnecessary.
+
+      - OPAL_XIVE_EQ_ESCALATE
+
+        When this is set, the EQ will escalate to the escalation interrupt
+        when failing to notify.
+
+ int64_t opal_xive_set_queue_info(uint64_t vp, uint32_t prio,
+                                  uint64_t qpage,
+                                  uint64_t qsize,
+                                  uint64_t qflags);
+
+    This allows the OS to configure the queue page for a given processor
+    and priority and adjust the behaviour of the queue via flags.
+
+    * qpage: physical address of the page where the interrupt events will
+      be posted. This has to be naturally aligned.
+
+    * qsize: log2 of the size of the above page. A 0 here will disable
+      the queue.
+
+    * qflags: Flags (see definitions in opal_xive_get_queue_info)
+
+    NOTE: Should this have the side effect of resetting the toggle/generation ?
+
+    NOTE: This must be called at least once on a queue with the flag
+          OPAL_XIVE_EQ_ENABLED in order to enable it after it has been
+          allocated (along with its owner VP).
+
+ int64_t opal_xive_donate_page(uint32_t chip_id, uint64_t addr);
+
+    This call is used to donate pages to OPAL for use by VP/EQ provisioning.
+
+    The pages must be of the size specified by the "ibm,xive-provision-page-size"
+    property and naturally aligned.
+
+    All donated pages are forgotten by OPAL (and thus returned to the OS)
+    on any call to opal_xive_reset().
+
+    The chip_id should be the chip on which the pages were allocated or -1
+    if unspecified. Ideally, when a VP allocation request fails with the
+    OPAL_XIVE_PROVISIONING error, the OS should allocate one such page
+    for each chip in the system and hand it to OPAL before trying again.
+
+    Note: It is possible that the provisioning ends up requiring more than
+    one page per chip. OPAL will keep returning the above error until enough
+    pages have been provided.
+
+ int64_t opal_xive_alloc_vp_block(uint32_t alloc_order);
+
+    This call is used to allocate a block of VPs. It will return a number
+    representing the base of the block which will be aligned on the alloc
+    order, allowing the OS to do basic arithmetic to index VPs in the block.
+
+    The VPs will have queue structures reserved (but not initialized nor
+    provisioned) for all the priorities defined in the "ibm,xive-#priorities"
+    property
+
+    This call might return OPAL_XIVE_PROVISIONING. In this case, the OS
+    must allocate pages and provision OPAL using opal_xive_donate_page(),
+    see the documentation for opal_xive_donate_page() for details.
+
+    The resulting VPs must be individudally enabled with opal_xive_set_vp_info
+    below with the OPAL_XIVE_VP_ENABLED flag set before use.
+
+    For all priorities, the corresponding queues must also be individually
+    provisioned and enabled with opal_xive_set_queue_info.
+
+int64_t opal_xive_free_vp_block(uint64_t vp);
+
+    This call is used to free a block of VPs. It must be called with the same
+    *base* number as was returned by opal_xive_alloc_vp() (any index into the
+    block will result in an OPAL_PARAMETER error).
+
+    The VPs must have been previously all disabled with opal_xive_set_vp_info
+    below with the OPAL_XIVE_VP_ENABLED flag cleared before use.
+
+    All the queues must also have been disabled.
+
+    Failure to do any of the above will result in an OPAL_XIVE_FREE_ACTIVE error.
+
+ int64_t opal_xive_get_vp_info(uint64_t vp,
+                               uint64_t *flags,
+                               uint64_t *cam_value,
+                               uint64_t *report_cl_pair);
+
+    This call returns information about an allocated VP:
+
+    * flags  :
+
+      - OPAL_XIVE_VP_ENABLED
+
+      This must be set for the VP to be usable and cleared before freeing it
+
+    * cam_value : This is the value to program into the thread management
+      area to dispatch that VP (ie, an encoding of the block + index).
+
+    * report_cl_pair:  This is the real address of the reporting cache line
+      pair for that VP (defaults to 0)
+
+ int64_t opal_xive_set_vp_info(uint64_t vp,
+                               uint64_t flags,
+                               uint64_t report_cl_pair);
+
+
+ int64_t opal_xive_allocate_irq(uint32_t chip_id);
+
+    This call allocates a software IRQ on a given chip. It returns the
+    interrupt number or an error.
+
+
+ int64_t opal_xive_free_irq(uint32_t girq);
+
+    This call frees a software IRQ that was allocated by
+    opal_xive_allocate_irq. Passing any other interrupt number
+    will result in an OPAL_PARAMETER error.
+
-- 
2.9.3