[Skiboot] [PATCH 04/32] xive: Document exploitation mode
benh at kernel.crashing.org
Tue Nov 22 13:13:06 AEDT 2016
(Pretty much work in progress)
Signed-off-by: Benjamin Herrenschmidt <benh at kernel.crashing.org>
doc/xive.txt | 580 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 580 insertions(+)
create mode 100644 doc/xive.txt
diff --git a/doc/xive.txt b/doc/xive.txt
new file mode 100644
@@ -0,0 +1,580 @@
+P9 XIVE Exploitation
+I - Device-tree updates
+ 1) The existing OPAL "/interrupt-controller at 0" node remains
+ This node represents both the emulated XICS source controller and
+ an abstraction of the virtualization engine. This represents the
+ fact thet OPAL set_xive/get_xive functions are still supported
+ though they don't provide access to the full functionality.
+ It is still the parent of all interrupts in the device-tree.
+ New or modified properties:
+ - "compatible" : This is extended with a new value "ibm,opal-xive-vc"
+ 2) The new /interrupt-controller@<addr> node
+ This node represents both the emulated XICS presentation controller
+ and the new XIVE presentation layer.
+ Unlike the traditional XICS, there is only one such node for the whole
+ New or modified properties:
+ - "compatible" : This contains at least the following strings:
+ - "ibm,opal-intc" : This represents the emulated XICS presentation
+ facility and might be the only property present if the version of
+ OPAL doesn't support XIVE exploitation.
+ - "ibm,opal-xive-pe" : This represents the XIVE presentation
+ - "ibm,xive-eq-sizes" : One cell per size supported, contains log2
+ of size, in ascending order.
+ - "ibm,xive-#priorities" : One cell, the number of supported priorities
+ (the priorities will be 0...n)
+ - "ibm,xive-provision-page-size" : Page size (in bytes) of the pages to
+ pass to OPAL for provisioning internal structures
+ (see opal_xive_donate_page). If this is absent, OPAL will never require
+ additional provisioning. The page must be naturally aligned.
+ - "ibm,xive-provision-chips" : The list of chip IDs for which provisioning
+ is required. Typically, if a VP allocation return OPAL_XIVE_PROVISIONING,
+ opal_xive_donate_page() will need to be called to donate a page to
+ *each* of these chips before trying again.
+ - "reg" property contains the addresses & sizes for the register
+ ranges corresponding respectively to the 4 rings:
+ - Ultravisor level
+ - Hypervisor level
+ - Guest OS level
+ - User level
+ For any of these, a size of 0 means this level is not supported.
+ 3) Interrupt descriptors
+ The interrupt descriptors (aka "interrupts" properties and parts
+ of "interrupt-map" properties) remain 2 cells. The first cell is
+ a global interrupt number which represents a unique interrupt
+ source in the system and is an abstraction provided by OPAL.
+ The default configuration for all sources in the IVT/EAS is to
+ issue that number (it's internally a combination of the source
+ chip and per-chip interrupt number but the details of that
+ combination are not exposed and subject to change).
+ The second cell remains as usual "0" for an edge interrupt and
+ "1" for a level interrupts.
+ 4) IPIs
+ Each "cpu" node now contains an "interrupts" property which has
+ one entry (2 cells per entry) for each thread on that core
+ containing the interrupt number for the IPI targeted at that
+ 5) Interrupt targets
+ Targetting of interrupts uses processor targets and priority
+ numbers. The processor target encoding depends on which API is
+ - The legacy opal_set/get_xive() APIs only support the old
+ "mangled" (ie. shifted by 2) HW processor numbers.
+ - The new opal_xive_set/get_irq_config API (and other
+ exploitation mode APIs) use a "token" VP number which is
+ described in II-2. Unmodified HW processor numbers are valid
+ VP numbers for those APIs.
+II - General operations
+Most configuration operations are abstracted via OPAL calls, there is
+no direct access or exposure of such things as real HW interrupt or VP
+OPAL sets up all the physical interrupts and assigns them numbers, it
+also allocates enough virtual interrupts to provide an IPI per physical
+thread in the system.
+All interrupts are pre-configured masked and must be set to an explicit
+target before first use. The default interrupt number is programmed
+in the EAS and will remain unchanged if the targetting/unmasking is
+done using the legacy set_xive() interface.
+An interrupt "target" is a combination of a target processor number
+and a priority.
+Processor numbers are in a single domain that represents both the
+physical processors and any virtual processor or group allocated
+using the interfaces defined in this specification. These numbers
+are an OPAL maintained abstraction and are only partially related
+to the real VP numbers:
+In order to maintain the grouping ability, when VPs are allocated
+in blocks of naturally aligned powers of 2, the underlying HW
+numbers will respect this alignment.
+Note: The block group mode extension makes the numbering scheme
+a bit more tricky than simple powers of two however, see below.
+ 1) Interrupt numbering and allocation
+ As specified in the device-tree definition, interrupt numbers
+ are abstracted by OPAL to be a 30-bit number. All HW interrupts
+ are "allocated" and configured at boot time along with enough
+ IPIs for all processor threads.
+ Additionally, in order to be compatible with the XICS emulation,
+ all interrupt numbers present in the device-tree (ie all physical
+ sources or pre-allocated IPIs) will fit within a 24-bit number
+ Interrupt sources that are only usable in exploitation mode, such
+ as escalation interrupts, can have numbers covering the full 30-bit
+ range. The same is true of interrupts allocated dynamically.
+ The hypervisor can allocate additional blocks of interrupts,
+ in which case OPAL will return the resulting abstracted global
+ numbers. They will have to be individually configured to map
+ to a given number at the target and be routed to a given target
+ and priority using opal_xive_set_irq_config(). This call is
+ semantically equivalent to the old opal_set_xive() which is
+ still supported with the addition that opal_xive_set_irq_config()
+ can also specify the logical interrupt number.
+ 2) VP numbering and allocation
+ A VP number is a 64-bit number. The internal make-up of that number
+ is opaque to the OS. However, it is a discrete integer that will
+ be a naturally aligned power of two when allocating a chunk of
+ VPs representing the "base" number of that chunk, the OS will do
+ basic arithmetic to get to all the VPs in the range.
+ Groups, when supported, will also be numbers in that space.
+ The physical processors numbering uses the same number space.
+ The underlying HW VP numbering is hidden from the OS, the APIs
+ uses the system processor numbers as presented in the
+ "ibm,ppc-interrupt-server#s" which corresponds to the PIR register
+ content to represent physical processors within the same number
+ space as dynamically allocated VPs.
+ Note about block group mode:
+ The block group mode shall as much as possible be handled
+ transparently by OPAL.
+ For example, on a 2-chips machine, a request to allocate
+ 2^n VPs might result in an allocation of 2^(n-1) VPs per
+ chip allocated accross 2 chips. The resulting VP numbers
+ will encode the order of the allocation allowing OPAL to
+ reconstitute which bits are the block ID bits and which bits
+ are the index bits in a way transparent to the OS. The overall
+ range of numbers passed to Linux will still be contiguous.
+ That implies however a limitation: We can only allocate within
+ power-of-two number of blocks. Thus the VP allocator will limit
+ itself to the largest power of two that can fit in the number
+ of available chips in the machine: A machine with 3 good chips
+ will only be able to allocate VPs from 2 of them.
+ 3) Group numbering and allocation
+ The group numbers are in the *same* number space as the VP
+ numbers. OPAL will internally use some bits of the VP number
+ to encode the group geometry.
+ [TBD] OPAL may or may not allocate a default group of all physical
+ processors, per-chip groups or per-core groups. This will be
+ represented in the device-tree somewhat...
+ [TBD] OPAL will provide interfaces for allocating groups
+ Note about P/Q bit operation on sources:
+ opal_xive_get_irq_info() returns a certain number of flags
+ which define the type of operation supported. The following
+ rules apply based on what those flags say:
+ - The Q bit isn't functional on an LSI interrupt. There is no
+ garantee that the special combination "01" will work for an
+ LSI (and in fact it will not work on the PHB LSIs). However
+ just setting P to 1 is sufficient to mask an LSI (just don't
+ EOI it while masked).
+ - The recommended setting for a masked interrupt that is
+ temporarily masked by a driver is "10". This means a new
+ occurrence while masked will be recorded and a "StoreEOI"
+ will replay it appropriately.
+III - Event queues
+Each virtual processor or group has a certain number of event queues
+associated with it. Each correspond to a given priority. The number
+of supported priorities is provided in the device-tree
+("ibm,xive-#priorities" property of the xive node).
+By default, OPAL populates at least one queue for every physical thread
+in the system. The number of queues and the size used is implementation
+specific. If the OS wants to re-use these to save memory, it can query
+the VP configuration.
+The opal_xive_get_queue_info() and opal_xive_set_queue_info() can be used
+to query a queue configuration (ie, to obtain the current page and size
+for the queue itself, but also to collect some configuration flags for
+that queue such as whether it coalesces notifications etc...) and to
+obtain the MMIO address of the queue EOI page (in the case where
+coalescing is enabled).
+IV - OPAL APIs
+ WARNING: *All* the calls listed below may return OPAL_BUSY unless
+ explicitely documented not to. In that case, the call
+ should be performed again. The OS is allowed to insert a
+ delay though no minimum nor maxmimum delay is specified.
+ This will typically happen when performing cache update
+ operations in the XIVE, if they result in a collision.
+ WARNING: Calls that are expected to be called at runtime
+ simultaneously without conflicts such as getting/setting
+ IRQ info or queue info are fine to do so concurrently.
+ However, there is no internal locking to prevent races
+ between things such as freeing a VP block and getting/setting
+ queue infos on that block.
+ These aren't fully specified (yet) but common sense shall
+ int64_t opal_xive_reset(uint64_t version)
+ The OS should call this once when starting up to re-initialize the
+ XIVE hardware and the OPAL XIVE related state back to all defaults.
+ It can call it a second time before handing over to another (ie.
+ kexec) to re-enable XICS emulation.
+ The "version" argument should be set to 1 to enable the XIVE
+ exploitation mode APIs or 0 to switch back to the default XICS
+ emulation mode.
+ Future versions of OPAL might allow higher versions than 1 to
+ represent newer versions of this API. OPAL will return an error
+ if it doesn't recognize the requested version.
+ Any page of memory that the OS has "donated" to OPAL, either backing
+ store for EQDs or VPDs or actual queue buffers will be removed from
+ the various HW maps and can be re-used by the OS or freed after this
+ call regardless of the version information. The HW will be reset to
+ a (mostly) clean state.
+ It is the responsibility of the caller to ensure that no other
+ XIVE or XICS emulation call happens simultaneously to this. This
+ basically should happen on an otherwise quiescent system. In the
+ case of kexec, it is recommended that all processors CPPR is lowered
+ Note: This call always executes fully synchronously, never returns
+ OPAL_BUSY and will work regardless of whether VPs and EQs are left
+ enabled or disabled. It *will* spend a significant amount of time
+ inside OPAL and as such is not suitable to be performed during normal
+ int64_t opal_xive_get_irq_info(uint32_t girq,
+ uint64_t *out_flags,
+ uint64_t *out_eoi_page,
+ uint64_t *out_trig_page,
+ uint32_t *out_esb_shift,
+ uint32_t *out_src_chip);
+ Returns info about an interrupt source.
+ * out_flags returns a set of flags. The following flags
+ are defined in the API (some bits are reserved, so any bit
+ not defined here should be ignored):
+ - OPAL_XIVE_IRQ_TRIGGER_PAGE
+ Indicate that the trigger page is a separate page. If that
+ bit is clear, there is either no trigger page or the trigger
+ can be done in the same page as the EOI, see below.
+ - OPAL_XIVE_IRQ_STORE_EOI
+ Indicates that the interrupt supports the "Store EOI" option,
+ ie a store to the EOI page will move Q into P and retrigger
+ if the resulting P bit is 1. If this flag is 0, then a store
+ to the EOI page will do a trigger if OPAL_XIVE_IRQ_TRIGGER_PAGE
+ is also 0.
+ - OPAL_XIVE_IRQ_LSI
+ Indicates that the source is a level sensitive source and thus
+ doesn't have a functional Q bit. The Q bit may or may not be
+ implemented in HW but SW shouldn't rely on it doing anything.
+ - OPAL_XIVE_IRQ_SHIFT_BUG
+ Indicates that the source has a HW bug that shifts the bits
+ of the "offset" inside the EOI page left by 4 bits. So when
+ this is set, us 0xc000, 0xd000... instead of 0xc00, 0xd00...
+ as offets in the EOI page.
+ * out_eoi_page and out_trig_page outputs will be set to the
+ EOI page physical address (always) and the trigger page address
+ (if it exists). If OPAL_XIVE_IRQ_TRIGGER_PAGE is 0 then there
+ will be no separate trigger page and *out_trig_page will be 0.
+ * out_esb_shift contains the size (as an order, ie 2^n) of the
+ EOI and trigger pages. Current supported values are 12 (4k)
+ and 16 (64k). Those cannot be configured by the OS and are set
+ by firmware but can be different for different interrupt sources.
+ * out_src_chip will be set to the chip ID of the HW entity this
+ interrupt is sourced from. It's meant to be informative only
+ and thus isn't guaranteed to be 100% accurate. The idea is for
+ the OS to use that to pick up a default target processor on
+ the same chip.
+ int64_t opal_xive_get_irq_config(uint32_t girq, uint64_t *out_vp,
+ uint8_t *out_prio, uint32_t *out_lirq);
+ Returns current the configuration of an interrupt source. This is
+ the equivalent of opal_get_xive() with the addition of the logical
+ interrupt number (the number that will be presented in the queue).
+ * girq: The interrupt number to get the configuration of as
+ provided by the device-tree.
+ * out_vp: Will contain the target virtual processor where the
+ interrupt is currently routed to. This can return 0xffffffff
+ if the interrupt isn't routed to a valid virtual processor.
+ * out_prio: Will contain the priority of the interrupt or 0xff
+ if masked
+ * out_lirq: Will contain the logical interrupt assigned to the
+ interrupt. By default this will be the same as girq.
+ int64_t opal_xive_set_irq_config(uint32_t girq, uint64_t vp, uint8_t prio,
+ uint32_t lirq);
+ This allows configuration and routing of a hardware interrupt. This is
+ equivalent to opal_set_xive() with the addition of the ability to
+ configure the logical IRQ number (the number that will be presented
+ in the target queue).
+ * girq: The interrupt number to configure of as provided by the
+ * vp: The target virtual processor. The target VP/Prio combination
+ must already exist, be enabled and populated (ie, a queue page must
+ be provisioned for that queue).
+ * prio: The priority of the interrupt.
+ * lirq: The logical interrupt number assigned to that interrupt
+ Note about masking:
+ If the prio is set to 0xff, this call will cause the interrupt to be
+ Note: This function might clobber the source P/Q bits. An interrupt
+ masked this way will be in a state where the events will be lost
+ while masked and not replayed while unmasked. Unkasking *will* clear
+ the state of the source P/Q bits unconditionally.
+ It is recommended for an OS exploiting the XIVE directly to not use
+ this function for temporary driver-initiated masking of interrupts
+ but to directly mask using the P/Q bits of the source instead.
+ Masking using this function is intended for the case where the OS has
+ no handler registered for a given interrupt anymore or when registering
+ a new handler for an interrupt that had none. In these case, losing
+ interrupts happening while no handler was attached is considered fine
+ and the source comes up in a "clean state" when used for the first time.
+ int64_t opal_xive_get_queue_info(uint64_t vp, uint32_t prio,
+ uint64_t *out_qpage,
+ uint64_t *out_qsize,
+ uint64_t *out_qeoi_page,
+ uint32_t *out_escalate_irq,
+ uint64_t *out_qflags);
+ This returns informations about a given interrupt queue associated
+ with a virtual processor and a priority.
+ * out_qpage: will contain the physical address of the page where the
+ interrupt events will be posted.
+ * out_qsize: will contain the log2 of the size of the queue buffer
+ or 0 if the queue hasn't been populated. Example: 12 for a 4k page.
+ * out_qeoi_page: will contain the physical address of the MMIO page
+ used to perform EOIs for the queue notifications.
+ * out_escalate_irq: will contain a girq number for the escalation
+ interrupt associated with that queue.
+ WARNING: The "escalate_irq" is a special interrupt number, depending
+ on the implementation it may or may not correspond to a normal XIVE
+ source. Masking of escalation IRQs is only supported using the PQ bits,
+ passing a priority of 0xff to opal_set_xive or
+ opal_xive_set_irq_configuration() will in effect only affect the PQ bits.
+ Being MSIs though, they do support the special "01" combination for
+ 'interrupt off'.
+ * out_qflags: will contain flags defined as follow:
+ - OPAL_XIVE_EQ_ENABLED
+ This must be set for the queue to be enabled and thus a valid
+ target for interrupts. Newly allocated queues are disabled by
+ default and must be disabled again before being freed (allocating
+ and freeing of queues currently only happens along with their
+ owner VP).
+ NOTE: A newly enabled queue will have the generation set to 1
+ and the queue pointer to 0. If the OS wants to "reset" a queue
+ generation and pointer, it thus must disable and re-enable
+ the queue.
+ - OPAL_XIVE_EQ_ALWAYS_NOTIFY
+ When this is set, the HW will always notify the VP on any new
+ entry in the queue, thus the queue own P/Q bits won't be relevant
+ and using the EOI page will be unnecessary.
+ - OPAL_XIVE_EQ_ESCALATE
+ When this is set, the EQ will escalate to the escalation interrupt
+ when failing to notify.
+ int64_t opal_xive_set_queue_info(uint64_t vp, uint32_t prio,
+ uint64_t qpage,
+ uint64_t qsize,
+ uint64_t qflags);
+ This allows the OS to configure the queue page for a given processor
+ and priority and adjust the behaviour of the queue via flags.
+ * qpage: physical address of the page where the interrupt events will
+ be posted. This has to be naturally aligned.
+ * qsize: log2 of the size of the above page. A 0 here will disable
+ the queue.
+ * qflags: Flags (see definitions in opal_xive_get_queue_info)
+ NOTE: Should this have the side effect of resetting the toggle/generation ?
+ NOTE: This must be called at least once on a queue with the flag
+ OPAL_XIVE_EQ_ENABLED in order to enable it after it has been
+ allocated (along with its owner VP).
+ int64_t opal_xive_donate_page(uint32_t chip_id, uint64_t addr);
+ This call is used to donate pages to OPAL for use by VP/EQ provisioning.
+ The pages must be of the size specified by the "ibm,xive-provision-page-size"
+ property and naturally aligned.
+ All donated pages are forgotten by OPAL (and thus returned to the OS)
+ on any call to opal_xive_reset().
+ The chip_id should be the chip on which the pages were allocated or -1
+ if unspecified. Ideally, when a VP allocation request fails with the
+ OPAL_XIVE_PROVISIONING error, the OS should allocate one such page
+ for each chip in the system and hand it to OPAL before trying again.
+ Note: It is possible that the provisioning ends up requiring more than
+ one page per chip. OPAL will keep returning the above error until enough
+ pages have been provided.
+ int64_t opal_xive_alloc_vp_block(uint32_t alloc_order);
+ This call is used to allocate a block of VPs. It will return a number
+ representing the base of the block which will be aligned on the alloc
+ order, allowing the OS to do basic arithmetic to index VPs in the block.
+ The VPs will have queue structures reserved (but not initialized nor
+ provisioned) for all the priorities defined in the "ibm,xive-#priorities"
+ This call might return OPAL_XIVE_PROVISIONING. In this case, the OS
+ must allocate pages and provision OPAL using opal_xive_donate_page(),
+ see the documentation for opal_xive_donate_page() for details.
+ The resulting VPs must be individudally enabled with opal_xive_set_vp_info
+ below with the OPAL_XIVE_VP_ENABLED flag set before use.
+ For all priorities, the corresponding queues must also be individually
+ provisioned and enabled with opal_xive_set_queue_info.
+int64_t opal_xive_free_vp_block(uint64_t vp);
+ This call is used to free a block of VPs. It must be called with the same
+ *base* number as was returned by opal_xive_alloc_vp() (any index into the
+ block will result in an OPAL_PARAMETER error).
+ The VPs must have been previously all disabled with opal_xive_set_vp_info
+ below with the OPAL_XIVE_VP_ENABLED flag cleared before use.
+ All the queues must also have been disabled.
+ Failure to do any of the above will result in an OPAL_XIVE_FREE_ACTIVE error.
+ int64_t opal_xive_get_vp_info(uint64_t vp,
+ uint64_t *flags,
+ uint64_t *cam_value,
+ uint64_t *report_cl_pair);
+ This call returns information about an allocated VP:
+ * flags :
+ - OPAL_XIVE_VP_ENABLED
+ This must be set for the VP to be usable and cleared before freeing it
+ * cam_value : This is the value to program into the thread management
+ area to dispatch that VP (ie, an encoding of the block + index).
+ * report_cl_pair: This is the real address of the reporting cache line
+ pair for that VP (defaults to 0)
+ int64_t opal_xive_set_vp_info(uint64_t vp,
+ uint64_t flags,
+ uint64_t report_cl_pair);
+ int64_t opal_xive_allocate_irq(uint32_t chip_id);
+ This call allocates a software IRQ on a given chip. It returns the
+ interrupt number or an error.
+ int64_t opal_xive_free_irq(uint32_t girq);
+ This call frees a software IRQ that was allocated by
+ opal_xive_allocate_irq. Passing any other interrupt number
+ will result in an OPAL_PARAMETER error.
More information about the Skiboot