[RESEND PATCH v7 3/5] powerpc/papr_scm: Fetch nvdimm health information from PHYP

Fri May 22 00:31:41 AEST 2020

Vaibhav Jain <vaibhav at linux.ibm.com> writes:
> Thanks for reviewing this this patch Ira. My responses below:
> Ira Weiny <ira.weiny at intel.com> writes:
>> On Wed, May 20, 2020 at 12:30:56AM +0530, Vaibhav Jain wrote:
>>> Implement support for fetching nvdimm health information via
>>> H_SCM_HEALTH hcall as documented in Ref[1]. The hcall returns a pair
>>> of 64-bit big-endian integers, bitwise-and of which is then stored in
>>> 'struct papr_scm_priv' and subsequently partially exposed to
>>> user-space via newly introduced dimm specific attribute
>>> 'papr/flags'. Since the hcall is costly, the health information is
>>> cached and only re-queried, 60s after the previous successful hcall.
...
>>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
>>> index f35592423380..142636e1a59f 100644
>>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>>> @@ -39,6 +78,15 @@ struct papr_scm_priv {
>>>  	struct resource res;
>>>  	struct nd_region *region;
>>>  	struct nd_interleave_set nd_set;
>>> +
>>> +	/* Protect dimm health data from concurrent read/writes */
>>> +	struct mutex health_mutex;
>>> +
>>> +	/* Last time the health information of the dimm was updated */
>>> +	unsigned long lasthealth_jiffies;
>>> +
>>> +	/* Health information for the dimm */
>>> +	u64 health_bitmap;
>>
>> I wonder if this should be typed big endian as you mention that it is in the
>> commit message?
> This was discussed in an earlier review of the patch series at
> https://lore.kernel.org/linux-nvdimm/878sjetcis.fsf@mpe.ellerman.id.au
>
> Even though health bitmap is returned in big endian format (For ex
> value 0xC00000000000000 indicates bits 0,1 set), its value is never
> used. Instead only test for specific bits being set in the register is
> done.

This has already caused a lot of confusion, so let me try and clear it
up. I will probably fail :)

The value is not big endian.

It's returned in a GPR (a register), from the hypervisor. The ordering
of bytes in a register is not dependent on what endian we're executing
in.

It's true that the hypervisor will have been running big endian, and
when it returns to us we will now be running little endian. But the
value is unchanged, it was 0xC00000000000000 in the GPR while the HV was
running and it's still 0xC00000000000000 when we return to Linux. You
can see this in mambo, see below for an example.

_However_, the specification of the bits in the bitmap value uses MSB 0
ordering, as is traditional for IBM documentation. That means the most
significant bit, aka. the left most bit, is called "bit 0".

See: https://en.wikipedia.org/wiki/Bit_numbering#MSB_0_bit_numbering

That is the opposite numbering from what most people use, and in
particular what most code in Linux uses, which is that bit 0 is the
least significant bit.

Which is where the confusion comes in. It's not that the bytes are
returned in a different order, it's that the bits are numbered
differently in the IBM documentation.

The way to fix this kind of thing is to read the docs, and convert all
the bits into correct numbering (LSB=0), and then throw away the docs ;)

cheers

In mambo we can set a breakpoint just before the kernel enters skiboot,
towards the end of __opal_call. The kernel is running LE and skiboot
runs BE.

  systemsim-p9 [~/skiboot/skiboot/external/mambo] b 0xc0000000000c1744
  breakpoint set at [0:0:0]: 0xc0000000000c1744 (0x00000000000C1744) Enc:0x2402004C : hrfid

Then run:

  systemsim-p9 [~/skiboot/skiboot/external/mambo] c
  [0:0:0]: 0xC0000000000C1744 (0x00000000000C1744) Enc:0x2402004C : hrfid
  INFO: 121671618: (121671618): ** Execution stopped: user (tcl),  **
  121671618: ** finished running 121671618 instructions **

And we stop there, on an hrfid that we haven't executed yet.
We can print r0, to see the OPAL token:

  systemsim-p9 [~/skiboot/skiboot/external/mambo] p r0
  0x0000000000000019

ie. we're calling OPAL_CONSOLE_WRITE_BUFFER_SPACE (25).

And we can print the MSR:

  systemsim-p9 [~/skiboot/skiboot/external/mambo] p msr
  0x9000000002001033

                     64-bit mode (SF): 0x1 [64-bit mode]
                Hypervisor State (HV): 0x1
               Vector Available (VEC): 0x1
  Machine Check Interrupt Enable (ME): 0x1
            Instruction Relocate (IR): 0x1
                   Data Relocate (DR): 0x1
           Recoverable Interrupt (RI): 0x1
              Little-Endian Mode (LE): 0x1 [little-endian]

ie. we're little endian.

We then step one instruction:

  systemsim-p9 [~/skiboot/skiboot/external/mambo] s
  [0:0:0]: 0x0000000030002BF0 (0x0000000030002BF0) Enc:0x7D9FFAA6 : mfspr   r12,PIR

Now we're in skiboot. Print the MSR again:

  systemsim-p9 [~/skiboot/skiboot/external/mambo] p msr
  0x9000000002001002

                     64-bit mode (SF): 0x1 [64-bit mode]
                Hypervisor State (HV): 0x1
               Vector Available (VEC): 0x1
  Machine Check Interrupt Enable (ME): 0x1
           Recoverable Interrupt (RI): 0x1

We're big endian.
Print r0:

  systemsim-p9 [~/skiboot/skiboot/external/mambo] p r0
  0x0000000000000019

r0 is unchanged!