[RESEND PATCH v7 3/5] powerpc/papr_scm: Fetch nvdimm health information from PHYP

Fri May 22 02:59:38 AEST 2020

Michael Ellerman <michaele at au1.ibm.com> writes:

> Vaibhav Jain <vaibhav at linux.ibm.com> writes:
>> Thanks for reviewing this this patch Ira. My responses below:
>> Ira Weiny <ira.weiny at intel.com> writes:
>>> On Wed, May 20, 2020 at 12:30:56AM +0530, Vaibhav Jain wrote:
>>>> Implement support for fetching nvdimm health information via
>>>> H_SCM_HEALTH hcall as documented in Ref[1]. The hcall returns a pair
>>>> of 64-bit big-endian integers, bitwise-and of which is then stored in
>>>> 'struct papr_scm_priv' and subsequently partially exposed to
>>>> user-space via newly introduced dimm specific attribute
>>>> 'papr/flags'. Since the hcall is costly, the health information is
>>>> cached and only re-queried, 60s after the previous successful hcall.
> ...
>>>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
>>>> index f35592423380..142636e1a59f 100644
>>>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>>>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>>>> @@ -39,6 +78,15 @@ struct papr_scm_priv {
>>>>  	struct resource res;
>>>>  	struct nd_region *region;
>>>>  	struct nd_interleave_set nd_set;
>>>> +
>>>> +	/* Protect dimm health data from concurrent read/writes */
>>>> +	struct mutex health_mutex;
>>>> +
>>>> +	/* Last time the health information of the dimm was updated */
>>>> +	unsigned long lasthealth_jiffies;
>>>> +
>>>> +	/* Health information for the dimm */
>>>> +	u64 health_bitmap;
>>>
>>> I wonder if this should be typed big endian as you mention that it is in the
>>> commit message?
>> This was discussed in an earlier review of the patch series at
>> https://lore.kernel.org/linux-nvdimm/878sjetcis.fsf@mpe.ellerman.id.au
>>
>> Even though health bitmap is returned in big endian format (For ex
>> value 0xC00000000000000 indicates bits 0,1 set), its value is never
>> used. Instead only test for specific bits being set in the register is
>> done.
>
> This has already caused a lot of confusion, so let me try and clear it
> up. I will probably fail :)
>
> The value is not big endian.
>
> It's returned in a GPR (a register), from the hypervisor. The ordering
> of bytes in a register is not dependent on what endian we're executing
> in.
>
> It's true that the hypervisor will have been running big endian, and
> when it returns to us we will now be running little endian. But the
> value is unchanged, it was 0xC00000000000000 in the GPR while the HV was
> running and it's still 0xC00000000000000 when we return to Linux. You
> can see this in mambo, see below for an example.
>
>
> _However_, the specification of the bits in the bitmap value uses MSB 0
> ordering, as is traditional for IBM documentation. That means the most
> significant bit, aka. the left most bit, is called "bit 0".
>
> See: https://en.wikipedia.org/wiki/Bit_numbering#MSB_0_bit_numbering
>
> That is the opposite numbering from what most people use, and in
> particular what most code in Linux uses, which is that bit 0 is the
> least significant bit.
>
> Which is where the confusion comes in. It's not that the bytes are
> returned in a different order, it's that the bits are numbered
> differently in the IBM documentation.
>
> The way to fix this kind of thing is to read the docs, and convert all
> the bits into correct numbering (LSB=0), and then throw away the docs ;)
>
> cheers
Thanks a lot for clarifying this Mpe and for this detailed explaination.

I have removed the term Big-Endian from v8 patch description to avoid
any further confusion.

>
>
>
> In mambo we can set a breakpoint just before the kernel enters skiboot,
> towards the end of __opal_call. The kernel is running LE and skiboot
> runs BE.
>
>   systemsim-p9 [~/skiboot/skiboot/external/mambo] b 0xc0000000000c1744
>   breakpoint set at [0:0:0]: 0xc0000000000c1744 (0x00000000000C1744) Enc:0x2402004C : hrfid
>
> Then run:
>
>   systemsim-p9 [~/skiboot/skiboot/external/mambo] c
>   [0:0:0]: 0xC0000000000C1744 (0x00000000000C1744) Enc:0x2402004C : hrfid
>   INFO: 121671618: (121671618): ** Execution stopped: user (tcl),  **
>   121671618: ** finished running 121671618 instructions **
>
> And we stop there, on an hrfid that we haven't executed yet.
> We can print r0, to see the OPAL token:
>
>   systemsim-p9 [~/skiboot/skiboot/external/mambo] p r0
>   0x0000000000000019
>
> ie. we're calling OPAL_CONSOLE_WRITE_BUFFER_SPACE (25).
>
> And we can print the MSR:
>
>   systemsim-p9 [~/skiboot/skiboot/external/mambo] p msr
>   0x9000000002001033
>   
>                      64-bit mode (SF): 0x1 [64-bit mode]
>                 Hypervisor State (HV): 0x1
>                Vector Available (VEC): 0x1
>   Machine Check Interrupt Enable (ME): 0x1
>             Instruction Relocate (IR): 0x1
>                    Data Relocate (DR): 0x1
>            Recoverable Interrupt (RI): 0x1
>               Little-Endian Mode (LE): 0x1 [little-endian]
>
> ie. we're little endian.
>
> We then step one instruction:
>
>   systemsim-p9 [~/skiboot/skiboot/external/mambo] s
>   [0:0:0]: 0x0000000030002BF0 (0x0000000030002BF0) Enc:0x7D9FFAA6 : mfspr   r12,PIR
>
> Now we're in skiboot. Print the MSR again:
>
>   systemsim-p9 [~/skiboot/skiboot/external/mambo] p msr
>   0x9000000002001002
>   
>                      64-bit mode (SF): 0x1 [64-bit mode]
>                 Hypervisor State (HV): 0x1
>                Vector Available (VEC): 0x1
>   Machine Check Interrupt Enable (ME): 0x1
>            Recoverable Interrupt (RI): 0x1
>
> We're big endian.
> Print r0:
>
>   systemsim-p9 [~/skiboot/skiboot/external/mambo] p r0
>   0x0000000000000019
>
> r0 is unchanged!
Got it. Thanks again.

-- 
Cheers
~ Vaibhav