[OpenPower-Firmware] Problem with CCS

Tue Apr 6 22:45:15 AEST 2021

Update: I have dealt with write leveling issue, I accidentally shifted a 
bit twice when trying to set PAR_A17_MASK in SEQ_CONTROL0, so it was 
left unmasked.

Now I'm back to initial issue with loop in CCS. This time however I see 
a difference between original code (refresh on):

     0x0000000000000000 - APB_ERROR_STATUS0
     0x0000000000001000 - RC_ERROR_STATUS0
     0x0000000000000000 - SEQ_ERROR_STATUS0
     0x0000000000000000 - WC_ERROR_STATUS0
     0x0000000000000400 - PC_ERROR_STATUS0
     0x0000000000002008 - PC_INIT_CAL_ERROR
     0x0000000000000688 - DDRPHY_PC_INIT_CAL_STATUS
     0x0000000000000080 - IOM_PHY0_DDRPHY_FIR_REG

and after setting DDRPHY_PC_INIT_CAL_CONFIG1_P0 as in previous mail:

     0x0000000000000000 - APB_ERROR_STATUS0
     0x0000000000001000 - RC_ERROR_STATUS0
     0x0000000000000000 - SEQ_ERROR_STATUS0
     0x0000000000000000 - WC_ERROR_STATUS0
     0x0000000000000000 - PC_ERROR_STATUS0
     0x0000000000000000 - PC_INIT_CAL_ERROR
     0x0000000000000608 - DDRPHY_PC_INIT_CAL_STATUS
     0x0000000000000000 - IOM_PHY0_DDRPHY_FIR_REG

PC_INIT_CAL_ERROR no longer reports an error, but 
DDRPHY_PC_INIT_CAL_STATUS still doesn't report a success. No DQ/DQS bits 
are disabled, neither with nor without refresh.

On 06.04.2021 12:28, Krystian Hebel wrote:
>
> Hi Daniel,
>
> Thanks for quick and informative response.
>> I got these answers from one of our memory experts.
>> Hi Krystian,
>>
>>  1. IBM mostly uses x4 DIMM's. Is it possible to run with a x4 DIMM
>>     for debug purposes to see if the problem persists? This will help
>>     debug configuration issues with the x8 DIMM's
>>
> This may be difficult due to remote work, but I'll see what can be done.
>>
>>  2. Have you tried disabling refresh to see if the issues go away?
>>
> Is it enough to just modify DDRPHY_PC_INIT_CAL_CONFIG1_P0? If yes, I 
> changed all of REFRESH_COUNT, REFRESH_CONTROL and REFRESH_ALL_RANKS to 
> all 0's and REFRESH_INTERVAL to all 1's. It still fails the same way, 
> but a few microseconds faster than before.
>>
>>  3. For calibration fails (which it looks like you are experiencing),
>>     I would recommend dumping the following registers for rank 0
>>     DQS disable bits
>>     0x8000007d0701103f
>>     0x8000047d0701103f
>>     0x8000087d0701103f
>>     0x80000c7d0701103f
>>     0x8000107d0701103f
>>
>>     DQ disable bits
>>     0x8000007c0701103f
>>     0x8000047c0701103f
>>     0x8000087c0701103f
>>     0x80000c7c0701103f
>>     0x8000107c0701103f
>>
>>     If calibration is passing on a given DRAM, all of the bits should
>>     be 0's. Fails are noted by 1's in the register. As per all PHY
>>     registers only the right most 16 bits matter.
>>
> Here I can see some fails: all DQ bits on first and second DP16 and 
> all configured DQS bits (0xc300 for first and 0x3c00 for second, which 
> is consistent with settings from [1]). The rest of DP16s passes. This 
> DIMM works with Hostboot so I think clock bits are selected properly.
>
> I haven't thought that these are updated by a hardware and then used 
> as an input for next steps. Now I know that what I think was a 
> successful write leveling, was actually skipping bad bits. I was 
> mislead by the fact that the second attempt took more time than the 
> first one, but it makes sense, as it starts from a higher initial 
> delay and has a longer way to go down and up again, if I understand 
> this step correctly.
>
> I went a step further and dumped all WR_DELAY_VALUE_x_RP0_REG - for 
> passed bits it is somewhere in range 0x1900-0x2b00, where every set of 
> 8 DQ bits and its accompanying DQS bit have the same value, which I 
> believe is expected for x8 memory. For failed bits this value is 
> always 0x3a00 for DQ bits (and whatever is in DELAY_VALUE_16-22 which 
> isn't configured as a DQS), but 0x4200 for DQS bits. Contrary to 
> passing DP16s, these values don't change between boots. They can 
> change slightly when I modify DDRPHY_WC_CONFIG1_P0, but still no pass.
>
>>  4. To my knowledge, there should not be an issue sending the RCW
>>     commands via i2c.
>>  5. Running in our test environment, I am seeing the following scoms
>>     for DQS align:
>>     CRONUSDEBUG(30807) : PUTSCOM   : p9n.mcbist:k0:n0:s0:p01:c1 :
>>     070123A5 4000000000000000 # Stop CCS
>>     CRONUSDEBUG(30818) : PUTSCOM   : p9n.mcbist:k0:n0:s0:p01:c1 :
>>     07012315 000000F0CC0000C0 # Configure init calibration
>>     CRONUSDEBUG(30823) : PUTSCOM   : p9n.mcbist:k0:n0:s0:p01:c1 :
>>     07012335 0000000000000041 # Go to instruction 1
>>     CRONUSDEBUG(30826) : PUTSCOM   : p9n.mcbist:k0:n0:s0:p01:c1 :
>>     07012316 000008F0CC000000 # don't do anything
>>     CRONUSDEBUG(30831) : PUTSCOM   : p9n.mcbist:k0:n0:s0:p01:c1 :
>>     07012336 0000000000000020 # End CCS
>>     CRONUSDEBUG(30839) : PUTSCOM   : p9n.mcbist:k0:n0:s0:p01:c1 :
>>     070123DB 0400000000000000 # Configure the port to run
>>     CRONUSDEBUG(30848) : PUTSCOM   : p9n.mcbist:k0:n0:s0:p01:c1 :
>>     070123A5 8000000000000000 # Kick off CCS
>>
>>     I hope that this trace helps.
>>
> So, DDR_CAL_RANK in ARR1 is a number, and not a bit map of selected 
> ranks? That was my initial understanding, but then I changed the code 
> to treat it as a bit map. Still, fixing the code doesn't help, even 
> though now it is identical to the trace above.
>
>
> [1] 
> https://git.raptorcs.com/git/talos-hostboot/tree/src/import/chips/p9/procedures/hwp/memory/lib/phy/dp16.C#n1963
> -- 
> Krystian Hebel
> Firmware Engineer
> https://3mdeb.com  | @3mdeb_com

-- 
Krystian Hebel
Firmware Engineer
https://3mdeb.com | @3mdeb_com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/openpower-firmware/attachments/20210406/9607786c/attachment.htm>