[OpenPower-Firmware] Problem with CCS

Thu Apr 8 21:29:32 AEST 2021

No, I haven't. How can I get it?

On 06.04.2021 16:00, Daniel M Crowell wrote:
>
> Have you attempted to get a complete scom trace from the original 
> Hostboot code and compare it to your new code? That is a pretty 
> typical debug strategy on our side when migrating from the initial 
> hardware bringup scripts into the firmware implementation.
>
> --
> Dan Crowell
> Senior Software Engineer - Power Systems Enablement Firmware
> IBM Rochester: t/l 553-2987
> dcrowell at us.ibm.com
>
> Inactive hide details for Krystian Hebel ---04/06/2021 07:45:26 
> AM---Update: I have dealt with write leveling issue, I 
> accidentKrystian Hebel ---04/06/2021 07:45:26 AM---Update: I have 
> dealt with write leveling issue, I accidentally shifted a bit twice 
> when trying to s
>
> From: Krystian Hebel <krystian.hebel at 3mdeb.com>
> To: Daniel M Crowell <dcrowell at us.ibm.com>
> Cc: firmware at 3mdeb.com, openpower-firmware at lists.ozlabs.org
> Date: 04/06/2021 07:45 AM
> Subject: [EXTERNAL] Re: [OpenPower-Firmware] Problem with CCS
>
> ------------------------------------------------------------------------
>
>
>
> Update: I have dealt with write leveling issue, I accidentally shifted 
> a bit twice when trying to set PAR_A17_MASK in SEQ_CONTROL0, so it was 
> left unmasked. Now I'm back to initial issue with loop in CCS. This 
> time however I see a difference ZjQcmQRYFpfptBannerStart
> *This Message Is From an External Sender *
> This message came from outside your organization.
> ZjQcmQRYFpfptBannerEnd
>
> Update: I have dealt with write leveling issue, I accidentally shifted 
> a bit twice when trying to set PAR_A17_MASK in SEQ_CONTROL0, so it was 
> left unmasked.
>
> Now I'm back to initial issue with loop in CCS. This time however I 
> see a difference between original code (refresh on):
>
>     0x0000000000000000 - APB_ERROR_STATUS0
>     0x0000000000001000 - RC_ERROR_STATUS0
>     0x0000000000000000 - SEQ_ERROR_STATUS0
>     0x0000000000000000 - WC_ERROR_STATUS0
>     0x0000000000000400 - PC_ERROR_STATUS0
>     0x0000000000002008 - PC_INIT_CAL_ERROR
>     0x0000000000000688 - DDRPHY_PC_INIT_CAL_STATUS
>     0x0000000000000080 - IOM_PHY0_DDRPHY_FIR_REG
>
> and after setting DDRPHY_PC_INIT_CAL_CONFIG1_P0 as in previous mail:
>
>     0x0000000000000000 - APB_ERROR_STATUS0
>     0x0000000000001000 - RC_ERROR_STATUS0
>     0x0000000000000000 - SEQ_ERROR_STATUS0
>     0x0000000000000000 - WC_ERROR_STATUS0
>     0x0000000000000000 - PC_ERROR_STATUS0
>     0x0000000000000000 - PC_INIT_CAL_ERROR
>     0x0000000000000608 - DDRPHY_PC_INIT_CAL_STATUS
>     0x0000000000000000 - IOM_PHY0_DDRPHY_FIR_REG
>
> PC_INIT_CAL_ERROR no longer reports an error, but 
> DDRPHY_PC_INIT_CAL_STATUS still doesn't report a success. No DQ/DQS 
> bits are disabled, neither with nor without refresh.
>
> On 06.04.2021 12:28, Krystian Hebel wrote:
>
>         Hi Daniel,
>
>         Thanks for quick and informative response.
>
>                 I got these answers from one of our memory experts.
>
>                 Hi Krystian,
>                     1. IBM mostly uses x4 DIMM's. Is it possible to
>                     run with a x4 DIMM for debug purposes to see if
>                     the problem persists? This will help debug
>                     configuration issues with the x8 DIMM's This may be difficult due to remote work, but I'll see what
>         can be done.
>                     2. Have you tried disabling refresh to see if the
>                     issues go away? Is it enough to just modify DDRPHY_PC_INIT_CAL_CONFIG1_P0? If
>         yes, I changed all of REFRESH_COUNT, REFRESH_CONTROL and
>         REFRESH_ALL_RANKS to all 0's and REFRESH_INTERVAL to all 1's.
>         It still fails the same way, but a few microseconds faster
>         than before.
>                     3. For calibration fails (which it looks like you
>                     are experiencing), I would recommend dumping the
>                     following registers for rank 0
>                     DQS disable bits
>                     0x8000007d0701103f
>                     0x8000047d0701103f
>                     0x8000087d0701103f
>                     0x80000c7d0701103f
>                     0x8000107d0701103f
>
>                     DQ disable bits
>                     0x8000007c0701103f
>                     0x8000047c0701103f
>                     0x8000087c0701103f
>                     0x80000c7c0701103f
>                     0x8000107c0701103f
>
>                     If calibration is passing on a given DRAM, all of
>                     the bits should be 0's. Fails are noted by 1's in
>                     the register. As per all PHY registers only the
>                     right most 16 bits matter. Here I can see some fails: all DQ bits on first and second
>         DP16 and all configured DQS bits (0xc300 for first and 0x3c00
>         for second, which is consistent with settings from [1]). The
>         rest of DP16s passes. This DIMM works with Hostboot so I think
>         clock bits are selected properly.
>
>         I haven't thought that these are updated by a hardware and
>         then used as an input for next steps. Now I know that what I
>         think was a successful write leveling, was actually skipping
>         bad bits. I was mislead by the fact that the second attempt
>         took more time than the first one, but it makes sense, as it
>         starts from a higher initial delay and has a longer way to go
>         down and up again, if I understand this step correctly.
>
>         I went a step further and dumped all WR_DELAY_VALUE_x_RP0_REG
>         - for passed bits it is somewhere in range 0x1900-0x2b00,
>         where every set of 8 DQ bits and its accompanying DQS bit have
>         the same value, which I believe is expected for x8 memory. For
>         failed bits this value is always 0x3a00 for DQ bits (and
>         whatever is in DELAY_VALUE_16-22 which isn't configured as a
>         DQS), but 0x4200 for DQS bits. Contrary to passing DP16s,
>         these values don't change between boots. They can change
>         slightly when I modify DDRPHY_WC_CONFIG1_P0, but still no pass.
>
>                     4. To my knowledge, there should not be an issue
>                     sending the RCW commands via i2c.
>                     5. Running in our test environment, I am seeing
>                     the following scoms for DQS align:
>                         CRONUSDEBUG(30807) : PUTSCOM   :
>                         p9n.mcbist:k0:n0:s0:p01:c1 :
>                         070123A5             4000000000000000 # Stop CCS
>                         CRONUSDEBUG(30818) : PUTSCOM   :
>                         p9n.mcbist:k0:n0:s0:p01:c1 : 07012315
>                         000000F0CC0000C0 # Configure init calibration
>                         CRONUSDEBUG(30823) : PUTSCOM   :
>                         p9n.mcbist:k0:n0:s0:p01:c1 : 07012335
>                         0000000000000041 # Go to instruction 1
>                         CRONUSDEBUG(30826) : PUTSCOM   :
>                         p9n.mcbist:k0:n0:s0:p01:c1 : 07012316
>                         000008F0CC000000 # don't do anything
>                         CRONUSDEBUG(30831) : PUTSCOM   :
>                         p9n.mcbist:k0:n0:s0:p01:c1 : 07012336
>                         0000000000000020 # End CCS
>                         CRONUSDEBUG(30839) : PUTSCOM   :
>                         p9n.mcbist:k0:n0:s0:p01:c1 : 070123DB
>                         0400000000000000 # Configure the port to run
>                         CRONUSDEBUG(30848) : PUTSCOM   :
>                         p9n.mcbist:k0:n0:s0:p01:c1 : 070123A5
>                         8000000000000000 # Kick off CCS
>
>                         I hope that this trace helps. So, DDR_CAL_RANK in ARR1 is a number, and not a bit map of
>         selected ranks? That was my initial understanding, but then I
>         changed the code to treat it as a bit map. Still, fixing the
>         code doesn't help, even though now it is identical to the
>         trace above.
>
>         [1]
>         _https://git.raptorcs.com/git/talos-hostboot/tree/src/import/chips/p9/procedures/hwp/memory/lib/phy/dp16.C#n1963_
>         <https://git.raptorcs.com/git/talos-hostboot/tree/src/import/chips/p9/procedures/hwp/memory/lib/phy/dp16.C#n1963>
>         -- 
>         Krystian Hebel
>         Firmware Engineer
>         _https://3mdeb.com_ <https://3mdeb.com> | @3mdeb_com
>
> -- 
> Krystian Hebel
> Firmware Engineer
> _https://3mdeb.com_ <https://3mdeb.com> | @3mdeb_com
>
>
-- 
Krystian Hebel
Firmware Engineer
https://3mdeb.com | @3mdeb_com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/openpower-firmware/attachments/20210408/aadf5912/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/openpower-firmware/attachments/20210408/aadf5912/attachment.gif>