[OpenPower-Firmware] Problem with CCS

Wed Apr 7 00:00:06 AEST 2021

Have you attempted to get a complete scom trace from the original Hostboot
code and compare it to your new code?  That is a pretty typical debug
strategy on our side when migrating from the initial hardware bringup
scripts into the firmware implementation.

--
Dan Crowell
Senior Software Engineer - Power Systems Enablement Firmware
IBM Rochester: t/l 553-2987
dcrowell at us.ibm.com

From:	Krystian Hebel <krystian.hebel at 3mdeb.com>
To:	Daniel M Crowell <dcrowell at us.ibm.com>
Cc:	firmware at 3mdeb.com, openpower-firmware at lists.ozlabs.org
Date:	04/06/2021 07:45 AM
Subject:	[EXTERNAL] Re: [OpenPower-Firmware] Problem with CCS

Update: I have dealt with write leveling issue, I accidentally shifted a
bit twice when trying to set PAR_A17_MASK in SEQ_CONTROL0, so it was left
unmasked. Now I'm back to initial issue with loop in CCS. This time however
I see a difference ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd

Update: I have dealt with write leveling issue, I accidentally shifted a
bit twice when trying to set PAR_A17_MASK in SEQ_CONTROL0, so it was left
unmasked.

Now I'm back to initial issue with loop in CCS. This time however I see a
difference between original code (refresh on):

    0x0000000000000000 - APB_ERROR_STATUS0
    0x0000000000001000 - RC_ERROR_STATUS0
    0x0000000000000000 - SEQ_ERROR_STATUS0
    0x0000000000000000 - WC_ERROR_STATUS0
    0x0000000000000400 - PC_ERROR_STATUS0
    0x0000000000002008 - PC_INIT_CAL_ERROR
    0x0000000000000688 - DDRPHY_PC_INIT_CAL_STATUS
    0x0000000000000080 - IOM_PHY0_DDRPHY_FIR_REG

and after setting DDRPHY_PC_INIT_CAL_CONFIG1_P0 as in previous mail:

    0x0000000000000000 - APB_ERROR_STATUS0
    0x0000000000001000 - RC_ERROR_STATUS0
    0x0000000000000000 - SEQ_ERROR_STATUS0
    0x0000000000000000 - WC_ERROR_STATUS0
    0x0000000000000000 - PC_ERROR_STATUS0
    0x0000000000000000 - PC_INIT_CAL_ERROR
    0x0000000000000608 - DDRPHY_PC_INIT_CAL_STATUS
    0x0000000000000000 - IOM_PHY0_DDRPHY_FIR_REG

PC_INIT_CAL_ERROR no longer reports an error, but DDRPHY_PC_INIT_CAL_STATUS
still doesn't report a success. No DQ/DQS bits are disabled, neither with
nor without refresh.

On 06.04.2021 12:28, Krystian Hebel wrote:

      Hi Daniel,

      Thanks for quick and informative response.
            I got these answers from one of our memory experts.

            Hi Krystian,
               1.	IBM mostly uses x4 DIMM's. Is it possible to run with a
                  x4 DIMM for debug purposes to see if the problem
                  persists? This will help debug configuration issues with
                  the x8 DIMM's
      This may be difficult due to remote work, but I'll see what can be
      done.
               2.	Have you tried disabling refresh to see if the issues go
                  away?
      Is it enough to just modify DDRPHY_PC_INIT_CAL_CONFIG1_P0? If yes, I
      changed all of REFRESH_COUNT, REFRESH_CONTROL and REFRESH_ALL_RANKS
      to all 0's and REFRESH_INTERVAL to all 1's. It still fails the same
      way, but a few microseconds faster than before.
               3.	For calibration fails (which it looks like you are
                  experiencing), I would recommend dumping the following
                  registers for rank 0
                  DQS disable bits
                  0x8000007d0701103f
                  0x8000047d0701103f
                  0x8000087d0701103f
                  0x80000c7d0701103f
                  0x8000107d0701103f

                  DQ disable bits
                  0x8000007c0701103f
                  0x8000047c0701103f
                  0x8000087c0701103f
                  0x80000c7c0701103f
                  0x8000107c0701103f

                  If calibration is passing on a given DRAM, all of the
                  bits should be 0's. Fails are noted by 1's in the
                  register. As per all PHY registers only the right most 16
                  bits matter.

      Here I can see some fails: all DQ bits on first and second DP16 and
      all configured DQS bits (0xc300 for first and 0x3c00 for second,
      which is consistent with settings from [1]). The rest of DP16s
      passes. This DIMM works with Hostboot so I think clock bits are
      selected properly.

      I haven't thought that these are updated by a hardware and then used
      as an input for next steps. Now I know that what I think was a
      successful write leveling, was actually skipping bad bits. I was
      mislead by the fact that the second attempt took more time than the
      first one, but it makes sense, as it starts from a higher initial
      delay and has a longer way to go down and up again, if I understand
      this step correctly.

      I went a step further and dumped all WR_DELAY_VALUE_x_RP0_REG - for
      passed bits it is somewhere in range 0x1900-0x2b00, where every set
      of 8 DQ bits and its accompanying DQS bit have the same value, which
      I believe is expected for x8 memory. For failed bits this value is
      always 0x3a00 for DQ bits (and whatever is in DELAY_VALUE_16-22 which
      isn't configured as a DQS), but 0x4200 for DQS bits. Contrary to
      passing DP16s, these values don't change between boots. They can
      change slightly when I modify DDRPHY_WC_CONFIG1_P0, but still no
      pass.
               4.	To my knowledge, there should not be an issue sending the
                  RCW commands via i2c.
               5.	Running in our test environment, I am seeing the
                  following scoms for DQS align:
                  CRONUSDEBUG(30807) : PUTSCOM   :
                  p9n.mcbist:k0:n0:s0:p01:c1 : 070123A5
                  4000000000000000 # Stop CCS
                  CRONUSDEBUG(30818) : PUTSCOM   :
                  p9n.mcbist:k0:n0:s0:p01:c1 : 07012315
                  000000F0CC0000C0 # Configure init calibration
                  CRONUSDEBUG(30823) : PUTSCOM   :
                  p9n.mcbist:k0:n0:s0:p01:c1 : 07012335
                  0000000000000041 # Go to instruction 1
                  CRONUSDEBUG(30826) : PUTSCOM   :
                  p9n.mcbist:k0:n0:s0:p01:c1 : 07012316
                  000008F0CC000000 # don't do anything
                  CRONUSDEBUG(30831) : PUTSCOM   :
                  p9n.mcbist:k0:n0:s0:p01:c1 : 07012336
                  0000000000000020 # End CCS
                  CRONUSDEBUG(30839) : PUTSCOM   :
                  p9n.mcbist:k0:n0:s0:p01:c1 : 070123DB
                  0400000000000000 # Configure the port to run
                  CRONUSDEBUG(30848) : PUTSCOM   :
                  p9n.mcbist:k0:n0:s0:p01:c1 : 070123A5
                  8000000000000000 # Kick off CCS

                  I hope that this trace helps.

      So, DDR_CAL_RANK in ARR1 is a number, and not a bit map of selected
      ranks? That was my initial understanding, but then I changed the code
      to treat it as a bit map. Still, fixing the code doesn't help, even
      though now it is identical to the trace above.

      [1]
      https://git.raptorcs.com/git/talos-hostboot/tree/src/import/chips/p9/procedures/hwp/memory/lib/phy/dp16.C#n1963
      --
      Krystian Hebel
      Firmware Engineer
      https://3mdeb.com | @3mdeb_com
--
Krystian Hebel
Firmware Engineer
https://3mdeb.com | @3mdeb_com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/openpower-firmware/attachments/20210406/dbddbf29/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/openpower-firmware/attachments/20210406/dbddbf29/attachment-0001.gif>