<div class="socmaildefaultfont" dir="ltr" style="font-family:Arial, Helvetica, sans-serif;font-size:10pt" ><div dir="ltr" >I got these answers from one of our memory experts.</div>

<div dir="ltr" > </div>

<div dir="ltr" ><div dir="ltr" >Hi <font size="2" >Krystian,</font></div>

<ol dir="ltr" >        <li>IBM mostly uses x4 DIMM's. Is it possible to run with a x4 DIMM for debug purposes to see if the problem persists? This will help debug configuration issues with the x8 DIMM's</li>        <li>Have you tried disabling refresh to see if the issues go away?</li>        <li>For calibration fails (which it looks like you are experiencing), I would recommend dumping the following registers for rank 0<br>        DQS disable bits<br>        0x8000007d0701103f<br>        0x8000047d0701103f<br>        0x8000087d0701103f<br>        0x80000c7d0701103f<br>        0x8000107d0701103f<br>        <br>        DQ disable bits<br>        0x8000007c0701103f<br>        0x8000047c0701103f<br>        0x8000087c0701103f<br>        0x80000c7c0701103f<br>        0x8000107c0701103f<br>        <br>        If calibration is passing on a given DRAM, all of the bits should be 0's. Fails are noted by 1's in the register. As per all PHY registers only the right most 16 bits matter.</li>        <li>To my knowledge, there should not be an issue sending the RCW commands via i2c.</li>        <li>Running in our test environment, I am seeing the following scoms for DQS align:

        <div>        <div>CRONUSDEBUG(30807) : PUTSCOM   : p9n.mcbist:k0:n0:s0:p01:c1 : 070123A5             4000000000000000 # Stop CCS<br>        CRONUSDEBUG(30818) : PUTSCOM   : p9n.mcbist:k0:n0:s0:p01:c1 : 07012315             000000F0CC0000C0 # Configure init calibration<br>        CRONUSDEBUG(30823) : PUTSCOM   : p9n.mcbist:k0:n0:s0:p01:c1 : 07012335             0000000000000041 # Go to instruction 1<br>        CRONUSDEBUG(30826) : PUTSCOM   : p9n.mcbist:k0:n0:s0:p01:c1 : 07012316             000008F0CC000000 # don't do anything<br>        CRONUSDEBUG(30831) : PUTSCOM   : p9n.mcbist:k0:n0:s0:p01:c1 : 07012336             0000000000000020 # End CCS<br>        CRONUSDEBUG(30839) : PUTSCOM   : p9n.mcbist:k0:n0:s0:p01:c1 : 070123DB             0400000000000000 # Configure the port to run<br>        CRONUSDEBUG(30848) : PUTSCOM   : p9n.mcbist:k0:n0:s0:p01:c1 : 070123A5             8000000000000000 # Kick off CCS<br>        <br>        I hope that this trace helps.</div>        </div>        </li></ol></div>

<div dir="ltr" ><br>--<br>Dan Crowell<br>Senior Software Engineer - Power Systems Enablement Firmware<br>IBM Rochester: t/l 553-2987<br>dcrowell@us.ibm.com</div>

<div dir="ltr" > </div>

<div dir="ltr" > </div>

<blockquote data-history-content-modified="1" dir="ltr" style="border-left:solid #aaaaaa 2px; margin-left:5px; padding-left:5px; direction:ltr; margin-right:0px" >----- Original message -----<br>From: Krystian Hebel <krystian.hebel@3mdeb.com><br>Sent by: "OpenPower-Firmware" <openpower-firmware-bounces+dcrowell=us.ibm.com@lists.ozlabs.org><br>To: openpower-firmware@lists.ozlabs.org<br>Cc: firmware@3mdeb.com<br>Subject: [EXTERNAL] [OpenPower-Firmware] Problem with CCS<br>Date: Thu, Apr 1, 2021 8:30 AM<br> 

<div><font size="2" face="Default Monospace,Courier New,Courier,monospace" >Hello,<br><br>I am currently working on implementation of memory training in<br>coreboot's port<br>for Talos II (POWER9). I'm stuck at DQS alignment, due to what I believe<br>is CCS<br>problem.<br><br>During this step CCS goes crazy: first command gets written in the place<br>of the<br>last one, minus GOTO_CMD field, which is zeroed. This results in<br>infinite loop<br>followed by a timeout. It works correctly for previous operations - MRS<br>loading,<br>ZQ calibration, write leveling and initial pattern write. For loading RCD<br>control words I'm using I2C instead of CCS - contrary to MR{0-6}, I<br>haven't seen<br>its state being mirrored in MC registers so I think this is acceptable,<br>please<br>correct me if I'm wrong.<br><br>This is an example of a working command (initial pattern write):<br><br>     Sending PHY calibration command 0x4000 to CCS - 1 instruction(s)<br>     0Last ARR0 (1) = 0x000008f0cc000000<br>     0Last ARR1 (1) = 0x0000000000000020, 38 us timeout<br>     1Last ARR0 (1) = 0x000008f0cc000000<br>     1Last ARR1 (1) = 0x0000000000000020<br>     2Last ARR1 (1) = 0x0000000000000020, took 5 us<br><br>and this is for DQS alignment:<br><br>     Sending PHY calibration command 0x2000 to CCS - 1 instruction(s)<br>     0Last ARR0 (1) = 0x000008f0cc000000<br>     0Last ARR1 (1) = 0x0000000000000020, 40 us timeout<br>     1Last ARR0 (1) = 0x000000f0cc0000c0<br>     1Last ARR1 (1) = 0x0000000000000440<br><br>'0Last' is just before setting CCS_CNTLQ_CCS_START, '1Last' is after the<br>program<br>succeeds or times out, '2Last' is mostly just to print out the time elapsed.<br>Number in brackets is index of the last instruction. Full code can be<br>found at<br>[1], mostly in files 'ccs.c' and 'istep_13_11.c'.<br><br>For even more info, I read those registers also right after setting<br>CCS_START<br>and between initial delay and polling. Those lines are removed from the<br>code as<br>they heavily impacted time calculation. For DQS alignment the bad values<br>were<br>present immediately after CCS_START and hold until the end, at least at the<br>points where they were read. What I find surprising, just after<br>CCS_START the<br>values change also for working CCS programs, but then they return to<br>normal in<br>reads after initial timeout.<br><br>I also dumped error/status registers, which in most cases reports no errors<br>(except for write leveling which has to be run twice to complete<br>successfully,<br>but that is another issue). For DQS alignment, APB, SEQ and WC, as well<br>as all<br>DP16 status registers are all zeroes. This is a list of registers which have<br>any of the bits set:<br><br>     0x0000000000001000 - RC_ERROR_STATUS0<br>     0x0000000000000400 - PC_ERROR_STATUS0<br>     0x0000000000002008 - PC_INIT_CAL_ERROR<br>     0x0000000000000688 - DDRPHY_PC_INIT_CAL_STATUS<br>     0x0000000000000080 - IOM_PHY0_DDRPHY_FIR_REG<br><br>Values of INIT_CAL_STATUS and RC_ERROR_STATUS say there was an overflow of<br>refresh pending counter, but whether it is a cause or a result of CCS<br>error is<br>beyond my current knowledge.<br><br>This is what I've tried, without success:<br>- playing with the settings in PC_INIT_CAL_CONFIG1: halving and zeroing<br>   REFRESH_COUNT and changing REFRESH_CONTROL between non-reserved values,<br>   I haven't touched REFRESH_ALL_RANKS because I'm testing it on just<br>one 1R x8<br>   DIMM anyway so it shouldn't make a difference<br>- manually sending REF commands before calibration, both instead and in<br>addition<br>   to those configured in PC_INIT_CAL_CONFIG1<br>- increasing timeout for this step - both initial delay and duration of<br>polling<br>- re-running DQS alignment after the error<br>- sending CCS_STOP and waiting for completion before starting new program.<br>- adding delays between calibration steps (500 us, much more than 9*tREFI)<br>- doing initial pattern write and DQS calibration with one CCS instruction<br><br>What am I missing? Are there any other SCOM registers I can read that<br>would help<br>with debugging?<br><br>[1] <a href="https://github.com/3mdeb/coreboot/tree/istep_13_11/src/soc/ibm/power9" target="_blank">https://github.com/3mdeb/coreboot/tree/istep_13_11/src/soc/ibm/power9</a> <br><br>--<br>Krystian Hebel<br>Firmware Engineer<br><a href="https://3mdeb.com" target="_blank">https://3mdeb.com</a>  | @3mdeb_com<br><br>_______________________________________________<br>OpenPower-Firmware mailing list<br>OpenPower-Firmware@lists.ozlabs.org<br><a href="https://lists.ozlabs.org/listinfo/openpower-firmware" target="_blank">https://lists.ozlabs.org/listinfo/openpower-firmware</a> </font><br> </div></blockquote>

<div dir="ltr" > </div></div><BR>