[Skiboot] [PATCH] opal-prd: Direct systemd to always restart opal-prd

Thu Mar 9 16:16:53 AEDT 2017

On 2017-03-09 10:33, Ananth N Mavinakayanahalli wrote:
> On Thu, Mar 09, 2017 at 09:17:08AM +0530, Vaibhav Jain wrote:
>> Hi Vaidy,
>> 
>> Vaidyanathan Srinivasan <svaidy at linux.vnet.ibm.com> writes:
>> 
>> > There should be a restart limit.  If we actually crash starting
>> > opal-prd, then we will get stuck infinitely trying to start it.
>> >
>> > This could happen if there are events pending and actually the event
>> > action is crashing prd.  We will keep getting the same attention and
>> > not make any progress.
>> >
>> > Does systemd unit file allow a limited restart attempts?
>> 
>> By default systemd will ratelimit the number of times a unit is
>> started via StartLimitBurst, StartLimitIntervalSec option which 
>> defaults
>> to 5-times/10-sec.
>> 
>> Also the default RestartSec interval is 100ms. So in case of an error
>> spike and default RestartSec; opal-prd may be restarted too quickly 
>> and
>> will get rate limited thereby missing reported prd errors.
>> 
>> So I would suggest RestartSec=1 and StartLimitIntervalSec=5 so that
>> opal-prd is restart after about 1 second interval without any rate
>> limiting.
> 
> One second is too long a time to not run opal-prd. I would think
> rate-limiting as it is, and getting to know systemd stopped trying to
> restart it is a better option than to keep trying and failing.
> 
> Ananth
> 

I have seen couple of times opal-prd getting crashed when it is trying 
to
process bad dimm's. During that state it is getting the below signal 11.

[   13.661462] opal-prd[1290]: unhandled signal 11 at 0000000000000000 
nip 00003fff8371fd70 lr 00003fff8371fa98 code 30001

Once it gets the signal 11, next further restarts of daemon are failed 
to start it.

service opal-prd status
● opal-prd.service - OPAL PRD daemon
    Loaded: loaded (/lib/systemd/system/opal-prd.service; enabled; vendor 
preset: enabled)
    Active: failed (Result: signal) since Tue 2017-03-07 14:24:09 IST; 
45s ago
   Process: 1290 ExecStart=/usr/sbin/opal-prd --pnor /dev/mtd0 
(code=killed, signal=SEGV)
  Main PID: 1290 (code=killed, signal=SEGV)

Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT: 
FAPI:dimmBadDqBitmapAccessHwp.C: >>dimmBadDqBitmapAccessHwp: Getting 
bitmap
Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT: 
FAPI:dimmBadDqBitmapAccessHwp.C: <<dimmBadDqBitmapAccessHwp
Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT: 
FAPI:dimmBadDqBitmapFuncs.C: <<dimmGetBadDqBitmap
Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT: 
FAPI:dimmBadDqBitmapFuncs.C: >>dimmSetBadDqBitmap. centaur.mba     
k0:n0:s0:p01:c1    :0:0:1
Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT: 
FAPI:dimmBadDqBitmapAccessHwp.C: >>dimmBadDqBitmapAccessHwp: Getting 
bitmap
Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT: 
FAPI:dimmBadDqBitmapAccessHwp.C: <<dimmBadDqBitmapAccessHwp
Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT: 
FAPI:dimmBadDqBitmapAccessHwp.C: >>dimmBadDqBitmapAccessHwp: Setting 
bitmap
Mar 07 14:24:09 ltc-test-hab02 systemd[1]: opal-prd.service: Main 
process exited, code=killed, status=11/SEGV
Mar 07 14:24:09 ltc-test-hab02 systemd[1]: opal-prd.service: Unit 
entered failed state.
Mar 07 14:24:09 ltc-test-hab02 systemd[1]: opal-prd.service: Failed with 
result 'signal'.

So rate limiting to 5 times(which is default) is a good idea.

Thanks
Pridhiviraj

> _______________________________________________
> Skiboot mailing list
> Skiboot at lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/skiboot