[Skiboot] [PATCH] opal-prd: Direct systemd to always restart opal-prd
ppaidipe
ppaidipe at linux.vnet.ibm.com
Thu Mar 9 16:16:53 AEDT 2017
On 2017-03-09 10:33, Ananth N Mavinakayanahalli wrote:
> On Thu, Mar 09, 2017 at 09:17:08AM +0530, Vaibhav Jain wrote:
>> Hi Vaidy,
>>
>> Vaidyanathan Srinivasan <svaidy at linux.vnet.ibm.com> writes:
>>
>> > There should be a restart limit. If we actually crash starting
>> > opal-prd, then we will get stuck infinitely trying to start it.
>> >
>> > This could happen if there are events pending and actually the event
>> > action is crashing prd. We will keep getting the same attention and
>> > not make any progress.
>> >
>> > Does systemd unit file allow a limited restart attempts?
>>
>> By default systemd will ratelimit the number of times a unit is
>> started via StartLimitBurst, StartLimitIntervalSec option which
>> defaults
>> to 5-times/10-sec.
>>
>> Also the default RestartSec interval is 100ms. So in case of an error
>> spike and default RestartSec; opal-prd may be restarted too quickly
>> and
>> will get rate limited thereby missing reported prd errors.
>>
>> So I would suggest RestartSec=1 and StartLimitIntervalSec=5 so that
>> opal-prd is restart after about 1 second interval without any rate
>> limiting.
>
> One second is too long a time to not run opal-prd. I would think
> rate-limiting as it is, and getting to know systemd stopped trying to
> restart it is a better option than to keep trying and failing.
>
> Ananth
>
I have seen couple of times opal-prd getting crashed when it is trying
to
process bad dimm's. During that state it is getting the below signal 11.
[ 13.661462] opal-prd[1290]: unhandled signal 11 at 0000000000000000
nip 00003fff8371fd70 lr 00003fff8371fa98 code 30001
Once it gets the signal 11, next further restarts of daemon are failed
to start it.
service opal-prd status
● opal-prd.service - OPAL PRD daemon
Loaded: loaded (/lib/systemd/system/opal-prd.service; enabled; vendor
preset: enabled)
Active: failed (Result: signal) since Tue 2017-03-07 14:24:09 IST;
45s ago
Process: 1290 ExecStart=/usr/sbin/opal-prd --pnor /dev/mtd0
(code=killed, signal=SEGV)
Main PID: 1290 (code=killed, signal=SEGV)
Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT:
FAPI:dimmBadDqBitmapAccessHwp.C: >>dimmBadDqBitmapAccessHwp: Getting
bitmap
Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT:
FAPI:dimmBadDqBitmapAccessHwp.C: <<dimmBadDqBitmapAccessHwp
Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT:
FAPI:dimmBadDqBitmapFuncs.C: <<dimmGetBadDqBitmap
Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT:
FAPI:dimmBadDqBitmapFuncs.C: >>dimmSetBadDqBitmap. centaur.mba
k0:n0:s0:p01:c1 :0:0:1
Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT:
FAPI:dimmBadDqBitmapAccessHwp.C: >>dimmBadDqBitmapAccessHwp: Getting
bitmap
Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT:
FAPI:dimmBadDqBitmapAccessHwp.C: <<dimmBadDqBitmapAccessHwp
Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT:
FAPI:dimmBadDqBitmapAccessHwp.C: >>dimmBadDqBitmapAccessHwp: Setting
bitmap
Mar 07 14:24:09 ltc-test-hab02 systemd[1]: opal-prd.service: Main
process exited, code=killed, status=11/SEGV
Mar 07 14:24:09 ltc-test-hab02 systemd[1]: opal-prd.service: Unit
entered failed state.
Mar 07 14:24:09 ltc-test-hab02 systemd[1]: opal-prd.service: Failed with
result 'signal'.
So rate limiting to 5 times(which is default) is a good idea.
Thanks
Pridhiviraj
> _______________________________________________
> Skiboot mailing list
> Skiboot at lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/skiboot
More information about the Skiboot
mailing list