[PATCH 0/5] cpufreq: cppc: Fix suspend/resume specific races with FIE code

Wed Jun 16 14:57:09 AEST 2021

On 15-06-21, 08:17, Qian Cai wrote:
> On 6/15/2021 3:50 AM, Viresh Kumar wrote:
> > This is a strange place to get the issue from. And this is a new
> > issue.
> 
> Well, it was still the same exercises with CPU online/offline.
> 
> > 
> >> [  488.151939][  T670]  kthread+0x3ac/0x460
> >> [  488.155854][  T670]  ret_from_fork+0x10/0x18
> >> [  488.160120][  T670] Code: 911e8000 aa1303e1 910a0000 941b595b (d4210000)
> >> [  488.166901][  T670] ---[ end trace e637e2d38b2cc087 ]---
> >> [  488.172206][  T670] Kernel panic - not syncing: Oops - BUG: Fatal exception
> >> [  488.179182][  T670] SMP: stopping secondary CPUs
> >> [  489.209347][  T670] SMP: failed to stop secondary CPUs 0-1,10-11,16-17,31
> >> [  489.216128][  T][  T670] Memoryn ]---
> > 
> > Can you give details on what exactly did you try to do, to get this ?
> > Normal boot or something more ?
> 
> Basically, it has the cpufreq driver as CPPC and the governor as
> schedutil. Running a few workloads to get CPU scaling up and down.
> Later, try to offline all CPUs until the last one and then online
> all CPUs.

Hmm, okay.

So I basically have very similar setup with 8 cores (1-policy
per-cpu), the only difference is I don't end up reading the
performance counters, everything else remains same. So I should see
issues now just like you, in case there are any.

Since the insmod/rmmod setup is a bit different, this is what I tried
today for around an hour with CONFIG_DEBUG_LIST and RCU debugging
options.

while true; do
    for i in `seq 1 7`;
    do
        echo 0 > /sys/devices/system/cpu/cpu$i/online;
    done;

    for i in `seq 1 7`;
    do
        echo 1 > /sys/devices/system/cpu/cpu$i/online;
    done;
done

I don't see any crashes, oops or warnings with latest stuff.

> I am hesitate to try this at the moment because this all feel like
> shooting in the dark.

I understand your point and you aren't completely wrong here. It
wasn't completely in dark but since I am unable to reproduce the issue
at my end, I asked for help.

FWIW, I think one of the possible cause of corruption of kthread thing
could have been because of the race in the topology related code. I
already fixed that in my tree yesterday.

> Ideally, you will be able to get access to one
> of those arm64 servers (Huawei, Ampere, TX2, FJ etc) eventually and
> really try the same exercises yourself with those debugging options
> like list debugging and KASAN on. That way you could fix things way
> efficiently.

Yeah, I thought of this work being over and I am not a user of it
normally. I had to enable it for ARM servers and I took help of my
colleagues (Vincent Guittot and Ionela) for testing the same.

I have also asked Vincent to give it a try again.

> I could share you the .config once you are there. Last
> but not least, once you get better narrow down of the issues, I'd
> hope to see someone else familiar with the code there to get review
> of those patches first (feel free to Cc me once you are ready to
> post) before I'll rerun the whole things again. That way we don't
> waste time on each other backing and forth chasing the shadow.

I did send the stuff up for review and this last thing (you reported)
was a different race altogether, so asked for testing without reviews.

Anyway, I am quite sure my tests have covered such issues now. I will
send out patches again soon.

Thanks Qian.

-- 
viresh