instruction storage exception handling

Wed Oct 27 18:47:34 AEDT 2021

Excerpts from Christophe Leroy's message of October 27, 2021 3:51 pm:
> 
> 
> Le 27/10/2021 à 07:25, Nicholas Piggin a écrit :
>> Excerpts from Christophe Leroy's message of October 27, 2021 3:00 pm:
>>>
>>>
>>> Le 27/10/2021 à 06:10, Nicholas Piggin a écrit :
>>>> Excerpts from Jacques de Laval's message of October 26, 2021 6:07 am:
>>>>> Hi,
>>>>>
>>>>> We are trying to upgrade kernel from 5.10 to 5.14.11. We have a Freescale/NXP
>>>>> T1023 SOC with two e5500 cores, and are running in 32-bit mode:
>>>>>
>>>>> 	CONFIG_PPC32=y
>>>>> 	# CONFIG_PPC64 is not set
>>>>>
>>>>> 	#
>>>>> 	# Processor support
>>>>> 	#
>>>>> 	# CONFIG_PPC_BOOK3S_32 is not set
>>>>> 	CONFIG_PPC_85xx=y
>>>>> 	# CONFIG_PPC_8xx is not set
>>>>> 	# CONFIG_40x is not set
>>>>> 	# CONFIG_44x is not set
>>>>> 	CONFIG_GENERIC_CPU=y
>>>>> 	# CONFIG_E5500_CPU is not set
>>>>> 	# CONFIG_E6500_CPU is not set
>>>>> 	CONFIG_E500=y
>>>>> 	CONFIG_PPC_E500MC=y
>>>>> 	CONFIG_PPC_FPU_REGS=y
>>>>> 	CONFIG_PPC_FPU=y
>>>>> 	CONFIG_FSL_EMB_PERFMON=y
>>>>> 	CONFIG_FSL_EMB_PERF_EVENT=y
>>>>> 	CONFIG_FSL_EMB_PERF_EVENT_E500=y
>>>>> 	CONFIG_BOOKE=y
>>>>> 	CONFIG_FSL_BOOKE=y
>>>>> 	CONFIG_PPC_FSL_BOOK3E=y
>>>>> 	CONFIG_PTE_64BIT=y
>>>>> 	CONFIG_PHYS_64BIT=y
>>>>> 	CONFIG_PPC_MMU_NOHASH=y
>>>>> 	CONFIG_PPC_BOOK3E_MMU=y
>>>>> 	# CONFIG_PMU_SYSFS is not set
>>>>> 	CONFIG_SMP=y
>>>>> 	CONFIG_NR_CPUS=2
>>>>> 	CONFIG_PPC_DOORBELL=y
>>>>> 	# end of Processor support
>>>>>
>>>>> We compile using 32-bit Bootlin PPC toolchain:
>>>>>
>>>>> 	powerpc-e500mc glibc bleeding-edge 2020.08-1.
>>>>>
>>>>> When booting, and starting PID 1 we sometimes get a hang. Nothing but our init
>>>>> process is running, and for debugging purposes our init currently looks like
>>>>> this:
>>>>>
>>>>> 	int main(int argc, char *argv[]){
>>>>> 		for (int i = 0; ; i++) {
>>>>> 			FILE *fp = fopen("/dev/kmsg", "w");
>>>>> 			if (fp) {
>>>>> 				fprintf(fp, "%d\n", i);
>>>>> 				fclose(fp);
>>>>> 			}
>>>>> 			sleep(1);
>>>>> 		}
>>>>> 		return 0;
>>>>> 	}
>>>>>
>>>>> When the hangup occur we don't get any output at all from our PID 1.
>>>>> The last output is from the kernel:
>>>>>
>>>>> 	Run /sbin/init as init process
>>>>> 	  with arguments:
>>>>> 	    /sbin/init
>>>>> 	  with environment:
>>>>> 	    HOME=/
>>>>> 	    TERM=linux
>>>>> 	    kgdboc=ttyS0,115200
>>>>>
>>>>> When issuing a backtrace on all active cpus we can see that the kernel is
>>>>> handling an instruction storage exception:
>>>>>
>>>>> 	sysrq: Show backtrace of all active CPUs
>>>>> 	sysrq: CPU0:
>>>>> 	CPU: 0 PID: 1 Comm: init Not tainted 5.14.11 #1
>>>>> 	NIP:  c02aac78 LR: c02aac2c CTR: 00000000
>>>>> 	REGS: c1907d40 TRAP: 0500   Not tainted  (5.14.11)
>>>>> 	MSR:  00029002 <CE,EE,ME>  CR: 82244284  XER: 20000000
>>>>> 	GPR00: 0000000f c1907e20 c1910000 0065a000 000001d0 01100cca c1907e84 0000000c
>>>>> 	GPR08: d39a8000 000001d3 0000000c c1907f10 42244284 00000000 00740514 bfb71670
>>>>> 	GPR16: 007040e6 00701418 b7c1a5f0 00702f18 00000000 bfb71690 0000fff1 b7c1c478
>>>>> 	GPR24: 00708558 00701698 d3994040 00029002 c1907f20 0065a238 00000355 d39a2790
>>>>> 	NIP [c02aac78] handle_mm_fault+0xf8/0x11f0
>>>>> 	LR [c02aac2c] handle_mm_fault+0xac/0x11f0
>>>>> 	Call Trace:
>>>>> 	[c1907e20] [c02aac10] handle_mm_fault+0x90/0x11f0 (unreliable)
>>>>> 	[c1907ec0] [c003078c] ___do_page_fault+0x26c/0x780
>>>>> 	[c1907ef0] [c0030cd4] do_page_fault+0x34/0x100
>>>>> 	[c1907f10] [c0000988] InstructionStorage+0x108/0x120
>>>>> 	--- interrupt: 400 at 0x65a238
>>>>> 	NIP:  0065a238 LR: 0052f26c CTR: 0052f260
>>>>> 	REGS: c1907f20 TRAP: 0400   Not tainted  (5.14.11)
>>>>> 	MSR:  0002d002 <CE,EE,PR,ME>  CR: 42242284  XER: 00000000
>>>>> 	GPR00: b7be9914 bfb71620 b7c203a0 8c008000 0070400d b7c182a0 000b8260 0052f260
>>>>> 	GPR08: 0047d448 0052f260 0000000a 00000003 42242284 00000000 00740514 bfb71670
>>>>> 	GPR16: 007040e6 00701418 b7c1a5f0 00702f18 00000000 bfb71690 0000fff1 b7c1c478
>>>>> 	GPR24: 00708558 00701698 00700000 00000015 b7c1c2b0 00707e20 b7c1b8a8 bfb71660
>>>>> 	NIP [0065a238] 0x65a238
>>>>> 	LR [0052f26c] 0x52f26c
>>>>> 	--- interrupt: 400
>>>>> 	Instruction dump:
>>>>> 	60a500c0 811f0020 57aa6cfa 813f0000 57a30026 809f004c 81080024 7d29e850
>>>>> 	90a1002c 5529a33e 93c10038 7d244a14 <90610034> 7d485215 91210030 41c203dc
>>>>>
>>>>> We have also observed that the CPU is continuously servicing the same interrupt
>>>>> (north of 140k times per sec), it is not deadlocked.
>>>>>
>>>>> We have not yet been able to reproduce this behavior under QEMU system
>>>>> emulation.
>>>>>
>>>>> When bisecting between 5.10 and 5.14.11 we can see that this behavior started
>>>>> with commit a01a3f2ddbcda83e8572787c0ec1dcbeba86915a:
>>>>>
>>>>> 	powerpc: remove arguments from fault handler functions
>>>>
>>>> Thank you for the excellent work to investigate and report this.
>>>>
>>>>>
>>>>> Our best guess that the instruction storage exception is not properly handled
>>>>> and the kernel is never able to recover from the page fault, but we don't
>>>>> really know how to proceed. Does anyone have any suggestions or insights?
>>>>
>>>> Before my patch, zero was passed to the page fault error code, after
>>>> my patch it passes the contents of ESR SPR.
>>>>
>>>> It looks like the BookE instruction access interrupt does not set ESR
>>>> (except for BO interrupts maybe?) so you're getting what was in the ESR
>>>> register from a previous interrupt, and maybe if that was a store then
>>>> access_error won't cause a segfault because is_exec is true so that
>>>> test bails out early, then it might just keep retrying the interrupt.
>>>>
>>>> That could explain why you don't always see the same thing.
>>>>
>>>> Now previous code still saved ESR in regs->esr/dsisr for some reason.
>>>> I can't quite see why that should have been necessary though. Does
> 
> According to the e500 Reference Manual, on ISI:
> 
> BO is set if the instruction fetch caused a byte-ordering exception; 
> otherwise cleared. *All* other defined ESR bits are *cleared*.

You're right. In that case it shouldn't change anything unless there
was a BO fault. I'm not sure what the problem is then. Guessing based
on the NIP and instructions, it looks like it's probably got the correct 
user address that it's storing into vmf on the stack, so it has got past 
the access checks so my theory would be wrong anyway.

Must be something simple but I can't see it yet. 

Thanks,
Nick