From engebret at vnet.ibm.com Tue May 7 06:31:39 2002 From: engebret at vnet.ibm.com (David Engebretsen) Date: Mon, 06 May 2002 15:31:39 -0500 Subject: eieio in atomics Message-ID: <3CD6E82B.F8F5BD97@vnet.ibm.com> Anton, I see that you added an eieio in atomic operations in the 2.5 tree. Can you shed some light on what was discovered here? I thought these were not required to imply a memory barrier. Was an actual problem hit and do we need a fix put into the 2.4 code base? Thanks - Dave. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Tue May 7 10:04:11 2002 From: anton at samba.org (Anton Blanchard) Date: Tue, 7 May 2002 10:04:11 +1000 Subject: eieio in atomics In-Reply-To: <3CD6E82B.F8F5BD97@vnet.ibm.com> References: <3CD6E82B.F8F5BD97@vnet.ibm.com> Message-ID: <20020507000411.GB21495@krispykreme> Hi Dave, > I see that you added an eieio in atomic operations in the 2.5 tree. > Can you shed some light on what was discovered here? I thought these > were not required to imply a memory barrier. Was an actual problem > hit and do we need a fix put into the 2.4 code base? I started looking through the memory barriers in the kernel and was worried about places where atomics were used as reference counts: foo->value = 1; if (!atomic_dec_and_test(&foo->refcnt)) kfree(foo); If we do this on two cpus, we need to make sure the first instruction does not end up moving down past the atomic operation. The window would be small, the isync at the end of the atomic would prevent it from moving any further down, so the window would be the few instructions between the successful store conditional and the isync. Having said that I havent convinced myself 100% that the race is real. My current understanding is that atomics and bitops that do not return values imply no memory barrier, but a memory barrier is implied for ones that return values. I started looking for operations that may imply memory barriers, below is a summary of the ones I found so far. The PA RISC and MIPS maintainers are interested in sorting this mess out, as are the SGI IA64 guys wrt IO barriers. Anton atomics with return values without return values smp_mb__{before|after}_atomic_{dec|inc} bitops with return values without return values smp_mb__{before|after}_clear_bit xchg* cmpxchg* mb rmb wmb smp_mb smp_rmb smp_wmb in* out* read* write* ppc specific: iobarrier_rw iobarrier_r iobarrier_w iosync ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From engebret at vnet.ibm.com Wed May 8 23:51:16 2002 From: engebret at vnet.ibm.com (David Engebretsen) Date: Wed, 08 May 2002 08:51:16 -0500 Subject: eieio in atomics References: <3CD6E82B.F8F5BD97@vnet.ibm.com> <20020507000411.GB21495@krispykreme> Message-ID: <3CD92D54.55453B6@vnet.ibm.com> Anton Blanchard wrote: > > Hi Dave, > > > I see that you added an eieio in atomic operations in the 2.5 tree. > > Can you shed some light on what was discovered here? I thought these > > were not required to imply a memory barrier. Was an actual problem > > hit and do we need a fix put into the 2.4 code base? > > I started looking through the memory barriers in the kernel and > was worried about places where atomics were used as reference counts: > > foo->value = 1; > if (!atomic_dec_and_test(&foo->refcnt)) > kfree(foo); > > If we do this on two cpus, we need to make sure the first instruction > does not end up moving down past the atomic operation. The window > would be small, the isync at the end of the atomic would prevent > it from moving any further down, so the window would be the few > instructions between the successful store conditional and the isync. > > Having said that I havent convinced myself 100% that the race is real. > The window you are worried about is this, if I follow your thinking: CPU 0 CPU1 ------- ------ Store A Store B isync Free A Alloc A Store A <-- this gets stomped by prior Store A For this case, the isync does not provide any value (now I am wondering why we have it at all in these atomic functions ...). It does not cause the Store A to be visible, only that the store instruction has executed to completion. An isync in a spinlock acquisition is only an import barrier (loads following the lock do not execute until the lock is taken). Spinlocks rely on the spin-unlock code to do a lwsync or eieio in order to provide correctness in the context of locks. Having said all this, for _this_ case, there I do not think there is an actual problem as the kfree path does a lwsync as there is a spinlock taken & released in free_block on that path. Now there are probabily cases where there is actually an exposure. I had hoped Rusty's statement was sufficient that this is not the case, but perhaps we need to look more carefully. Trick will be to find them. It would be in a place where the atomics are used as a spinlock to gate access to a data structure. Do you have some other concrete examples in had we can look at more closely? Dave. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Thu May 9 17:40:51 2002 From: anton at samba.org (Anton Blanchard) Date: Thu, 9 May 2002 17:40:51 +1000 Subject: icache coherency Message-ID: <20020509074051.GC10468@krispykreme> Hi, Ive been playing with the cache flush avoidance code, looking for areas we can optimise. The first obvious area is to avoid flushing the icache after a COW fault when the VMA is not marked executable. Unfortunately we cannot for a few reasons: 1. The ppc32 ABI has a blrl instruction in the word before the start of the GOT. The aim here is to be able to calculate the address of the GOT by doing: bl GOT-4 mflr r30 Unfortunately the segment with the blrl is not always marked executable. Alan Modra just fixed this in binutils but there are an awful number of binaries and libraries out there with this problem. 2. The PLT was not always marked executable. More bad news. Ive put together workarounds in ld.so and the kernel that get around these two problems, but it doesnt help us much performance wise (I have to mark the entire data section executable and we end up syncing the icache on COW faults for all executables and libraries.) Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Thu May 9 20:15:07 2002 From: anton at samba.org (Anton Blanchard) Date: Thu, 9 May 2002 20:15:07 +1000 Subject: icache coherency In-Reply-To: <20020509074051.GC10468@krispykreme> References: <20020509074051.GC10468@krispykreme> Message-ID: <20020509101506.GD10468@krispykreme> > 1. The segment with the blrl is not always marked executable. > > 2. The PLT was not always marked executable. More bad news. 3. PLTs are in mmap(/dev/zero) memory, so I cant even optimise icache syncing in anonymous memory regions. Milton suggested adding something in the ELF header to say the executable is compiled correctly, this might be worth pursuing. Dave Gibson tried enabling per page execute permission a while ago (the 4XX supports it). Of course nothing worked :) We should look towards fixing this mess. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Fri May 10 15:45:26 2002 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 May 2002 15:45:26 +1000 Subject: nasty bug in ppc32 ld.so Message-ID: <20020510054526.GZ10468@krispykreme> Hi, I tracked down why SDET on recent 2.4 and 2.5 kernels sometimes SEGVs. This happens mostly on RS64, probably due to its large icache. ld.so tries to optimise away a series of icbis: /* Now, we've modified code. We need to write the changes from the data cache to a second-level unified cache, then make sure that stale data in the instruction cache is removed. (In a multiprocessor system, the effect is more complex.) Most of the PLT shouldn't be in the instruction cache, but there may be a little overlap at the start and the end. It dcbst's the entire range but only icbi's the first and last cachelines. This is clearly invalid, the kernel does not have to icache synchronise a zero filled page when handing it out to a process, yet this is exactly what the above behaviour relies on. Finding this bug was difficult because ld.so did do the dcbst's. So you could never get an invalid instruction in the icache, just a stale one. And since it was in the PLT which was always in the same spot, we would jump through a stale branch in the PLT to the wrong function. Weird symptoms result :) I'll submit a patch to the libc guys. In other news I'll test out enforcing execute permission on 64 bit binaries, hopefully we can flush out any places we execute out of areas not marked executable. Lets catch them before it all gets cast in stone. Anton --- glibc-2.2.5/sysdeps/powerpc/dl-machine.c~ Sun Sep 9 07:21:34 2001 +++ glibc-2.2.5/sysdeps/powerpc/dl-machine.c Fri May 10 12:58:04 2002 @@ -311,7 +311,8 @@ PPC_DCBST (plt + i); PPC_DCBST (plt + size_modified - 1); PPC_SYNC; - PPC_ICBI (plt); + for (i = 0; i < size_modified; i += 4) + PPC_ICBI (plt + i); PPC_ICBI (plt + size_modified - 1); PPC_ISYNC; } @@ -534,5 +535,5 @@ return; } - MODIFIED_CODE_NOQUEUE (reloc_addr); + MODIFIED_CODE (reloc_addr); } ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Mon May 13 18:48:42 2002 From: anton at samba.org (Anton Blanchard) Date: Mon, 13 May 2002 18:48:42 +1000 Subject: signal handling... Message-ID: <20020513084842.GA10248@krispykreme> Hi, Rusty pointed out that we don't fill in the si_addr field in the siginfo struct for a SIGILL. Ive been going through the spec and it turns out our siginfo handling is a mess. si_code is also broken because we leak the upper 4 bytes to userspace. At the moment Im masking it in the 32 bit signal handling code but it should probably be done in a copy_siginfo_to_user32 routine. Here is my current arch/ppc64/kernel/traps.c. I'll be committing stuff as I go to the 2.5 tree, once its been tested for a bit we can start backporting the more critical bits. Our SIGFPE handling is broken too, we dont clear the fpscr bits when we jump into the handler. When we try to do FP in the signal handler (eg we want to print out a FP register dump for debugging) we die. Paulus has fixed this in the ppc32 port and Im in the process of merging the fix into ppc64. While Im at it, I'll try and add the bits to support changing FP modes via prctl. traps.c summary: Clean up all our oops printing: serialise it, call console_verbose and bust_spinlocks. System reset exception checks RI bit and allows us to continue if it is recoverable. Useful when using it for debugging. I wonder what we should do when debugging is disabled, maybe initiate a shutdown. Fix lots of cases where we missed setting si_code and si_addr. Return FPE errors in si_code for overflow, underflow, divide by zero, inexact, invalid operation. Remove CONFIG_KDB special cases. It should be using the generic debugger hooks like xmon and kgdb Anton /* * linux/arch/ppc/kernel/traps.c * * Copyright (C) 1995-1996 Gary Thomas (gdt at linuxppc.org) * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version * 2 of the License, or (at your option) any later version. * * Modified by Cort Dougan (cort at cs.nmt.edu) * and Paul Mackerras (paulus at cs.anu.edu.au) */ /* * This file handles the architecture-dependent parts of hardware exceptions */ #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef CONFIG_KDB #include #endif #include #include #include #include #include #include extern int fix_alignment(struct pt_regs *); extern void bad_page_fault(struct pt_regs *, unsigned long); /* This is true if we are using the firmware NMI handler (typically LPAR) */ extern int fwnmi_active; #ifdef CONFIG_XMON extern void xmon(struct pt_regs *regs); extern int xmon_bpt(struct pt_regs *regs); extern int xmon_sstep(struct pt_regs *regs); extern int xmon_iabr_match(struct pt_regs *regs); extern int xmon_dabr_match(struct pt_regs *regs); extern void (*xmon_fault_handler)(struct pt_regs *regs); #endif #ifdef CONFIG_XMON #define CONFIG_DEBUGGER void (*debugger)(struct pt_regs *regs) = xmon; int (*debugger_bpt)(struct pt_regs *regs) = xmon_bpt; int (*debugger_sstep)(struct pt_regs *regs) = xmon_sstep; int (*debugger_iabr_match)(struct pt_regs *regs) = xmon_iabr_match; int (*debugger_dabr_match)(struct pt_regs *regs) = xmon_dabr_match; void (*debugger_fault_handler)(struct pt_regs *regs); #else #ifdef CONFIG_KGDB #define CONFIG_DEBUGGER void (*debugger)(struct pt_regs *regs); int (*debugger_bpt)(struct pt_regs *regs); int (*debugger_sstep)(struct pt_regs *regs); int (*debugger_iabr_match)(struct pt_regs *regs); int (*debugger_dabr_match)(struct pt_regs *regs); void (*debugger_fault_handler)(struct pt_regs *regs); #endif #endif /* * Trap & Exception support */ /* Should we panic on bad kernel exceptions or try to recover */ #undef PANIC_ON_ERROR static spinlock_t die_lock = SPIN_LOCK_UNLOCKED; void die(const char *str, struct pt_regs *regs, long err) { console_verbose(); spin_lock_irq(&die_lock); bust_spinlocks(1); printk("Oops: %s, sig: %ld\n", str, err); show_regs(regs); print_backtrace((unsigned long *)regs->gpr[1]); bust_spinlocks(0); spin_unlock_irq(&die_lock); #ifdef PANIC_ON_ERROR panic(str); #else do_exit(SIGSEGV); #endif } static void _exception(int signr, siginfo_t *info, struct pt_regs *regs) { if (!user_mode(regs)) { #ifdef CONFIG_DEBUGGER if (debugger) debugger(regs); #endif die("Exception in kernel mode\n", regs, signr); } force_sig_info(signr, info, current); } /* Get the error information for errors coming through the * FWNMI vectors. The pt_regs' r3 will be updated to reflect * the actual r3 if possible, and a ptr to the error log entry * will be returned if found. */ static struct rtas_error_log *FWNMI_get_errinfo(struct pt_regs *regs) { unsigned long errdata = regs->gpr[3]; struct rtas_error_log *errhdr = NULL; unsigned long *savep; if ((errdata >= 0x7000 && errdata < 0x7fff0) || (errdata >= rtas.base && errdata < rtas.base + rtas.size - 16)) { savep = __va(errdata); regs->gpr[3] = savep[0]; /* restore original r3 */ errhdr = (struct rtas_error_log *)(savep + 1); } else { printk("FWNMI: corrupt r3\n"); } return errhdr; } /* Call this when done with the data returned by FWNMI_get_errinfo. * It will release the saved data area for other CPUs in the * partition to receive FWNMI errors. */ static void FWNMI_release_errinfo(void) { unsigned long ret = rtas_call(rtas_token("ibm,nmi-interlock"), 0, 1, NULL); if (ret != 0) printk("FWNMI: nmi-interlock failed: %ld\n", ret); } void SystemResetException(struct pt_regs *regs) { if (fwnmi_active) { char *msg; unsigned long *r3 = __va(regs->gpr[3]); /* for FWNMI debug */ struct rtas_error_log *errlog; msg = "FWNMI is active with save area at %016lx\n"; udbg_printf(msg, r3); printk(msg, r3); errlog = FWNMI_get_errinfo(regs); } #ifdef CONFIG_DEBUGGER if (debugger) debugger(regs); #endif #ifdef PANIC_ON_ERROR panic("System Reset"); #else /* Must die if the interrupt is not recoverable */ if (!(regs->msr & MSR_RI)) panic("Unrecoverable System Reset"); #endif /* What should we do here? We could issue a shutdown or hard reset. */ } static int power4_handle_mce(struct pt_regs *regs) { return 0; } void MachineCheckException(struct pt_regs *regs) { siginfo_t info; if (fwnmi_active) { struct rtas_error_log *errhdr = FWNMI_get_errinfo(regs); if (errhdr) { /* ToDo: attempt to recover from some errors here */ } FWNMI_release_errinfo(); } if (!user_mode(regs)) { /* Attempt to recover if the interrupt is recoverable */ if (regs->msr & MSR_RI) { if (__is_processor(PV_POWER4) && power4_handle_mce(regs)) return; } #ifdef CONFIG_DEBUGGER if (debugger_fault_handler) { debugger_fault_handler(regs); return; } if (debugger) debugger(regs); #endif console_verbose(); spin_lock_irq(&die_lock); bust_spinlocks(1); printk("Machine check in kernel mode.\n"); printk("Caused by (from SRR1=%lx): ", regs->msr); show_regs(regs); print_backtrace((unsigned long *)regs->gpr[1]); bust_spinlocks(0); spin_unlock_irq(&die_lock); panic("Unrecoverable Machine Check"); } /* * XXX we should check RI bit on exception exit and kill the * task if it was cleared */ info.si_signo = SIGBUS; info.si_errno = 0; info.si_code = BUS_ADRERR; info.si_addr = (void *)regs->nip; _exception(SIGSEGV, &info, regs); } void UnknownException(struct pt_regs *regs) { siginfo_t info; printk("Bad trap at PC: %lx, SR: %lx, vector=%lx\n", regs->nip, regs->msr, regs->trap); info.si_signo = SIGTRAP; info.si_errno = 0; info.si_code = 0; info.si_addr = 0; _exception(SIGTRAP, &info, regs); } void InstructionBreakpointException(struct pt_regs *regs) { siginfo_t info; #ifdef CONFIG_DEBUGGER if (debugger_iabr_match && debugger_iabr_match(regs)) return; #endif info.si_signo = SIGTRAP; info.si_errno = 0; info.si_code = TRAP_BRKPT; info.si_addr = (void *)regs->nip; _exception(SIGTRAP, &info, regs); } static void parse_fpe(struct pt_regs *regs) { siginfo_t info; unsigned long fpscr; if (regs->msr & MSR_FP) giveup_fpu(current); fpscr = current->thread.fpscr; if (fpscr & (FPSCR_VE|FPSCR_VX)) /* Invalid operation */ info.si_code = FPE_FLTINV; else if (fpscr & (FPSCR_OE|FPSCR_OX)) /* Overflow */ info.si_code = FPE_FLTOVF; else if (fpscr & (FPSCR_UE|FPSCR_UX)) /* Underflow */ info.si_code = FPE_FLTUND; else if (fpscr & (FPSCR_ZE|FPSCR_ZX)) /* Divide by zero */ info.si_code = FPE_FLTDIV; else if (fpscr & (FPSCR_XE|FPSCR_XX)) /* Inexact result */ info.si_code = FPE_FLTRES; else info.si_code = 0; info.si_signo = SIGFPE; info.si_errno = 0; info.si_addr = (void *)regs->nip; _exception(SIGFPE, &info, regs); } void ProgramCheckException(struct pt_regs *regs) { siginfo_t info; if (regs->msr & 0x100000) { /* IEEE FP exception */ parse_fpe(regs); } else if (regs->msr & 0x40000) { /* Privileged instruction */ info.si_signo = SIGILL; info.si_errno = 0; info.si_code = ILL_PRVOPC; info.si_addr = (void *)regs->nip; _exception(SIGILL, &info, regs); } else if (regs->msr & 0x20000) { /* trap exception */ #ifdef CONFIG_DEBUGGER if (debugger_bpt && debugger_bpt(regs)) return; #endif info.si_signo = SIGTRAP; info.si_errno = 0; info.si_code = TRAP_BRKPT; info.si_addr = (void *)regs->nip; _exception(SIGTRAP, &info, regs); } else { /* Illegal instruction */ info.si_signo = SIGILL; info.si_errno = 0; info.si_code = ILL_ILLTRP; info.si_addr = (void *)regs->nip; _exception(SIGILL, &info, regs); } } void SingleStepException(struct pt_regs *regs) { siginfo_t info; regs->msr &= ~MSR_SE; /* Turn off 'trace' bit */ #ifdef CONFIG_DEBUGGER if (debugger_sstep && debugger_sstep(regs)) return; #endif info.si_signo = SIGTRAP; info.si_errno = 0; info.si_code = TRAP_TRACE; info.si_addr = (void *)regs->nip; _exception(SIGTRAP, &info, regs); } void PerformanceMonitorException(struct pt_regs *regs) { siginfo_t info; info.si_signo = SIGTRAP; info.si_errno = 0; info.si_code = TRAP_BRKPT; info.si_addr = 0; _exception(SIGTRAP, &info, regs); } void AlignmentException(struct pt_regs *regs) { int fixed; siginfo_t info; fixed = fix_alignment(regs); if (fixed == 1) { if (!user_mode(regs)) PPCDBG(PPCDBG_ALIGNFIXUP, "fix alignment at %lx\n", regs->nip); regs->nip += 4; /* skip over emulated instruction */ return; } /* Operand address was bad */ if (fixed == -EFAULT) { if (user_mode(regs)) { info.si_signo = SIGSEGV; info.si_errno = 0; info.si_code = SEGV_MAPERR; info.si_addr = (void *)regs->dar; force_sig_info(SIGSEGV, &info, current); } else { /* Search exception table */ bad_page_fault(regs, regs->dar); } return; } info.si_signo = SIGBUS; info.si_errno = 0; info.si_code = BUS_ADRALN; info.si_addr = (void *)regs->nip; _exception(SIGBUS, &info, regs); } void __init trap_init(void) { } ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/