From engebret at vnet.ibm.com  Tue May  7 06:31:39 2002
From: engebret at vnet.ibm.com (David Engebretsen)
Date: Mon, 06 May 2002 15:31:39 -0500
Subject: eieio in atomics
Message-ID: <3CD6E82B.F8F5BD97@vnet.ibm.com>


Anton,

I see that you added an eieio in atomic operations in the 2.5 tree.  Can you
shed some light on what was discovered here?  I thought these were not required
to imply a memory barrier.  Was an actual problem hit and do we need a fix put
into the 2.4 code base?

Thanks -

Dave.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Tue May  7 10:04:11 2002
From: anton at samba.org (Anton Blanchard)
Date: Tue, 7 May 2002 10:04:11 +1000
Subject: eieio in atomics
In-Reply-To: <3CD6E82B.F8F5BD97@vnet.ibm.com>
References: <3CD6E82B.F8F5BD97@vnet.ibm.com>
Message-ID: <20020507000411.GB21495@krispykreme>


Hi Dave,

> I see that you added an eieio in atomic operations in the 2.5 tree.
> Can you shed some light on what was discovered here?  I thought these
> were not required to imply a memory barrier.  Was an actual problem
> hit and do we need a fix put into the 2.4 code base?

I started looking through the memory barriers in the kernel and
was worried about places where atomics were used as reference counts:

	foo->value = 1;
	if (!atomic_dec_and_test(&foo->refcnt))
		kfree(foo);

If we do this on two cpus, we need to make sure the first instruction
does not end up moving down past the atomic operation. The window
would be small, the isync at the end of the atomic would prevent
it from moving any further down, so the window would be the few
instructions between the successful store conditional and the isync.

Having said that I havent convinced myself 100% that the race is real.

My current understanding is that atomics and bitops that do not return
values imply no memory barrier, but a memory barrier is implied for
ones that return values.

I started looking for operations that may imply memory barriers,
below is a summary of the ones I found so far. The PA RISC and MIPS
maintainers are interested in sorting this mess out, as are the SGI
IA64 guys wrt IO barriers.

Anton

atomics
        with return values
        without return values
        smp_mb__{before|after}_atomic_{dec|inc}

bitops
        with return values
        without return values
        smp_mb__{before|after}_clear_bit

xchg*
cmpxchg*

mb
rmb
wmb

smp_mb
smp_rmb
smp_wmb

in*
out*

read*
write*

ppc specific:
	iobarrier_rw
	iobarrier_r
	iobarrier_w
	iosync

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From engebret at vnet.ibm.com  Wed May  8 23:51:16 2002
From: engebret at vnet.ibm.com (David Engebretsen)
Date: Wed, 08 May 2002 08:51:16 -0500
Subject: eieio in atomics
References: <3CD6E82B.F8F5BD97@vnet.ibm.com> <20020507000411.GB21495@krispykreme>
Message-ID: <3CD92D54.55453B6@vnet.ibm.com>


Anton Blanchard wrote:
>
> Hi Dave,
>
> > I see that you added an eieio in atomic operations in the 2.5 tree.
> > Can you shed some light on what was discovered here?  I thought these
> > were not required to imply a memory barrier.  Was an actual problem
> > hit and do we need a fix put into the 2.4 code base?
>
> I started looking through the memory barriers in the kernel and
> was worried about places where atomics were used as reference counts:
>
>         foo->value = 1;
>         if (!atomic_dec_and_test(&foo->refcnt))
>                 kfree(foo);
>
> If we do this on two cpus, we need to make sure the first instruction
> does not end up moving down past the atomic operation. The window
> would be small, the isync at the end of the atomic would prevent
> it from moving any further down, so the window would be the few
> instructions between the successful store conditional and the isync.
>
> Having said that I havent convinced myself 100% that the race is real.
>

The window you are worried about is this, if I follow your thinking:

CPU 0		CPU1
-------		------
Store A
Store B
isync
Free A
		Alloc A
		Store A <-- this gets stomped by prior Store A

For this case, the isync does not provide any value (now I am wondering why we
have it at all in these atomic functions ...).  It does not cause the Store A to
be visible, only that the store instruction has executed to completion.  An
isync in a spinlock acquisition is only an import barrier (loads following the
lock do not execute until the lock is taken).  Spinlocks rely on the spin-unlock
code to do a lwsync or eieio in order to provide correctness in the context of
locks.

Having said all this, for _this_ case, there I do not think there is an actual
problem as the kfree path does a lwsync as there is a spinlock taken & released
in free_block on that path.

Now there are probabily cases where there is actually an exposure.  I had hoped
Rusty's statement was sufficient that this is not the case, but perhaps we need
to look more carefully.  Trick will be to find them.  It would be in a place
where the atomics are used as a spinlock to gate access to a data structure.

Do you have some other concrete examples in had we can look at more closely?

Dave.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Thu May  9 17:40:51 2002
From: anton at samba.org (Anton Blanchard)
Date: Thu, 9 May 2002 17:40:51 +1000
Subject: icache coherency
Message-ID: <20020509074051.GC10468@krispykreme>


Hi,

Ive been playing with the cache flush avoidance code, looking for
areas we can optimise. The first obvious area is to avoid flushing
the icache after a COW fault when the VMA is not marked executable.

Unfortunately we cannot for a few reasons:

1. The ppc32 ABI has a blrl instruction in the word before the start
of the GOT. The aim here is to be able to calculate the address of the
GOT by doing:

	bl GOT-4
	mflr r30

Unfortunately the segment with the blrl is not always marked executable.
Alan Modra just fixed this in binutils but there are an awful number of
binaries and libraries out there with this problem.

2. The PLT was not always marked executable. More bad news.


Ive put together workarounds in ld.so and the kernel that get around
these two problems, but it doesnt help us much performance wise (I have
to mark the entire data section executable and we end up syncing the
icache on COW faults for all executables and libraries.)

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Thu May  9 20:15:07 2002
From: anton at samba.org (Anton Blanchard)
Date: Thu, 9 May 2002 20:15:07 +1000
Subject: icache coherency
In-Reply-To: <20020509074051.GC10468@krispykreme>
References: <20020509074051.GC10468@krispykreme>
Message-ID: <20020509101506.GD10468@krispykreme>


> 1. The segment with the blrl is not always marked executable.
>
> 2. The PLT was not always marked executable. More bad news.

3. PLTs are in mmap(/dev/zero) memory, so I cant even optimise
icache syncing in anonymous memory regions.

Milton suggested adding something in the ELF header to say the
executable is compiled correctly, this might be worth pursuing.

Dave Gibson tried enabling per page execute permission a while ago
(the 4XX supports it). Of course nothing worked :) We should
look towards fixing this mess.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Fri May 10 15:45:26 2002
From: anton at samba.org (Anton Blanchard)
Date: Fri, 10 May 2002 15:45:26 +1000
Subject: nasty bug in ppc32 ld.so
Message-ID: <20020510054526.GZ10468@krispykreme>


Hi,

I tracked down why SDET on recent 2.4 and 2.5 kernels sometimes SEGVs.
This happens mostly on RS64, probably due to its large icache.

ld.so tries to optimise away a series of icbis:

      /* Now, we've modified code.  We need to write the changes from
         the data cache to a second-level unified cache, then make
         sure that stale data in the instruction cache is removed.
         (In a multiprocessor system, the effect is more complex.)
         Most of the PLT shouldn't be in the instruction cache, but
         there may be a little overlap at the start and the end.

It dcbst's the entire range but only icbi's the first and last
cachelines. This is clearly invalid, the kernel does not have to
icache synchronise a zero filled page when handing it out to a process,
yet this is exactly what the above behaviour relies on.

Finding this bug was difficult because ld.so did do the dcbst's. So you
could never get an invalid instruction in the icache, just a stale one.
And since it was in the PLT which was always in the same spot, we would
jump through a stale branch in the PLT to the wrong function. Weird
symptoms result :)

I'll submit a patch to the libc guys.

In other news I'll test out enforcing execute permission on 64 bit
binaries, hopefully we can flush out any places we execute out of areas
not marked executable. Lets catch them before it all gets cast in stone.

Anton

--- glibc-2.2.5/sysdeps/powerpc/dl-machine.c~	Sun Sep  9 07:21:34 2001
+++ glibc-2.2.5/sysdeps/powerpc/dl-machine.c	Fri May 10 12:58:04 2002
@@ -311,7 +311,8 @@
 	PPC_DCBST (plt + i);
       PPC_DCBST (plt + size_modified - 1);
       PPC_SYNC;
-      PPC_ICBI (plt);
+      for (i = 0; i < size_modified; i += 4)
+        PPC_ICBI (plt + i);
       PPC_ICBI (plt + size_modified - 1);
       PPC_ISYNC;
     }
@@ -534,5 +535,5 @@
       return;
     }

-  MODIFIED_CODE_NOQUEUE (reloc_addr);
+  MODIFIED_CODE (reloc_addr);
 }

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Mon May 13 18:48:42 2002
From: anton at samba.org (Anton Blanchard)
Date: Mon, 13 May 2002 18:48:42 +1000
Subject: signal handling...
Message-ID: <20020513084842.GA10248@krispykreme>


Hi,

Rusty pointed out that we don't fill in the si_addr field in the siginfo
struct for a SIGILL. Ive been going through the spec and it turns out our
siginfo handling is a mess.

si_code is also broken because we leak the upper 4 bytes to userspace.
At the moment Im masking it in the 32 bit signal handling code
but it should probably be done in a copy_siginfo_to_user32 routine.

Here is my current arch/ppc64/kernel/traps.c. I'll be committing stuff
as I go to the 2.5 tree, once its been tested for a bit we can start
backporting the more critical bits.

Our SIGFPE handling is broken too, we dont clear the fpscr bits when
we jump into the handler. When we try to do FP in the signal handler
(eg we want to print out a FP register dump for debugging) we die.
Paulus has fixed this in the ppc32 port and Im in the process of
merging the fix into ppc64. While Im at it, I'll try and add the bits
to support changing FP modes via prctl.

traps.c summary:

Clean up all our oops printing: serialise it, call console_verbose
and bust_spinlocks.

System reset exception checks RI bit and allows us to continue if it
    is recoverable. Useful when using it for debugging. I wonder what we
    should do when debugging is disabled, maybe initiate a shutdown.

Fix lots of cases where we missed setting si_code and si_addr.

Return FPE errors in si_code for overflow, underflow, divide by zero,
    inexact, invalid operation.

Remove CONFIG_KDB special cases. It should be using the generic debugger
    hooks like xmon and kgdb

Anton


/*
 *  linux/arch/ppc/kernel/traps.c
 *
 *  Copyright (C) 1995-1996  Gary Thomas (gdt at linuxppc.org)
 *
 *  This program is free software; you can redistribute it and/or
 *  modify it under the terms of the GNU General Public License
 *  as published by the Free Software Foundation; either version
 *  2 of the License, or (at your option) any later version.
 *
 *  Modified by Cort Dougan (cort at cs.nmt.edu)
 *  and Paul Mackerras (paulus at cs.anu.edu.au)
 */

/*
 * This file handles the architecture-dependent parts of hardware exceptions
 */

#include <linux/errno.h>
#include <linux/sched.h>
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/stddef.h>
#include <linux/unistd.h>
#include <linux/ptrace.h>
#include <linux/slab.h>
#include <linux/user.h>
#include <linux/a.out.h>
#include <linux/interrupt.h>
#include <linux/config.h>
#include <linux/init.h>

#ifdef CONFIG_KDB
#include <linux/kdb.h>
#endif

#include <asm/pgtable.h>
#include <asm/uaccess.h>
#include <asm/system.h>
#include <asm/io.h>
#include <asm/processor.h>
#include <asm/ppcdebug.h>

extern int fix_alignment(struct pt_regs *);
extern void bad_page_fault(struct pt_regs *, unsigned long);

/* This is true if we are using the firmware NMI handler (typically LPAR) */
extern int fwnmi_active;

#ifdef CONFIG_XMON
extern void xmon(struct pt_regs *regs);
extern int xmon_bpt(struct pt_regs *regs);
extern int xmon_sstep(struct pt_regs *regs);
extern int xmon_iabr_match(struct pt_regs *regs);
extern int xmon_dabr_match(struct pt_regs *regs);
extern void (*xmon_fault_handler)(struct pt_regs *regs);
#endif

#ifdef CONFIG_XMON
#define CONFIG_DEBUGGER
void (*debugger)(struct pt_regs *regs) = xmon;
int (*debugger_bpt)(struct pt_regs *regs) = xmon_bpt;
int (*debugger_sstep)(struct pt_regs *regs) = xmon_sstep;
int (*debugger_iabr_match)(struct pt_regs *regs) = xmon_iabr_match;
int (*debugger_dabr_match)(struct pt_regs *regs) = xmon_dabr_match;
void (*debugger_fault_handler)(struct pt_regs *regs);
#else
#ifdef CONFIG_KGDB
#define CONFIG_DEBUGGER
void (*debugger)(struct pt_regs *regs);
int (*debugger_bpt)(struct pt_regs *regs);
int (*debugger_sstep)(struct pt_regs *regs);
int (*debugger_iabr_match)(struct pt_regs *regs);
int (*debugger_dabr_match)(struct pt_regs *regs);
void (*debugger_fault_handler)(struct pt_regs *regs);
#endif
#endif

/*
 * Trap & Exception support
 */

/* Should we panic on bad kernel exceptions or try to recover */
#undef PANIC_ON_ERROR

static spinlock_t die_lock = SPIN_LOCK_UNLOCKED;

void die(const char *str, struct pt_regs *regs, long err)
{
	console_verbose();
	spin_lock_irq(&die_lock);
	bust_spinlocks(1);
	printk("Oops: %s, sig: %ld\n", str, err);
	show_regs(regs);
	print_backtrace((unsigned long *)regs->gpr[1]);
	bust_spinlocks(0);
	spin_unlock_irq(&die_lock);

#ifdef PANIC_ON_ERROR
	panic(str);
#else
	do_exit(SIGSEGV);
#endif
}

static void
_exception(int signr, siginfo_t *info, struct pt_regs *regs)
{
	if (!user_mode(regs)) {
#ifdef CONFIG_DEBUGGER
		if (debugger)
			debugger(regs);
#endif
		die("Exception in kernel mode\n", regs, signr);
	}

	force_sig_info(signr, info, current);
}

/* Get the error information for errors coming through the
 * FWNMI vectors.  The pt_regs' r3 will be updated to reflect
 * the actual r3 if possible, and a ptr to the error log entry
 * will be returned if found.
 */
static struct rtas_error_log *FWNMI_get_errinfo(struct pt_regs *regs)
{
	unsigned long errdata = regs->gpr[3];
	struct rtas_error_log *errhdr = NULL;
	unsigned long *savep;

	if ((errdata >= 0x7000 && errdata < 0x7fff0) ||
	    (errdata >= rtas.base && errdata < rtas.base + rtas.size - 16)) {
		savep = __va(errdata);
		regs->gpr[3] = savep[0];	/* restore original r3 */
		errhdr = (struct rtas_error_log *)(savep + 1);
	} else {
		printk("FWNMI: corrupt r3\n");
	}
	return errhdr;
}

/* Call this when done with the data returned by FWNMI_get_errinfo.
 * It will release the saved data area for other CPUs in the
 * partition to receive FWNMI errors.
 */
static void FWNMI_release_errinfo(void)
{
	unsigned long ret = rtas_call(rtas_token("ibm,nmi-interlock"), 0, 1, NULL);
	if (ret != 0)
		printk("FWNMI: nmi-interlock failed: %ld\n", ret);
}

void
SystemResetException(struct pt_regs *regs)
{
	if (fwnmi_active) {
		char *msg;
		unsigned long *r3 = __va(regs->gpr[3]); /* for FWNMI debug */
		struct rtas_error_log *errlog;

		msg = "FWNMI is active with save area at %016lx\n";
		udbg_printf(msg, r3); printk(msg, r3);
		errlog = FWNMI_get_errinfo(regs);
	}

#ifdef CONFIG_DEBUGGER
	if (debugger)
		debugger(regs);
#endif

#ifdef PANIC_ON_ERROR
	panic("System Reset");
#else
	/* Must die if the interrupt is not recoverable */
	if (!(regs->msr & MSR_RI))
		panic("Unrecoverable System Reset");
#endif

	/* What should we do here? We could issue a shutdown or hard reset. */
}

static int power4_handle_mce(struct pt_regs *regs)
{
	return 0;
}

void
MachineCheckException(struct pt_regs *regs)
{
	siginfo_t info;

	if (fwnmi_active) {
		struct rtas_error_log *errhdr = FWNMI_get_errinfo(regs);
		if (errhdr) {
			/* ToDo: attempt to recover from some errors here */
		}
		FWNMI_release_errinfo();
	}

	if (!user_mode(regs)) {
		/* Attempt to recover if the interrupt is recoverable */
		if (regs->msr & MSR_RI) {
			if (__is_processor(PV_POWER4) &&
			    power4_handle_mce(regs))
				return;
		}

#ifdef CONFIG_DEBUGGER
		if (debugger_fault_handler) {
			debugger_fault_handler(regs);
			return;
		}
		if (debugger)
			debugger(regs);
#endif
		console_verbose();
		spin_lock_irq(&die_lock);
		bust_spinlocks(1);
		printk("Machine check in kernel mode.\n");
		printk("Caused by (from SRR1=%lx): ", regs->msr);
		show_regs(regs);
		print_backtrace((unsigned long *)regs->gpr[1]);
		bust_spinlocks(0);
		spin_unlock_irq(&die_lock);
		panic("Unrecoverable Machine Check");
	}

	/*
	 * XXX we should check RI bit on exception exit and kill the
	 * task if it was cleared
	 */
	info.si_signo = SIGBUS;
	info.si_errno = 0;
	info.si_code = BUS_ADRERR;
	info.si_addr = (void *)regs->nip;
	_exception(SIGSEGV, &info, regs);
}

void
UnknownException(struct pt_regs *regs)
{
	siginfo_t info;

	printk("Bad trap at PC: %lx, SR: %lx, vector=%lx\n",
	       regs->nip, regs->msr, regs->trap);

	info.si_signo = SIGTRAP;
	info.si_errno = 0;
	info.si_code = 0;
	info.si_addr = 0;
	_exception(SIGTRAP, &info, regs);
}

void
InstructionBreakpointException(struct pt_regs *regs)
{
	siginfo_t info;

#ifdef CONFIG_DEBUGGER
	if (debugger_iabr_match && debugger_iabr_match(regs))
		return;
#endif

	info.si_signo = SIGTRAP;
	info.si_errno = 0;
	info.si_code = TRAP_BRKPT;
	info.si_addr = (void *)regs->nip;
	_exception(SIGTRAP, &info, regs);
}

static void parse_fpe(struct pt_regs *regs)
{
	siginfo_t info;
	unsigned long fpscr;

	if (regs->msr & MSR_FP)
		giveup_fpu(current);

	fpscr = current->thread.fpscr;

	if (fpscr & (FPSCR_VE|FPSCR_VX))	/* Invalid operation */
		info.si_code = FPE_FLTINV;
	else if (fpscr & (FPSCR_OE|FPSCR_OX))	/* Overflow */
		info.si_code = FPE_FLTOVF;
	else if (fpscr & (FPSCR_UE|FPSCR_UX))	/* Underflow */
		info.si_code = FPE_FLTUND;
	else if (fpscr & (FPSCR_ZE|FPSCR_ZX))	/* Divide by zero */
		info.si_code = FPE_FLTDIV;
	else if (fpscr & (FPSCR_XE|FPSCR_XX))	/* Inexact result */
		info.si_code = FPE_FLTRES;
	else
		info.si_code = 0;

	info.si_signo = SIGFPE;
	info.si_errno = 0;
	info.si_addr = (void *)regs->nip;
	_exception(SIGFPE, &info, regs);
}

void
ProgramCheckException(struct pt_regs *regs)
{
	siginfo_t info;

	if (regs->msr & 0x100000) {
		/* IEEE FP exception */

		parse_fpe(regs);
	} else if (regs->msr & 0x40000) {
		/* Privileged instruction */

		info.si_signo = SIGILL;
		info.si_errno = 0;
		info.si_code = ILL_PRVOPC;
		info.si_addr = (void *)regs->nip;
		_exception(SIGILL, &info, regs);
	} else if (regs->msr & 0x20000) {
		/* trap exception */

#ifdef CONFIG_DEBUGGER
		if (debugger_bpt && debugger_bpt(regs))
			return;
#endif
		info.si_signo = SIGTRAP;
		info.si_errno = 0;
		info.si_code = TRAP_BRKPT;
		info.si_addr = (void *)regs->nip;
		_exception(SIGTRAP, &info, regs);
	} else {
		/* Illegal instruction */

		info.si_signo = SIGILL;
		info.si_errno = 0;
		info.si_code = ILL_ILLTRP;
		info.si_addr = (void *)regs->nip;
		_exception(SIGILL, &info, regs);
	}
}

void
SingleStepException(struct pt_regs *regs)
{
	siginfo_t info;

	regs->msr &= ~MSR_SE;  /* Turn off 'trace' bit */

#ifdef CONFIG_DEBUGGER
	if (debugger_sstep && debugger_sstep(regs))
		return;
#endif

	info.si_signo = SIGTRAP;
	info.si_errno = 0;
	info.si_code = TRAP_TRACE;
	info.si_addr = (void *)regs->nip;
	_exception(SIGTRAP, &info, regs);
}

void
PerformanceMonitorException(struct pt_regs *regs)
{
	siginfo_t info;

	info.si_signo = SIGTRAP;
	info.si_errno = 0;
	info.si_code = TRAP_BRKPT;
	info.si_addr = 0;
	_exception(SIGTRAP, &info, regs);
}

void
AlignmentException(struct pt_regs *regs)
{
	int fixed;
	siginfo_t info;

	fixed = fix_alignment(regs);

	if (fixed == 1) {
		if (!user_mode(regs))
			PPCDBG(PPCDBG_ALIGNFIXUP, "fix alignment at %lx\n",
			       regs->nip);
		regs->nip += 4;	/* skip over emulated instruction */
		return;
	}

	/* Operand address was bad */
	if (fixed == -EFAULT) {
		if (user_mode(regs)) {
			info.si_signo = SIGSEGV;
			info.si_errno = 0;
			info.si_code = SEGV_MAPERR;
			info.si_addr = (void *)regs->dar;
			force_sig_info(SIGSEGV, &info, current);
		} else {
			/* Search exception table */
			bad_page_fault(regs, regs->dar);
		}

		return;
	}

	info.si_signo = SIGBUS;
	info.si_errno = 0;
	info.si_code = BUS_ADRALN;
	info.si_addr = (void *)regs->nip;
	_exception(SIGBUS, &info, regs);
}

void __init trap_init(void)
{
}

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/