From paulus at samba.org Wed Mar 1 09:51:30 2006 From: paulus at samba.org (Paul Mackerras) Date: Wed, 1 Mar 2006 09:51:30 +1100 Subject: Membership stats (Was: Re: merge these lists?) In-Reply-To: References: <20060208110718.57e9f9f5.sfr@canb.auug.org.au> Message-ID: <17412.54258.868906.846176@cargo.ozlabs.ibm.com> Kumar Gala writes: > Where did we leave on with this? I was about to request that > marc.theaimsgroup.com start archiving some of the ppc lists but figured > doing it after we merged lists would be better. My current thought is to call the combined list "linuxppc-dev" and put the appropriate redirection in place from linuxppc64-dev to linuxppc-dev. If anyone really objects to that, shout now. Paul. From pradeep at us.ibm.com Wed Mar 1 13:20:54 2006 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Tue, 28 Feb 2006 19:20:54 -0700 Subject: Problems loading some select modules In-Reply-To: <17411.55866.172377.50234@cargo.ozlabs.ibm.com> Message-ID: Hello Paul, Please find the tar file with the files that you requested. (See attached file: findex.tar.gz) This was picked up from the svn tree (5421) of openib.org. This is not yet in the mainline. To set the context -this was seen on a Sles9sp2 machine and not a Rhel4U3 machine, as indicated in the previous mail. Thanks for looking into this! Pradeep pradeep at us.ibm.com Paul Mackerras wrote on 02/27/2006 09:06:02 PM: > Pradeep Satyanarayana writes: > > > I was trying to load some Infiniband modules (using modprobe) on Power5 > > machine (p570), and I get the following error: > > > > WARNING: Error inserting findex > > (/lib/modules/2.6.16-rc2/kernel/drivers/infiniband/core/findex.ko): Invalid > > module format > > > > Also, in /var/log/messages I see the following error about the same module: > > > > kernel: findex: doesn't contain .toc or .stubs. > > Interesting. I don't see findex.c in the kernel sources anywhere. It > could be that a very simple module that only accesses variables on the > stack would not need a toc, and maybe in this case the toolchain > doesn't generate a toc. Could you send me the source of your module > plus the generated findex.ko? > > Paul. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060228/8291c691/attachment.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: findex.tar.gz Type: application/octet-stream Size: 38685 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060228/8291c691/attachment.obj From mbligh at mbligh.org Wed Mar 1 10:56:24 2006 From: mbligh at mbligh.org (Martin Bligh) Date: Tue, 28 Feb 2006 15:56:24 -0800 Subject: 2.6.16-rc5-mm1 In-Reply-To: <20060228042439.43e6ef41.akpm@osdl.org> References: <20060228042439.43e6ef41.akpm@osdl.org> Message-ID: <4404E328.7070807@mbligh.org> Andrew Morton wrote: > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc5/2.6.16-rc5-mm1/ New panic on IBM power4 lpar of P690. 2.6.16-rc5-git3 is OK. (config: http://ftp.kernel.org/pub/linux/kernel/people/mbligh/config/abat/power4) http://test.kernel.org/24165/debug/console.log STAFProc version 2.6.2 initialized Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=32 NUMA PSERIES LPAR Modules linked in: NIP: C000000000064748 LR: C000000000064764 CTR: C0000000000A8B10 REGS: c00000077dfaad30 TRAP: 0300 Not tainted (2.6.16-rc5-mm1-autokern1) MSR: 8000000000009032 CR: 28000488 XER: 00000000 DAR: 0000000000000000, DSISR: 0000000040000000 TASK = c00000076a1ac720[11058] 'mingetty' THREAD: c00000077dfa8000 CPU: 1 GPR00: 0000000000000007 C00000077DFAAFB0 C000000000644F70 C00000076F303F08 GPR04: C00000076D478E00 0000000000000000 C00000076D478E00 0000000000000001 GPR08: 000000000000000A 0000000000000000 00000000000001AA C00000076F303F08 GPR12: 0000000048000442 C000000000548B80 0000000010060000 0000000010060000 GPR16: 0000000010060000 0000000010080000 0000000010080000 0000000010060000 GPR20: 0000000010010000 0000000000000001 0000000000000000 0000000000000001 GPR24: 0000000010003738 C00000077255F010 0000000000000001 0000000000000000 GPR28: 0000000000000007 0000000000000000 C00000000055F4A8 C0000000041F82A0 NIP [C000000000064748] .__rcu_process_callbacks+0x1fc/0x2f8 LR [C000000000064764] .__rcu_process_callbacks+0x218/0x2f8 Call Trace: [C00000077DFAAFB0] [C000000000064764] .__rcu_process_callbacks+0x218/0x2f8 (unreliable) [C00000077DFAB040] [C000000000064874] .rcu_process_callbacks+0x30/0x58 [C00000077DFAB0C0] [C000000000053848] .tasklet_action+0xe4/0x19c [C00000077DFAB160] [C000000000053EA8] .__do_softirq+0x9c/0x16c [C00000077DFAB200] [C00000000000B6C8] .do_softirq+0x74/0xac [C00000077DFAB280] [C000000000054388] .irq_exit+0x64/0x7c [C00000077DFAB300] [C00000000001E0AC] .timer_interrupt+0x460/0x48c [C00000077DFAB3E0] [C0000000000034DC] decrementer_common+0xdc/0x100 --- Exception: 901 at ._atomic_dec_and_lock+0x3c/0xb8 LR = .mntput_no_expire+0x30/0xcc [C00000077DFAB6D0] [C0000007680CF438] 0xc0000007680cf438 (unreliable) [C00000077DFAB750] [C0000000000CD6D0] .mntput_no_expire+0x30/0xcc [C00000077DFAB7E0] [C0000000000B9F40] .path_release+0x44/0x5c [C00000077DFAB870] [C0000000000ED840] .proc_pid_follow_link+0x34/0xf0 [C00000077DFAB900] [C0000000000BD01C] .__link_path_walk+0xe64/0x1394 [C00000077DFAB9E0] [C0000000000BD5DC] .link_path_walk+0x90/0x168 [C00000077DFABAE0] [C0000000000BDE28] .do_path_lookup+0x2fc/0x364 [C00000077DFABB90] [C0000000000BF4E0] .__user_walk_fd+0x68/0xa8 [C00000077DFABC30] [C0000000000B5578] .vfs_stat_fd+0x24/0x70 [C00000077DFABD30] [C0000000000B56BC] .sys_stat64+0x1c/0x50 [C00000077DFABE30] [C00000000000871C] syscall_exit+0x0/0x40 Instruction dump: 38000000 901d0080 e87f0040 2fa30000 419e00fc 7c6b1b78 3b800000 ebab0000 7d635b78 fbbf0040 60000000 e92b0008 f8410028 60000000 e9690010 <0>Kernel panic - not syncing: Fatal exception in interrupt smp_call_function on cpu 1: other cpus not responding (1) -- 0:conmux-control -- time-stamp -- Feb/28/06 5:08:52 -- From huangjq at cn.ibm.com Wed Mar 1 18:26:42 2006 From: huangjq at cn.ibm.com (Jin Qi Huang) Date: Wed, 1 Mar 2006 15:26:42 +0800 Subject: Kernel oops then panic when perform a soft reset on ppc64 box In-Reply-To: <17401.9075.295712.950980@cargo.ozlabs.ibm.com> Message-ID: I have found some information fron IBM website: 1. The state of the processor after taking the Soft Reset exception is unremarkable, because SRESET# merely causes an exception. 2. The SRESET# pin is only causes the processor to take the System Reset exception. this make me very doubt, since soft reset only causes a Systerm Reset exception and the state of the processor is unremarkable, why the exception handler of Linux let system go to die? thinks for you reply! -- Regards, Paul Mackerras 2006-02-20 10:03 To Jin Qi Huang/China/Contr/IBM at IBMCN cc linuxppc64-dev at ozlabs.org Subject Re: Kernel oops then panic when perform a soft reset on ppc64 box Jin Qi Huang writes: > When I perform a soft reset on HMC console to a ppc64 box, the kernel oops > then panic, here is the procedure to reproduce it: That's normal, what did you expect it to do? Paul. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060301/8e671e00/attachment.htm From olof at lixom.net Thu Mar 2 03:45:31 2006 From: olof at lixom.net (Olof Johansson) Date: Wed, 1 Mar 2006 10:45:31 -0600 Subject: [PATCH] Fix powerpc bad_page_fault output (Re: 2.6.16-rc5-mm1) In-Reply-To: <4404E328.7070807@mbligh.org> References: <20060228042439.43e6ef41.akpm@osdl.org> <4404E328.7070807@mbligh.org> Message-ID: <20060301164531.GA17755@pb15.lixom.net> On Tue, Feb 28, 2006 at 03:56:24PM -0800, Martin Bligh wrote: > Andrew Morton wrote: > >ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc5/2.6.16-rc5-mm1/ > > New panic on IBM power4 lpar of P690. 2.6.16-rc5-git3 is OK. > > (config: > http://ftp.kernel.org/pub/linux/kernel/people/mbligh/config/abat/power4) > > http://test.kernel.org/24165/debug/console.log For what it's worth, this is a NULL pointer dereference in the RCU code. Seems that the human-readible parts are printed at a differnet printk level (well, _at_ a level), so they fell off. Not good. Andrew and/or Paulus, see patch below. Thanks, Olof --- It seems that the die() output is printk'd without any prink level, so some distros will log the register dumps and the human readible format differently. (I.e. see http://test.kernel.org/24165/debug/console.log, which lacks the KERN_ALERT parts) Changing the die() output to include a level will likely confuse users that currently rely on getting the output where they're getting it, so instead remove it from the bad_page_fault() output. Signed-off-by: Olof Johansson diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index ec4adcb..fee050a 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -389,7 +389,7 @@ void bad_page_fault(struct pt_regs *regs /* kernel has accessed a bad area */ - printk(KERN_ALERT "Unable to handle kernel paging request for "); + printk("Unable to handle kernel paging request for "); switch (regs->trap) { case 0x300: case 0x380: @@ -402,8 +402,7 @@ void bad_page_fault(struct pt_regs *regs default: printk("unknown fault\n"); } - printk(KERN_ALERT "Faulting instruction address: 0x%08lx\n", - regs->nip); + printk("Faulting instruction address: 0x%08lx\n", regs->nip); die("Kernel access of bad area", regs, sig); } From greg at kroah.com Thu Mar 2 08:46:00 2006 From: greg at kroah.com (Greg KH) Date: Wed, 1 Mar 2006 13:46:00 -0800 Subject: fix build breakage in eeh.c in 2.6.16-rc5-git5 Message-ID: <20060301214600.GA17702@kroah.com> This patch should fixe a problem with eeh_add_device_late() not being defined in the ppc64 build process, causing the build to break. Signed-off-by: Greg Kroah-Hartman --- include/asm-powerpc/eeh.h | 1 + 1 files changed, 1 insertion(+) --- linux-2.6.15.orig/include/asm-powerpc/eeh.h 2006-03-01 11:30:19.000000000 -0800 +++ linux-2.6.15/include/asm-powerpc/eeh.h 2006-03-01 12:04:25.000000000 -0800 @@ -61,6 +61,7 @@ * to finish the eeh setup for this device. */ void eeh_add_device_early(struct device_node *); +void eeh_add_device_late(struct pci_dev *dev); void eeh_add_device_tree_early(struct device_node *); void eeh_add_device_tree_late(struct pci_bus *); From paulus at samba.org Thu Mar 2 08:54:56 2006 From: paulus at samba.org (Paul Mackerras) Date: Thu, 2 Mar 2006 08:54:56 +1100 Subject: fix build breakage in eeh.c in 2.6.16-rc5-git5 In-Reply-To: <20060301214600.GA17702@kroah.com> References: <20060301214600.GA17702@kroah.com> Message-ID: <17414.6192.426294.502401@cargo.ozlabs.ibm.com> Greg KH writes: > This patch should fixe a problem with eeh_add_device_late() not being > defined in the ppc64 build process, causing the build to break. John Rose just sent a patch making eeh_add_device_late static and moving it to be defined before it is called in arch/powerpc/platforms/pseries/eeh.c. Since he maintains this stuff, I'm more inclined to take his patch. Paul. From greg at kroah.com Thu Mar 2 09:03:28 2006 From: greg at kroah.com (Greg KH) Date: Wed, 1 Mar 2006 14:03:28 -0800 Subject: fix build breakage in eeh.c in 2.6.16-rc5-git5 In-Reply-To: <17414.6192.426294.502401@cargo.ozlabs.ibm.com> References: <20060301214600.GA17702@kroah.com> <17414.6192.426294.502401@cargo.ozlabs.ibm.com> Message-ID: <20060301220328.GB7354@kroah.com> On Thu, Mar 02, 2006 at 08:54:56AM +1100, Paul Mackerras wrote: > Greg KH writes: > > > This patch should fixe a problem with eeh_add_device_late() not being > > defined in the ppc64 build process, causing the build to break. > > John Rose just sent a patch making eeh_add_device_late static and > moving it to be defined before it is called in > arch/powerpc/platforms/pseries/eeh.c. > > Since he maintains this stuff, I'm more inclined to take his patch. That's fine with me, as long as it makes it into 2.6.16-final :) thanks, greg k-h From greg at kroah.com Thu Mar 2 09:15:40 2006 From: greg at kroah.com (Greg KH) Date: Wed, 1 Mar 2006 14:15:40 -0800 Subject: fix build breakage in eeh.c in 2.6.16-rc5-git5 In-Reply-To: <20060301220328.GB7354@kroah.com> References: <20060301214600.GA17702@kroah.com> <17414.6192.426294.502401@cargo.ozlabs.ibm.com> <20060301220328.GB7354@kroah.com> Message-ID: <20060301221540.GA9638@kroah.com> On Wed, Mar 01, 2006 at 02:03:28PM -0800, Greg KH wrote: > On Thu, Mar 02, 2006 at 08:54:56AM +1100, Paul Mackerras wrote: > > Greg KH writes: > > > > > This patch should fixe a problem with eeh_add_device_late() not being > > > defined in the ppc64 build process, causing the build to break. > > > > John Rose just sent a patch making eeh_add_device_late static and > > moving it to be defined before it is called in > > arch/powerpc/platforms/pseries/eeh.c. > > > > Since he maintains this stuff, I'm more inclined to take his patch. > > That's fine with me, as long as it makes it into 2.6.16-final :) Hm, looks like my fix made it into Linus's tree, so you might want to send him the "correct" way to do this against that. thanks, greg k-h From paulmck at us.ibm.com Thu Mar 2 11:09:36 2006 From: paulmck at us.ibm.com (Paul E. McKenney) Date: Wed, 1 Mar 2006 16:09:36 -0800 Subject: [PATCH] Fix powerpc bad_page_fault output (Re: 2.6.16-rc5-mm1) In-Reply-To: <20060301164531.GA17755@pb15.lixom.net> References: <20060228042439.43e6ef41.akpm@osdl.org> <4404E328.7070807@mbligh.org> <20060301164531.GA17755@pb15.lixom.net> Message-ID: <20060302000936.GE1296@us.ibm.com> On Wed, Mar 01, 2006 at 10:45:31AM -0600, Olof Johansson wrote: > On Tue, Feb 28, 2006 at 03:56:24PM -0800, Martin Bligh wrote: > > Andrew Morton wrote: > > >ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc5/2.6.16-rc5-mm1/ > > > > New panic on IBM power4 lpar of P690. 2.6.16-rc5-git3 is OK. > > > > (config: > > http://ftp.kernel.org/pub/linux/kernel/people/mbligh/config/abat/power4) > > > > http://test.kernel.org/24165/debug/console.log > > For what it's worth, this is a NULL pointer dereference in the RCU > code. And in an area where it is tougher than usual to blame the problem on a broken use of RCU, as well. ;-) The "rcp" argument to __rcu_process_callbacks() is C00000076F303F08 and "rdp" is C00000076F303F08, or am I mis-remembering the POWER ABI? Thanx, Paul > Seems that the human-readible parts are printed at a differnet printk level > (well, _at_ a level), so they fell off. Not good. > > Andrew and/or Paulus, see patch below. > > > Thanks, > > Olof > > > --- > > It seems that the die() output is printk'd without any prink level, > so some distros will log the register dumps and the human readible > format differently. > > (I.e. see http://test.kernel.org/24165/debug/console.log, which lacks > the KERN_ALERT parts) > > Changing the die() output to include a level will likely confuse users > that currently rely on getting the output where they're getting it, > so instead remove it from the bad_page_fault() output. > > Signed-off-by: Olof Johansson > > > diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c > index ec4adcb..fee050a 100644 > --- a/arch/powerpc/mm/fault.c > +++ b/arch/powerpc/mm/fault.c > @@ -389,7 +389,7 @@ void bad_page_fault(struct pt_regs *regs > > /* kernel has accessed a bad area */ > > - printk(KERN_ALERT "Unable to handle kernel paging request for "); > + printk("Unable to handle kernel paging request for "); > switch (regs->trap) { > case 0x300: > case 0x380: > @@ -402,8 +402,7 @@ void bad_page_fault(struct pt_regs *regs > default: > printk("unknown fault\n"); > } > - printk(KERN_ALERT "Faulting instruction address: 0x%08lx\n", > - regs->nip); > + printk("Faulting instruction address: 0x%08lx\n", regs->nip); > > die("Kernel access of bad area", regs, sig); > } > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > From paulus at samba.org Thu Mar 2 11:35:18 2006 From: paulus at samba.org (Paul Mackerras) Date: Thu, 2 Mar 2006 11:35:18 +1100 Subject: [PATCH] Fix powerpc bad_page_fault output (Re: 2.6.16-rc5-mm1) In-Reply-To: <20060301164531.GA17755@pb15.lixom.net> References: <20060228042439.43e6ef41.akpm@osdl.org> <4404E328.7070807@mbligh.org> <20060301164531.GA17755@pb15.lixom.net> Message-ID: <17414.15814.146349.883153@cargo.ozlabs.ibm.com> Olof Johansson writes: > Seems that the human-readible parts are printed at a differnet printk level > (well, _at_ a level), so they fell off. Not good. My understanding was that printk lines without a level are considered to be at KERN_ERR or so. Is that wrong? > Andrew and/or Paulus, see patch below. It really seems strange to be *removing* printk level tags. I'd like to nack this until I understand why it will improve things. At the very least it needs a big fat comment so some janitor doesn't come along and put the tags back in. Paul. From mbligh at mbligh.org Thu Mar 2 12:14:21 2006 From: mbligh at mbligh.org (Martin Bligh) Date: Wed, 01 Mar 2006 17:14:21 -0800 Subject: [PATCH] Fix powerpc bad_page_fault output (Re: 2.6.16-rc5-mm1) In-Reply-To: <17414.15814.146349.883153@cargo.ozlabs.ibm.com> References: <20060228042439.43e6ef41.akpm@osdl.org> <4404E328.7070807@mbligh.org> <20060301164531.GA17755@pb15.lixom.net> <17414.15814.146349.883153@cargo.ozlabs.ibm.com> Message-ID: <440646ED.2030108@mbligh.org> Paul Mackerras wrote: > Olof Johansson writes: > > >>Seems that the human-readible parts are printed at a differnet printk level >>(well, _at_ a level), so they fell off. Not good. > > > My understanding was that printk lines without a level are considered > to be at KERN_ERR or so. Is that wrong? > > >>Andrew and/or Paulus, see patch below. > > > It really seems strange to be *removing* printk level tags. I'd like > to nack this until I understand why it will improve things. At the > very least it needs a big fat comment so some janitor doesn't come > along and put the tags back in. He's removing KERN_ALERT ... I guess it could get switched from KERN_ALERT to KERN_ERR, but ... Either way, KERN_ALERT seems way too low to me. I object to getting half the oops, and not the other half ;-) M. From olof at lixom.net Thu Mar 2 13:22:44 2006 From: olof at lixom.net (Olof Johansson) Date: Wed, 1 Mar 2006 20:22:44 -0600 Subject: [PATCH] Fix powerpc bad_page_fault output (Re: 2.6.16-rc5-mm1) In-Reply-To: <440646ED.2030108@mbligh.org> References: <20060228042439.43e6ef41.akpm@osdl.org> <4404E328.7070807@mbligh.org> <20060301164531.GA17755@pb15.lixom.net> <17414.15814.146349.883153@cargo.ozlabs.ibm.com> <440646ED.2030108@mbligh.org> Message-ID: <20060302022244.GB17755@pb15.lixom.net> On Wed, Mar 01, 2006 at 05:14:21PM -0800, Martin Bligh wrote: > He's removing KERN_ALERT ... I guess it could get switched from > KERN_ALERT to KERN_ERR, but ... > > Either way, KERN_ALERT seems way too low to me. I object to getting > half the oops, and not the other half ;-) Right. The new printk's were added recently, and I took the KERN_ALERT level from the x86 code then without double-checking what die() uses. I guess I could move the die() output over instead, or move them both to KERN_ERR. -Olof From paulus at samba.org Thu Mar 2 16:16:30 2006 From: paulus at samba.org (Paul Mackerras) Date: Thu, 2 Mar 2006 16:16:30 +1100 Subject: [PATCH] Fix powerpc bad_page_fault output (Re: 2.6.16-rc5-mm1) In-Reply-To: <440646ED.2030108@mbligh.org> References: <20060228042439.43e6ef41.akpm@osdl.org> <4404E328.7070807@mbligh.org> <20060301164531.GA17755@pb15.lixom.net> <17414.15814.146349.883153@cargo.ozlabs.ibm.com> <440646ED.2030108@mbligh.org> Message-ID: <17414.32686.589133.160989@cargo.ozlabs.ibm.com> Martin Bligh writes: > He's removing KERN_ALERT ... I guess it could get switched from > KERN_ALERT to KERN_ERR, but ... > > Either way, KERN_ALERT seems way too low to me. I object to getting > half the oops, and not the other half ;-) KERN_ALERT is two steps higher in priority (lower number) than KERN_ERR. Why on earth would we see KERN_ERR messages but not KERN_ALERT messages? In fact die() should probably be using KERN_EMERG. Messages without a loglevel are by default logged at KERN_WARNING level, one step lower in priority than KERN_ERR. This all sounds to me like there is something wacky going on somewhere, and we need to get to the bottom of it rather than just remove printk tags. Paul. From anton at samba.org Thu Mar 2 16:24:29 2006 From: anton at samba.org (Anton Blanchard) Date: Thu, 2 Mar 2006 16:24:29 +1100 Subject: [PATCH] Fix powerpc bad_page_fault output (Re: 2.6.16-rc5-mm1) In-Reply-To: <20060302022244.GB17755@pb15.lixom.net> References: <20060228042439.43e6ef41.akpm@osdl.org> <4404E328.7070807@mbligh.org> <20060301164531.GA17755@pb15.lixom.net> <17414.15814.146349.883153@cargo.ozlabs.ibm.com> <440646ED.2030108@mbligh.org> <20060302022244.GB17755@pb15.lixom.net> Message-ID: <20060302052428.GF5552@krispykreme> > Right. The new printk's were added recently, and I took the KERN_ALERT > level from the x86 code then without double-checking what die() uses. I > guess I could move the die() output over instead, or move them both to > KERN_ERR. I just noticed x86 can now pass the log level around via show_trace_log_lvl and show_stack_log_lvl. Something we might want to add so we can KERN_EMERG the whole oops. Anton From santil at us.ibm.com Fri Mar 3 06:40:18 2006 From: santil at us.ibm.com (Santiago Leon) Date: Thu, 02 Mar 2006 13:40:18 -0600 Subject: [PATCH] powerpc: ibmveth: Harden driver initilisation for kexec In-Reply-To: <20060131042903.GF28896@krispykreme> References: <20060131041055.5623C68A46@ozlabs.org> <20060131042903.GF28896@krispykreme> Message-ID: <44074A22.8060705@us.ibm.com> From: Michael Ellerman After a kexec the veth driver will fail when trying to register with the Hypervisor because the previous kernel has not unregistered. So if the registration fails, we unregister and then try again. Signed-off-by: Michael Ellerman Acked-by: Anton Blanchard Signed-off-by: Santiago Leon --- drivers/net/ibmveth.c | 32 ++++++++++++++++++++++++++------ 1 files changed, 26 insertions(+), 6 deletions(-) Looks good to me, and has been around for a couple of months. Index: kexec/drivers/net/ibmveth.c =================================================================== --- kexec.orig/drivers/net/ibmveth.c +++ kexec/drivers/net/ibmveth.c @@ -436,6 +436,31 @@ static void ibmveth_cleanup(struct ibmve ibmveth_free_buffer_pool(adapter, &adapter->rx_buff_pool[i]); } +static int ibmveth_register_logical_lan(struct ibmveth_adapter *adapter, + union ibmveth_buf_desc rxq_desc, u64 mac_address) +{ + int rc, try_again = 1; + + /* After a kexec the adapter will still be open, so our attempt to + * open it will fail. So if we get a failure we free the adapter and + * try again, but only once. */ +retry: + rc = h_register_logical_lan(adapter->vdev->unit_address, + adapter->buffer_list_dma, rxq_desc.desc, + adapter->filter_list_dma, mac_address); + + if (rc != H_Success && try_again) { + do { + rc = h_free_logical_lan(adapter->vdev->unit_address); + } while (H_isLongBusy(rc) || (rc == H_Busy)); + + try_again = 0; + goto retry; + } + + return rc; +} + static int ibmveth_open(struct net_device *netdev) { struct ibmveth_adapter *adapter = netdev->priv; @@ -504,12 +529,7 @@ static int ibmveth_open(struct net_devic ibmveth_debug_printk("filter list @ 0x%p\n", adapter->filter_list_addr); ibmveth_debug_printk("receive q @ 0x%p\n", adapter->rx_queue.queue_addr); - - lpar_rc = h_register_logical_lan(adapter->vdev->unit_address, - adapter->buffer_list_dma, - rxq_desc.desc, - adapter->filter_list_dma, - mac_address); + lpar_rc = ibmveth_register_logical_lan(adapter, rxq_desc, mac_address); if(lpar_rc != H_Success) { ibmveth_error_printk("h_register_logical_lan failed with %ld\n", lpar_rc); From michael at ellerman.id.au Fri Mar 3 11:22:45 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Fri, 3 Mar 2006 11:22:45 +1100 Subject: [PATCH] powerpc: ibmveth: Harden driver initilisation for kexec In-Reply-To: <44074A22.8060705@us.ibm.com> References: <20060131041055.5623C68A46@ozlabs.org> <20060131042903.GF28896@krispykreme> <44074A22.8060705@us.ibm.com> Message-ID: <200603031122.51174.michael@ellerman.id.au> Hi Jeff, I realise it's late, but it'd be really good if you could send this up for 2.6.16, we're hosed without it. cheers On Fri, 3 Mar 2006 06:40, Santiago Leon wrote: > From: Michael Ellerman > > After a kexec the veth driver will fail when trying to register with the > Hypervisor because the previous kernel has not unregistered. > > So if the registration fails, we unregister and then try again. > > Signed-off-by: Michael Ellerman > Acked-by: Anton Blanchard > Signed-off-by: Santiago Leon > --- > > drivers/net/ibmveth.c | 32 ++++++++++++++++++++++++++------ > 1 files changed, 26 insertions(+), 6 deletions(-) > > Looks good to me, and has been around for a couple of months. > > Index: kexec/drivers/net/ibmveth.c > =================================================================== > --- kexec.orig/drivers/net/ibmveth.c > +++ kexec/drivers/net/ibmveth.c > @@ -436,6 +436,31 @@ static void ibmveth_cleanup(struct ibmve > ibmveth_free_buffer_pool(adapter, &adapter->rx_buff_pool[i]); > } > > +static int ibmveth_register_logical_lan(struct ibmveth_adapter *adapter, > + union ibmveth_buf_desc rxq_desc, u64 mac_address) > +{ > + int rc, try_again = 1; > + > + /* After a kexec the adapter will still be open, so our attempt to > + * open it will fail. So if we get a failure we free the adapter and > + * try again, but only once. */ > +retry: > + rc = h_register_logical_lan(adapter->vdev->unit_address, > + adapter->buffer_list_dma, rxq_desc.desc, > + adapter->filter_list_dma, mac_address); > + > + if (rc != H_Success && try_again) { > + do { > + rc = h_free_logical_lan(adapter->vdev->unit_address); > + } while (H_isLongBusy(rc) || (rc == H_Busy)); > + > + try_again = 0; > + goto retry; > + } > + > + return rc; > +} > + > static int ibmveth_open(struct net_device *netdev) > { > struct ibmveth_adapter *adapter = netdev->priv; > @@ -504,12 +529,7 @@ static int ibmveth_open(struct net_devic > ibmveth_debug_printk("filter list @ 0x%p\n", adapter->filter_list_addr); > ibmveth_debug_printk("receive q @ 0x%p\n", > adapter->rx_queue.queue_addr); > > - > - lpar_rc = h_register_logical_lan(adapter->vdev->unit_address, > - adapter->buffer_list_dma, > - rxq_desc.desc, > - adapter->filter_list_dma, > - mac_address); > + lpar_rc = ibmveth_register_logical_lan(adapter, rxq_desc, mac_address); > > if(lpar_rc != H_Success) { > ibmveth_error_printk("h_register_logical_lan failed with %ld\n", > lpar_rc); > > > > _______________________________________________ > Linuxppc64-dev mailing list > Linuxppc64-dev at ozlabs.org > https://ozlabs.org/mailman/listinfo/linuxppc64-dev -- Michael Ellerman IBM OzLabs wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060303/c8e585b2/attachment.pgp From rdunlap at xenotime.net Fri Mar 3 11:34:23 2006 From: rdunlap at xenotime.net (Randy.Dunlap) Date: Thu, 2 Mar 2006 16:34:23 -0800 Subject: [PATCH] powerpc: ibmveth: Harden driver initilisation for kexec In-Reply-To: <200603031122.51174.michael@ellerman.id.au> References: <20060131041055.5623C68A46@ozlabs.org> <20060131042903.GF28896@krispykreme> <44074A22.8060705@us.ibm.com> <200603031122.51174.michael@ellerman.id.au> Message-ID: <20060302163423.f758c5bc.rdunlap@xenotime.net> On Fri, 3 Mar 2006 11:22:45 +1100 Michael Ellerman wrote: > Hi Jeff, > > I realise it's late, but it'd be really good if you could send this up for > 2.6.16, we're hosed without it. I'm wondering if this means that for every virtual/hypervisor situation, we have to modify any $interested_drivers. Why wouldn't we come up with a cleaner solution (in the long term)? E.g., could the hypervisor know when one of it's virtual OSes dies or reboots and release its resources then? This patch just looks like a short-term solution to me. > cheers > > On Fri, 3 Mar 2006 06:40, Santiago Leon wrote: > > From: Michael Ellerman > > > > After a kexec the veth driver will fail when trying to register with the > > Hypervisor because the previous kernel has not unregistered. > > > > So if the registration fails, we unregister and then try again. > > > > Signed-off-by: Michael Ellerman > > Acked-by: Anton Blanchard > > Signed-off-by: Santiago Leon > > --- > > > > drivers/net/ibmveth.c | 32 ++++++++++++++++++++++++++------ > > 1 files changed, 26 insertions(+), 6 deletions(-) > > > > Looks good to me, and has been around for a couple of months. > > > > Index: kexec/drivers/net/ibmveth.c > > =================================================================== > > --- kexec.orig/drivers/net/ibmveth.c > > +++ kexec/drivers/net/ibmveth.c > > @@ -436,6 +436,31 @@ static void ibmveth_cleanup(struct ibmve > > ibmveth_free_buffer_pool(adapter, &adapter->rx_buff_pool[i]); > > } > > > > +static int ibmveth_register_logical_lan(struct ibmveth_adapter *adapter, > > + union ibmveth_buf_desc rxq_desc, u64 mac_address) > > +{ > > + int rc, try_again = 1; > > + > > + /* After a kexec the adapter will still be open, so our attempt to > > + * open it will fail. So if we get a failure we free the adapter and > > + * try again, but only once. */ > > +retry: > > + rc = h_register_logical_lan(adapter->vdev->unit_address, > > + adapter->buffer_list_dma, rxq_desc.desc, > > + adapter->filter_list_dma, mac_address); > > + > > + if (rc != H_Success && try_again) { > > + do { > > + rc = h_free_logical_lan(adapter->vdev->unit_address); > > + } while (H_isLongBusy(rc) || (rc == H_Busy)); > > + > > + try_again = 0; > > + goto retry; > > + } > > + > > + return rc; > > +} > > + > > static int ibmveth_open(struct net_device *netdev) > > { > > struct ibmveth_adapter *adapter = netdev->priv; > > @@ -504,12 +529,7 @@ static int ibmveth_open(struct net_devic > > ibmveth_debug_printk("filter list @ 0x%p\n", adapter->filter_list_addr); > > ibmveth_debug_printk("receive q @ 0x%p\n", > > adapter->rx_queue.queue_addr); > > > > - > > - lpar_rc = h_register_logical_lan(adapter->vdev->unit_address, > > - adapter->buffer_list_dma, > > - rxq_desc.desc, > > - adapter->filter_list_dma, > > - mac_address); > > + lpar_rc = ibmveth_register_logical_lan(adapter, rxq_desc, mac_address); > > > > if(lpar_rc != H_Success) { > > ibmveth_error_printk("h_register_logical_lan failed with %ld\n", > > lpar_rc); > > > > > > > > _______________________________________________ > > Linuxppc64-dev mailing list > > Linuxppc64-dev at ozlabs.org > > https://ozlabs.org/mailman/listinfo/linuxppc64-dev > > -- > Michael Ellerman > IBM OzLabs --- ~Randy From paulus at samba.org Fri Mar 3 12:00:54 2006 From: paulus at samba.org (Paul Mackerras) Date: Fri, 3 Mar 2006 12:00:54 +1100 Subject: [PATCH] powerpc: ibmveth: Harden driver initilisation for kexec In-Reply-To: <20060302163423.f758c5bc.rdunlap@xenotime.net> References: <20060131041055.5623C68A46@ozlabs.org> <20060131042903.GF28896@krispykreme> <44074A22.8060705@us.ibm.com> <200603031122.51174.michael@ellerman.id.au> <20060302163423.f758c5bc.rdunlap@xenotime.net> Message-ID: <17415.38214.77398.803632@cargo.ozlabs.ibm.com> Randy.Dunlap writes: > E.g., could the hypervisor know when one of it's virtual OSes > dies or reboots and release its resources then? I think the point is that with kexec, the same virtual machine keeps running, so the hypervisor doesn't see the OS dying or rebooting. Paul. From michael at ellerman.id.au Fri Mar 3 12:10:47 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Fri, 3 Mar 2006 12:10:47 +1100 Subject: [PATCH] powerpc: ibmveth: Harden driver initilisation for kexec In-Reply-To: <20060302163423.f758c5bc.rdunlap@xenotime.net> References: <20060131041055.5623C68A46@ozlabs.org> <200603031122.51174.michael@ellerman.id.au> <20060302163423.f758c5bc.rdunlap@xenotime.net> Message-ID: <200603031210.53220.michael@ellerman.id.au> On Fri, 3 Mar 2006 11:34, Randy.Dunlap wrote: > On Fri, 3 Mar 2006 11:22:45 +1100 Michael Ellerman wrote: > > Hi Jeff, > > > > I realise it's late, but it'd be really good if you could send this up > > for 2.6.16, we're hosed without it. > > I'm wondering if this means that for every virtual/hypervisor > situation, we have to modify any $interested_drivers. > Why wouldn't we come up with a cleaner solution (in the long term)? > > E.g., could the hypervisor know when one of it's virtual OSes > dies or reboots and release its resources then? It does exactly that for a regular reboot, but when we kexec we _don't_ die or reboot, as far as the Hypervisor is concerned it's all systems go. It's something of a double-edged sword, we're totally in control which gives us lots of flexibility, and _fast_ reboot times, but we also have to do a bit of extra stuff (ie. this patch) to keep things sane. cheers -- Michael Ellerman IBM OzLabs wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060303/911a8749/attachment.pgp From jgarzik at pobox.com Fri Mar 3 12:04:39 2006 From: jgarzik at pobox.com (Jeff Garzik) Date: Thu, 02 Mar 2006 20:04:39 -0500 Subject: [PATCH] powerpc: ibmveth: Harden driver initilisation for kexec In-Reply-To: <44074A22.8060705@us.ibm.com> References: <20060131041055.5623C68A46@ozlabs.org> <20060131042903.GF28896@krispykreme> <44074A22.8060705@us.ibm.com> Message-ID: <44079627.6070100@pobox.com> Santiago Leon wrote: > From: Michael Ellerman > > After a kexec the veth driver will fail when trying to register with the > Hypervisor because the previous kernel has not unregistered. > > So if the registration fails, we unregister and then try again. > > Signed-off-by: Michael Ellerman > Acked-by: Anton Blanchard > Signed-off-by: Santiago Leon > --- > > drivers/net/ibmveth.c | 32 ++++++++++++++++++++++++++------ > 1 files changed, 26 insertions(+), 6 deletions(-) > > Looks good to me, and has been around for a couple of months. This seems completely bonkers to me: are resources available? if no free resources try again It makes resource checking pointless. Jeff From michael at ellerman.id.au Fri Mar 3 13:11:56 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Fri, 3 Mar 2006 13:11:56 +1100 Subject: [PATCH] powerpc: ibmveth: Harden driver initilisation for kexec In-Reply-To: <44079627.6070100@pobox.com> References: <20060131041055.5623C68A46@ozlabs.org> <44074A22.8060705@us.ibm.com> <44079627.6070100@pobox.com> Message-ID: <200603031312.00787.michael@ellerman.id.au> On Fri, 3 Mar 2006 12:04, Jeff Garzik wrote: > Santiago Leon wrote: > > From: Michael Ellerman > > > > After a kexec the veth driver will fail when trying to register with the > > Hypervisor because the previous kernel has not unregistered. > > > > So if the registration fails, we unregister and then try again. > > > > Signed-off-by: Michael Ellerman > > Acked-by: Anton Blanchard > > Signed-off-by: Santiago Leon > > --- > > > > drivers/net/ibmveth.c | 32 ++++++++++++++++++++++++++------ > > 1 files changed, 26 insertions(+), 6 deletions(-) > > > > Looks good to me, and has been around for a couple of months. > > This seems completely bonkers to me: > > are resources available? > if no > free resources > try again I'm not sure I follow, are you suggesting we do the h_free_logical_lan() unconditionally, followed by h_register_logical_lan() ?? If that's what you mean, I didn't do it that way because it would effect the normal code path. This patch only modifies the behaviour if we fail to register the adapter. I'm much more comfortable changing the failure case than the default. cheers -- Michael Ellerman IBM OzLabs wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060303/b9cd998a/attachment.pgp From rdunlap at xenotime.net Fri Mar 3 15:12:42 2006 From: rdunlap at xenotime.net (Randy.Dunlap) Date: Thu, 2 Mar 2006 20:12:42 -0800 Subject: [PATCH] powerpc: ibmveth: Harden driver initilisation for kexec In-Reply-To: <200603031210.53220.michael@ellerman.id.au> References: <20060131041055.5623C68A46@ozlabs.org> <200603031122.51174.michael@ellerman.id.au> <20060302163423.f758c5bc.rdunlap@xenotime.net> <200603031210.53220.michael@ellerman.id.au> Message-ID: <20060302201242.b688f811.rdunlap@xenotime.net> On Fri, 3 Mar 2006 12:10:47 +1100 Michael Ellerman wrote: > On Fri, 3 Mar 2006 11:34, Randy.Dunlap wrote: > > On Fri, 3 Mar 2006 11:22:45 +1100 Michael Ellerman wrote: > > > Hi Jeff, > > > > > > I realise it's late, but it'd be really good if you could send this up > > > for 2.6.16, we're hosed without it. > > > > I'm wondering if this means that for every virtual/hypervisor > > situation, we have to modify any $interested_drivers. > > Why wouldn't we come up with a cleaner solution (in the long term)? > > > > E.g., could the hypervisor know when one of it's virtual OSes > > dies or reboots and release its resources then? > > It does exactly that for a regular reboot, but when we kexec we _don't_ die or > reboot, as far as the Hypervisor is concerned it's all systems go. > > It's something of a double-edged sword, we're totally in control which gives > us lots of flexibility, and _fast_ reboot times, but we also have to do a bit > of extra stuff (ie. this patch) to keep things sane. s/this patch/some patch/ Yes, you have certainly thought about this more/longer than I have, so why is something more generic like this bad instead of good: Somewhere early in start_kernel() (e.g.), do an hv call that says "free all assigned resources". Maybe hv doesn't know "all assigned resources." Maybe it's just that this patch is simpler than an hv change, although this (current) patch could leave some other drivers that need to be "fixed," while an hv change wouldn't do that. So I'm not opposed to this current patch as a short-term solution, but I don't think it's the right long-term solution. --- ~Randy From david at gibson.dropbear.id.au Fri Mar 3 16:24:06 2006 From: david at gibson.dropbear.id.au (David Gibson) Date: Fri, 3 Mar 2006 16:24:06 +1100 Subject: powerpc: Fix pud_ERROR() message Message-ID: <20060303052406.GK23766@localhost.localdomain> Paulus, please apply The powerpc pud_ERROR() function misleadingly prints a message indicating a pmd error. This patch fixes. Signed-off-by: David Gibson Index: working-2.6/include/asm-powerpc/pgtable-4k.h =================================================================== --- working-2.6.orig/include/asm-powerpc/pgtable-4k.h 2006-03-03 16:21:31.000000000 +1100 +++ working-2.6/include/asm-powerpc/pgtable-4k.h 2006-03-03 16:21:53.000000000 +1100 @@ -93,4 +93,4 @@ (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))) #define pud_ERROR(e) \ - printk("%s:%d: bad pmd %08lx.\n", __FILE__, __LINE__, pud_val(e)) + printk("%s:%d: bad pud %08lx.\n", __FILE__, __LINE__, pud_val(e)) -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson From prenuka at gmail.com Fri Mar 3 23:39:13 2006 From: prenuka at gmail.com (Renuka Pampana) Date: Fri, 3 Mar 2006 15:39:13 +0300 Subject: Invitation Message-ID: <9b23fc710603030439h443ad288n6e639e4c31badfd2@mail.gmail.com> Hi! I just signed up for a great promotion that I think you should check out. We can both get a free new IPOD nano! http://www.plogfile.com/?r=ATE0UXMIOQg0A2UEBC4M&i=gmail&z=1&tc=18 Talk to you soon! From prenuka at gmail.com Fri Mar 3 23:41:18 2006 From: prenuka at gmail.com (Renuka Pampana) Date: Fri, 3 Mar 2006 15:41:18 +0300 Subject: Invitation Message-ID: <9b23fc710603030441j386e7fccnaf1c83b2fa16fe60@mail.gmail.com> Hey! Check out this great promotion! We can both get the cool new iPod nano for free! http://www.saleadig.com/?r=cTF0UXMIOQg0A2UEBC4M&i=gmail&z=1&tc=18 Talk to you soon! http://www.saleadig.com/?r=cTF0UXMIOQg0A2UEBC4M&i=gmail&z=1&tc=18 From dhowells at redhat.com Sat Mar 4 03:03:00 2006 From: dhowells at redhat.com (David Howells) Date: Fri, 03 Mar 2006 16:03:00 +0000 Subject: Memory barriers and spin_unlock safety Message-ID: <32518.1141401780@warthog.cambridge.redhat.com> Hi, We've just had an interesting discussion on IRC and this has come up with two unanswered questions: (1) Is spin_unlock() is entirely safe on Pentium3+ and x86_64 where ?FENCE instructions are available? Consider the following case, where you want to do two reads effectively atomically, and so wrap them in a spinlock: spin_lock(&mtx); a = *A; b = *B; spin_unlock(&mtx); On x86 Pentium3+ and x86_64, what's to stop you from getting the reads done after the unlock since there's no LFENCE instruction there to stop you? What you'd expect is: LOCK WRITE mtx --> implies MFENCE READ *A } which may be reordered READ *B } WRITE mtx But what you might get instead is this: LOCK WRITE mtx --> implies MFENCE WRITE mtx --> implies SFENCE READ *A } which may be reordered READ *B } There doesn't seem to be anything that says that the reads can't leak outside of the locked section; at least, there doesn't in the AMD's system programming manual for Amd64 (book 2, section 7.1). Writes on the other hand may not happen out of order, so changing things inside a critical section would seem to be okay. On PowerPC, on the other hand, the barriers have to be made explicit because they're not implied by LWARX/STWCX or by ordinary stores: LWARX mtx STWCX mtx ISYNC READ *A } which may be reordered READ *B } LWSYNC WRITE mtx So, should the spin_unlock() on i386 and x86_64 be doing an LFENCE instruction before unlocking? (2) What is the minimum functionality that can be expected of a memory barriers? I was of the opinion that all we could expect is for the CPU executing one them to force the instructions it is executing to be complete up to a point - depending on the type of barrier - before continuing past it. On pentiums, x86_64, and frv this seems to be exactly what you get for a barrier; there doesn't seem to be any external evidence of it that appears on the bus, other than the CPU does a load of memory transactions. However, on ppc/ppc64, it seems to be more thorough than that, and there seems to be some special interaction between the CPU processing the instruction and the other CPUs in the system. It's not entirely obvious from the manual just what this does. As I understand it, Andrew Morton is of the opinion that issuing a read barrier on one CPU will cause the other CPUs in the system to sync up, but that doesn't look likely on all archs. David From dhowells at redhat.com Sat Mar 4 03:45:46 2006 From: dhowells at redhat.com (David Howells) Date: Fri, 03 Mar 2006 16:45:46 +0000 Subject: Memory barriers and spin_unlock safety In-Reply-To: <32518.1141401780@warthog.cambridge.redhat.com> References: <32518.1141401780@warthog.cambridge.redhat.com> Message-ID: <1146.1141404346@warthog.cambridge.redhat.com> David Howells wrote: > WRITE mtx > --> implies SFENCE Actually, I'm not sure this is true. The AMD64 Instruction Manual's writeup of SFENCE implies that writes can be reordered, which sort of contradicts what the AMD64 System Programming Manual says. If this isn't true, then x86_64 at least should do MFENCE before the store in spin_unlock() or change the store to be LOCK'ed. The same may also apply for Pentium3+ class CPUs with the i386 arch. David From torvalds at osdl.org Sat Mar 4 03:55:35 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Fri, 3 Mar 2006 08:55:35 -0800 (PST) Subject: Memory barriers and spin_unlock safety In-Reply-To: <32518.1141401780@warthog.cambridge.redhat.com> References: <32518.1141401780@warthog.cambridge.redhat.com> Message-ID: On Fri, 3 Mar 2006, David Howells wrote: > > We've just had an interesting discussion on IRC and this has come up with two > unanswered questions: > > (1) Is spin_unlock() is entirely safe on Pentium3+ and x86_64 where ?FENCE > instructions are available? > > Consider the following case, where you want to do two reads effectively > atomically, and so wrap them in a spinlock: > > spin_lock(&mtx); > a = *A; > b = *B; > spin_unlock(&mtx); > > On x86 Pentium3+ and x86_64, what's to stop you from getting the reads > done after the unlock since there's no LFENCE instruction there to stop > you? The rules are, afaik, that reads can pass buffered writes, BUT WRITES CANNOT PASS READS (aka "writes to memory are always carried out in program order"). IOW, reads can bubble up, but writes cannot. So the way I read the Intel rules is that "passing" is always about being done earlier than otherwise allowed, not about being done later. (You only "pass" somebody in traffic when you go ahead of them. If you fall behind them, you don't "pass" them, _they_ pass you). Now, this is not so much meant to be a semantic argument (the meaning of the word "pass") as to an explanation of what I believe Intel meant, since we know from Intel designers that the simple non-atomic write is supposedly a perfectly fine unlock instruction. So when Intel says "reads can be carried out speculatively and in any order", that just says that reads are not ordered wrt other _reads_. They _are_ ordered wrt other writes, but only one way: they can pass an earlier write, but they can't fall back behind a later one. This is consistent with (a) optimization (you want to do reads _early_, not late) (b) behaviour (we've been told that a single write is sufficient, with the exception of an early P6 core revision) (c) at least one way of reading the documentation. And I claim that (a) and (b) are the important parts, and that (c) is just the rationale. > (2) What is the minimum functionality that can be expected of a memory > barriers? I was of the opinion that all we could expect is for the CPU > executing one them to force the instructions it is executing to be > complete up to a point - depending on the type of barrier - before > continuing past it. Well, no. You should expect even _less_. The core can continue doing things past a barrier. For example, a write barrier may not actually serialize anything at all: the sane way of doing write barriers is to just put a note in the write-queue, and that note just disallows write queue entries from being moved around it. So you might have a write barrier with two writes on either side, and the writes might _both_ be outstanding wrt the core despite the barrier. So there's not necessarily any synchronization at all on a execution core level, just a partial ordering between the resulting actions of the core. > However, on ppc/ppc64, it seems to be more thorough than that, and there > seems to be some special interaction between the CPU processing the > instruction and the other CPUs in the system. It's not entirely obvious > from the manual just what this does. PPC has an absolutely _horrible_ memory ordering implementation, as far as I can tell. The thing is broken. I think it's just implementation breakage, not anything really fundamental, but the fact that their write barriers are expensive is a big sign that they are doing something bad. For example, their write buffers may not have a way to serialize in the buffers, and at that point from an _implementation_ standpoint, you just have to serialize the whole core to make sure that writes don't pass each other. > As I understand it, Andrew Morton is of the opinion that issuing a read > barrier on one CPU will cause the other CPUs in the system to sync up, but > that doesn't look likely on all archs. No. Issuing a read barrier on one CPU will do absolutely _nothing_ on the other CPU. All barriers are purely local to one CPU, and do not generate any bus traffic what-so-ever. They only potentially affect the order of bus traffic due to the instructions around them (obviously). So a read barrier on one CPU _has_ to be paired with a write barrier on the other side in order to make sense (although the write barrier can obviously be of the implied kind, ie a lock/unlock event, or just architecture-specific knowledge of write behaviour, ie for example knowing that writes are always seen in-order on x86). Linus From torvalds at osdl.org Sat Mar 4 04:03:05 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Fri, 3 Mar 2006 09:03:05 -0800 (PST) Subject: Memory barriers and spin_unlock safety In-Reply-To: <1146.1141404346@warthog.cambridge.redhat.com> References: <32518.1141401780@warthog.cambridge.redhat.com> <1146.1141404346@warthog.cambridge.redhat.com> Message-ID: On Fri, 3 Mar 2006, David Howells wrote: > David Howells wrote: > > > WRITE mtx > > --> implies SFENCE > > Actually, I'm not sure this is true. The AMD64 Instruction Manual's writeup of > SFENCE implies that writes can be reordered, which sort of contradicts what > the AMD64 System Programming Manual says. Note that _normal_ writes never need an SFENCE, because they are ordered by the core. The reason to use SFENCE is because of _special_ writes. For example, if you use a non-temporal store, then the write buffer ordering goes away, because there is no write buffer involved (the store goes directly to the L2 or outside the bus). Or when you talk to weakly ordered memory (ie a frame buffer that isn't cached, and where the MTRR memory ordering bits say that writes be done speculatively), you may want to say "I'm going to do the store that starts the graphics pipeline, all my previous stores need to be done now". THAT is when you need to use SFENCE. So SFENCE really isn't about the "smp_wmb()" kind of fencing at all. It's about the much weaker ordering that is allowed by the special IO memory types and nontemporal instructions. (Actually, I think one special case of non-temporal instruction is the "repeat movs/stos" thing: I think you should _not_ use a "repeat stos" to unlock a spinlock, exactly because those stores are not ordered wrt each other, and they can bypass the write queue. Of course, doing that would be insane anyway, so no harm done ;^). > If this isn't true, then x86_64 at least should do MFENCE before the store in > spin_unlock() or change the store to be LOCK'ed. The same may also apply for > Pentium3+ class CPUs with the i386 arch. No. But if you want to make sure, you can always check with Intel engineers. I'm pretty sure I have this right, though, because Intel engineers have certainly looked at Linux sources and locking, and nobody has ever said that we'd need an SFENCE. Linus From arjan at infradead.org Sat Mar 4 07:02:12 2006 From: arjan at infradead.org (Arjan van de Ven) Date: Fri, 03 Mar 2006 21:02:12 +0100 Subject: Memory barriers and spin_unlock safety In-Reply-To: <1146.1141404346@warthog.cambridge.redhat.com> References: <32518.1141401780@warthog.cambridge.redhat.com> <1146.1141404346@warthog.cambridge.redhat.com> Message-ID: <1141416133.10732.65.camel@laptopd505.fenrus.org> On Fri, 2006-03-03 at 16:45 +0000, David Howells wrote: > David Howells wrote: > > > WRITE mtx > > --> implies SFENCE > > Actually, I'm not sure this is true. The AMD64 Instruction Manual's writeup of > SFENCE implies that writes can be reordered, which sort of contradicts what > the AMD64 System Programming Manual says. there are 2 or 3 special instructions which do "non temporal stores" (movntq and movnit and maybe one more). sfense is designed for those. From dhowells at redhat.com Sat Mar 4 07:15:35 2006 From: dhowells at redhat.com (David Howells) Date: Fri, 03 Mar 2006 20:15:35 +0000 Subject: Memory barriers and spin_unlock safety In-Reply-To: References: <32518.1141401780@warthog.cambridge.redhat.com> Message-ID: <5001.1141416935@warthog.cambridge.redhat.com> Linus Torvalds wrote: > The rules are, afaik, that reads can pass buffered writes, BUT WRITES > CANNOT PASS READS (aka "writes to memory are always carried out in program > order"). So in the example I gave, a read after the spin_unlock() may actually get executed before the store in the spin_unlock(), but a read before the unlock will not get executed after. > No. Issuing a read barrier on one CPU will do absolutely _nothing_ on the > other CPU. Well, I think you mean will guarantee absolutely _nothing_ on the other CPU for the Linux kernel. According to the IBM powerpc book I have, it does actually do something on the other CPUs, though it doesn't say exactly what. Anyway, thanks. I'll write up some documentation on barriers for inclusion in the kernel. David From dhowells at redhat.com Sat Mar 4 07:17:07 2006 From: dhowells at redhat.com (David Howells) Date: Fri, 03 Mar 2006 20:17:07 +0000 Subject: Memory barriers and spin_unlock safety In-Reply-To: References: <32518.1141401780@warthog.cambridge.redhat.com> <1146.1141404346@warthog.cambridge.redhat.com> Message-ID: <5041.1141417027@warthog.cambridge.redhat.com> Linus Torvalds wrote: > Note that _normal_ writes never need an SFENCE, because they are ordered > by the core. > > The reason to use SFENCE is because of _special_ writes. I suspect, then, that x86_64 should not have an SFENCE for smp_wmb(), and that only io_wmb() should have that. David From benh at kernel.crashing.org Sat Mar 4 08:06:05 2006 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sat, 04 Mar 2006 08:06:05 +1100 Subject: Memory barriers and spin_unlock safety In-Reply-To: References: <32518.1141401780@warthog.cambridge.redhat.com> Message-ID: <1141419966.3888.67.camel@localhost.localdomain> > PPC has an absolutely _horrible_ memory ordering implementation, as far as > I can tell. The thing is broken. I think it's just implementation > breakage, not anything really fundamental, but the fact that their write > barriers are expensive is a big sign that they are doing something bad. Are they ? read barriers and full barriers are, write barriers should be fairly cheap (but then, I haven't measured). > For example, their write buffers may not have a way to serialize in the > buffers, and at that point from an _implementation_ standpoint, you just > have to serialize the whole core to make sure that writes don't pass each > other. The main problem I've had in the past with the ppc barriers is more a subtle thing in the spec that unfortunately was taken to the word by implementors, and is that the simple write barrier (eieio) will only order within the same storage space, that is will not order between cacheable and non-cacheable storage. That means IOs could leak out of locks etc... Which is why we use expensive barriers in MMIO wrappers for now (though we might investigate the use of mmioXb instead in the future). > No. Issuing a read barrier on one CPU will do absolutely _nothing_ on the > other CPU. All barriers are purely local to one CPU, and do not generate > any bus traffic what-so-ever. They only potentially affect the order of > bus traffic due to the instructions around them (obviously). Actually, the ppc's full barrier (sync) will generate bus traffic, and I think in some case eieio barriers can propagate to the chipset to enforce ordering there too depending on some voodoo settings and wether the storage space is cacheable or not. > So a read barrier on one CPU _has_ to be paired with a write barrier on > the other side in order to make sense (although the write barrier can > obviously be of the implied kind, ie a lock/unlock event, or just > architecture-specific knowledge of write behaviour, ie for example knowing > that writes are always seen in-order on x86). > > Linus > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ From torvalds at osdl.org Sat Mar 4 08:31:08 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Fri, 3 Mar 2006 13:31:08 -0800 (PST) Subject: Memory barriers and spin_unlock safety In-Reply-To: <5001.1141416935@warthog.cambridge.redhat.com> References: <32518.1141401780@warthog.cambridge.redhat.com> <5001.1141416935@warthog.cambridge.redhat.com> Message-ID: On Fri, 3 Mar 2006, David Howells wrote: > > So in the example I gave, a read after the spin_unlock() may actually get > executed before the store in the spin_unlock(), but a read before the unlock > will not get executed after. Yes. > > No. Issuing a read barrier on one CPU will do absolutely _nothing_ on the > > other CPU. > > Well, I think you mean will guarantee absolutely _nothing_ on the other CPU for > the Linux kernel. According to the IBM powerpc book I have, it does actually > do something on the other CPUs, though it doesn't say exactly what. Yeah, Power really does have some funky stuff in their memory ordering. I'm not quite sure why, though. And it definitely isn't implied by any of the Linux kernel barriers. (They also do TLB coherency in hw etc strange things). Linus From torvalds at osdl.org Sat Mar 4 08:34:17 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Fri, 3 Mar 2006 13:34:17 -0800 (PST) Subject: Memory barriers and spin_unlock safety In-Reply-To: <5041.1141417027@warthog.cambridge.redhat.com> References: <32518.1141401780@warthog.cambridge.redhat.com> <1146.1141404346@warthog.cambridge.redhat.com> <5041.1141417027@warthog.cambridge.redhat.com> Message-ID: On Fri, 3 Mar 2006, David Howells wrote: > > I suspect, then, that x86_64 should not have an SFENCE for smp_wmb(), and that > only io_wmb() should have that. Indeed. I think smp_wmb() should be a compiler fence only on x86(-64), ie just compile to a "barrier()" (and not even that on UP, of course). Linus From davem at davemloft.net Sat Mar 4 08:52:17 2006 From: davem at davemloft.net (David S. Miller) Date: Fri, 03 Mar 2006 13:52:17 -0800 (PST) Subject: Memory barriers and spin_unlock safety In-Reply-To: <200603031518.15806.hollisb@us.ibm.com> References: <1141419966.3888.67.camel@localhost.localdomain> <200603031518.15806.hollisb@us.ibm.com> Message-ID: <20060303.135217.65983538.davem@davemloft.net> From: Hollis Blanchard Date: Fri, 3 Mar 2006 15:18:13 -0600 > On Friday 03 March 2006 15:06, Benjamin Herrenschmidt wrote: > > The main problem I've had in the past with the ppc barriers is more a > > subtle thing in the spec that unfortunately was taken to the word by > > implementors, and is that the simple write barrier (eieio) will only > > order within the same storage space, that is will not order between > > cacheable and non-cacheable storage. > > I've heard Sparc has the same issue... in which case it may not be a "chip > designer was too literal" thing, but rather it really simplifies chip > implementation to do it that way. There is a "membar #MemIssue" that is meant to deal with this should it ever matter, but for most sparc64 chips it doesn't which is why we don't use that memory barrier type at all in the Linux kernel. For UltraSPARC-I and II it technically could matter in Relaxed Memory Ordering (RMO) mode which is what we run the kernel and 64-bit userspace in, but I've never seen an issue resulting from it. For UltraSPARC-III and later, the chip only implements the Total Store Ordering (TSO) memory model and the manual explicitly states that cacheable and non-cacheable memory operations are ordered, even using language such as "there is an implicit 'membar #MemIssue' between them". It further goes on to say: The UltraSPARCIII Cu processor maintains ordering between cacheable and non-cacheable accesses. The UltraSPARC III Cu processor maintains TSO ordering between memory references regardless of their cacheability. Niagara behaves almost identically to UltraSPARC-III in this area. From torvalds at osdl.org Sat Mar 4 09:04:21 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Fri, 3 Mar 2006 14:04:21 -0800 (PST) Subject: Memory barriers and spin_unlock safety In-Reply-To: <1141419966.3888.67.camel@localhost.localdomain> References: <32518.1141401780@warthog.cambridge.redhat.com> <1141419966.3888.67.camel@localhost.localdomain> Message-ID: On Sat, 4 Mar 2006, Benjamin Herrenschmidt wrote: > > The main problem I've had in the past with the ppc barriers is more a > subtle thing in the spec that unfortunately was taken to the word by > implementors, and is that the simple write barrier (eieio) will only > order within the same storage space, that is will not order between > cacheable and non-cacheable storage. If so, a simple write barrier should be sufficient. That's exactly what the x86 write barriers do too, ie stores to magic IO space are _not_ ordered wrt a normal [smp_]wmb() (or, as per how this thread started, a spin_unlock()) at all. On x86, we actually have this "CONFIG_X86_OOSTORE" configuration option that gets enable for you select a WINCHIP device, because that allows a weaker memory ordering for normal memory too, and that will end up using an "sfence" instruction for store buffers. But it's not normally enabled. So the eieio should be sufficient,then. Of course, the x86 store buffers do tend to flush out stuff after a certain cycle-delay too, so there may be drivers that technically are buggy on x86, but where the store buffer in practice is small and flushes out quickly enough that you'll never _see_ the bug. > Actually, the ppc's full barrier (sync) will generate bus traffic, and I > think in some case eieio barriers can propagate to the chipset to > enforce ordering there too depending on some voodoo settings and wether > the storage space is cacheable or not. Well, the regular kernel ops definitely won't depend on that, since that's not the case anywhere else. Linus From bcrl at linux.intel.com Sat Mar 4 08:51:14 2006 From: bcrl at linux.intel.com (Benjamin LaHaise) Date: Fri, 3 Mar 2006 13:51:14 -0800 Subject: Memory barriers and spin_unlock safety In-Reply-To: References: <32518.1141401780@warthog.cambridge.redhat.com> <1146.1141404346@warthog.cambridge.redhat.com> <5041.1141417027@warthog.cambridge.redhat.com> Message-ID: <20060303215114.GA13893@linux.intel.com> On Fri, Mar 03, 2006 at 01:34:17PM -0800, Linus Torvalds wrote: > Indeed. I think smp_wmb() should be a compiler fence only on x86(-64), ie > just compile to a "barrier()" (and not even that on UP, of course). Actually, no. At least in testing an implementation of Dekker's and Peterson's algorithms as a replacement for the locked operation in our spinlocks, it is absolutely necessary to have an sfence in the lock to ensure the lock is visible to the other CPU before proceeding. I'd use smp_wmb() as the fence is completely unnecessary on UP and is even irq-safe. Here's a copy of the Peterson's implementation to illustrate (it works, it's just slower than the existing spinlocks). -ben diff --git a/include/asm-x86_64/spinlock.h b/include/asm-x86_64/spinlock.h index fe484a6..45bd386 100644 --- a/include/asm-x86_64/spinlock.h +++ b/include/asm-x86_64/spinlock.h @@ -4,6 +4,8 @@ #include #include #include +#include +#include #include /* @@ -18,50 +20,53 @@ */ #define __raw_spin_is_locked(x) \ - (*(volatile signed int *)(&(x)->slock) <= 0) - -#define __raw_spin_lock_string \ - "\n1:\t" \ - "lock ; decl %0\n\t" \ - "js 2f\n" \ - LOCK_SECTION_START("") \ - "2:\t" \ - "rep;nop\n\t" \ - "cmpl $0,%0\n\t" \ - "jle 2b\n\t" \ - "jmp 1b\n" \ - LOCK_SECTION_END - -#define __raw_spin_unlock_string \ - "movl $1,%0" \ - :"=m" (lock->slock) : : "memory" + ((*(volatile signed int *)(x) & ~0xff) != 0) static inline void __raw_spin_lock(raw_spinlock_t *lock) { - __asm__ __volatile__( - __raw_spin_lock_string - :"=m" (lock->slock) : : "memory"); + int cpu = read_pda(cpunumber); + + barrier(); + lock->flags[cpu] = 1; + lock->turn = cpu ^ 1; + barrier(); + + asm volatile("sfence":::"memory"); + + while (lock->flags[cpu ^ 1] && (lock->turn != cpu)) { + cpu_relax(); + barrier(); + } } #define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock) static inline int __raw_spin_trylock(raw_spinlock_t *lock) { - int oldval; - - __asm__ __volatile__( - "xchgl %0,%1" - :"=q" (oldval), "=m" (lock->slock) - :"0" (0) : "memory"); - - return oldval > 0; + int cpu = read_pda(cpunumber); + barrier(); + if (__raw_spin_is_locked(lock)) + return 0; + + lock->flags[cpu] = 1; + lock->turn = cpu ^ 1; + asm volatile("sfence":::"memory"); + + if (lock->flags[cpu ^ 1] && (lock->turn != cpu)) { + lock->flags[cpu] = 0; + barrier(); + return 0; + } + return 1; } static inline void __raw_spin_unlock(raw_spinlock_t *lock) { - __asm__ __volatile__( - __raw_spin_unlock_string - ); + int cpu; + //asm volatile("lfence":::"memory"); + cpu = read_pda(cpunumber); + lock->flags[cpu] = 0; + barrier(); } #define __raw_spin_unlock_wait(lock) \ diff --git a/include/asm-x86_64/spinlock_types.h b/include/asm-x86_64/spinlock_types.h index 59efe84..a409cbf 100644 --- a/include/asm-x86_64/spinlock_types.h +++ b/include/asm-x86_64/spinlock_types.h @@ -6,10 +6,11 @@ #endif typedef struct { - volatile unsigned int slock; + volatile unsigned char turn; + volatile unsigned char flags[3]; } raw_spinlock_t; -#define __RAW_SPIN_LOCK_UNLOCKED { 1 } +#define __RAW_SPIN_LOCK_UNLOCKED { 0, { 0, } } typedef struct { volatile unsigned int lock; From torvalds at osdl.org Sat Mar 4 09:21:58 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Fri, 3 Mar 2006 14:21:58 -0800 (PST) Subject: Memory barriers and spin_unlock safety In-Reply-To: <20060303215114.GA13893@linux.intel.com> References: <32518.1141401780@warthog.cambridge.redhat.com> <1146.1141404346@warthog.cambridge.redhat.com> <5041.1141417027@warthog.cambridge.redhat.com> <20060303215114.GA13893@linux.intel.com> Message-ID: On Fri, 3 Mar 2006, Benjamin LaHaise wrote: > > Actually, no. At least in testing an implementation of Dekker's and > Peterson's algorithms as a replacement for the locked operation in > our spinlocks, it is absolutely necessary to have an sfence in the lock > to ensure the lock is visible to the other CPU before proceeding. I suspect you have some bug in your implementation. I think Dekker's algorithm depends on the reads and writes being ordered, and you don't seem to do that. The thing is, you pretty much _have_ to be wrong, because the x86-64 memory ordering rules are _exactly_ the same as for x86, and we've had that simple store as an unlock for a long long time. Linus From torvalds at osdl.org Sat Mar 4 09:36:34 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Fri, 3 Mar 2006 14:36:34 -0800 (PST) Subject: Memory barriers and spin_unlock safety In-Reply-To: References: <32518.1141401780@warthog.cambridge.redhat.com> <1146.1141404346@warthog.cambridge.redhat.com> <5041.1141417027@warthog.cambridge.redhat.com> <20060303215114.GA13893@linux.intel.com> Message-ID: On Fri, 3 Mar 2006, Linus Torvalds wrote: > > I suspect you have some bug in your implementation. I think Dekker's > algorithm depends on the reads and writes being ordered, and you don't > seem to do that. IOW, I think you need a full memory barrier after the "lock->turn = cpu ^ 1;" and you should have a "smp_rmb()" in between your reads of "lock->flags[cpu ^ 1]" and "lock->turn" to give the ordering that Dekker (or Peterson) expects. IOW, the code should be something like lock->flags[other] = 1; smp_wmb(); lock->turn = other smp_mb(); while (lock->turn == cpu) { smp_rmb(); if (!lock->flags[other]) break; } where the wmb's are no-ops on x86, but the rmb's certainly are not. I _suspect_ that the fact that it starts working with an 'sfence' in there somewhere is just because the sfence ends up being "serializing enough" that it just happens to work, but that it has nothing to do with the current kernel wmb() being wrong. Linus From hollisb at us.ibm.com Sat Mar 4 08:18:13 2006 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Fri, 3 Mar 2006 15:18:13 -0600 Subject: Memory barriers and spin_unlock safety In-Reply-To: <1141419966.3888.67.camel@localhost.localdomain> References: <32518.1141401780@warthog.cambridge.redhat.com> <1141419966.3888.67.camel@localhost.localdomain> Message-ID: <200603031518.15806.hollisb@us.ibm.com> On Friday 03 March 2006 15:06, Benjamin Herrenschmidt wrote: > The main problem I've had in the past with the ppc barriers is more a > subtle thing in the spec that unfortunately was taken to the word by > implementors, and is that the simple write barrier (eieio) will only > order within the same storage space, that is will not order between > cacheable and non-cacheable storage. I've heard Sparc has the same issue... in which case it may not be a "chip designer was too literal" thing, but rather it really simplifies chip implementation to do it that way. -- Hollis Blanchard IBM Linux Technology Center From paulus at samba.org Sat Mar 4 21:58:04 2006 From: paulus at samba.org (Paul Mackerras) Date: Sat, 4 Mar 2006 21:58:04 +1100 Subject: Memory barriers and spin_unlock safety In-Reply-To: <1141419966.3888.67.camel@localhost.localdomain> References: <32518.1141401780@warthog.cambridge.redhat.com> <1141419966.3888.67.camel@localhost.localdomain> Message-ID: <17417.29372.744064.211813@cargo.ozlabs.ibm.com> Benjamin Herrenschmidt writes: > Actually, the ppc's full barrier (sync) will generate bus traffic, and I > think in some case eieio barriers can propagate to the chipset to > enforce ordering there too depending on some voodoo settings and wether > the storage space is cacheable or not. Eieio has to go to the PCI host bridge because it is supposed to prevent write-combining, both in the host bridge and in the CPU. Paul. From paulus at samba.org Sat Mar 4 21:58:06 2006 From: paulus at samba.org (Paul Mackerras) Date: Sat, 4 Mar 2006 21:58:06 +1100 Subject: Memory barriers and spin_unlock safety In-Reply-To: References: <32518.1141401780@warthog.cambridge.redhat.com> Message-ID: <17417.29375.87604.537434@cargo.ozlabs.ibm.com> Linus Torvalds writes: > PPC has an absolutely _horrible_ memory ordering implementation, as far as > I can tell. The thing is broken. I think it's just implementation > breakage, not anything really fundamental, but the fact that their write > barriers are expensive is a big sign that they are doing something bad. An smp_wmb() is just an eieio on PPC, which is pretty cheap. I made wmb() be a sync though, because it seemed that there were drivers that expected wmb() to provide an ordering between a write to memory and a write to an MMIO register. If that is a bogus assumption then we could make wmb() lighter-weight (after auditing all the drivers we're interested in, of course, ...). And in a subsequent message: > If so, a simple write barrier should be sufficient. That's exactly what > the x86 write barriers do too, ie stores to magic IO space are _not_ > ordered wrt a normal [smp_]wmb() (or, as per how this thread started, a > spin_unlock()) at all. By magic IO space, do you mean just any old memory-mapped device register in a PCI device, or do you mean something else? Paul. From schwab at suse.de Sun Mar 5 01:03:52 2006 From: schwab at suse.de (Andreas Schwab) Date: Sat, 04 Mar 2006 15:03:52 +0100 Subject: GigE on PowerMac G5 Message-ID: I suppose the NIC in the PowerMac G5 can do GigE, yet when plugged into a GB switch it is only willing to talk 100MB with it. Any idea why? Kernel is 2.6.16-rc5-git2. # lsprop /proc/device-tree/ht at 0,f2000000/pci at 6/ethernet at f name "ethernet" linux,phandle ff9c53d8 interrupt-parent ff9779b0 gbit-phy assigned-addresses 82047810 00000000 80400000 00000000 00200000 82047830 00000000 80300000 00000000 00100000 local-mac-address 00 0a 95 ba b8 70 .....p stats 00000000 00000000 00000000 00000000 00000000 reg 00047800 00000000 00000000 00000000 00000000 02047810 00000000 00000000 00000000 00020000 02047830 00000000 00000000 00000000 00010000 max-frame-size 000005ee (1518) address-bits 00000030 (48) built-in compatible "K2-GMAC" category "net" removable "network" network-type "ethernet" device_type "network" fast-back-to-back devsel-speed 00000002 max-latency 00000040 (64) min-grant 00000040 (64) interrupts 00000029 00000001 class-code 00020000 (131072) revision-id 00000000 device-id 0000004c (76) vendor-id 0000106b (4203) # ethtool eth0 Settings for eth0: Supported ports: [ TP MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Advertised auto-negotiation: No Speed: 100Mb/s Duplex: Full Port: MII PHYAD: 0 Transceiver: external Auto-negotiation: on Supports Wake-on: g Wake-on: d Current message level: 0x00000007 (7) Link detected: yes Andreas. -- Andreas Schwab, SuSE Labs, schwab at suse.de SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." From schwab at suse.de Sun Mar 5 01:53:38 2006 From: schwab at suse.de (Andreas Schwab) Date: Sat, 04 Mar 2006 15:53:38 +0100 Subject: GigE on PowerMac G5 Message-ID: [Sorry for duplicate posting, I've used the wrong list address.] I suppose the NIC in the PowerMac G5 can do GigE, yet when plugged into a GB switch it is only willing to talk 100MB with it. Any idea why? Kernel is 2.6.16-rc5-git2. # lsprop /proc/device-tree/ht at 0,f2000000/pci at 6/ethernet at f name "ethernet" linux,phandle ff9c53d8 interrupt-parent ff9779b0 gbit-phy assigned-addresses 82047810 00000000 80400000 00000000 00200000 82047830 00000000 80300000 00000000 00100000 local-mac-address 00 0a 95 ba b8 70 .....p stats 00000000 00000000 00000000 00000000 00000000 reg 00047800 00000000 00000000 00000000 00000000 02047810 00000000 00000000 00000000 00020000 02047830 00000000 00000000 00000000 00010000 max-frame-size 000005ee (1518) address-bits 00000030 (48) built-in compatible "K2-GMAC" category "net" removable "network" network-type "ethernet" device_type "network" fast-back-to-back devsel-speed 00000002 max-latency 00000040 (64) min-grant 00000040 (64) interrupts 00000029 00000001 class-code 00020000 (131072) revision-id 00000000 device-id 0000004c (76) vendor-id 0000106b (4203) # ethtool eth0 Settings for eth0: Supported ports: [ TP MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Advertised auto-negotiation: No Speed: 100Mb/s Duplex: Full Port: MII PHYAD: 0 Transceiver: external Auto-negotiation: on Supports Wake-on: g Wake-on: d Current message level: 0x00000007 (7) Link detected: yes Andreas. -- Andreas Schwab, SuSE Labs, schwab at suse.de SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." From torvalds at osdl.org Sun Mar 5 04:28:54 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Sat, 4 Mar 2006 09:28:54 -0800 (PST) Subject: Memory barriers and spin_unlock safety In-Reply-To: <17417.29375.87604.537434@cargo.ozlabs.ibm.com> References: <32518.1141401780@warthog.cambridge.redhat.com> <17417.29375.87604.537434@cargo.ozlabs.ibm.com> Message-ID: On Sat, 4 Mar 2006, Paul Mackerras wrote: > > > If so, a simple write barrier should be sufficient. That's exactly what > > the x86 write barriers do too, ie stores to magic IO space are _not_ > > ordered wrt a normal [smp_]wmb() (or, as per how this thread started, a > > spin_unlock()) at all. > > By magic IO space, do you mean just any old memory-mapped device > register in a PCI device, or do you mean something else? Any old memory-mapped device that has been marked as write-combining in the MTRR's or page tables. So the rules from the PC side (and like it or not, they end up being what all the drivers are tested with) are: - regular stores are ordered by write barriers - PIO stores are always synchronous - MMIO stores are ordered by IO semantics - PCI ordering must be honored: * write combining is only allowed on PCI memory resources that are marked prefetchable. If your host bridge does write combining in general, it's not a "host bridge", it's a "host disaster". * for others, writes can always be posted, but they cannot be re-ordered wrt either reads or writes to that device (ie a read will always be fully synchronizing) - io_wmb must be honored In addition, it will help a hell of a lot if you follow the PC notion of "per-region extra rules", ie you'd default to the non-prefetchable behaviour even for areas that are prefetchable from a PCI standpoint, but allow some way to relax the ordering rules in various ways. PC's use MTRR's or page table hints for this, but it's actually perfectly possible to do it by virtual address (ie decide on "ioremap()" time by looking at some bits that you've saved away to remap it to a certain virtual address range, and then use the virtual address as a hint for readl/writel whether you need to serialize or not). On x86, we already use the "virtual address" trick to distinguish between PIO and MMIO for the newer ioread/iowrite interface (the older inb/outb/readb/writeb interfaces obviously don't need that, since the IO space is statically encoded in the function call itself). The reason I mention the MTRR emulation is again just purely compatibility with drivers that get 99.9% of all the testing on a PC platform. Linus From benh at kernel.crashing.org Sun Mar 5 08:16:40 2006 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sun, 05 Mar 2006 08:16:40 +1100 Subject: GigE on PowerMac G5 In-Reply-To: References: Message-ID: <1141507000.17127.4.camel@localhost.localdomain> On Sat, 2006-03-04 at 15:53 +0100, Andreas Schwab wrote: > [Sorry for duplicate posting, I've used the wrong list address.] > > I suppose the NIC in the PowerMac G5 can do GigE, yet when plugged into a > GB switch it is only willing to talk 100MB with it. Any idea why? Kernel > is 2.6.16-rc5-git2. Works for me... Must be a problem with auto-neg and your switch, or the cable.... Can you check how the switch is configured maybe ? You can also try forcing the link speed with ethtool. Ben. > # lsprop /proc/device-tree/ht at 0,f2000000/pci at 6/ethernet at f > name "ethernet" > linux,phandle ff9c53d8 > interrupt-parent ff9779b0 > gbit-phy > assigned-addresses 82047810 00000000 80400000 00000000 00200000 > 82047830 00000000 80300000 00000000 00100000 > local-mac-address 00 0a 95 ba b8 70 .....p > stats 00000000 00000000 00000000 00000000 00000000 > reg 00047800 00000000 00000000 00000000 00000000 > 02047810 00000000 00000000 00000000 00020000 > 02047830 00000000 00000000 00000000 00010000 > max-frame-size 000005ee (1518) > address-bits 00000030 (48) > built-in > compatible "K2-GMAC" > category "net" > removable "network" > network-type "ethernet" > device_type "network" > fast-back-to-back > devsel-speed 00000002 > max-latency 00000040 (64) > min-grant 00000040 (64) > interrupts 00000029 00000001 > class-code 00020000 (131072) > revision-id 00000000 > device-id 0000004c (76) > vendor-id 0000106b (4203) > # ethtool eth0 > Settings for eth0: > Supported ports: [ TP MII ] > Supported link modes: 10baseT/Half 10baseT/Full > 100baseT/Half 100baseT/Full > 1000baseT/Half 1000baseT/Full > Supports auto-negotiation: Yes > Advertised link modes: 10baseT/Half 10baseT/Full > 100baseT/Half 100baseT/Full > 1000baseT/Half 1000baseT/Full > Advertised auto-negotiation: No > Speed: 100Mb/s > Duplex: Full > Port: MII > PHYAD: 0 > Transceiver: external > Auto-negotiation: on > Supports Wake-on: g > Wake-on: d > Current message level: 0x00000007 (7) > Link detected: yes > > Andreas. > From benh at kernel.crashing.org Sun Mar 5 09:49:53 2006 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sun, 05 Mar 2006 09:49:53 +1100 Subject: Memory barriers and spin_unlock safety In-Reply-To: <17417.29372.744064.211813@cargo.ozlabs.ibm.com> References: <32518.1141401780@warthog.cambridge.redhat.com> <1141419966.3888.67.camel@localhost.localdomain> <17417.29372.744064.211813@cargo.ozlabs.ibm.com> Message-ID: <1141512594.17127.16.camel@localhost.localdomain> On Sat, 2006-03-04 at 21:58 +1100, Paul Mackerras wrote: > Benjamin Herrenschmidt writes: > > > Actually, the ppc's full barrier (sync) will generate bus traffic, and I > > think in some case eieio barriers can propagate to the chipset to > > enforce ordering there too depending on some voodoo settings and wether > > the storage space is cacheable or not. > > Eieio has to go to the PCI host bridge because it is supposed to > prevent write-combining, both in the host bridge and in the CPU. That can be disabled with HID bits tho ;) Ben. From paulus at samba.org Sun Mar 5 10:36:18 2006 From: paulus at samba.org (Paul Mackerras) Date: Sun, 5 Mar 2006 10:36:18 +1100 Subject: GigE on PowerMac G5 In-Reply-To: References: Message-ID: <17418.9330.763002.180595@cargo.ozlabs.ibm.com> Andreas Schwab writes: > I suppose the NIC in the PowerMac G5 can do GigE, yet when plugged into a > GB switch it is only willing to talk 100MB with it. Any idea why? Kernel > is 2.6.16-rc5-git2. It does 1000Mb/s here... # ethtool eth0 Settings for eth0: Supported ports: [ TP MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Advertised auto-negotiation: No Speed: 1000Mb/s Duplex: Full Port: MII PHYAD: 0 Transceiver: external Auto-negotiation: on Supports Wake-on: g Wake-on: d Current message level: 0x00000007 (7) Link detected: yes Paul. From mbuesch at freenet.de Sun Mar 5 13:04:40 2006 From: mbuesch at freenet.de (Michael Buesch) Date: Sun, 5 Mar 2006 03:04:40 +0100 Subject: Memory barriers and spin_unlock safety In-Reply-To: <17417.29375.87604.537434@cargo.ozlabs.ibm.com> References: <32518.1141401780@warthog.cambridge.redhat.com> <17417.29375.87604.537434@cargo.ozlabs.ibm.com> Message-ID: <200603050304.41436.mbuesch@freenet.de> On Saturday 04 March 2006 11:58, you wrote: > Linus Torvalds writes: > > > PPC has an absolutely _horrible_ memory ordering implementation, as far as > > I can tell. The thing is broken. I think it's just implementation > > breakage, not anything really fundamental, but the fact that their write > > barriers are expensive is a big sign that they are doing something bad. > > An smp_wmb() is just an eieio on PPC, which is pretty cheap. I made > wmb() be a sync though, because it seemed that there were drivers that > expected wmb() to provide an ordering between a write to memory and a > write to an MMIO register. If that is a bogus assumption then we > could make wmb() lighter-weight (after auditing all the drivers we're > interested in, of course, ...). In the bcm43xx driver there is code which looks like the following: /* Write some coherent DMA memory */ wmb(); /* Write MMIO, which depends on the DMA memory * write to be finished. */ Are the assumptions in this code correct? Is wmb() the correct thing to do here? I heavily tested this code on PPC UP and did not see any anormaly, yet. -- Greetings Michael. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060305/578694ae/attachment.pgp From david at gibson.dropbear.id.au Mon Mar 6 12:51:29 2006 From: david at gibson.dropbear.id.au (David Gibson) Date: Mon, 6 Mar 2006 12:51:29 +1100 Subject: powerpc: Make pmd_bad() and pud_bad() checks non-trivial Message-ID: <20060306015129.GA21408@localhost.localdomain> Paulus, please apply. At present, the powerpc pmd_bad() and pud_bad() macros return false unless the fiven pmd or pud is zero. This patch makes these tests more thorough, checking if the given pmd or pud looks like a plausible pte page or pmd page pointer respectively. This can result in helpful error messages when messing with the pagetable code. Signed-off-by: David Gibson Index: working-2.6/include/asm-powerpc/pgtable.h =================================================================== --- working-2.6.orig/include/asm-powerpc/pgtable.h 2006-03-06 11:38:45.000000000 +1100 +++ working-2.6/include/asm-powerpc/pgtable.h 2006-03-06 12:51:14.000000000 +1100 @@ -188,9 +188,13 @@ static inline pte_t pfn_pte(unsigned lon #define pte_pfn(x) ((unsigned long)((pte_val(x)>>PTE_RPN_SHIFT))) #define pte_page(x) pfn_to_page(pte_pfn(x)) +#define PMD_BAD_BITS (PTE_TABLE_SIZE-1) +#define PUD_BAD_BITS (PMD_TABLE_SIZE-1) + #define pmd_set(pmdp, pmdval) (pmd_val(*(pmdp)) = (pmdval)) #define pmd_none(pmd) (!pmd_val(pmd)) -#define pmd_bad(pmd) (pmd_val(pmd) == 0) +#define pmd_bad(pmd) (!is_kernel_addr(pmd_val(pmd)) \ + || (pmd_val(pmd) & PMD_BAD_BITS)) #define pmd_present(pmd) (pmd_val(pmd) != 0) #define pmd_clear(pmdp) (pmd_val(*(pmdp)) = 0) #define pmd_page_kernel(pmd) (pmd_val(pmd) & ~PMD_MASKED_BITS) @@ -198,7 +202,8 @@ static inline pte_t pfn_pte(unsigned lon #define pud_set(pudp, pudval) (pud_val(*(pudp)) = (pudval)) #define pud_none(pud) (!pud_val(pud)) -#define pud_bad(pud) ((pud_val(pud)) == 0) +#define pud_bad(pud) (!is_kernel_addr(pud_val(pud)) \ + || (pud_val(pud) & PUD_BAD_BITS)) #define pud_present(pud) (pud_val(pud) != 0) #define pud_clear(pudp) (pud_val(*(pudp)) = 0) #define pud_page(pud) (pud_val(pud) & ~PUD_MASKED_BITS) -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson From michael at ellerman.id.au Mon Mar 6 13:29:07 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Mon, 6 Mar 2006 13:29:07 +1100 Subject: dead hvc_console with kdump kernel In-Reply-To: References: Message-ID: <200603061329.08262.michael@ellerman.id.au> On Fri, 24 Feb 2006 07:41, Ryan Arnold wrote: > If interrupts end up being disabled by the kexec call and you still need > the console you could try to find a way to set hp->irq = NO_IRQ in this > case such that the khvcd thread is continually rescheduled to poll the > hypervisor buffer and never sleeps indefinitely, as via the interrupt > driven method. Still not sure what's going on here, some interrupt weirdness. This patch serves as a workaround for the moment: Index: kdump/drivers/char/hvc_console.c =================================================================== --- kdump.orig/drivers/char/hvc_console.c 2006-03-06 12:19:42.000000000 +1100 +++ kdump/drivers/char/hvc_console.c 2006-03-06 12:22:32.000000000 +1100 @@ -591,10 +591,12 @@ static int hvc_poll(struct hvc_struct *h if (test_bit(TTY_THROTTLED, &tty->flags)) goto throttled; +#ifndef CONFIG_CRASH_DUMP /* If we aren't interrupt driven and aren't throttled, we always * request a reschedule */ if (hp->irq == NO_IRQ) +#endif poll_mask |= HVC_POLL_READ; /* Read data if any */ From michael at ellerman.id.au Mon Mar 6 17:26:30 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Mon, 6 Mar 2006 17:26:30 +1100 Subject: dead hvc_console with kdump kernel In-Reply-To: <200603061329.08262.michael@ellerman.id.au> References: <200603061329.08262.michael@ellerman.id.au> Message-ID: <200603061726.35175.michael@ellerman.id.au> On Mon, 6 Mar 2006 13:29, Michael Ellerman wrote: > On Fri, 24 Feb 2006 07:41, Ryan Arnold wrote: > > If interrupts end up being disabled by the kexec call and you still need > > the console you could try to find a way to set hp->irq = NO_IRQ in this > > case such that the khvcd thread is continually rescheduled to poll the > > hypervisor buffer and never sleeps indefinitely, as via the interrupt > > driven method. > > Still not sure what's going on here, some interrupt weirdness. I'm stuck on this, when we switch to the kdump kernel we just stop getting interrupts for the console. We never see them in xics_get_irq(), but everything else seems to be working dandy. :( -- Michael Ellerman IBM OzLabs wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060306/9f3bc794/attachment.pgp From schwab at suse.de Mon Mar 6 21:40:59 2006 From: schwab at suse.de (Andreas Schwab) Date: Mon, 06 Mar 2006 11:40:59 +0100 Subject: GigE on PowerMac G5 In-Reply-To: <1141507000.17127.4.camel@localhost.localdomain> (Benjamin Herrenschmidt's message of "Sun, 05 Mar 2006 08:16:40 +1100") References: <1141507000.17127.4.camel@localhost.localdomain> Message-ID: Benjamin Herrenschmidt writes: > On Sat, 2006-03-04 at 15:53 +0100, Andreas Schwab wrote: >> I suppose the NIC in the PowerMac G5 can do GigE, yet when plugged into a >> GB switch it is only willing to talk 100MB with it. Any idea why? Kernel >> is 2.6.16-rc5-git2. > > Works for me... Must be a problem with auto-neg and your switch, or the > cable.... Can you check how the switch is configured maybe ? You can > also try forcing the link speed with ethtool. It's not the cable, I have swapped it with another system where Gb is working fine. Neither it's the switch port, I have swapped it too. I can't force the speed with ethtool either. Any other idea what to look for? Andreas. -- Andreas Schwab, SuSE Labs, schwab at suse.de SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." From benh at kernel.crashing.org Tue Mar 7 00:15:06 2006 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 07 Mar 2006 00:15:06 +1100 Subject: GigE on PowerMac G5 In-Reply-To: References: <1141507000.17127.4.camel@localhost.localdomain> Message-ID: <1141650907.11221.61.camel@localhost.localdomain> On Mon, 2006-03-06 at 11:40 +0100, Andreas Schwab wrote: > Benjamin Herrenschmidt writes: > > > On Sat, 2006-03-04 at 15:53 +0100, Andreas Schwab wrote: > >> I suppose the NIC in the PowerMac G5 can do GigE, yet when plugged into a > >> GB switch it is only willing to talk 100MB with it. Any idea why? Kernel > >> is 2.6.16-rc5-git2. > > > > Works for me... Must be a problem with auto-neg and your switch, or the > > cable.... Can you check how the switch is configured maybe ? You can > > also try forcing the link speed with ethtool. > > It's not the cable, I have swapped it with another system where Gb is > working fine. Neither it's the switch port, I have swapped it too. I > can't force the speed with ethtool either. Any other idea what to look > for? At this point, all I can say is... does it work in OS X ? Ben. From olh at suse.de Tue Mar 7 06:38:17 2006 From: olh at suse.de (Olaf Hering) Date: Mon, 6 Mar 2006 20:38:17 +0100 Subject: [PATCH] change compat shmget size arg to signed In-Reply-To: <20060224111242.08f14bd9.sfr@canb.auug.org.au> References: <20060224101644.548b0c24.sfr@canb.auug.org.au> <20060223232717.GB29454@suse.de> <20060224111242.08f14bd9.sfr@canb.auug.org.au> Message-ID: <20060306193817.GA3214@suse.de> On Fri, Feb 24, Stephen Rothwell wrote: > On Fri, 24 Feb 2006 00:27:17 +0100 Olaf Hering wrote: > > > > On Fri, Feb 24, Stephen Rothwell wrote: > > > > > Does the ltp test fail on a standard kernel(where SHMMAX is 0x2000000), or > > > only on a SLES kernel (where SHMMAX is ULONG_MAX)? > > > > It fails with SLES9 and SLES10. SLES9 has 0x2000000 as default. > > So what was shm_ctlmax set to when the test was run. > > I am trying to figure out why this test: > > if (size < SHMMIN || size > shm_ctlmax) > return -EINVAL; > > Doesn't return -EINVAL for size == 0xffffffff if shm_ctlmax is 0x2000000? shm_ctlmax is a sysctrl, so it can have anything. The ltp test is invalid. shmget02 dos not fail after: echo $(( 0x2000000 )) > /proc/sys/kernel/shmmax From johnrose at austin.ibm.com Tue Mar 7 12:03:28 2006 From: johnrose at austin.ibm.com (John Rose) Date: Mon, 06 Mar 2006 19:03:28 -0600 Subject: [PATCH 1/3] cleanup PCI Host Bridge setup Message-ID: <1141693408.8166.16.camel@sinatra.austin.ibm.com> Since setup_phb() and pci_process_bridge_OF_ranges() are always called together, and since the latter falls under the category of "setup", move the latter into the former. Thanks- John Signed-off-by: John Rose diff -puN arch/powerpc/kernel/rtas_pci.c~cleanup_phb_setup arch/powerpc/kernel/rtas_pci.c --- 2_6_p5/arch/powerpc/kernel/rtas_pci.c~cleanup_phb_setup 2006-03-03 15:42:35.000000000 -0600 +++ 2_6_p5-johnrose/arch/powerpc/kernel/rtas_pci.c 2006-03-06 17:23:50.000000000 -0600 @@ -292,6 +292,8 @@ static int __devinit setup_phb(struct de phb->ops = &rtas_pci_ops; phb->buid = get_phb_buid(dev); + pci_process_bridge_OF_ranges(phb, dev, 0); + return 0; } @@ -323,7 +325,6 @@ unsigned long __init find_and_init_phbs( if (!phb) continue; setup_phb(node, phb); - pci_process_bridge_OF_ranges(phb, node, 0); pci_setup_phb_io(phb, index == 0); #ifdef CONFIG_PPC_PSERIES /* XXX This code need serious fixing ... --BenH */ @@ -369,7 +370,6 @@ struct pci_controller * __devinit init_p if (!phb) return NULL; setup_phb(dn, phb); - pci_process_bridge_OF_ranges(phb, dn, primary); pci_setup_phb_io_dynamic(phb, primary); _ From johnrose at austin.ibm.com Tue Mar 7 12:03:58 2006 From: johnrose at austin.ibm.com (John Rose) Date: Mon, 06 Mar 2006 19:03:58 -0600 Subject: [PATCH 2/3] move init_phb_dynamic() to pseries Message-ID: <1141693438.8166.20.camel@sinatra.austin.ibm.com> Since init_phb_dynamic() only comes into play during dynamic partitioning on POWER systems, move it to pseries-specific file. This is also necessary for the addition of some pseries-specific fixups during PHB creation. Thanks- John Signed-off-by: John Rose diff -puN arch/powerpc/kernel/rtas_pci.c~move_init_phb_dyn arch/powerpc/kernel/rtas_pci.c --- 2_6_p5/arch/powerpc/kernel/rtas_pci.c~move_init_phb_dyn 2006-03-03 15:43:03.000000000 -0600 +++ 2_6_p5-johnrose/arch/powerpc/kernel/rtas_pci.c 2006-03-03 15:43:03.000000000 -0600 @@ -280,8 +280,7 @@ static int phb_set_bus_ranges(struct dev return 0; } -static int __devinit setup_phb(struct device_node *dev, - struct pci_controller *phb) +int __devinit setup_phb(struct device_node *dev, struct pci_controller *phb) { if (is_python(dev)) python_countermeasures(dev); @@ -360,26 +359,6 @@ unsigned long __init find_and_init_phbs( return 0; } -struct pci_controller * __devinit init_phb_dynamic(struct device_node *dn) -{ - struct pci_controller *phb; - int primary; - - primary = list_empty(&hose_list); - phb = pcibios_alloc_controller(dn); - if (!phb) - return NULL; - setup_phb(dn, phb); - - pci_setup_phb_io_dynamic(phb, primary); - - pci_devs_phb_init_dynamic(phb); - scan_phb(phb); - - return phb; -} -EXPORT_SYMBOL(init_phb_dynamic); - /* RPA-specific bits for removing PHBs */ int pcibios_remove_root_bus(struct pci_controller *phb) { diff -puN arch/powerpc/kernel/pci_64.c~move_init_phb_dyn arch/powerpc/kernel/pci_64.c diff -puN include/asm-powerpc/ppc-pci.h~move_init_phb_dyn include/asm-powerpc/ppc-pci.h --- 2_6_p5/include/asm-powerpc/ppc-pci.h~move_init_phb_dyn 2006-03-03 15:43:03.000000000 -0600 +++ 2_6_p5-johnrose/include/asm-powerpc/ppc-pci.h 2006-03-03 15:43:03.000000000 -0600 @@ -38,6 +38,7 @@ void *traverse_pci_devices(struct device void pci_devs_phb_init(void); void pci_devs_phb_init_dynamic(struct pci_controller *phb); +int setup_phb(struct device_node *dev, struct pci_controller *phb); void __devinit scan_phb(struct pci_controller *hose); /* From rtas_pci.h */ diff -puN arch/powerpc/platforms/pseries/pci_dlpar.c~move_init_phb_dyn arch/powerpc/platforms/pseries/pci_dlpar.c --- 2_6_p5/arch/powerpc/platforms/pseries/pci_dlpar.c~move_init_phb_dyn 2006-03-03 15:43:03.000000000 -0600 +++ 2_6_p5-johnrose/arch/powerpc/platforms/pseries/pci_dlpar.c 2006-03-06 17:23:46.000000000 -0600 @@ -27,6 +27,7 @@ #include #include +#include static struct pci_bus * find_bus_among_children(struct pci_bus *bus, @@ -179,3 +180,23 @@ pcibios_add_pci_devices(struct pci_bus * } } EXPORT_SYMBOL_GPL(pcibios_add_pci_devices); + +struct pci_controller * __devinit init_phb_dynamic(struct device_node *dn) +{ + struct pci_controller *phb; + int primary; + + primary = list_empty(&hose_list); + phb = pcibios_alloc_controller(dn); + if (!phb) + return NULL; + setup_phb(dn, phb); + + pci_setup_phb_io_dynamic(phb, primary); + + pci_devs_phb_init_dynamic(phb); + scan_phb(phb); + + return phb; +} +EXPORT_SYMBOL_GPL(init_phb_dynamic); _ From johnrose at austin.ibm.com Tue Mar 7 12:04:25 2006 From: johnrose at austin.ibm.com (John Rose) Date: Mon, 06 Mar 2006 19:04:25 -0600 Subject: [PATCH 3/3] properly configure DDR/P5IOC children devs Message-ID: <1141693465.8166.22.camel@sinatra.austin.ibm.com> The dynamic add path for PCI Host Bridges can fail to configure children adapters under P5IOC controllers. It fails to properly fixup bus/device resources, and it fails to properly enable EEH. Both of these steps need to occur before any children devices are enabled in pci_bus_add_devices(). This fix has been tested for P5IOC and non-P5IOC slots. Thanks- John Signed-off-by: John Rose diff -puN arch/powerpc/kernel/pci_64.c~fixup_phb_devs arch/powerpc/kernel/pci_64.c --- 2_6_p5/arch/powerpc/kernel/pci_64.c~fixup_phb_devs 2006-03-03 15:43:38.000000000 -0600 +++ 2_6_p5-johnrose/arch/powerpc/kernel/pci_64.c 2006-03-03 15:43:39.000000000 -0600 @@ -589,7 +589,6 @@ void __devinit scan_phb(struct pci_contr #endif /* CONFIG_PPC_MULTIPLATFORM */ if (mode == PCI_PROBE_NORMAL) hose->last_busno = bus->subordinate = pci_scan_child_bus(bus); - pci_bus_add_devices(bus); } static int __init pcibios_init(void) @@ -608,8 +607,10 @@ static int __init pcibios_init(void) printk("PCI: Probing PCI hardware\n"); /* Scan all of the recorded PCI controllers. */ - list_for_each_entry_safe(hose, tmp, &hose_list, list_node) + list_for_each_entry_safe(hose, tmp, &hose_list, list_node) { scan_phb(hose); + pci_bus_add_devices(hose->bus); + } #ifndef CONFIG_PPC_ISERIES if (pci_probe_only) diff -puN arch/powerpc/platforms/pseries/pci_dlpar.c~fixup_phb_devs arch/powerpc/platforms/pseries/pci_dlpar.c --- 2_6_p5/arch/powerpc/platforms/pseries/pci_dlpar.c~fixup_phb_devs 2006-03-03 15:44:04.000000000 -0600 +++ 2_6_p5-johnrose/arch/powerpc/platforms/pseries/pci_dlpar.c 2006-03-03 15:46:25.000000000 -0600 @@ -195,7 +195,13 @@ struct pci_controller * __devinit init_p pci_setup_phb_io_dynamic(phb, primary); pci_devs_phb_init_dynamic(phb); + + if (dn->child) + eeh_add_device_tree_early(dn); + scan_phb(phb); + pcibios_fixup_new_pci_devices(phb->bus, 0); + pci_bus_add_devices(phb->bus); return phb; } _ From schwab at suse.de Tue Mar 7 23:53:21 2006 From: schwab at suse.de (Andreas Schwab) Date: Tue, 07 Mar 2006 13:53:21 +0100 Subject: GigE on PowerMac G5 In-Reply-To: <1141650907.11221.61.camel@localhost.localdomain> (Benjamin Herrenschmidt's message of "Tue, 07 Mar 2006 00:15:06 +1100") References: <1141507000.17127.4.camel@localhost.localdomain> <1141650907.11221.61.camel@localhost.localdomain> Message-ID: Benjamin Herrenschmidt writes: > At this point, all I can say is... does it work in OS X ? Strange, OS X can't do it either. Looks like I have a hardware problem. Andreas. -- Andreas Schwab, SuSE Labs, schwab at suse.de SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." From dhowells at redhat.com Wed Mar 8 04:36:59 2006 From: dhowells at redhat.com (David Howells) Date: Tue, 07 Mar 2006 17:36:59 +0000 Subject: Memory barriers and spin_unlock safety In-Reply-To: <5041.1141417027@warthog.cambridge.redhat.com> References: <5041.1141417027@warthog.cambridge.redhat.com> <32518.1141401780@warthog.cambridge.redhat.com> <1146.1141404346@warthog.cambridge.redhat.com> Message-ID: <31420.1141753019@warthog.cambridge.redhat.com> David Howells wrote: > I suspect, then, that x86_64 should not have an SFENCE for smp_wmb(), and > that only io_wmb() should have that. Hmmm... We don't actually have io_wmb()... Should the following be added to all archs? io_mb() io_rmb() io_wmb() David From dhowells at redhat.com Wed Mar 8 04:40:45 2006 From: dhowells at redhat.com (David Howells) Date: Tue, 07 Mar 2006 17:40:45 +0000 Subject: [PATCH] Document Linux's memory barriers Message-ID: <31492.1141753245@warthog.cambridge.redhat.com> The attached patch documents the Linux kernel's memory barriers. Signed-Off-By: David Howells --- warthog>diffstat -p1 mb.diff Documentation/memory-barriers.txt | 359 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 359 insertions(+) diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt new file mode 100644 index 0000000..c2fc51b --- /dev/null +++ b/Documentation/memory-barriers.txt @@ -0,0 +1,359 @@ + ============================ + LINUX KERNEL MEMORY BARRIERS + ============================ + +Contents: + + (*) What are memory barriers? + + (*) Linux kernel memory barrier functions. + + (*) Implied kernel memory barriers. + + (*) i386 and x86_64 arch specific notes. + + +========================= +WHAT ARE MEMORY BARRIERS? +========================= + +Memory barriers are instructions to both the compiler and the CPU to impose a +partial ordering between the memory access operations specified either side of +the barrier. + +Older and less complex CPUs will perform memory accesses in exactly the order +specified, so if one is given the following piece of code: + + a = *A; + *B = b; + c = *C; + d = *D; + *E = e; + +It can be guaranteed that it will complete the memory access for each +instruction before moving on to the next line, leading to a definite sequence +of operations on the bus: + + read *A, write *B, read *C, read *D, write *E. + +However, with newer and more complex CPUs, this isn't always true because: + + (*) they can rearrange the order of the memory accesses to promote better use + of the CPU buses and caches; + + (*) reads are synchronous and may need to be done immediately to permit + progress, whereas writes can often be deferred without a problem; + + (*) and they are able to combine reads and writes to improve performance when + talking to the SDRAM (modern SDRAM chips can do batched accesses of + adjacent locations, cutting down on transaction setup costs). + +So what you might actually get from the above piece of code is: + + read *A, read *C+*D, write *E, write *B + +Under normal operation, this is probably not going to be a problem; however, +there are two circumstances where it definitely _can_ be a problem: + + (1) I/O + + Many I/O devices can be memory mapped, and so appear to the CPU as if + they're just memory locations. However, to control the device, the driver + has to make the right accesses in exactly the right order. + + Consider, for example, an ethernet chipset such as the AMD PCnet32. It + presents to the CPU an "address register" and a bunch of "data registers". + The way it's accessed is to write the index of the internal register you + want to access to the address register, and then read or write the + appropriate data register to access the chip's internal register: + + *ADR = ctl_reg_3; + reg = *DATA; + + The problem with a clever CPU or a clever compiler is that the write to + the address register isn't guaranteed to happen before the access to the + data register, if the CPU or the compiler thinks it is more efficient to + defer the address write: + + read *DATA, write *ADR + + then things will break. + + The way to deal with this is to insert an I/O memory barrier between the + two accesses: + + *ADR = ctl_reg_3; + mb(); + reg = *DATA; + + In this case, the barrier makes a guarantee that all memory accesses + before the barrier will happen before all the memory accesses after the + barrier. It does _not_ guarantee that all memory accesses before the + barrier will be complete by the time the barrier is complete. + + (2) Multiprocessor interaction + + When there's a system with more than one processor, these may be working + on the same set of data, but attempting not to use locks as locks are + quite expensive. This means that accesses that affect both CPUs may have + to be carefully ordered to prevent error. + + Consider the R/W semaphore slow path. In that, a waiting process is + queued on the semaphore, as noted by it having a record on its stack + linked to the semaphore's list: + + struct rw_semaphore { + ... + struct list_head waiters; + }; + + struct rwsem_waiter { + struct list_head list; + struct task_struct *task; + }; + + To wake up the waiter, the up_read() or up_write() functions have to read + the pointer from this record to know as to where the next waiter record + is, clear the task pointer, call wake_up_process() on the task, and + release the task struct reference held: + + READ waiter->list.next; + READ waiter->task; + WRITE waiter->task; + CALL wakeup + RELEASE task + + If any of these steps occur out of order, then the whole thing may fail. + + Note that the waiter does not get the semaphore lock again - it just waits + for its task pointer to be cleared. Since the record is on its stack, this + means that if the task pointer is cleared _before_ the next pointer in the + list is read, then another CPU might start processing the waiter and it + might clobber its stack before up*() functions have a chance to read the + next pointer. + + CPU 0 CPU 1 + =============================== =============================== + down_xxx() + Queue waiter + Sleep + up_yyy() + READ waiter->task; + WRITE waiter->task; + + Resume processing + down_xxx() returns + call foo() + foo() clobbers *waiter + + READ waiter->list.next; + --- OOPS --- + + This could be dealt with using a spinlock, but then the down_xxx() + function has to get the spinlock again after it's been woken up, which is + a waste of resources. + + The way to deal with this is to insert an SMP memory barrier: + + READ waiter->list.next; + READ waiter->task; + smp_mb(); + WRITE waiter->task; + CALL wakeup + RELEASE task + + In this case, the barrier makes a guarantee that all memory accesses + before the barrier will happen before all the memory accesses after the + barrier. It does _not_ guarantee that all memory accesses before the + barrier will be complete by the time the barrier is complete. + + SMP memory barriers are normally no-ops on a UP system because the CPU + orders overlapping accesses with respect to itself. + + +===================================== +LINUX KERNEL MEMORY BARRIER FUNCTIONS +===================================== + +The Linux kernel has six basic memory barriers: + + MANDATORY (I/O) SMP + =============== ================ + GENERAL mb() smp_mb() + READ rmb() smp_rmb() + WRITE wmb() smp_wmb() + +General memory barriers make a guarantee that all memory accesses specified +before the barrier will happen before all memory accesses specified after the +barrier. + +Read memory barriers make a guarantee that all memory reads specified before +the barrier will happen before all memory reads specified after the barrier. + +Write memory barriers make a guarantee that all memory writes specified before +the barrier will happen before all memory writes specified after the barrier. + +SMP memory barriers are no-ops on uniprocessor compiled systems because it is +assumed that a CPU will be self-consistent, and will order overlapping accesses +with respect to itself. + +There is no guarantee that any of the memory accesses specified before a memory +barrier will be complete by the completion of a memory barrier; the barrier can +be considered to draw a line in the access queue that accesses of the +appropriate type may not cross. + +There is no guarantee that issuing a memory barrier on one CPU will have any +direct effect on another CPU or any other hardware in the system. The indirect +effect will be the order the first CPU commits its accesses to the bus. + +Note that these are the _minimum_ guarantees. Different architectures may give +more substantial guarantees, but they may not be relied upon outside of arch +specific code. + + +There are some more advanced barriering functions: + + (*) set_mb(var, value) + (*) set_wmb(var, value) + + These assign the value to the variable and then insert at least a write + barrier after it, depending on the function. + + +============================== +IMPLIED KERNEL MEMORY BARRIERS +============================== + +Some of the other functions in the linux kernel imply memory barriers. For +instance all the following (pseudo-)locking functions imply barriers. + + (*) interrupt disablement and/or interrupts + (*) spin locks + (*) R/W spin locks + (*) mutexes + (*) semaphores + (*) R/W semaphores + +In all cases there are variants on a LOCK operation and an UNLOCK operation. + + (*) LOCK operation implication: + + Memory accesses issued after the LOCK will be completed after the LOCK + accesses have completed. + + Memory accesses issued before the LOCK may be completed after the LOCK + accesses have completed. + + (*) UNLOCK operation implication: + + Memory accesses issued before the UNLOCK will be completed before the + UNLOCK accesses have completed. + + Memory accesses issued after the UNLOCK may be completed before the UNLOCK + accesses have completed. + + (*) LOCK vs UNLOCK implication: + + The LOCK accesses will be completed before the unlock accesses. + +Locks and semaphores may not provide any guarantee of ordering on UP compiled +systems, and so can't be counted on in such a situation to actually do +anything at all, especially with respect to I/O memory barriering. + +Either interrupt disablement (LOCK) and enablement (UNLOCK) will barrier +memory and I/O accesses individually, or interrupt handling will barrier +memory and I/O accesses on entry and on exit. This prevents an interrupt +routine interfering with accesses made in a disabled-interrupt section of code +and vice versa. + +This specification is a _minimum_ guarantee; any particular architecture may +provide more substantial guarantees, but these may not be relied upon outside +of arch specific code. + + +As an example, consider the following: + + *A = a; + *B = b; + LOCK + *C = c; + *D = d; + UNLOCK + *E = e; + *F = f; + +The following sequence of events on the bus is acceptable: + + LOCK, *F+*A, *E, *C+*D, *B, UNLOCK + +But none of the following are: + + *F+*A, *B, LOCK, *C, *D, UNLOCK, *E + *A, *B, *C, LOCK, *D, UNLOCK, *E, *F + *A, *B, LOCK, *C, UNLOCK, *D, *E, *F + *B, LOCK, *C, *D, UNLOCK, *F+*A, *E + + +Consider also the following (going back to the AMD PCnet example): + + DISABLE IRQ + *ADR = ctl_reg_3; + mb(); + x = *DATA; + *ADR = ctl_reg_4; + mb(); + *DATA = y; + *ADR = ctl_reg_5; + mb(); + z = *DATA; + ENABLE IRQ + + *ADR = ctl_reg_7; + mb(); + q = *DATA + + +What's to stop "z = *DATA" crossing "*ADR = ctl_reg_7" and reading from the +wrong register? (There's no guarantee that the process of handling an +interrupt will barrier memory accesses in any way). + + +============================== +I386 AND X86_64 SPECIFIC NOTES +============================== + +Earlier i386 CPUs (pre-Pentium-III) are fully ordered - the operations on the +bus appear in program order - and so there's no requirement for any sort of +explicit memory barriers. + +From the Pentium-III onwards were three new memory barrier instructions: +LFENCE, SFENCE and MFENCE which correspond to the kernel memory barrier +functions rmb(), wmb() and mb(). However, there are additional implicit memory +barriers in the CPU implementation: + + (*) Interrupt processing implies mb(). + + (*) The LOCK prefix adds implication of mb() on whatever instruction it is + attached to. + + (*) Normal writes to memory imply wmb() [and so SFENCE is normally not + required]. + + (*) Normal writes imply a semi-rmb(): reads before a write may not complete + after that write, but reads after a write may complete before the write + (ie: reads may go _ahead_ of writes). + + (*) Non-temporal writes imply no memory barrier, and are the intended target + of SFENCE. + + (*) Accesses to uncached memory imply mb() [eg: memory mapped I/O]. + + +====================== +POWERPC SPECIFIC NOTES +====================== + +The powerpc is weakly ordered, and its read and write accesses may be +completed generally in any order. It's memory barriers are also to some extent +more substantial than the mimimum requirement, and may directly effect +hardware outside of the CPU. From matthew at wil.cx Wed Mar 8 04:40:57 2006 From: matthew at wil.cx (Matthew Wilcox) Date: Tue, 7 Mar 2006 10:40:57 -0700 Subject: Memory barriers and spin_unlock safety In-Reply-To: <31420.1141753019@warthog.cambridge.redhat.com> References: <5041.1141417027@warthog.cambridge.redhat.com> <32518.1141401780@warthog.cambridge.redhat.com> <1146.1141404346@warthog.cambridge.redhat.com> <31420.1141753019@warthog.cambridge.redhat.com> Message-ID: <20060307174057.GD7301@parisc-linux.org> On Tue, Mar 07, 2006 at 05:36:59PM +0000, David Howells wrote: > David Howells wrote: > > > I suspect, then, that x86_64 should not have an SFENCE for smp_wmb(), and > > that only io_wmb() should have that. > > Hmmm... We don't actually have io_wmb()... Should the following be added to > all archs? > > io_mb() > io_rmb() > io_wmb() it's spelled mmiowb(), and reads from IO space are synchronous, so don't need barriers. From ak at suse.de Tue Mar 7 21:34:52 2006 From: ak at suse.de (Andi Kleen) Date: Tue, 7 Mar 2006 11:34:52 +0100 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <31492.1141753245@warthog.cambridge.redhat.com> References: <31492.1141753245@warthog.cambridge.redhat.com> Message-ID: <200603071134.52962.ak@suse.de> On Tuesday 07 March 2006 18:40, David Howells wrote: > +Older and less complex CPUs will perform memory accesses in exactly the order > +specified, so if one is given the following piece of code: > + > + a = *A; > + *B = b; > + c = *C; > + d = *D; > + *E = e; > + > +It can be guaranteed that it will complete the memory access for each > +instruction before moving on to the next line, leading to a definite sequence > +of operations on the bus: Actually gcc is free to reorder it (often it will not when it cannot prove that they don't alias, but sometimes it can) > + > + Consider, for example, an ethernet chipset such as the AMD PCnet32. It > + presents to the CPU an "address register" and a bunch of "data registers". > + The way it's accessed is to write the index of the internal register you > + want to access to the address register, and then read or write the > + appropriate data register to access the chip's internal register: > + > + *ADR = ctl_reg_3; > + reg = *DATA; You're not supposed to do it this way anyways. The official way to access MMIO space is using read/write[bwlq] Haven't read all of it sorry, but thanks for the work of documenting it. -Andi From torvalds at osdl.org Wed Mar 8 05:28:45 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Tue, 7 Mar 2006 10:28:45 -0800 (PST) Subject: Memory barriers and spin_unlock safety In-Reply-To: <1141755496.31814.56.camel@localhost.localdomain> References: <5041.1141417027@warthog.cambridge.redhat.com> <32518.1141401780@warthog.cambridge.redhat.com> <1146.1141404346@warthog.cambridge.redhat.com> <31420.1141753019@warthog.cambridge.redhat.com> <1141755496.31814.56.camel@localhost.localdomain> Message-ID: On Tue, 7 Mar 2006, Alan Cox wrote: > > What kind of mb/rmb/wmb goes with ioread/iowrite ? It seems we actually > need one that can work out what to do for the general io API ? The ioread/iowrite things only guarantee the laxer MMIO rules, since it _might_ be mmio. So you'd use the mmio barriers. In fact, I would suggest that architectures that can do PIO in a more relaxed manner (x86 cannot, since all the serialization is in hardware) would do even a PIO in the more relaxed ordering (ie writes can at least be posted, but obviously not merged, since that would be against PCI specs). x86 tends to serialize PIO too much (I think at least Intel CPU's will actually wait for the PIO write to be acknowledged by _something_ on the bus, although it obviously can't wait for the device to have acted on it). Linus From dhowells at redhat.com Wed Mar 8 05:30:40 2006 From: dhowells at redhat.com (David Howells) Date: Tue, 07 Mar 2006 18:30:40 +0000 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <200603071134.52962.ak@suse.de> References: <200603071134.52962.ak@suse.de> <31492.1141753245@warthog.cambridge.redhat.com> Message-ID: <7621.1141756240@warthog.cambridge.redhat.com> Andi Kleen wrote: > Actually gcc is free to reorder it > (often it will not when it cannot prove that they don't alias, but sometimes > it can) Yeah... I have mentioned the fact that compilers can reorder too, but obviously not enough. > You're not supposed to do it this way anyways. The official way to access > MMIO space is using read/write[bwlq] True, I suppose. I should make it clear that these accessor functions imply memory barriers, if indeed they do, and that you should use them rather than accessing I/O registers directly (at least, outside the arch you should). David From ak at suse.de Tue Mar 7 22:13:46 2006 From: ak at suse.de (Andi Kleen) Date: Tue, 7 Mar 2006 12:13:46 +0100 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <7621.1141756240@warthog.cambridge.redhat.com> References: <200603071134.52962.ak@suse.de> <31492.1141753245@warthog.cambridge.redhat.com> <7621.1141756240@warthog.cambridge.redhat.com> Message-ID: <200603071213.47885.ak@suse.de> On Tuesday 07 March 2006 19:30, David Howells wrote: > > You're not supposed to do it this way anyways. The official way to access > > MMIO space is using read/write[bwlq] > > True, I suppose. I should make it clear that these accessor functions imply > memory barriers, if indeed they do, I don't think they do. > and that you should use them rather than > accessing I/O registers directly (at least, outside the arch you should). Even inside the architecture it's a good idea. -Andi From alan at lxorguk.ukuu.org.uk Wed Mar 8 05:40:25 2006 From: alan at lxorguk.ukuu.org.uk (Alan Cox) Date: Tue, 07 Mar 2006 18:40:25 +0000 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <31492.1141753245@warthog.cambridge.redhat.com> References: <31492.1141753245@warthog.cambridge.redhat.com> Message-ID: <1141756825.31814.75.camel@localhost.localdomain> On Maw, 2006-03-07 at 17:40 +0000, David Howells wrote: > +Older and less complex CPUs will perform memory accesses in exactly the order > +specified, so if one is given the following piece of code: Not really true. Some of the fairly old dumb processors don't do this to the bus, and just about anything with a cache wont (as it'll burst cache lines to main memory) > + want to access to the address register, and then read or write the > + appropriate data register to access the chip's internal register: > + > + *ADR = ctl_reg_3; > + reg = *DATA; Not allowed anyway > + In this case, the barrier makes a guarantee that all memory accesses > + before the barrier will happen before all the memory accesses after the > + barrier. It does _not_ guarantee that all memory accesses before the > + barrier will be complete by the time the barrier is complete. Better meaningful example would be barriers versus an IRQ handler. Which leads nicely onto section 2 > +General memory barriers make a guarantee that all memory accesses specified > +before the barrier will happen before all memory accesses specified after the > +barrier. No. They guarantee that to an observer also running on that set of processors the accesses to main memory will appear to be ordered in that manner. They don't guarantee I/O related ordering for non main memory due to things like PCI posting rules and NUMA goings on. As an example of the difference here a Geode will reorder stores as it feels but snoop the bus such that it can ensure an external bus master cannot observe this by holding it off the bus to fix up ordering violations first. > +Read memory barriers make a guarantee that all memory reads specified before > +the barrier will happen before all memory reads specified after the barrier. > + > +Write memory barriers make a guarantee that all memory writes specified before > +the barrier will happen before all memory writes specified after the barrier. Both with the caveat above > +There is no guarantee that any of the memory accesses specified before a memory > +barrier will be complete by the completion of a memory barrier; the barrier can > +be considered to draw a line in the access queue that accesses of the > +appropriate type may not cross. CPU generated accesses to main memory > + (*) interrupt disablement and/or interrupts > + (*) spin locks > + (*) R/W spin locks > + (*) mutexes > + (*) semaphores > + (*) R/W semaphores Should probably cover schedule() here. > +Locks and semaphores may not provide any guarantee of ordering on UP compiled > +systems, and so can't be counted on in such a situation to actually do > +anything at all, especially with respect to I/O memory barriering. _irqsave/_irqrestore ... > +============================== > +I386 AND X86_64 SPECIFIC NOTES > +============================== > + > +Earlier i386 CPUs (pre-Pentium-III) are fully ordered - the operations on the > +bus appear in program order - and so there's no requirement for any sort of > +explicit memory barriers. Actually they are not. Processors prior to Pentium Pro ensure that the perceived ordering between processors of writes to main memory is preserved. The Pentium Pro is supposed to but does not in SMP cases. Our spin_unlock code knows about this. It also has some problems with this situation when handling write combining memory. The IDT Winchip series processors are run in out of order store mode and our lock functions and dmamappers should know enough about this. On x86 memory barriers for read serialize order using lock instructions, on write the winchip at least generates serializing instructions. barrier() is pure CPU level of course > + (*) Normal writes to memory imply wmb() [and so SFENCE is normally not > + required]. Only at an on processor level and not for all clones, also there are errata here for PPro. > + (*) Accesses to uncached memory imply mb() [eg: memory mapped I/O]. Not always. MMIO ordering is outside of the CPU ordering rules and into PCI and other bus ordering rules. Consider writel(STOP_DMA, &foodev->ctrl); free_dma_buffers(foodev); This leads to horrible disasters. > + > +====================== > +POWERPC SPECIFIC NOTES Can't comment on PPC From jbarnes at virtuousgeek.org Wed Mar 8 05:46:11 2006 From: jbarnes at virtuousgeek.org (Jesse Barnes) Date: Tue, 7 Mar 2006 10:46:11 -0800 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <7621.1141756240@warthog.cambridge.redhat.com> References: <200603071134.52962.ak@suse.de> <31492.1141753245@warthog.cambridge.redhat.com> <7621.1141756240@warthog.cambridge.redhat.com> Message-ID: <200603071046.11980.jbarnes@virtuousgeek.org> On Tuesday, March 7, 2006 10:30 am, David Howells wrote: > True, I suppose. I should make it clear that these accessor functions > imply memory barriers, if indeed they do, and that you should use them > rather than accessing I/O registers directly (at least, outside the > arch you should). But they don't, that's why we have mmiowb(). There are lots of cases to handle: 1) memory vs. memory 2) memory vs. I/O 3) I/O vs. I/O (reads and writes for every case). AFAIK, we have (1) fairly well handled with a plethora of barrier ops. (2) is a bit fuzzy with the current operations I think, and for (3) all we have is mmiowb() afaik. Maybe one of the ppc64 guys can elaborate on the barriers their hw needs for the above cases (I think they're the pathological case, so covering them should be good enough everybody). Btw, thanks for putting together this documentation, it's desperately needed. Jesse From jbarnes at virtuousgeek.org Wed Mar 8 04:56:16 2006 From: jbarnes at virtuousgeek.org (Jesse Barnes) Date: Tue, 7 Mar 2006 09:56:16 -0800 Subject: Memory barriers and spin_unlock safety In-Reply-To: <20060307174057.GD7301@parisc-linux.org> References: <5041.1141417027@warthog.cambridge.redhat.com> <31420.1141753019@warthog.cambridge.redhat.com> <20060307174057.GD7301@parisc-linux.org> Message-ID: <200603070956.16763.jbarnes@virtuousgeek.org> On Tuesday, March 7, 2006 9:40 am, Matthew Wilcox wrote: > On Tue, Mar 07, 2006 at 05:36:59PM +0000, David Howells wrote: > > David Howells wrote: > > > I suspect, then, that x86_64 should not have an SFENCE for > > > smp_wmb(), and that only io_wmb() should have that. > > > > Hmmm... We don't actually have io_wmb()... Should the following be > > added to all archs? > > > > io_mb() > > io_rmb() > > io_wmb() > > it's spelled mmiowb(), and reads from IO space are synchronous, so > don't need barriers. To expand on willy's note, the reason it's called mmiowb as opposed to iowb is because I/O port acccess (inX/outX) are inherently synchronous and don't need barriers. mmio writes, however (writeX) need barrier operations to ensure ordering on some platforms. This raises the question of what semantics the unified I/O mapping routines have... are ioreadX/iowriteX synchronous or should we define the barriers you mention above for them? (IIRC ppc64 can use an io read ordering op). Jesse From alan at lxorguk.ukuu.org.uk Wed Mar 8 05:55:06 2006 From: alan at lxorguk.ukuu.org.uk (Alan Cox) Date: Tue, 07 Mar 2006 18:55:06 +0000 Subject: Memory barriers and spin_unlock safety In-Reply-To: References: <5041.1141417027@warthog.cambridge.redhat.com> <32518.1141401780@warthog.cambridge.redhat.com> <1146.1141404346@warthog.cambridge.redhat.com> <31420.1141753019@warthog.cambridge.redhat.com> <1141755496.31814.56.camel@localhost.localdomain> Message-ID: <1141757706.31814.80.camel@localhost.localdomain> On Maw, 2006-03-07 at 10:28 -0800, Linus Torvalds wrote: > x86 tends to serialize PIO too much (I think at least Intel CPU's will > actually wait for the PIO write to be acknowledged by _something_ on the > bus, although it obviously can't wait for the device to have acted on it). Don't bet on that 8( In the PCI case the I/O write appears to be acked by the bridges used on x86 when the write completes on the PCI bus and then back to the CPU. MMIO is thankfully posted. At least thats how the timings on some devices look. From alan at lxorguk.ukuu.org.uk Wed Mar 8 05:18:16 2006 From: alan at lxorguk.ukuu.org.uk (Alan Cox) Date: Tue, 07 Mar 2006 18:18:16 +0000 Subject: Memory barriers and spin_unlock safety In-Reply-To: <31420.1141753019@warthog.cambridge.redhat.com> References: <5041.1141417027@warthog.cambridge.redhat.com> <32518.1141401780@warthog.cambridge.redhat.com> <1146.1141404346@warthog.cambridge.redhat.com> <31420.1141753019@warthog.cambridge.redhat.com> Message-ID: <1141755496.31814.56.camel@localhost.localdomain> On Maw, 2006-03-07 at 17:36 +0000, David Howells wrote: > Hmmm... We don't actually have io_wmb()... Should the following be added to > all archs? > > io_mb() > io_rmb() > io_wmb() What kind of mb/rmb/wmb goes with ioread/iowrite ? It seems we actually need one that can work out what to do for the general io API ? From matthew at wil.cx Wed Mar 8 06:06:02 2006 From: matthew at wil.cx (Matthew Wilcox) Date: Tue, 7 Mar 2006 12:06:02 -0700 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: References: <31492.1141753245@warthog.cambridge.redhat.com> <1141756825.31814.75.camel@localhost.localdomain> Message-ID: <20060307190602.GE7301@parisc-linux.org> On Tue, Mar 07, 2006 at 01:54:33PM -0500, linux-os (Dick Johnson) wrote: > This might be a good place to document: > dummy = readl(&foodev->ctrl); > > Will flush all pending writes to the PCI bus and that: > (void) readl(&foodev->ctrl); > ... won't because `gcc` may optimize it away. In fact, variable > "dummy" should be global or `gcc` may make it go away as well. static inline unsigned int readl(const volatile void __iomem *addr) { return *(volatile unsigned int __force *) addr; } The cast is volatile, so gcc knows not to optimise it away. From linux-os at analogic.com Wed Mar 8 06:15:42 2006 From: linux-os at analogic.com (linux-os (Dick Johnson)) Date: Tue, 7 Mar 2006 14:15:42 -0500 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <20060307190602.GE7301@parisc-linux.org> References: <31492.1141753245@warthog.cambridge.redhat.com> <1141756825.31814.75.camel@localhost.localdomain> <20060307190602.GE7301@parisc-linux.org> Message-ID: On Tue, 7 Mar 2006, Matthew Wilcox wrote: > On Tue, Mar 07, 2006 at 01:54:33PM -0500, linux-os (Dick Johnson) wrote: >> This might be a good place to document: >> dummy = readl(&foodev->ctrl); >> >> Will flush all pending writes to the PCI bus and that: >> (void) readl(&foodev->ctrl); >> ... won't because `gcc` may optimize it away. In fact, variable >> "dummy" should be global or `gcc` may make it go away as well. > > static inline unsigned int readl(const volatile void __iomem *addr) > { > return *(volatile unsigned int __force *) addr; > } > > The cast is volatile, so gcc knows not to optimise it away. > When the assignment is not made a.k.a., cast to void, or when the assignment is made to an otherwise unused variable, `gcc` does, indeed make it go away. These problems caused weeks of chagrin after it was found that a PCI DMA operation took 20 or more times than it should. The writel(START_DMA, &control), followed by a dummy = readl(&control), ended up with the readl() missing. That meant that the DMA didn't start until some timer code read a status register, wondering why it hadn't completed yet. Cheers, Dick Johnson Penguin : Linux version 2.6.15.4 on an i686 machine (5589.50 BogoMips). Warning : 98.36% of all statistics are fiction, book release in April. _  **************************************************************** The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors at analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them. Thank you. From dhowells at redhat.com Wed Mar 8 06:24:03 2006 From: dhowells at redhat.com (David Howells) Date: Tue, 07 Mar 2006 19:24:03 +0000 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <200603071213.47885.ak@suse.de> References: <200603071213.47885.ak@suse.de> <200603071134.52962.ak@suse.de> <31492.1141753245@warthog.cambridge.redhat.com> <7621.1141756240@warthog.cambridge.redhat.com> Message-ID: <8620.1141759443@warthog.cambridge.redhat.com> Andi Kleen wrote: > > > You're not supposed to do it this way anyways. The official way to access > > > MMIO space is using read/write[bwlq] > > > > True, I suppose. I should make it clear that these accessor functions imply > > memory barriers, if indeed they do, > > I don't think they do. Hmmm.. Seems Stephen Hemminger disagrees: | > > 1) Access to i/o mapped memory does not need memory barriers. | > | > There's no guarantee of that. On FRV you have to insert barriers as | > appropriate when you're accessing I/O mapped memory if ordering is required | > (accessing an ethernet card vs accessing a frame buffer), but support for | > inserting the appropriate barriers is built into gcc - which knows the rules | > for when to insert them. | > | > Or are you referring to the fact that this should be implicit in inX(), | > outX(), readX(), writeX() and similar? | | yes David From linux-os at analogic.com Wed Mar 8 05:54:33 2006 From: linux-os at analogic.com (linux-os (Dick Johnson)) Date: Tue, 7 Mar 2006 13:54:33 -0500 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <1141756825.31814.75.camel@localhost.localdomain> References: <31492.1141753245@warthog.cambridge.redhat.com> <1141756825.31814.75.camel@localhost.localdomain> Message-ID: On Tue, 7 Mar 2006, Alan Cox wrote: [SNIPPED...] > > Not always. MMIO ordering is outside of the CPU ordering rules and into > PCI and other bus ordering rules. Consider > > writel(STOP_DMA, &foodev->ctrl); > free_dma_buffers(foodev); > > This leads to horrible disasters. This might be a good place to document: dummy = readl(&foodev->ctrl); Will flush all pending writes to the PCI bus and that: (void) readl(&foodev->ctrl); ... won't because `gcc` may optimize it away. In fact, variable "dummy" should be global or `gcc` may make it go away as well. Cheers, Dick Johnson Penguin : Linux version 2.6.15.4 on an i686 machine (5589.50 BogoMips). Warning : 98.36% of all statistics are fiction, book release in April. _  **************************************************************** The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors at analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them. Thank you. From ak at suse.de Tue Mar 7 22:57:23 2006 From: ak at suse.de (Andi Kleen) Date: Tue, 7 Mar 2006 12:57:23 +0100 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <1141759408.2617.9.camel@serpentine.pathscale.com> References: <200603071134.52962.ak@suse.de> <7621.1141756240@warthog.cambridge.redhat.com> <1141759408.2617.9.camel@serpentine.pathscale.com> Message-ID: <200603071257.24234.ak@suse.de> On Tuesday 07 March 2006 20:23, Bryan O'Sullivan wrote: > On Tue, 2006-03-07 at 18:30 +0000, David Howells wrote: > > > True, I suppose. I should make it clear that these accessor functions imply > > memory barriers, if indeed they do, > > They don't, but according to Documentation/DocBook/deviceiobook.tmpl > they are performed by the compiler in the order specified. I don't think that's correct. Probably the documentation should be fixed. -Andi From alan at lxorguk.ukuu.org.uk Wed Mar 8 06:33:41 2006 From: alan at lxorguk.ukuu.org.uk (Alan Cox) Date: Tue, 07 Mar 2006 19:33:41 +0000 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: References: <31492.1141753245@warthog.cambridge.redhat.com> <1141756825.31814.75.camel@localhost.localdomain> Message-ID: <1141760021.2455.1.camel@localhost.localdomain> On Maw, 2006-03-07 at 13:54 -0500, linux-os (Dick Johnson) wrote: > On Tue, 7 Mar 2006, Alan Cox wrote: > > writel(STOP_DMA, &foodev->ctrl); > > free_dma_buffers(foodev); > > > > This leads to horrible disasters. > > This might be a good place to document: > dummy = readl(&foodev->ctrl); Absolutely. And this falls outside of the memory barrier functions. > > Will flush all pending writes to the PCI bus and that: > (void) readl(&foodev->ctrl); > ... won't because `gcc` may optimize it away. In fact, variable > "dummy" should be global or `gcc` may make it go away as well. If they were ordinary functions then maybe, but they are not so a simple readl(&foodev->ctrl) will be sufficient and isn't optimised away. Alan From bos at serpentine.com Wed Mar 8 06:23:28 2006 From: bos at serpentine.com (Bryan O'Sullivan) Date: Tue, 07 Mar 2006 11:23:28 -0800 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <7621.1141756240@warthog.cambridge.redhat.com> References: <200603071134.52962.ak@suse.de> <31492.1141753245@warthog.cambridge.redhat.com> <7621.1141756240@warthog.cambridge.redhat.com> Message-ID: <1141759408.2617.9.camel@serpentine.pathscale.com> On Tue, 2006-03-07 at 18:30 +0000, David Howells wrote: > True, I suppose. I should make it clear that these accessor functions imply > memory barriers, if indeed they do, They don't, but according to Documentation/DocBook/deviceiobook.tmpl they are performed by the compiler in the order specified. They also convert between PCI byte order and CPU byte order. If you want to avoid that, you need the __raw_* versions, which are not guaranteed to be provided by all arches. References: <200603071134.52962.ak@suse.de> <1141759408.2617.9.camel@serpentine.pathscale.com> <200603071257.24234.ak@suse.de> Message-ID: <200603071201.22397.jbarnes@virtuousgeek.org> On Tuesday, March 7, 2006 3:57 am, Andi Kleen wrote: > On Tuesday 07 March 2006 20:23, Bryan O'Sullivan wrote: > > On Tue, 2006-03-07 at 18:30 +0000, David Howells wrote: > > > True, I suppose. I should make it clear that these accessor > > > functions imply memory barriers, if indeed they do, > > > > They don't, but according to Documentation/DocBook/deviceiobook.tmpl > > they are performed by the compiler in the order specified. > > I don't think that's correct. Probably the documentation should > be fixed. On ia64 I'm pretty sure it's true, and it seems like it should be in the general case too. The compiler shouldn't reorder uncached memory accesses with volatile semantics... Jesse From dhowells at redhat.com Wed Mar 8 07:09:07 2006 From: dhowells at redhat.com (David Howells) Date: Tue, 07 Mar 2006 20:09:07 +0000 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <1141756825.31814.75.camel@localhost.localdomain> References: <1141756825.31814.75.camel@localhost.localdomain> <31492.1141753245@warthog.cambridge.redhat.com> Message-ID: <9551.1141762147@warthog.cambridge.redhat.com> Alan Cox wrote: > Better meaningful example would be barriers versus an IRQ handler. Which > leads nicely onto section 2 Yes, except that I can't think of one that's feasible that doesn't have to do with I/O - which isn't a problem if you are using the proper accessor functions. Such an example has to involve more than one CPU, because you don't tend to get memory/memory ordering problems on UP. The obvious one might be circular buffers, except there's no problem there provided you have a memory barrier between accessing the buffer and updating your pointer into it. David From torvalds at osdl.org Wed Mar 8 07:21:18 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Tue, 7 Mar 2006 12:21:18 -0800 (PST) Subject: Memory barriers and spin_unlock safety In-Reply-To: <1141757706.31814.80.camel@localhost.localdomain> References: <5041.1141417027@warthog.cambridge.redhat.com> <32518.1141401780@warthog.cambridge.redhat.com> <1146.1141404346@warthog.cambridge.redhat.com> <31420.1141753019@warthog.cambridge.redhat.com> <1141755496.31814.56.camel@localhost.localdomain> <1141757706.31814.80.camel@localhost.localdomain> Message-ID: On Tue, 7 Mar 2006, Alan Cox wrote: > > In the PCI case the I/O write appears to be acked by the bridges used on > x86 when the write completes on the PCI bus and then back to the CPU. > MMIO is thankfully posted. At least thats how the timings on some > devices look. Oh, absolutely. I'm sayign that you shouldn't wait for even that, since it's totally pointless (it's not synchronized _anyway_) and adds absolutely zero gain. To really synchronize, you need to read from the device anyway. So the "wait for bus activity" is just making PIO slower for no good reason, and keeps the core waiting when it could do something more useful. On an x86, there are legacy reasons to do it (people expect certain timings). But that was what I was saying - on non-x86 architectures, there's no reason for the ioread/iowrite interfaces to be as serializing as the old-fashioned PIO ones are. Might as well do the MMIO rules for a non-cacheable region: no re-ordering, but no waiting for the bus either. Linus From bos at serpentine.com Wed Mar 8 08:14:45 2006 From: bos at serpentine.com (Bryan O'Sullivan) Date: Tue, 07 Mar 2006 13:14:45 -0800 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <200603071257.24234.ak@suse.de> References: <200603071134.52962.ak@suse.de> <7621.1141756240@warthog.cambridge.redhat.com> <1141759408.2617.9.camel@serpentine.pathscale.com> <200603071257.24234.ak@suse.de> Message-ID: <1141766085.5255.12.camel@serpentine.pathscale.com> On Tue, 2006-03-07 at 12:57 +0100, Andi Kleen wrote: > > > True, I suppose. I should make it clear that these accessor functions imply > > > memory barriers, if indeed they do, > > > > They don't, but according to Documentation/DocBook/deviceiobook.tmpl > > they are performed by the compiler in the order specified. > > I don't think that's correct. Probably the documentation should > be fixed. That's why I hedged my words with "according to ..." :-) But on most arches those accesses do indeed seem to happen in-order. On i386 and x86_64, it's a natural consequence of program store ordering. On at least some other arches, there are explicit memory barriers in the implementation of the access macros to force this ordering to occur. References: <200603071134.52962.ak@suse.de> <200603071257.24234.ak@suse.de> <1141766085.5255.12.camel@serpentine.pathscale.com> Message-ID: <200603072224.09976.ak@suse.de> On Tuesday 07 March 2006 22:14, Bryan O'Sullivan wrote: > On Tue, 2006-03-07 at 12:57 +0100, Andi Kleen wrote: > > > > True, I suppose. I should make it clear that these accessor functions > > > > imply memory barriers, if indeed they do, > > > > > > They don't, but according to Documentation/DocBook/deviceiobook.tmpl > > > they are performed by the compiler in the order specified. > > > > I don't think that's correct. Probably the documentation should > > be fixed. > > That's why I hedged my words with "according to ..." :-) > > But on most arches those accesses do indeed seem to happen in-order. On > i386 and x86_64, it's a natural consequence of program store ordering. Not true for reads on x86. -Andi From alan at lxorguk.ukuu.org.uk Wed Mar 8 11:32:03 2006 From: alan at lxorguk.ukuu.org.uk (Alan Cox) Date: Wed, 08 Mar 2006 00:32:03 +0000 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <9551.1141762147@warthog.cambridge.redhat.com> References: <1141756825.31814.75.camel@localhost.localdomain> <31492.1141753245@warthog.cambridge.redhat.com> <9551.1141762147@warthog.cambridge.redhat.com> Message-ID: <1141777924.2455.21.camel@localhost.localdomain> On Maw, 2006-03-07 at 20:09 +0000, David Howells wrote: > Alan Cox wrote: > > > Better meaningful example would be barriers versus an IRQ handler. Which > > leads nicely onto section 2 > > Yes, except that I can't think of one that's feasible that doesn't have to do > with I/O - which isn't a problem if you are using the proper accessor > functions. We get them off bus masters for one and you can construct silly versions of the other. There are several kernel instances of while(*ptr != HAVE_RESPONDED && time_before(jiffies, timeout)) rmb(); where we wait for hardware to bus master respond when it is fast and doesn't IRQ. From alan at lxorguk.ukuu.org.uk Wed Mar 8 11:35:16 2006 From: alan at lxorguk.ukuu.org.uk (Alan Cox) Date: Wed, 08 Mar 2006 00:35:16 +0000 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <200603071257.24234.ak@suse.de> References: <200603071134.52962.ak@suse.de> <7621.1141756240@warthog.cambridge.redhat.com> <1141759408.2617.9.camel@serpentine.pathscale.com> <200603071257.24234.ak@suse.de> Message-ID: <1141778116.2455.23.camel@localhost.localdomain> On Maw, 2006-03-07 at 12:57 +0100, Andi Kleen wrote: > > They don't, but according to Documentation/DocBook/deviceiobook.tmpl > > they are performed by the compiler in the order specified. > > I don't think that's correct. Probably the documentation should > be fixed. It would be wiser to ensure they are performed in the order specified. As far as I can see this is currently true due to the volatile cast and most drivers rely on this property so the brown and sticky will impact the rotating air impeller pretty fast if it isnt. From alan at lxorguk.ukuu.org.uk Wed Mar 8 11:36:58 2006 From: alan at lxorguk.ukuu.org.uk (Alan Cox) Date: Wed, 08 Mar 2006 00:36:58 +0000 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <200603072224.09976.ak@suse.de> References: <200603071134.52962.ak@suse.de> <200603071257.24234.ak@suse.de> <1141766085.5255.12.camel@serpentine.pathscale.com> <200603072224.09976.ak@suse.de> Message-ID: <1141778218.2455.25.camel@localhost.localdomain> On Maw, 2006-03-07 at 22:24 +0100, Andi Kleen wrote: > > But on most arches those accesses do indeed seem to happen in-order. On > > i386 and x86_64, it's a natural consequence of program store ordering. > > Not true for reads on x86. You must have a strange kernel Andi. Mine marks them as volatile unsigned char * references. Alan From nickpiggin at yahoo.com.au Wed Mar 8 13:07:37 2006 From: nickpiggin at yahoo.com.au (Nick Piggin) Date: Wed, 08 Mar 2006 13:07:37 +1100 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <31492.1141753245@warthog.cambridge.redhat.com> References: <31492.1141753245@warthog.cambridge.redhat.com> Message-ID: <440E3C69.2040902@yahoo.com.au> David Howells wrote: >The attached patch documents the Linux kernel's memory barriers. > >Signed-Off-By: David Howells >--- > > Good :) >+============================== >+IMPLIED KERNEL MEMORY BARRIERS >+============================== >+ >+Some of the other functions in the linux kernel imply memory barriers. For >+instance all the following (pseudo-)locking functions imply barriers. >+ >+ (*) interrupt disablement and/or interrupts > Is this really the case? I mean interrupt disablement only synchronises with the local CPU, so it probably should not _have_ to imply barriers (eg. some architectures are playing around with "virtual" interrupt disablement). [...] >+ >+Either interrupt disablement (LOCK) and enablement (UNLOCK) will barrier >+memory and I/O accesses individually, or interrupt handling will barrier >+memory and I/O accesses on entry and on exit. This prevents an interrupt >+routine interfering with accesses made in a disabled-interrupt section of code >+and vice versa. >+ > But CPUs should always be consistent WRT themselves, so I'm not sure that it is needed? Thanks, Nick -- Send instant messages to your online friends http://au.messenger.yahoo.com From paulus at samba.org Wed Mar 8 14:10:01 2006 From: paulus at samba.org (Paul Mackerras) Date: Wed, 8 Mar 2006 14:10:01 +1100 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <31492.1141753245@warthog.cambridge.redhat.com> References: <31492.1141753245@warthog.cambridge.redhat.com> Message-ID: <17422.19209.60360.178668@cargo.ozlabs.ibm.com> David Howells writes: > The attached patch documents the Linux kernel's memory barriers. Thanks for venturing into this particular lion's den. :) > +Memory barriers are instructions to both the compiler and the CPU to impose a > +partial ordering between the memory access operations specified either side of > +the barrier. ... as observed from another agent in the system - another CPU or a bus-mastering I/O device. A given CPU will always see its own memory accesses in order. > + (*) reads are synchronous and may need to be done immediately to permit Leave out the "are synchronous and". It's not true. I also think you need to avoid talking about "the bus". Some systems don't have a bus, but rather have an interconnection fabric between the CPUs and the memories. Talking about a bus implies that all memory accesses in fact get serialized (by having to be sent one after the other over the bus) and that you can therefore talk about the order in which they get to memory. In some systems, no such order exists. It's possible to talk sensibly about the order in which memory accesses get done without talking about a bus or requiring a total ordering on the memory access. The PowerPC architecture spec does this by specifying that in certain circumstances one load or store has to be "performed with respect to other processors and mechanisms" before another. A load is said to be performed with respect to another agent when a store by that agent can no longer change the value returned by the load. Similarly, a store is performed w.r.t. an agent when any load done by the agent will return the value stored (or a later value). > + The way to deal with this is to insert an I/O memory barrier between the > + two accesses: > + > + *ADR = ctl_reg_3; > + mb(); > + reg = *DATA; Ummm, this implies mb() is "an I/O memory barrier". I can see people getting confused if they read this and then see mb() being used when no I/O is being done. > +The Linux kernel has six basic memory barriers: > + > + MANDATORY (I/O) SMP > + =============== ================ > + GENERAL mb() smp_mb() > + READ rmb() smp_rmb() > + WRITE wmb() smp_wmb() > + > +General memory barriers make a guarantee that all memory accesses specified > +before the barrier will happen before all memory accesses specified after the > +barrier. By "memory accesses" do you mean accesses to system memory, or do you mean loads and stores - which may be to system memory, memory on an I/O device (e.g. a framebuffer) or to memory-mapped I/O registers? Linus explained recently that wmb() on x86 does not order stores to system memory w.r.t. stores to stores to prefetchable I/O memory (at least that's what I think he said ;). > +Some of the other functions in the linux kernel imply memory barriers. For > +instance all the following (pseudo-)locking functions imply barriers. > + > + (*) interrupt disablement and/or interrupts Enabling/disabling interrupts doesn't imply a barrier on powerpc, and nor does taking an interrupt or returning from one. > + (*) spin locks I think it's still an open question as to whether spin locks do any ordering between accesses to system memory and accesses to I/O registers. > + (*) R/W spin locks > + (*) mutexes > + (*) semaphores > + (*) R/W semaphores > + > +In all cases there are variants on a LOCK operation and an UNLOCK operation. > + > + (*) LOCK operation implication: > + > + Memory accesses issued after the LOCK will be completed after the LOCK > + accesses have completed. > + > + Memory accesses issued before the LOCK may be completed after the LOCK > + accesses have completed. > + > + (*) UNLOCK operation implication: > + > + Memory accesses issued before the UNLOCK will be completed before the > + UNLOCK accesses have completed. > + > + Memory accesses issued after the UNLOCK may be completed before the UNLOCK > + accesses have completed. And therefore an UNLOCK followed by a LOCK is equivalent to a full barrier, but a LOCK followed by an UNLOCK isn't. > +Either interrupt disablement (LOCK) and enablement (UNLOCK) will barrier > +memory and I/O accesses individually, or interrupt handling will barrier > +memory and I/O accesses on entry and on exit. This prevents an interrupt > +routine interfering with accesses made in a disabled-interrupt section of code > +and vice versa. I don't think this is right, and I don't think it is necessary to achieve the end you state, since a CPU will always see its own memory accesses in program order. > +The following sequence of events on the bus is acceptable: > + > + LOCK, *F+*A, *E, *C+*D, *B, UNLOCK What does *F+*A mean? > +Consider also the following (going back to the AMD PCnet example): > + > + DISABLE IRQ > + *ADR = ctl_reg_3; > + mb(); > + x = *DATA; > + *ADR = ctl_reg_4; > + mb(); > + *DATA = y; > + *ADR = ctl_reg_5; > + mb(); > + z = *DATA; > + ENABLE IRQ > + > + *ADR = ctl_reg_7; > + mb(); > + q = *DATA > + > + > +What's to stop "z = *DATA" crossing "*ADR = ctl_reg_7" and reading from the > +wrong register? (There's no guarantee that the process of handling an > +interrupt will barrier memory accesses in any way). Well, the driver should *not* be doing *ADR at all, it should be using read[bwl]/write[bwl]. The architecture code has to implement read*/write* in such a way that the accesses generated can't be reordered. I _think_ it also has to make sure the write accesses can't be write-combined, but it would be good to have that clarified. > +====================== > +POWERPC SPECIFIC NOTES > +====================== > + > +The powerpc is weakly ordered, and its read and write accesses may be > +completed generally in any order. It's memory barriers are also to some extent > +more substantial than the mimimum requirement, and may directly effect > +hardware outside of the CPU. Unfortunately mb()/smp_mb() are quite expensive on PowerPC, since the only instruction we have that implies a strong enough barrier is sync, which also performs several other kinds of synchronization, such as waiting until all previous instructions have completed executing to the point where they can no longer cause an exception. Paul. From paulus at samba.org Wed Mar 8 14:20:57 2006 From: paulus at samba.org (Paul Mackerras) Date: Wed, 8 Mar 2006 14:20:57 +1100 Subject: Memory barriers and spin_unlock safety In-Reply-To: References: <32518.1141401780@warthog.cambridge.redhat.com> <17417.29375.87604.537434@cargo.ozlabs.ibm.com> Message-ID: <17422.19865.635112.820824@cargo.ozlabs.ibm.com> Linus Torvalds writes: > So the rules from the PC side (and like it or not, they end up being > what all the drivers are tested with) are: > > - regular stores are ordered by write barriers I thought regular stores were always ordered anyway? > - PIO stores are always synchronous By synchronous, do you mean ordered with respect to all other accesses (regular memory, MMIO, prefetchable MMIO, PIO)? In other words, if I store a value in regular memory, then do an outb() to a device, and the device does a DMA read to the location I just stored to, is the device guaranteed to see the value I just stored (assuming no other later store to the location)? > - MMIO stores are ordered by IO semantics > - PCI ordering must be honored: > * write combining is only allowed on PCI memory resources > that are marked prefetchable. If your host bridge does write > combining in general, it's not a "host bridge", it's a "host > disaster". Presumably the host bridge doesn't know what sort of PCI resource is mapped at a given address, so that information (whether the resource is prefetchable) must come from the CPU, which would get it from the TLB entry or an MTRR entry - is that right? Or is there some gentleman's agreement between the host bridge and the BIOS that certain address ranges are only used for certain types of PCI memory resources? > * for others, writes can always be posted, but they cannot > be re-ordered wrt either reads or writes to that device > (ie a read will always be fully synchronizing) > - io_wmb must be honored What ordering is there between stores to regular memory and stores to non-prefetchable MMIO? If a store to regular memory can be performed before a store to MMIO, does a wmb() suffice to enforce an ordering, or do you have to use mmiowb()? Do PCs ever use write-through caching on prefetchable MMIO resources? Thanks, Paul. From torvalds at osdl.org Wed Mar 8 14:30:33 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Tue, 7 Mar 2006 19:30:33 -0800 (PST) Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <17422.19209.60360.178668@cargo.ozlabs.ibm.com> References: <31492.1141753245@warthog.cambridge.redhat.com> <17422.19209.60360.178668@cargo.ozlabs.ibm.com> Message-ID: On Wed, 8 Mar 2006, Paul Mackerras wrote: > > Linus explained recently that wmb() on x86 does not order stores to > system memory w.r.t. stores to stores to prefetchable I/O memory (at > least that's what I think he said ;). In fact, it won't order stores to normal memory even wrt any _non-prefetchable_ IO memory. PCI (and any other sane IO fabric, for that matter) will do IO posting, so the fact that the CPU _core_ may order them due to a wmb() doesn't actually mean anything. The only way to _really_ synchronize with a store to an IO device is literally to read from that device (*). No amount of memory barriers will do it. So you can really only order stores to regular memory wrt each other, and stores to IO memory wrt each other. For the former, "smp_wmb()" does it. For IO memory, normal IO memory is _always_ supposed to be in program order (at least for PCI. It's part of how the bus is supposed to work), unless the IO range allows prefetching (and you've set some MTRR). And if you do, that, currently you're kind of screwed. mmiowb() should do it, but nobody really uses it, and I think it's broken on x86 (it's a no-op, it really should be an "sfence"). A full "mb()" is probably most likely to work in practice. And yes, we should clean this up. Linus (*) The "read" can of course be any event that tells you that the store has happened - it doesn't necessarily have to be an actual "read[bwl]()" operation. Eg the store might start a command, and when you get the completion interrupt, you obviously know that the store is done, just from a causal reason. From torvalds at osdl.org Wed Mar 8 14:54:46 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Tue, 7 Mar 2006 19:54:46 -0800 (PST) Subject: Memory barriers and spin_unlock safety In-Reply-To: <17422.19865.635112.820824@cargo.ozlabs.ibm.com> References: <32518.1141401780@warthog.cambridge.redhat.com> <17417.29375.87604.537434@cargo.ozlabs.ibm.com> <17422.19865.635112.820824@cargo.ozlabs.ibm.com> Message-ID: On Wed, 8 Mar 2006, Paul Mackerras wrote: > > Linus Torvalds writes: > > > So the rules from the PC side (and like it or not, they end up being > > what all the drivers are tested with) are: > > > > - regular stores are ordered by write barriers > > I thought regular stores were always ordered anyway? For the hw, yes. For the compiler no. So you actually do end up needing write barriers even on x86. It won't compile to any actual _instruction_, but it will be a compiler barrier (ie it just ends up being an empty inline asm that "modifies" memory). So forgetting the wmb() is a bug even on x86, unless you happen to program in assembly. Of course, the x86 hw semantics _do_ mean that forgetting it is less likely to cause problems, just because the compiler re-ordering is fairly unlikely most of the time. > > - PIO stores are always synchronous > > By synchronous, do you mean ordered with respect to all other accesses > (regular memory, MMIO, prefetchable MMIO, PIO)? Close, yes. HOWEVER, it's only really ordered wrt the "innermost" bus. I don't think PCI bridges are supposed to post PIO writes, but a x86 CPU basically won't stall for them forever. I _think_ they'll wait for it to hit that external bus, though. So it's totally serializing in the sense that all preceding reads have completed and all preceding writes have hit the cache-coherency point, but you don't necessarily know when the write itself will hit the device (the write will return before that necessarily happens). > In other words, if I store a value in regular memory, then do an > outb() to a device, and the device does a DMA read to the location I > just stored to, is the device guaranteed to see the value I just > stored (assuming no other later store to the location)? Yes, assuming that the DMA read is in respose to (ie causally related to) the write. > > - MMIO stores are ordered by IO semantics > > - PCI ordering must be honored: > > * write combining is only allowed on PCI memory resources > > that are marked prefetchable. If your host bridge does write > > combining in general, it's not a "host bridge", it's a "host > > disaster". > > Presumably the host bridge doesn't know what sort of PCI resource is > mapped at a given address, so that information (whether the resource > is prefetchable) must come from the CPU, which would get it from the > TLB entry or an MTRR entry - is that right? Correct. Although it could of course be a map in the host bridge itself, not on the CPU. If the host bridge doesn't know, then the host bridge had better not combine or the CPU had better tell it not to combine, using something like a "sync" instruction that causes bus traffic. Either of those approaches is likely a performance disaster, so you do want to have the CPU and/or hostbridge do this all automatically for you. Which is what the PC world does. > Or is there some gentleman's agreement between the host bridge and the > BIOS that certain address ranges are only used for certain types of > PCI memory resources? Not that I know. I _think_ all of the PC world just depends on the CPU doing the write combining, and the CPU knows thanks to MTRR's and page tables. But I could well imagine that there is some situation where the logic is further out. > What ordering is there between stores to regular memory and stores to > non-prefetchable MMIO? Non-prefetchable MMIO will be in-order on x86 wrt regular memory (unless you use one of the non-temporal stores). To get out-of-order stores you have to use a special MTRR setting (mtrr type "WC" for "write combining"). Or possibly non-temporal writes to an uncached area. I don't think we do. > If a store to regular memory can be performed before a store to MMIO, > does a wmb() suffice to enforce an ordering, or do you have to use > mmiowb()? On x86, MMIO normally doesn't need memory barriers either for the normal case (see above). We don't even need the compiler barrier, because we use a "volatile" pointer for that, telling the compiler to keep its hands off. > Do PCs ever use write-through caching on prefetchable MMIO resources? Basically only for frame buffers, with MTRR rules (and while write-through is an option, normally you'd use "write-combining", which doesn't cache at all, but write combines in the write buffers and writes the combined results out to the bus - there's usually something like four or eight write buffers of up to a cacheline in size for combining). Yeah, I realize this can be awkward. PC's actually get good performance (ie they normally can easily fill the bus bandwidth) _and_ the sw doesn't even need to do anything. That's what you get from several decades of hw tweaking with a fixed - or almost-fixed - software base. I _like_ PC's. Almost every other architecture decided to be lazy in hw, and put the onus on the software to tell it what was right. The PC platform hardware competition didn't allow for the "let's recompile the software" approach, so the hardware does it all for you. Very well too. It does make it somewhat hard for other platforms. Linus From nickpiggin at yahoo.com.au Wed Mar 8 18:41:30 2006 From: nickpiggin at yahoo.com.au (Nick Piggin) Date: Wed, 08 Mar 2006 18:41:30 +1100 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <17422.19209.60360.178668@cargo.ozlabs.ibm.com> References: <31492.1141753245@warthog.cambridge.redhat.com> <17422.19209.60360.178668@cargo.ozlabs.ibm.com> Message-ID: <440E8AAA.9030609@yahoo.com.au> Paul Mackerras wrote: > David Howells writes: >>+ The way to deal with this is to insert an I/O memory barrier between the >>+ two accesses: >>+ >>+ *ADR = ctl_reg_3; >>+ mb(); >>+ reg = *DATA; > > > Ummm, this implies mb() is "an I/O memory barrier". I can see people > getting confused if they read this and then see mb() being used when > no I/O is being done. > Isn't it? Why wouldn't you just use smp_mb() if no IO is being done? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com From olh at suse.de Wed Mar 8 22:45:34 2006 From: olh at suse.de (Olaf Hering) Date: Wed, 8 Mar 2006 12:45:34 +0100 Subject: dead hvc_console with kdump kernel In-Reply-To: <200603081736.41406.michael@ellerman.id.au> References: <200603061329.08262.michael@ellerman.id.au> <200603061726.35175.michael@ellerman.id.au> <200603081736.41406.michael@ellerman.id.au> Message-ID: <20060308114534.GA18522@suse.de> On Wed, Mar 08, Michael Ellerman wrote: > So we take an interrupt, and while we're processing it we decide we should > panic, this happens for example if we panic via sysrq. You are right, sleep 2 ; echo c > /proc/sysrq-trigger ; sleep 2 gives me a working console. From dhowells at redhat.com Wed Mar 8 23:34:25 2006 From: dhowells at redhat.com (David Howells) Date: Wed, 08 Mar 2006 12:34:25 +0000 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: References: <31492.1141753245@warthog.cambridge.redhat.com> <17422.19209.60360.178668@cargo.ozlabs.ibm.com> Message-ID: <27607.1141821265@warthog.cambridge.redhat.com> Linus Torvalds wrote: > > Linus explained recently that wmb() on x86 does not order stores to > > system memory w.r.t. stores to stores to prefetchable I/O memory (at > > least that's what I think he said ;). On i386 and x86_64, do IN and OUT instructions imply MFENCE? It's not obvious from the x86_64 docs. David From alan at lxorguk.ukuu.org.uk Thu Mar 9 00:12:57 2006 From: alan at lxorguk.ukuu.org.uk (Alan Cox) Date: Wed, 08 Mar 2006 13:12:57 +0000 Subject: Memory barriers and spin_unlock safety In-Reply-To: References: <32518.1141401780@warthog.cambridge.redhat.com> <17417.29375.87604.537434@cargo.ozlabs.ibm.com> <17422.19865.635112.820824@cargo.ozlabs.ibm.com> Message-ID: <1141823577.7605.31.camel@localhost.localdomain> On Maw, 2006-03-07 at 19:54 -0800, Linus Torvalds wrote: > Close, yes. HOWEVER, it's only really ordered wrt the "innermost" bus. I > don't think PCI bridges are supposed to post PIO writes, but a x86 CPU > basically won't stall for them forever. The bridges I have will stall forever. You can observe this directly if an IDE device decides to hang the IORDY line on the IDE cable or you crash the GPU on an S3 card. Alan From dhowells at redhat.com Thu Mar 9 00:19:52 2006 From: dhowells at redhat.com (David Howells) Date: Wed, 08 Mar 2006 13:19:52 +0000 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <17422.19209.60360.178668@cargo.ozlabs.ibm.com> References: <17422.19209.60360.178668@cargo.ozlabs.ibm.com> <31492.1141753245@warthog.cambridge.redhat.com> Message-ID: <28393.1141823992@warthog.cambridge.redhat.com> Paul Mackerras wrote: > By "memory accesses" do you mean accesses to system memory, or do you > mean loads and stores - which may be to system memory, memory on an I/O > device (e.g. a framebuffer) or to memory-mapped I/O registers? Well, I meant all loads and stores, irrespective of their destination. However, on i386, for example, you've actually got at least two different I/O access domains, and I don't know how they impinge upon each other (IN/OUT vs MOV). > Enabling/disabling interrupts doesn't imply a barrier on powerpc, and > nor does taking an interrupt or returning from one. Surely it ought to, otherwise what's to stop accesses done with interrupts disabled crossing with accesses done inside an interrupt handler? > > +Either interrupt disablement (LOCK) and enablement (UNLOCK) will barrier > ... > I don't think this is right, and I don't think it is necessary to > achieve the end you state, since a CPU will always see its own memory > accesses in program order. But what about a driver accessing some memory that its device is going to observe under irq disablement, and then getting an interrupt immediately after from that same device, the handler for which communicates with the device, possibly then being broken because the CPU hasn't completed all the memory accesses that the driver made while interrupts are disabled? Alternatively, might it be possible for communications between two CPUs to be stuffed because one took an interrupt that also modified common data before the it had committed the memory accesses done under interrupt disablement? This would suggest using a lock though. I'm not sure that I can come up with a feasible example for this, but Alan Cox seems to think that it's a valid problem too. The only likely way I can see this being a problem is with unordered I/O writes, which would suggest you have to place an mmiowb() before unlocking the spinlock in such a case, assuming it is possible to get unordered I/O writes (which I think it is). > What does *F+*A mean? Combined accesses. > Well, the driver should *not* be doing *ADR at all, it should be using > read[bwl]/write[bwl]. The architecture code has to implement > read*/write* in such a way that the accesses generated can't be > reordered. I _think_ it also has to make sure the write accesses > can't be write-combined, but it would be good to have that clarified. Than what use mmiowb()? Surely write combining and out-of-order reads are reasonable for cacheable devices like framebuffers. David From dhowells at redhat.com Thu Mar 9 01:37:58 2006 From: dhowells at redhat.com (David Howells) Date: Wed, 08 Mar 2006 14:37:58 +0000 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <31492.1141753245@warthog.cambridge.redhat.com> References: <31492.1141753245@warthog.cambridge.redhat.com> Message-ID: <29826.1141828678@warthog.cambridge.redhat.com> The attached patch documents the Linux kernel's memory barriers. I've updated it from the comments I've been given. Note that the per-arch notes sections are gone because it's clear that there are so many exceptions, that it's not worth having them. I've added a list of references to other documents. I've tried to get rid of the concept of memory accesses appearing on the bus; what matters is apparent behaviour with respect to other observers in the system. I'm not sure that any mention interrupts vs interrupt disablement should be retained... it's unclear that there is actually anything that guarantees that stuff won't leak out of an interrupt-disabled section and into an interrupt handler. Paul Mackerras says this isn't valid on powerpc, and looking at the code seems to confirm that, barring implicit enforcement by the CPU. Signed-Off-By: David Howells --- warthog>diffstat -p1 /tmp/mb.diff Documentation/memory-barriers.txt | 589 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 589 insertions(+) diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt new file mode 100644 index 0000000..1340c8d --- /dev/null +++ b/Documentation/memory-barriers.txt @@ -0,0 +1,589 @@ + ============================ + LINUX KERNEL MEMORY BARRIERS + ============================ + +Contents: + + (*) What are memory barriers? + + (*) Where are memory barriers needed? + + - Accessing devices. + - Multiprocessor interaction. + - Interrupts. + + (*) Linux kernel compiler barrier functions. + + (*) Linux kernel memory barrier functions. + + (*) Implicit kernel memory barriers. + + - Locking functions. + - Interrupt disablement functions. + - Miscellaneous functions. + + (*) Linux kernel I/O barriering. + + (*) References. + + +========================= +WHAT ARE MEMORY BARRIERS? +========================= + +Memory barriers are instructions to both the compiler and the CPU to impose an +apparent partial ordering between the memory access operations specified either +side of the barrier. They request that the sequence of memory events generated +appears to other components of the system as if the barrier is effective on +that CPU. + +Note that: + + (*) there's no guarantee that the sequence of memory events is _actually_ so + ordered. It's possible for the CPU to do out-of-order accesses _as long + as no-one is looking_, and then fix up the memory if someone else tries to + see what's going on (for instance a bus master device); what matters is + the _apparent_ order as far as other processors and devices are concerned; + and + + (*) memory barriers are only guaranteed to act within the CPU processing them, + and are not, for the most part, guaranteed to percolate down to other CPUs + in the system or to any I/O hardware that that CPU may communicate with. + + +For example, a programmer might take it for granted that the CPU will perform +memory accesses in exactly the order specified, so that if a CPU is, for +example, given the following piece of code: + + a = *A; + *B = b; + c = *C; + d = *D; + *E = e; + +They would then expect that the CPU will complete the memory access for each +instruction before moving on to the next one, leading to a definite sequence of +operations as seen by external observers in the system: + + read *A, write *B, read *C, read *D, write *E. + + +Reality is, of course, much messier. With many CPUs and compilers, this isn't +always true because: + + (*) reads are more likely to need to be completed immediately to permit + execution progress, whereas writes can often be deferred without a + problem; + + (*) reads can be done speculatively, and then the result discarded should it + prove not to be required; + + (*) the order of the memory accesses may be rearranged to promote better use + of the CPU buses and caches; + + (*) reads and writes may be combined to improve performance when talking to + the memory or I/O hardware that can do batched accesses of adjacent + locations, thus cutting down on transaction setup costs (memory and PCI + devices may be able to do this); and + + (*) the CPU's data cache may affect the ordering, though cache-coherency + mechanisms should alleviate this - once the write has actually hit the + cache. + +So what another CPU, say, might actually observe from the above piece of code +is: + + read *A, read {*C,*D}, write *E, write *B + + (By "read {*C,*D}" I mean a combined single read). + + +It is also guaranteed that a CPU will be self-consistent: it will see its _own_ +accesses appear to be correctly ordered, without the need for a memory +barrier. For instance with the following code: + + X = *A; + *A = Y; + Z = *A; + +assuming no intervention by an external influence, it can be taken that: + + (*) X will hold the old value of *A, and will never happen after the write and + thus end up being given the value that was assigned to *A from Y instead; + and + + (*) Z will always be given the value in *A that was assigned there from Y, and + will never happen before the write, and thus end up with the same value + that was in *A initially. + +(This is ignoring the fact that the value initially in *A may appear to be the +same as the value assigned to *A from Y). + + +================================= +WHERE ARE MEMORY BARRIERS NEEDED? +================================= + +Under normal operation, access reordering is probably not going to be a problem +as a linear program will still appear to operate correctly. There are, +however, three circumstances where reordering definitely _could_ be a problem: + + +ACCESSING DEVICES +----------------- + +Many devices can be memory mapped, and so appear to the CPU as if they're just +memory locations. However, to control the device, the driver has to make the +right accesses in exactly the right order. + +Consider, for example, an ethernet chipset such as the AMD PCnet32. It +presents to the CPU an "address register" and a bunch of "data registers". The +way it's accessed is to write the index of the internal register to be accessed +to the address register, and then read or write the appropriate data register +to access the chip's internal register, which could - theoretically - be done +by: + + *ADR = ctl_reg_3; + reg = *DATA; + +The problem with a clever CPU or a clever compiler is that the write to the +address register isn't guaranteed to happen before the access to the data +register, if the CPU or the compiler thinks it is more efficient to defer the +address write: + + read *DATA, write *ADR + +then things will break. + + +In the Linux kernel, however, I/O should be done through the appropriate +accessor routines - such as inb() or writel() - which know how to make such +accesses appropriately sequential. + +On some systems, I/O writes are not strongly ordered across all CPUs, and so +locking should be used, and mmiowb() should be issued prior to unlocking the +critical section. + +See Documentation/DocBook/deviceiobook.tmpl for more information. + + +MULTIPROCESSOR INTERACTION +-------------------------- + +When there's a system with more than one processor, these may be working on the +same set of data, but attempting not to use locks as locks are quite expensive. +This means that accesses that affect both CPUs may have to be carefully ordered +to prevent error. + +Consider the R/W semaphore slow path. In that, a waiting process is queued on +the semaphore, as noted by it having a record on its stack linked to the +semaphore's list: + + struct rw_semaphore { + ... + struct list_head waiters; + }; + + struct rwsem_waiter { + struct list_head list; + struct task_struct *task; + }; + +To wake up the waiter, the up_read() or up_write() functions have to read the +pointer from this record to know as to where the next waiter record is, clear +the task pointer, call wake_up_process() on the task, and release the reference +held on the waiter's task struct: + + READ waiter->list.next; + READ waiter->task; + WRITE waiter->task; + CALL wakeup + RELEASE task + +If any of these steps occur out of order, then the whole thing may fail. + +Note that the waiter does not get the semaphore lock again - it just waits for +its task pointer to be cleared. Since the record is on its stack, this means +that if the task pointer is cleared _before_ the next pointer in the list is +read, another CPU might start processing the waiter and it might clobber its +stack before up*() functions have a chance to read the next pointer. + + CPU 0 CPU 1 + =============================== =============================== + down_xxx() + Queue waiter + Sleep + up_yyy() + READ waiter->task; + WRITE waiter->task; + + Resume processing + down_xxx() returns + call foo() + foo() clobbers *waiter + + READ waiter->list.next; + --- OOPS --- + +This could be dealt with using a spinlock, but then the down_xxx() function has +to get the spinlock again after it's been woken up, which is a waste of +resources. + +The way to deal with this is to insert an SMP memory barrier: + + READ waiter->list.next; + READ waiter->task; + smp_mb(); + WRITE waiter->task; + CALL wakeup + RELEASE task + +In this case, the barrier makes a guarantee that all memory accesses before the +barrier will appear to happen before all the memory accesses after the barrier +with respect to the other CPUs on the system. It does _not_ guarantee that all +the memory accesses before the barrier will be complete by the time the barrier +itself is complete. + +SMP memory barriers are normally mere compiler barriers on a UP system because +the CPU orders overlapping accesses with respect to itself. + + +INTERRUPTS +---------- + +A driver may be interrupted by its own interrupt service routine, and thus they +may interfere with each other's attempts to control or access the device. + +This may be alleviated - at least in part - by disabling interrupts (a form of +locking), such that the critical operations are all contained within the +disabled-interrupt section in the driver. Whilst the driver's interrupt +routine is executing, the driver's core may not run on the same CPU, and its +interrupt is not permitted to happen again until the current interrupt has been +handled, thus the interrupt handler does not need to lock against that. + + +However, consider the following example: + + CPU 1 CPU 2 + =============================== =============================== + [A is 0 and B is 0] + DISABLE IRQ + *A = 1; + smp_wmb(); + *B = 2; + ENABLE IRQ + + *A = 3 + a = *A; + b = *B; + smp_wmb(); + *B = 4; + + +CPU 2 might see *A == 3 and *B == 0, when what it probably ought to see is *B +== 2 and *A == 1 or *A == 3, or *B == 4 and *A == 3. + +This might happen because the write "*B = 2" might occur after the write "*A = +3" - in which case the former write has leaked from the interrupt-disabled +section into the interrupt handler. In this case it is a lock of some +description should very probably be used. + + +This sort of problem might also occur with relaxed I/O ordering rules, if it's +permitted for I/O writes to cross. For instance, if a driver was talking to an +ethernet card that sports an address register and a data register: + + DISABLE IRQ + writew(ADR, ctl_reg_3); + writew(DATA, y); + ENABLE IRQ + + writew(ADR, ctl_reg_4); + q = readw(DATA); + + +In such a case, an mmiowb() is needed, firstly to prevent the first write to +the address register from occurring after the write to the data register, and +secondly to prevent the write to the data register from happening after the +second write to the address register. + + +======================================= +LINUX KERNEL COMPILER BARRIER FUNCTIONS +======================================= + +The Linux kernel has an explicit compiler barrier function that prevents the +compiler from moving the memory accesses either side of it to the other side: + + barrier(); + +This has no direct effect on the CPU, which may then reorder things however it +wishes. + +In addition, accesses to "volatile" memory locations and volatile asm +statements act as implicit compiler barriers. + + +===================================== +LINUX KERNEL MEMORY BARRIER FUNCTIONS +===================================== + +The Linux kernel has six basic CPU memory barriers: + + MANDATORY SMP CONDITIONAL + =============== =============== + GENERAL mb() smp_mb() + READ rmb() smp_rmb() + WRITE wmb() smp_wmb() + +General memory barriers give a guarantee that all memory accesses specified +before the barrier will appear to happen before all memory accesses specified +after the barrier with respect to the other components of the system. + +Read and write memory barriers give similar guarantees, but only for memory +reads versus memory reads and memory writes versus memory writes respectively. + +All memory barriers imply compiler barriers. + +SMP memory barriers are only compiler barriers on uniprocessor compiled systems +because it is assumed that a CPU will be apparently self-consistent, and will +order overlapping accesses correctly with respect to itself. + +There is no guarantee that any of the memory accesses specified before a memory +barrier will be complete by the completion of a memory barrier; the barrier can +be considered to draw a line in that CPU's access queue that accesses of the +appropriate type may not cross. + +There is no guarantee that issuing a memory barrier on one CPU will have any +direct effect on another CPU or any other hardware in the system. The indirect +effect will be the order in which the second CPU sees the first CPU's accesses +occur. + +There is no guarantee that some intervening piece of off-the-CPU hardware will +not reorder the memory accesses. CPU cache coherency mechanisms should +propegate the indirect effects of a memory barrier between CPUs. + +Note that these are the _minimum_ guarantees. Different architectures may give +more substantial guarantees, but they may not be relied upon outside of arch +specific code. + + +There are some more advanced barriering functions: + + (*) set_mb(var, value) + (*) set_wmb(var, value) + + These assign the value to the variable and then insert at least a write + barrier after it, depending on the function. + + +=============================== +IMPLICIT KERNEL MEMORY BARRIERS +=============================== + +Some of the other functions in the linux kernel imply memory barriers, amongst +them are locking and scheduling functions and interrupt management functions. + +This specification is a _minimum_ guarantee; any particular architecture may +provide more substantial guarantees, but these may not be relied upon outside +of arch specific code. + + +LOCKING FUNCTIONS +----------------- + +For instance all the following locking functions imply barriers: + + (*) spin locks + (*) R/W spin locks + (*) mutexes + (*) semaphores + (*) R/W semaphores + +In all cases there are variants on a LOCK operation and an UNLOCK operation. + + (*) LOCK operation implication: + + Memory accesses issued after the LOCK will be completed after the LOCK + accesses have completed. + + Memory accesses issued before the LOCK may be completed after the LOCK + accesses have completed. + + (*) UNLOCK operation implication: + + Memory accesses issued before the UNLOCK will be completed before the + UNLOCK accesses have completed. + + Memory accesses issued after the UNLOCK may be completed before the UNLOCK + accesses have completed. + + (*) LOCK vs UNLOCK implication: + + The LOCK accesses will be completed before the UNLOCK accesses. + +And therefore an UNLOCK followed by a LOCK is equivalent to a full barrier, but +a LOCK followed by an UNLOCK isn't. + +Locks and semaphores may not provide any guarantee of ordering on UP compiled +systems, and so can't be counted on in such a situation to actually do anything +at all, especially with respect to I/O barriering, unless combined with +interrupt disablement operations. + + +As an example, consider the following: + + *A = a; + *B = b; + LOCK + *C = c; + *D = d; + UNLOCK + *E = e; + *F = f; + +The following sequence of events is acceptable: + + LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK + +But none of the following are: + + {*F,*A}, *B, LOCK, *C, *D, UNLOCK, *E + *A, *B, *C, LOCK, *D, UNLOCK, *E, *F + *A, *B, LOCK, *C, UNLOCK, *D, *E, *F + *B, LOCK, *C, *D, UNLOCK, {*F,*A}, *E + + +INTERRUPT DISABLEMENT FUNCTIONS +------------------------------- + +Interrupt disablement (LOCK equivalent) and enablement (UNLOCK equivalent) will +barrier memory and I/O accesses versus memory and I/O accesses done in the +interrupt handler. This prevents an interrupt routine interfering with +accesses made in a disabled-interrupt section of code and vice versa. + +Note that whilst interrupt disablement barriers all act as compiler barriers, +they only act as memory barriers with respect to interrupts, not with respect +to nested sections. + +Consider the following: + + + *X = x; + + *A = a; + SAVE IRQ AND DISABLE + *B = b; + SAVE IRQ AND DISABLE + *C = c; + RESTORE IRQ + *D = d; + RESTORE IRQ + *E = e; + + *Y = y; + + +It is acceptable to observe the following sequences of events: + + { INT, *X }, *A, SAVE, *B, SAVE, *C, REST, *D, REST, *E, { INT, *Y } + { INT, *X }, *A, SAVE, *B, SAVE, *C, REST, *D, REST, { INT, *Y, *E } + { INT, *X }, SAVE, SAVE, *A, *B, *C, *D, *E, REST, REST, { INT, *Y } + { INT }, *X, SAVE, SAVE, *A, *B, *C, *D, *E, REST, REST, { INT, *Y } + { INT }, *A, *X, SAVE, SAVE, *B, *C, *D, *E, REST, REST, { INT, *Y } + +But not the following: + + { INT }, SAVE, *A, *B, *X, SAVE, *C, REST, *D, REST, *E, { INT, *Y } + { INT, *X }, *A, SAVE, *B, SAVE, *C, REST, REST, { INT, *Y, *D, *E } + + +MISCELLANEOUS FUNCTIONS +----------------------- + +Other functions that imply barriers: + + (*) schedule() and similar imply full memory barriers. + + +=========================== +LINUX KERNEL I/O BARRIERING +=========================== + +When accessing I/O memory, drivers should use the appropriate accessor +functions: + + (*) inX(), outX(): + + These are intended to talk to legacy i386 hardware using an alternate bus + addressing mode. They are synchronous as far as the x86 CPUs are + concerned, but other CPUs and intermediary bridges may not honour that. + + They are guaranteed to be fully ordered with respect to each other. + + (*) readX(), writeX(): + + These are guaranteed to be fully ordered and uncombined with respect to + each other on the issuing CPU, provided they're not accessing a + prefetchable device. However, intermediary hardware (such as a PCI + bridge) may indulge in deferral if it so wishes; to flush a write, a read + from the same location must be performed. + + Used with prefetchable I/O memory, an mmiowb() barrier may be required to + force writes to be ordered. + + (*) readX_relaxed() + + These are not guaranteed to be ordered in any way. There is no I/O read + barrier available. + + (*) ioreadX(), iowriteX() + + These will perform as appropriate for the type of access they're actually + doing, be it in/out or read/write. + + +========== +REFERENCES +========== + +AMD64 Architecture Programmer's Manual Volume 2: System Programming + Chapter 7.1: Memory-Access Ordering + Chapter 7.4: Buffering and Combining Memory Writes + +IA-32 Intel Architecture Software Developer's Manual, Volume 3: +System Programming Guide + Chapter 7.1: Locked Atomic Operations + Chapter 7.2: Memory Ordering + Chapter 7.4: Serializing Instructions + +The SPARC Architecture Manual, Version 9 + Chapter 8: Memory Models + Appendix D: Formal Specification of the Memory Models + Appendix J: Programming with the Memory Models + +UltraSPARC Programmer Reference Manual + Chapter 5: Memory Accesses and Cacheability + Chapter 15: Sparc-V9 Memory Models + +UltraSPARC III Cu User's Manual + Chapter 9: Memory Models + +UltraSPARC IIIi Processor User's Manual + Chapter 8: Memory Models + +UltraSPARC Architecture 2005 + Chapter 9: Memory + Appendix D: Formal Specifications of the Memory Models + +UltraSPARC T1 Supplment to the UltraSPARC Architecture 2005 + Chapter 8: Memory Models + Appendix F: Caches and Cache Coherency + +Solaris Internals, Core Kernel Architecture, p63-68: + Chapter 3.3: Hardware Considerations for Locks and + Synchronization + +Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching +for Kernel Programmers: + Chapter 13: Other Memory Models From hch at lst.de Thu Mar 9 02:19:39 2006 From: hch at lst.de (Christoph Hellwig) Date: Wed, 8 Mar 2006 16:19:39 +0100 Subject: asm/cputable.h sparse warnings Message-ID: <20060308151939.GA12762@lst.de> Looks like not many people are running sparse on powerpc ;-) For every file compiled I get the churn of warnings below. The reason seems to be that it's using large values in enums, something that's very murky in the C standards, and gcc adds even less well-defined extensions to it that make this code work in practice. I think the only sane fix is to switch the cputype constants to cpp macros, although that'd make the file a lot larger.. include/asm/cputable.h:330:6: warning: cast truncates bits from constant value (2000000241 becomes 241) include/asm/cputable.h:330:24: warning: cast truncates bits from constant value (e000000241 becomes 241) include/asm/cputable.h:330:40: warning: cast truncates bits from constant value (cf00100241 becomes 100241) include/asm/cputable.h:331:6: warning: cast truncates bits from constant value (cf00100649 becomes 100649) include/asm/cputable.h:331:24: warning: cast truncates bits from constant value (fcf00100241 becomes 100241) include/asm/cputable.h:331:42: warning: cast truncates bits from constant value (21cf00100249 becomes 100249) include/asm/cputable.h:372:6: warning: cast truncates bits from constant value (2000000241 becomes 241) include/asm/cputable.h:372:24: warning: cast truncates bits from constant value (e000000241 becomes 241) include/asm/cputable.h:372:40: warning: cast truncates bits from constant value (cf00100241 becomes 100241) include/asm/cputable.h:373:6: warning: cast truncates bits from constant value (cf00100649 becomes 100649) include/asm/cputable.h:373:24: warning: cast truncates bits from constant value (fcf00100241 becomes 100241) include/asm/cputable.h:373:42: warning: cast truncates bits from constant value (21cf00100249 becomes 100249) include/asm/cputable.h:409:6: warning: cast truncates bits from constant value (100000100649 becomes 100649) From torvalds at osdl.org Thu Mar 9 02:30:19 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Wed, 8 Mar 2006 07:30:19 -0800 (PST) Subject: Memory barriers and spin_unlock safety In-Reply-To: <1141823577.7605.31.camel@localhost.localdomain> References: <32518.1141401780@warthog.cambridge.redhat.com> <17417.29375.87604.537434@cargo.ozlabs.ibm.com> <17422.19865.635112.820824@cargo.ozlabs.ibm.com> <1141823577.7605.31.camel@localhost.localdomain> Message-ID: On Wed, 8 Mar 2006, Alan Cox wrote: > > On Maw, 2006-03-07 at 19:54 -0800, Linus Torvalds wrote: > > Close, yes. HOWEVER, it's only really ordered wrt the "innermost" bus. I > > don't think PCI bridges are supposed to post PIO writes, but a x86 CPU > > basically won't stall for them forever. > > The bridges I have will stall forever. You can observe this directly if > an IDE device decides to hang the IORDY line on the IDE cable or you > crash the GPU on an S3 card. Ok. The only thing I have tested is the timing of "outb()" on its own, which is definitely long enough that it clearly waits for _some_ bus activity (ie the CPU doesn't just post the write internally), but I don't know exactly what the rules are as far as the core itself is concerned: I suspect the core just waits until it has hit the northbridge or something. In contrast, a MMIO write to a WC region at least will not necessarily pause the core at all: it just hits the write queue in the core, and the core continues on (and may generate other writes that will be combined in the write buffers before the first one even hits the bus). Linus From matthew at wil.cx Thu Mar 9 02:41:57 2006 From: matthew at wil.cx (Matthew Wilcox) Date: Wed, 8 Mar 2006 08:41:57 -0700 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <20060308145506.GA5095@devserv.devel.redhat.com> References: <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> Message-ID: <20060308154157.GI7301@parisc-linux.org> On Wed, Mar 08, 2006 at 09:55:06AM -0500, Alan Cox wrote: > On Wed, Mar 08, 2006 at 02:37:58PM +0000, David Howells wrote: > > + (*) reads can be done speculatively, and then the result discarded should it > > + prove not to be required; > > That might be worth an example with an if() because PPC will do this and if > its a read with a side effect (eg I/O space) you get singed.. PPC does speculative memory accesses to IO? Are you *sure*? > > +same set of data, but attempting not to use locks as locks are quite expensive. > > s/are quite/is quite > > and is quite confusing to read His grammar's right ... but I'd just leave out the 'as' part. As you're right that it's confusing ;-) > > +SMP memory barriers are normally mere compiler barriers on a UP system because > > s/mere// > > Makes it easier to read if you are not 1st language English. Maybe s/mere/only/? > > +SMP memory barriers are only compiler barriers on uniprocessor compiled systems > > +because it is assumed that a CPU will be apparently self-consistent, and will > > +order overlapping accesses correctly with respect to itself. > > Is this true of IA-64 ?? Yes: #else # define smp_mb() barrier() # define smp_rmb() barrier() # define smp_wmb() barrier() # define smp_read_barrier_depends() do { } while(0) #endif > > + (*) inX(), outX(): > > + > > + These are intended to talk to legacy i386 hardware using an alternate bus > > + addressing mode. They are synchronous as far as the x86 CPUs are > > Not really true. Lots of PCI devices use them. Need to talk about "I/O space" Port space is deprecated though. PCI 2.3 says: "Devices are recommended always to map control functions into Memory Space." > > + > > + These are guaranteed to be fully ordered and uncombined with respect to > > + each other on the issuing CPU, provided they're not accessing a > > MTRRs > > > + prefetchable device. However, intermediary hardware (such as a PCI > > + bridge) may indulge in deferral if it so wishes; to flush a write, a read > > + from the same location must be performed. > > False. Its not so tightly restricted and many devices the location you write > is not safe to read so you must use another. I'd have to dig the PCI spec > out but I believe it says the same devfn. It also says stuff about rules for > visibility of bus mastering relative to these accesses and PCI config space > accesses relative to the lot (the latter serveral chipsets get wrong). We > should probably point people at the PCI 2.2 spec . 3.2.5 of PCI 2.3 seems most relevant: Since memory write transactions may be posted in bridges anywhere in the system, and I/O writes may be posted in the host bus bridge, a master cannot automatically tell when its write transaction completes at the final destination. For a device driver to guarantee that a write has completed at the actual target (and not at an intermediate bridge), it must complete a read to the same device that the write targeted. The read (memory or I/O) forces all bridges between the originating master and the actual target to flush all posted data before allowing the read to complete. For additional details on device drivers, refer to Section 6.5. Refer to Section 3.10., item 6, for other cases where a read is necessary. Appendix E is also of interest: 2. Memory writes can be posted in both directions in a bridge. I/O and Configuration writes are not posted. (I/O writes can be posted in the Host Bridge, but some restrictions apply.) Read transactions (Memory, I/O, or Configuration) are not posted. 5. A read transaction must push ahead of it through the bridge any posted writes originating on the same side of the bridge and posted before the read. Before the read transaction can complete on its originating bus, it must pull out of the bridge any posted writes that originated on the opposite side and were posted before the read command completes on the read-destination bus. I like the way they contradict each other slightly wrt config reads and whether you have to read from the same device, or merely the same bus. One thing that is clear is that a read of a status register on the bridge isn't enough, it needs to be *through* the bridge, not *to* the bridge. I wonder if a config read of a non-existent device on the other side of the bridge would force the write to complete ... From hch at lst.de Thu Mar 9 02:47:00 2006 From: hch at lst.de (Christoph Hellwig) Date: Wed, 8 Mar 2006 16:47:00 +0100 Subject: [PATCH] powerpc: add for_each_node_by_foo helpers Message-ID: <20060308154700.GA15859@lst.de> Typical use for of_find_node_by_name and of_find_node_by_type is to iterate over all nodes of a given type/name. Add a helper macro to do that (in spirit of the list_for_each* macros). Signed-off-by: Christoph Hellwig Index: linux-2.6/include/asm-powerpc/prom.h =================================================================== --- linux-2.6.orig/include/asm-powerpc/prom.h 2006-02-10 19:45:44.000000000 +0100 +++ linux-2.6/include/asm-powerpc/prom.h 2006-03-08 16:41:46.000000000 +0100 @@ -126,8 +126,14 @@ /* New style node lookup */ extern struct device_node *of_find_node_by_name(struct device_node *from, const char *name); +#define for_each_node_by_name(dn, name) \ + for (dn = of_find_node_by_name(NULL, name); dn; \ + dn = of_find_node_by_name(dn, name)) extern struct device_node *of_find_node_by_type(struct device_node *from, const char *type); +#define for_each_node_by_type(dn, type) \ + for (dn = of_find_node_by_type(NULL, type); dn; \ + dn = of_find_node_by_type(dn, type)) extern struct device_node *of_find_compatible_node(struct device_node *from, const char *type, const char *compat); extern struct device_node *of_find_node_by_path(const char *path); From hch at lst.de Thu Mar 9 02:49:46 2006 From: hch at lst.de (Christoph Hellwig) Date: Wed, 8 Mar 2006 16:49:46 +0100 Subject: [PATCH] cell: cleanup iommu initialization Message-ID: <20060308154946.GB15859@lst.de> - add a cell_map_one_iommu helper to factor out some duplicated code - use for_each_node_by_type from my last patch - add ul postfix to some large constants in the hardcoded case to fix up sparse warnings. - minor formatting fixes. Note that the hardcoded case still doesn't look very nice. The hardcoded addresses should probably get some meaningull defines, and mapping both iommus into the same dma window looks at least slightly odd to me. Signed-off-by: Christoph Hellwig Index: linux-2.6/arch/powerpc/platforms/cell/iommu.c =================================================================== --- linux-2.6.orig/arch/powerpc/platforms/cell/iommu.c 2006-03-08 16:39:56.000000000 +0100 +++ linux-2.6/arch/powerpc/platforms/cell/iommu.c 2006-03-08 16:45:13.000000000 +0100 @@ -335,67 +335,56 @@ iommu_devnode_setup(d); } - -static int cell_map_iommu_hardcoded(int num_nodes) +static void cell_map_one_iommu(struct cell_iommu *iommu, + unsigned long base, unsigned long mmio_base) { - struct cell_iommu *iommu = NULL; - - pr_debug("%s(%d): Using hardcoded defaults\n", __FUNCTION__, __LINE__); + iommu->base = base; + iommu->mmio_base = mmio_base; - /* node 0 */ - iommu = &cell_iommus[0]; - iommu->mapped_base = __ioremap(0x20000511000, 0x1000, _PAGE_NO_CACHE); - iommu->mapped_mmio_base = __ioremap(0x20000510000, 0x1000, _PAGE_NO_CACHE); + iommu->mapped_base = __ioremap(base, 0x1000, _PAGE_NO_CACHE); + iommu->mapped_mmio_base = __ioremap(mmio_base, 0x1000, _PAGE_NO_CACHE); enable_mapping(iommu->mapped_base, iommu->mapped_mmio_base); +} - cell_do_map_iommu(iommu, 0x048a, - 0x20000000ul,0x20000000ul); - - if (num_nodes < 2) - return 0; - - /* node 1 */ - iommu = &cell_iommus[1]; - iommu->mapped_base = __ioremap(0x30000511000, 0x1000, _PAGE_NO_CACHE); - iommu->mapped_mmio_base = __ioremap(0x30000510000, 0x1000, _PAGE_NO_CACHE); - - enable_mapping(iommu->mapped_base, iommu->mapped_mmio_base); +static int cell_map_iommu_hardcoded(int num_nodes) +{ + pr_debug("%s(%d): Using hardcoded defaults\n", __FUNCTION__, __LINE__); - cell_do_map_iommu(iommu, 0x048a, - 0x20000000,0x20000000ul); + /* node 0 */ + cell_map_one_iommu(&cell_iommus[0], 0x20000511000ul, 0x20000510000ul); + cell_do_map_iommu(&cell_iommus[0], 0x048a, 0x20000000ul, 0x20000000ul); + if (num_nodes > 1) { + /* node 1 */ + cell_map_one_iommu(&cell_iommus[1], 0x30000511000ul, 0x30000510000ul); + cell_do_map_iommu(&cell_iommus[1], 0x048a, 0x20000000ul, 0x20000000ul); + } return 0; } - static int cell_map_iommu(void) { unsigned int num_nodes = 0, *node_id; unsigned long *base, *mmio_base; struct device_node *dn; - struct cell_iommu *iommu = NULL; /* determine number of nodes (=iommus) */ pr_debug("%s(%d): determining number of nodes...", __FUNCTION__, __LINE__); - for(dn = of_find_node_by_type(NULL, "cpu"); - dn; - dn = of_find_node_by_type(dn, "cpu")) { - node_id = (unsigned int *)get_property(dn, "node-id", NULL); + for_each_node_by_type(dn, "cpu") { + node_id = (unsigned int *)get_property(dn, "node-id", NULL); if (num_nodes < *node_id) num_nodes = *node_id; - } + } num_nodes++; pr_debug("%i found.\n", num_nodes); /* map the iommu registers for each node */ pr_debug("%s(%d): Looping through nodes\n", __FUNCTION__, __LINE__); - for(dn = of_find_node_by_type(NULL, "cpu"); - dn; - dn = of_find_node_by_type(dn, "cpu")) { + for_each_node_by_type(dn, "cpu") { node_id = (unsigned int *)get_property(dn, "node-id", NULL); base = (unsigned long *)get_property(dn, "ioc-cache", NULL); mmio_base = (unsigned long *)get_property(dn, "ioc-translation", NULL); @@ -403,16 +392,7 @@ if (!base || !mmio_base || !node_id) return cell_map_iommu_hardcoded(num_nodes); - iommu = &cell_iommus[*node_id]; - iommu->base = *base; - iommu->mmio_base = *mmio_base; - - iommu->mapped_base = __ioremap(*base, 0x1000, _PAGE_NO_CACHE); - iommu->mapped_mmio_base = __ioremap(*mmio_base, 0x1000, _PAGE_NO_CACHE); - - enable_mapping(iommu->mapped_base, - iommu->mapped_mmio_base); - + cell_map_one_iommu(&cell_iommus[*node_id], *base, *mmio_base); /* everything else will be done in iommu_bus_setup */ } From duncan.sands at math.u-psud.fr Wed Mar 8 19:25:18 2006 From: duncan.sands at math.u-psud.fr (Duncan Sands) Date: Wed, 8 Mar 2006 09:25:18 +0100 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <9551.1141762147@warthog.cambridge.redhat.com> References: <1141756825.31814.75.camel@localhost.localdomain> <31492.1141753245@warthog.cambridge.redhat.com> <9551.1141762147@warthog.cambridge.redhat.com> Message-ID: <200603080925.19425.duncan.sands@math.u-psud.fr> On Tuesday 7 March 2006 21:09, David Howells wrote: > Alan Cox wrote: > > > Better meaningful example would be barriers versus an IRQ handler. Which > > leads nicely onto section 2 > > Yes, except that I can't think of one that's feasible that doesn't have to do > with I/O - which isn't a problem if you are using the proper accessor > functions. > > Such an example has to involve more than one CPU, because you don't tend to > get memory/memory ordering problems on UP. On UP you at least need compiler barriers, right? You're in trouble if you think you are writing in a certain order, and expect to see the same order from an interrupt handler, but the compiler decided to rearrange the order of the writes... > The obvious one might be circular buffers, except there's no problem there > provided you have a memory barrier between accessing the buffer and updating > your pointer into it. > > David Ciao, Duncan. From alan at redhat.com Thu Mar 9 01:55:06 2006 From: alan at redhat.com (Alan Cox) Date: Wed, 8 Mar 2006 09:55:06 -0500 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <29826.1141828678@warthog.cambridge.redhat.com> References: <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> Message-ID: <20060308145506.GA5095@devserv.devel.redhat.com> On Wed, Mar 08, 2006 at 02:37:58PM +0000, David Howells wrote: > + (*) reads can be done speculatively, and then the result discarded should it > + prove not to be required; That might be worth an example with an if() because PPC will do this and if its a read with a side effect (eg I/O space) you get singed.. > +same set of data, but attempting not to use locks as locks are quite expensive. s/are quite/is quite and is quite confusing to read > +SMP memory barriers are normally mere compiler barriers on a UP system because s/mere// Makes it easier to read if you are not 1st language English. > +In addition, accesses to "volatile" memory locations and volatile asm > +statements act as implicit compiler barriers. Add The use of volatile generates poorer code and hides the serialization in type declarations that may be far from the code. The Linux coding style therefore strongly favours the use of explicit barriers except in small and specific cases. > +SMP memory barriers are only compiler barriers on uniprocessor compiled systems > +because it is assumed that a CPU will be apparently self-consistent, and will > +order overlapping accesses correctly with respect to itself. Is this true of IA-64 ?? > +There is no guarantee that some intervening piece of off-the-CPU hardware will > +not reorder the memory accesses. CPU cache coherency mechanisms should > +propegate the indirect effects of a memory barrier between CPUs. [For information on bus mastering DMA and coherency please read ....] sincee have a doc on this > +There are some more advanced barriering functions: "barriering" ... ick, barrier. > +LOCKING FUNCTIONS > +----------------- > + > +For instance all the following locking functions imply barriers: s/For instance// > + (*) spin locks > + (*) R/W spin locks > + (*) mutexes > + (*) semaphores > + (*) R/W semaphores > + > +In all cases there are variants on a LOCK operation and an UNLOCK operation. > + > + (*) LOCK operation implication: > + > + Memory accesses issued after the LOCK will be completed after the LOCK > + accesses have completed. > + > + Memory accesses issued before the LOCK may be completed after the LOCK > + accesses have completed. > + > + (*) UNLOCK operation implication: > + > + Memory accesses issued before the UNLOCK will be completed before the > + UNLOCK accesses have completed. > + > + Memory accesses issued after the UNLOCK may be completed before the UNLOCK > + accesses have completed. > + > + (*) LOCK vs UNLOCK implication: > + > + The LOCK accesses will be completed before the UNLOCK accesses. > + > +And therefore an UNLOCK followed by a LOCK is equivalent to a full barrier, but > +a LOCK followed by an UNLOCK isn't. > + > +Locks and semaphores may not provide any guarantee of ordering on UP compiled > +systems, and so can't be counted on in such a situation to actually do anything > +at all, especially with respect to I/O barriering, unless combined with > +interrupt disablement operations. s/disablement/disabling/ Should clarify local ordering v SMP ordering for locks implied here. > +INTERRUPT DISABLEMENT FUNCTIONS > +------------------------------- s/Disablement/Disabling/ > +Interrupt disablement (LOCK equivalent) and enablement (UNLOCK equivalent) will disable > +=========================== > +LINUX KERNEL I/O BARRIERING /barriering/barriers > + (*) inX(), outX(): > + > + These are intended to talk to legacy i386 hardware using an alternate bus > + addressing mode. They are synchronous as far as the x86 CPUs are Not really true. Lots of PCI devices use them. Need to talk about "I/O space" > + concerned, but other CPUs and intermediary bridges may not honour that. > + > + They are guaranteed to be fully ordered with respect to each other. And make clear I/O space is a CPU property and that inX()/outX() may well map to read/write variant functions on many processors > + (*) readX(), writeX(): > + > + These are guaranteed to be fully ordered and uncombined with respect to > + each other on the issuing CPU, provided they're not accessing a MTRRs > + prefetchable device. However, intermediary hardware (such as a PCI > + bridge) may indulge in deferral if it so wishes; to flush a write, a read > + from the same location must be performed. False. Its not so tightly restricted and many devices the location you write is not safe to read so you must use another. I'd have to dig the PCI spec out but I believe it says the same devfn. It also says stuff about rules for visibility of bus mastering relative to these accesses and PCI config space accesses relative to the lot (the latter serveral chipsets get wrong). We should probably point people at the PCI 2.2 spec . Looks much much better than the first version and just goes to prove how complex this all is From miltonm at bga.com Thu Mar 9 03:00:14 2006 From: miltonm at bga.com (Milton Miller) Date: Wed, 8 Mar 2006 10:00:14 -0600 Subject: dead hvc_console with kdump kernel In-Reply-To: <200603081736.41406.michael@ellerman.id.au> References: <200603061329.08262.michael@ellerman.id.au> <200603061726.35175.michael@ellerman.id.au> <200603081736.41406.michael@ellerman.id.au> Message-ID: <4e714f8f51a3ec10b00a99cb45d35f9f@bga.com> On Mar 8, 2006, at 12:36 AM, Michael Ellerman wrote: > On Mon, 6 Mar 2006 17:26, Michael Ellerman wrote: >> On Mon, 6 Mar 2006 13:29, Michael Ellerman wrote: >>> On Fri, 24 Feb 2006 07:41, Ryan Arnold wrote: >>>> If interrupts end up being disabled by the kexec call and you still >>>> need the console you could try to find a way to set hp->irq = >>>> NO_IRQ in >>>> this case such that the khvcd thread is continually rescheduled to >>>> poll >>>> the hypervisor buffer and never sleeps indefinitely, as via the >>>> interrupt driven method. >>> >>> Still not sure what's going on here, some interrupt weirdness. > > So we take an interrupt, and while we're processing it we decide we > should > panic, this happens for example if we panic via sysrq. > > The interrupt we took never gets eoi'ed, and so the second kernel will > never > get another one. This is what's causing the console "hang", should have > thought of it earlier :/ I've confirmed it happens for any irq. > > The question is how/where do we do the eoi. On real xics it is safe to do in the initial register irq. Don't do it every irq register though. Hypervisor will cover us for emulated xics if we do that. For real mpic? I don't know. Maybe we do an mpic reset and that will cover us? Ben? milton From jschopp at austin.ibm.com Thu Mar 9 03:49:23 2006 From: jschopp at austin.ibm.com (Joel Schopp) Date: Wed, 08 Mar 2006 10:49:23 -0600 Subject: [PATCH] cell: cleanup iommu initialization In-Reply-To: <20060308154946.GB15859@lst.de> References: <20060308154946.GB15859@lst.de> Message-ID: <440F0B13.1070205@austin.ibm.com> > - iommu->mapped_base = __ioremap(0x20000511000, 0x1000, _PAGE_NO_CACHE); > - iommu->mapped_mmio_base = __ioremap(0x20000510000, 0x1000, _PAGE_NO_CACHE); > + iommu->mapped_base = __ioremap(base, 0x1000, _PAGE_NO_CACHE); > + iommu->mapped_mmio_base = __ioremap(mmio_base, 0x1000, _PAGE_NO_CACHE); While we are cleaning this up can we name these constants. Many of these unnamed constants throughout the patch. From hch at lst.de Thu Mar 9 03:51:44 2006 From: hch at lst.de (Christoph Hellwig) Date: Wed, 8 Mar 2006 17:51:44 +0100 Subject: [PATCH] cell: cleanup iommu initialization In-Reply-To: <440F0B13.1070205@austin.ibm.com> References: <20060308154946.GB15859@lst.de> <440F0B13.1070205@austin.ibm.com> Message-ID: <20060308165144.GA17966@lst.de> On Wed, Mar 08, 2006 at 10:49:23AM -0600, Joel Schopp wrote: > > >- iommu->mapped_base = __ioremap(0x20000511000, 0x1000, > >_PAGE_NO_CACHE); > >- iommu->mapped_mmio_base = __ioremap(0x20000510000, 0x1000, > >_PAGE_NO_CACHE); > >+ iommu->mapped_base = __ioremap(base, 0x1000, _PAGE_NO_CACHE); > >+ iommu->mapped_mmio_base = __ioremap(mmio_base, 0x1000, > >_PAGE_NO_CACHE); > > While we are cleaning this up can we name these constants. Many of these > unnamed constants throughout the patch. I already mentioned that in the description. Where in the cell documentation could I find canonical names for these? From bos at serpentine.com Thu Mar 9 03:40:11 2006 From: bos at serpentine.com (Bryan O'Sullivan) Date: Wed, 08 Mar 2006 08:40:11 -0800 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <27607.1141821265@warthog.cambridge.redhat.com> References: <31492.1141753245@warthog.cambridge.redhat.com> <17422.19209.60360.178668@cargo.ozlabs.ibm.com> <27607.1141821265@warthog.cambridge.redhat.com> Message-ID: <1141836011.4347.22.camel@camp4.serpentine.com> On Wed, 2006-03-08 at 12:34 +0000, David Howells wrote: > On i386 and x86_64, do IN and OUT instructions imply MFENCE? No. References: <20060308145506.GA5095@devserv.devel.redhat.com> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> Message-ID: <9834.1141837491@warthog.cambridge.redhat.com> Alan Cox wrote: > [For information on bus mastering DMA and coherency please read ....] > > sincee have a doc on this Documentation/pci.txt? > The use of volatile generates poorer code and hides the serialization in > type declarations that may be far from the code. I'm not sure what you mean by that. > Is this true of IA-64 ?? Are you referring to non-temporal loads and stores? > > +There are some more advanced barriering functions: > > "barriering" ... ick, barrier. Picky:-) > Should clarify local ordering v SMP ordering for locks implied here. Do you mean explain what each sort of lock does? > > + (*) inX(), outX(): > > + > > + These are intended to talk to legacy i386 hardware using an alternate bus > > + addressing mode. They are synchronous as far as the x86 CPUs are > > Not really true. Lots of PCI devices use them. Need to talk about "I/O space" Which bit is not really true? David From dhowells at redhat.com Thu Mar 9 04:19:41 2006 From: dhowells at redhat.com (David Howells) Date: Wed, 08 Mar 2006 17:19:41 +0000 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <20060308154157.GI7301@parisc-linux.org> References: <20060308154157.GI7301@parisc-linux.org> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> Message-ID: <10095.1141838381@warthog.cambridge.redhat.com> Matthew Wilcox wrote: > > That might be worth an example with an if() because PPC will do this and > > if its a read with a side effect (eg I/O space) you get singed.. > > PPC does speculative memory accesses to IO? Are you *sure*? Can you do speculative reads from frame buffers? > # define smp_read_barrier_depends() do { } while(0) What's this one meant to do? > Port space is deprecated though. PCI 2.3 says: That's sort of irrelevant for the here. I still need to document the interaction. > Since memory write transactions may be posted in bridges anywhere > in the system, and I/O writes may be posted in the host bus bridge, I'm not sure whether this is beyond the scope of this document. Maybe the document's scope needs to be expanded. David From alan at redhat.com Thu Mar 9 04:36:05 2006 From: alan at redhat.com (Alan Cox) Date: Wed, 8 Mar 2006 12:36:05 -0500 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <9834.1141837491@warthog.cambridge.redhat.com> References: <20060308145506.GA5095@devserv.devel.redhat.com> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <9834.1141837491@warthog.cambridge.redhat.com> Message-ID: <20060308173605.GB13063@devserv.devel.redhat.com> On Wed, Mar 08, 2006 at 05:04:51PM +0000, David Howells wrote: > > [For information on bus mastering DMA and coherency please read ....] > > sincee have a doc on this > > Documentation/pci.txt? and: Documentation/DMA-mapping.txt Documentation/DMA-API.txt > > > The use of volatile generates poorer code and hides the serialization in > > type declarations that may be far from the code. > > I'm not sure what you mean by that. in foo.h struct blah { volatile int x; /* need serialization } 2 million miles away blah.x = 1; blah.y = 4; And you've no idea that its magically serialized due to a type declaration in a header you've never read. Hence the "dont use volatile" rule > > Is this true of IA-64 ?? > > Are you referring to non-temporal loads and stores? Yep. But Matthew answered that > > Should clarify local ordering v SMP ordering for locks implied here. > > Do you mean explain what each sort of lock does? spin_unlock ensures that local CPU writes before the lock are visible to all processors before the lock is dropped but it has no effect on I/O ordering. Just a need for clarity. > > > + (*) inX(), outX(): > > > + > > > + These are intended to talk to legacy i386 hardware using an alternate bus > > > + addressing mode. They are synchronous as far as the x86 CPUs are > > > > Not really true. Lots of PCI devices use them. Need to talk about "I/O space" > > Which bit is not really true? The "legacy i386 hardware" bit. Many processors have an I/O space. From dhowells at redhat.com Thu Mar 9 05:35:07 2006 From: dhowells at redhat.com (David Howells) Date: Wed, 08 Mar 2006 18:35:07 +0000 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <20060308173605.GB13063@devserv.devel.redhat.com> References: <20060308173605.GB13063@devserv.devel.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <9834.1141837491@warthog.cambridge.redhat.com> Message-ID: <11922.1141842907@warthog.cambridge.redhat.com> Alan Cox wrote: > spin_unlock ensures that local CPU writes before the lock are visible > to all processors before the lock is dropped but it has no effect on > I/O ordering. Just a need for clarity. So I can't use spinlocks in my driver to make sure two different CPUs don't interfere with each other when trying to communicate with a device because the spinlocks don't guarantee that I/O operations will stay in effect within the locking section? David From alan at redhat.com Thu Mar 9 05:45:00 2006 From: alan at redhat.com (Alan Cox) Date: Wed, 8 Mar 2006 13:45:00 -0500 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <11922.1141842907@warthog.cambridge.redhat.com> References: <20060308173605.GB13063@devserv.devel.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <9834.1141837491@warthog.cambridge.redhat.com> <11922.1141842907@warthog.cambridge.redhat.com> Message-ID: <20060308184500.GA17716@devserv.devel.redhat.com> On Wed, Mar 08, 2006 at 06:35:07PM +0000, David Howells wrote: > Alan Cox wrote: > > > spin_unlock ensures that local CPU writes before the lock are visible > > to all processors before the lock is dropped but it has no effect on > > I/O ordering. Just a need for clarity. > > So I can't use spinlocks in my driver to make sure two different CPUs don't > interfere with each other when trying to communicate with a device because the > spinlocks don't guarantee that I/O operations will stay in effect within the > locking section? If you have CPU #0 spin_lock(&foo->lock) writel(0, &foo->regnum) writel(1, &foo->data); spin_unlock(&foo->lock); CPU #1 spin_lock(&foo->lock); writel(4, &foo->regnum); writel(5, &foo->data); spin_unlock(&foo->lock); then on some NUMA infrastructures the order may not be as you expect. The CPU will execute writel 0, writel 1 and the second CPU later will execute writel 4 writel 5, but the order they hit the PCI bridge may not be the same order. Usually such things don't matter but in a register windowed case getting 0/4/1/5 might be rather unfortunate. See Documentation/DocBook/deviceiobook.tmpl (or its output) The following case is safe spin_lock(&foo->lock); writel(0, &foo->regnum); reg = readl(&foo->data); spin_unlock(&foo->lock); as the real must complete and it forces the write to complete. The pure write case used above should be implemented as spin_lock(&foo->lock); writel(0, &foo->regnum); writel(1, &foo->data); mmiowb(); spin_unlock(&foo->lock); The mmiowb ensures that the writels will occur before the writel from another CPU then taking the lock and issuing a writel. Welcome to the wonderful world of NUMA Alan From dhowells at redhat.com Thu Mar 9 05:59:53 2006 From: dhowells at redhat.com (David Howells) Date: Wed, 08 Mar 2006 18:59:53 +0000 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <20060308184500.GA17716@devserv.devel.redhat.com> References: <20060308184500.GA17716@devserv.devel.redhat.com> <20060308173605.GB13063@devserv.devel.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <9834.1141837491@warthog.cambridge.redhat.com> <11922.1141842907@warthog.cambridge.redhat.com> Message-ID: <14067.1141844393@warthog.cambridge.redhat.com> Alan Cox wrote: > then on some NUMA infrastructures the order may not be as you expect. Oh, yuck! Okay... does NUMA guarantee the same for ordinary memory accesses inside the critical section? David From ak at suse.de Wed Mar 8 22:38:38 2006 From: ak at suse.de (Andi Kleen) Date: Wed, 8 Mar 2006 12:38:38 +0100 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <14067.1141844393@warthog.cambridge.redhat.com> References: <20060308184500.GA17716@devserv.devel.redhat.com> <11922.1141842907@warthog.cambridge.redhat.com> <14067.1141844393@warthog.cambridge.redhat.com> Message-ID: <200603081238.39372.ak@suse.de> On Wednesday 08 March 2006 19:59, David Howells wrote: > Alan Cox wrote: > > > then on some NUMA infrastructures the order may not be as you expect. > > Oh, yuck! > > Okay... does NUMA guarantee the same for ordinary memory accesses inside the > critical section? If you use barriers the ordering should be the same on cc/NUMA vs SMP. Otherwise it wouldn't be "cc" But it might be quite unfair. -Andi From abergman at de.ibm.com Thu Mar 9 04:56:50 2006 From: abergman at de.ibm.com (Arnd Bergmann) Date: Wed, 8 Mar 2006 18:56:50 +0100 Subject: [PATCH] cell: cleanup iommu initialization In-Reply-To: <20060308165144.GA17966@lst.de> References: <20060308154946.GB15859@lst.de> <440F0B13.1070205@austin.ibm.com> <20060308165144.GA17966@lst.de> Message-ID: <200603081856.50755.abergman@de.ibm.com> On Wednesday 08 March 2006 17:51, Christoph Hellwig wrote: > On Wed, Mar 08, 2006 at 10:49:23AM -0600, Joel Schopp wrote: > > > > >-????iommu->mapped_base = __ioremap(0x20000511000, 0x1000, > > >_PAGE_NO_CACHE); > > >-????iommu->mapped_mmio_base = __ioremap(0x20000510000, 0x1000, > > >_PAGE_NO_CACHE); > > >+????iommu->mapped_base = __ioremap(base, 0x1000, _PAGE_NO_CACHE); > > >+????iommu->mapped_mmio_base = __ioremap(mmio_base, 0x1000, > > >_PAGE_NO_CACHE); > > > > While we are cleaning this up can we name these constants. ?Many of these > > unnamed constants throughout the patch. > > I already mentioned that in the description. ?Where in the cell > documentation could I find canonical names for these? Hmm, the reason we did not have the base addresses here so far is that they are set up by the firmware and therefore not really constants that are documented anywhere. In the new code we get them from the device tree and the hardcoded case will go away one day when the board that have been shipped to customers so far have all upgraded their firmware (they need to do that anyway if they want NUMA support). Thinking about this again, we should probably just define constants anyway. The CBE public registers document probably names only the offsets relative to the areas, so we need to make up names for ourselves. Arnd <>< From dhowells at redhat.com Thu Mar 9 06:08:42 2006 From: dhowells at redhat.com (David Howells) Date: Wed, 08 Mar 2006 19:08:42 +0000 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <20060308184500.GA17716@devserv.devel.redhat.com> References: <20060308184500.GA17716@devserv.devel.redhat.com> <20060308173605.GB13063@devserv.devel.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <9834.1141837491@warthog.cambridge.redhat.com> <11922.1141842907@warthog.cambridge.redhat.com> Message-ID: <14275.1141844922@warthog.cambridge.redhat.com> Alan Cox wrote: > spin_lock(&foo->lock); > writel(0, &foo->regnum); I presume there only needs to be an mmiowb() here if you've got the appropriate CPU's I/O memory window set up to be weakly ordered. > writel(1, &foo->data); > mmiowb(); > spin_unlock(&foo->lock); David From torvalds at osdl.org Thu Mar 9 06:26:41 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Wed, 8 Mar 2006 11:26:41 -0800 (PST) Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <14275.1141844922@warthog.cambridge.redhat.com> References: <20060308184500.GA17716@devserv.devel.redhat.com> <20060308173605.GB13063@devserv.devel.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <9834.1141837491@warthog.cambridge.redhat.com> <11922.1141842907@warthog.cambridge.redhat.com> <14275.1141844922@warthog.cambridge.redhat.com> Message-ID: On Wed, 8 Mar 2006, David Howells wrote: > Alan Cox wrote: > > > spin_lock(&foo->lock); > > writel(0, &foo->regnum); > > I presume there only needs to be an mmiowb() here if you've got the > appropriate CPU's I/O memory window set up to be weakly ordered. Actually, since the different NUMA things may have different paths to the PCI thing, I don't think even the mmiowb() will really help. It has nothing to serialize _with_. It only orders mmio from within _one_ CPU and "path" to the destination. The IO might be posted somewhere on a PCI bridge, and and depending on the posting rules, the mmiowb() just isn't relevant for IO coming through another path. Of course, to get into that deep doo-doo, your IO fabric must be separate from the memory fabric, and the hardware must be pretty special, I think. So for example, if you are using an Opteron with it's NUMA memory setup between CPU's over HT links, from an _IO_ standpoint it's not really anything strange, since it uses the same fabric for memory coherency and IO coherency, and from an IO ordering standpoint it's just normal SMP. But if you have a separate IO fabric and basically two different CPU's can get to one device through two different paths, no amount of write barriers of any kind will ever help you. So in the really general case, it's still basically true that the _only_ thing that serializes a MMIO write to a device is a _read_ from that device, since then the _device_ ends up being the serialization point. So in the exteme case, you literally have to do a read from the device before you release the spinlock, if ordering to the device from two different CPU's matters to you. The IO paths simply may not be serializable with the normal memory paths, so spinlocks have absolutely _zero_ ordering capability, and a write barrier on either the normal memory side or the IO side doesn't affect anything. Now, I'm by no means claiming that we necessarily get this right in general, or even very commonly. The undeniable fact is that "big NUMA" machines need to validate the drivers they use separately. The fact that it works on a normal PC - and that it's been tested to death there - does not guarantee much anything. The good news, of course, is that you don't use that kind of "big NUMA" system the same way you'd use a regular desktop SMP. You don't plug in random devices into it and just expect them to work. I'd hope ;) Linus From dhowells at redhat.com Thu Mar 9 06:31:42 2006 From: dhowells at redhat.com (David Howells) Date: Wed, 08 Mar 2006 19:31:42 +0000 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: References: <20060308184500.GA17716@devserv.devel.redhat.com> <20060308173605.GB13063@devserv.devel.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <9834.1141837491@warthog.cambridge.redhat.com> <11922.1141842907@warthog.cambridge.redhat.com> <14275.1141844922@warthog.cambridge.redhat.com> Message-ID: <19984.1141846302@warthog.cambridge.redhat.com> Linus Torvalds wrote: > Actually, since the different NUMA things may have different paths to the > PCI thing, I don't think even the mmiowb() will really help. It has > nothing to serialize _with_. On NUMA PowerPC, should mmiowb() be a SYNC or an EIEIO instruction then? Those do inter-component synchronisation. David From dhowells at redhat.com Thu Mar 9 06:37:11 2006 From: dhowells at redhat.com (David Howells) Date: Wed, 08 Mar 2006 19:37:11 +0000 Subject: [PATCH] Document Linux's memory barriers [try #3] In-Reply-To: <29826.1141828678@warthog.cambridge.redhat.com> References: <29826.1141828678@warthog.cambridge.redhat.com> <31492.1141753245@warthog.cambridge.redhat.com> Message-ID: <21627.1141846631@warthog.cambridge.redhat.com> The attached patch documents the Linux kernel's memory barriers. I've updated it from the comments I've been given. Note that the per-arch notes sections are gone because it's clear that there are so many exceptions, that it's not worth having them. I've added a list of references to other documents. I've tried to get rid of the concept of memory accesses appearing on the bus; what matters is apparent behaviour with respect to other observers in the system. I'm not sure that any mention interrupts vs interrupt disablement should be retained... it's unclear that there is actually anything that guarantees that stuff won't leak out of an interrupt-disabled section and into an interrupt handler. Paul Mackerras says this isn't valid on powerpc, and looking at the code seems to confirm that, barring implicit enforcement by the CPU. There's also some uncertainty with respect to spinlocks vs I/O accesses on NUMA. Signed-Off-By: David Howells --- warthog>diffstat -p1 /tmp/mb.diff Documentation/memory-barriers.txt | 781 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 781 insertions(+) diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt new file mode 100644 index 0000000..6eeb7e4 --- /dev/null +++ b/Documentation/memory-barriers.txt @@ -0,0 +1,781 @@ + ============================ + LINUX KERNEL MEMORY BARRIERS + ============================ + +Contents: + + (*) What are memory barriers? + + (*) Where are memory barriers needed? + + - Accessing devices. + - Multiprocessor interaction. + - Interrupts. + + (*) Explicit kernel compiler barriers. + + (*) Explicit kernel memory barriers. + + (*) Implicit kernel memory barriers. + + - Locking functions. + - Interrupt disabling functions. + - Miscellaneous functions. + + (*) Inter-CPU locking barrier effects. + + - Locks vs memory accesses. + - Locks vs I/O accesses. + + (*) Kernel I/O barrier effects. + + (*) References. + + +========================= +WHAT ARE MEMORY BARRIERS? +========================= + +Memory barriers are instructions to both the compiler and the CPU to impose an +apparent partial ordering between the memory access operations specified either +side of the barrier. They request that the sequence of memory events generated +appears to other components of the system as if the barrier is effective on +that CPU. + +Note that: + + (*) there's no guarantee that the sequence of memory events is _actually_ so + ordered. It's possible for the CPU to do out-of-order accesses _as long + as no-one is looking_, and then fix up the memory if someone else tries to + see what's going on (for instance a bus master device); what matters is + the _apparent_ order as far as other processors and devices are concerned; + and + + (*) memory barriers are only guaranteed to act within the CPU processing them, + and are not, for the most part, guaranteed to percolate down to other CPUs + in the system or to any I/O hardware that that CPU may communicate with. + + +For example, a programmer might take it for granted that the CPU will perform +memory accesses in exactly the order specified, so that if a CPU is, for +example, given the following piece of code: + + a = *A; + *B = b; + c = *C; + d = *D; + *E = e; + +They would then expect that the CPU will complete the memory access for each +instruction before moving on to the next one, leading to a definite sequence of +operations as seen by external observers in the system: + + read *A, write *B, read *C, read *D, write *E. + + +Reality is, of course, much messier. With many CPUs and compilers, this isn't +always true because: + + (*) reads are more likely to need to be completed immediately to permit + execution progress, whereas writes can often be deferred without a + problem; + + (*) reads can be done speculatively, and then the result discarded should it + prove not to be required; + + (*) the order of the memory accesses may be rearranged to promote better use + of the CPU buses and caches; + + (*) reads and writes may be combined to improve performance when talking to + the memory or I/O hardware that can do batched accesses of adjacent + locations, thus cutting down on transaction setup costs (memory and PCI + devices may be able to do this); and + + (*) the CPU's data cache may affect the ordering, though cache-coherency + mechanisms should alleviate this - once the write has actually hit the + cache. + +So what another CPU, say, might actually observe from the above piece of code +is: + + read *A, read {*C,*D}, write *E, write *B + + (By "read {*C,*D}" I mean a combined single read). + + +It is also guaranteed that a CPU will be self-consistent: it will see its _own_ +accesses appear to be correctly ordered, without the need for a memory +barrier. For instance with the following code: + + X = *A; + *A = Y; + Z = *A; + +assuming no intervention by an external influence, it can be taken that: + + (*) X will hold the old value of *A, and will never happen after the write and + thus end up being given the value that was assigned to *A from Y instead; + and + + (*) Z will always be given the value in *A that was assigned there from Y, and + will never happen before the write, and thus end up with the same value + that was in *A initially. + +(This is ignoring the fact that the value initially in *A may appear to be the +same as the value assigned to *A from Y). + + +================================= +WHERE ARE MEMORY BARRIERS NEEDED? +================================= + +Under normal operation, access reordering is probably not going to be a problem +as a linear program will still appear to operate correctly. There are, +however, three circumstances where reordering definitely _could_ be a problem: + + +ACCESSING DEVICES +----------------- + +Many devices can be memory mapped, and so appear to the CPU as if they're just +memory locations. However, to control the device, the driver has to make the +right accesses in exactly the right order. + +Consider, for example, an ethernet chipset such as the AMD PCnet32. It +presents to the CPU an "address register" and a bunch of "data registers". The +way it's accessed is to write the index of the internal register to be accessed +to the address register, and then read or write the appropriate data register +to access the chip's internal register, which could - theoretically - be done +by: + + *ADR = ctl_reg_3; + reg = *DATA; + +The problem with a clever CPU or a clever compiler is that the write to the +address register isn't guaranteed to happen before the access to the data +register, if the CPU or the compiler thinks it is more efficient to defer the +address write: + + read *DATA, write *ADR + +then things will break. + + +In the Linux kernel, however, I/O should be done through the appropriate +accessor routines - such as inb() or writel() - which know how to make such +accesses appropriately sequential. + +On some systems, I/O writes are not strongly ordered across all CPUs, and so +locking should be used, and mmiowb() should be issued prior to unlocking the +critical section. + +See Documentation/DocBook/deviceiobook.tmpl for more information. + + +MULTIPROCESSOR INTERACTION +-------------------------- + +When there's a system with more than one processor, the CPUs in the system may +be working on the same set of data at the same time. This can cause +synchronisation problems, and the usual way of dealing with them is to use +locks - but locks are quite expensive, and so it may be preferable to operate +without the use of a lock if at all possible. In such a case accesses that +affect both CPUs may have to be carefully ordered to prevent error. + +Consider the R/W semaphore slow path. In that, a waiting process is queued on +the semaphore, as noted by it having a record on its stack linked to the +semaphore's list: + + struct rw_semaphore { + ... + struct list_head waiters; + }; + + struct rwsem_waiter { + struct list_head list; + struct task_struct *task; + }; + +To wake up the waiter, the up_read() or up_write() functions have to read the +pointer from this record to know as to where the next waiter record is, clear +the task pointer, call wake_up_process() on the task, and release the reference +held on the waiter's task struct: + + READ waiter->list.next; + READ waiter->task; + WRITE waiter->task; + CALL wakeup + RELEASE task + +If any of these steps occur out of order, then the whole thing may fail. + +Note that the waiter does not get the semaphore lock again - it just waits for +its task pointer to be cleared. Since the record is on its stack, this means +that if the task pointer is cleared _before_ the next pointer in the list is +read, another CPU might start processing the waiter and it might clobber its +stack before up*() functions have a chance to read the next pointer. + + CPU 0 CPU 1 + =============================== =============================== + down_xxx() + Queue waiter + Sleep + up_yyy() + READ waiter->task; + WRITE waiter->task; + + Resume processing + down_xxx() returns + call foo() + foo() clobbers *waiter + + READ waiter->list.next; + --- OOPS --- + +This could be dealt with using a spinlock, but then the down_xxx() function has +to get the spinlock again after it's been woken up, which is a waste of +resources. + +The way to deal with this is to insert an SMP memory barrier: + + READ waiter->list.next; + READ waiter->task; + smp_mb(); + WRITE waiter->task; + CALL wakeup + RELEASE task + +In this case, the barrier makes a guarantee that all memory accesses before the +barrier will appear to happen before all the memory accesses after the barrier +with respect to the other CPUs on the system. It does _not_ guarantee that all +the memory accesses before the barrier will be complete by the time the barrier +itself is complete. + +SMP memory barriers are normally nothing more than compiler barriers on a +kernel compiled for a UP system because the CPU orders overlapping accesses +with respect to itself, and so CPU barriers aren't needed. + + +INTERRUPTS +---------- + +A driver may be interrupted by its own interrupt service routine, and thus they +may interfere with each other's attempts to control or access the device. + +This may be alleviated - at least in part - by disabling interrupts (a form of +locking), such that the critical operations are all contained within the +interrupt-disabled section in the driver. Whilst the driver's interrupt +routine is executing, the driver's core may not run on the same CPU, and its +interrupt is not permitted to happen again until the current interrupt has been +handled, thus the interrupt handler does not need to lock against that. + + +However, consider the following example: + + CPU 1 CPU 2 + =============================== =============================== + [A is 0 and B is 0] + DISABLE IRQ + *A = 1; + smp_wmb(); + *B = 2; + ENABLE IRQ + + *A = 3 + a = *A; + b = *B; + smp_wmb(); + *B = 4; + + +CPU 2 might see *A == 3 and *B == 0, when what it probably ought to see is *B +== 2 and *A == 1 or *A == 3, or *B == 4 and *A == 3. + +This might happen because the write "*B = 2" might occur after the write "*A = +3" - in which case the former write has leaked from the interrupt-disabled +section into the interrupt handler. In this case it is a lock of some +description should very probably be used. + + +This sort of problem might also occur with relaxed I/O ordering rules, if it's +permitted for I/O writes to cross. For instance, if a driver was talking to an +ethernet card that sports an address register and a data register: + + DISABLE IRQ + writew(ADR, ctl_reg_3); + writew(DATA, y); + ENABLE IRQ + + writew(ADR, ctl_reg_4); + q = readw(DATA); + + +In such a case, an mmiowb() is needed, firstly to prevent the first write to +the address register from occurring after the write to the data register, and +secondly to prevent the write to the data register from happening after the +second write to the address register. + + +================================= +EXPLICIT KERNEL COMPILER BARRIERS +================================= + +The Linux kernel has an explicit compiler barrier function that prevents the +compiler from moving the memory accesses either side of it to the other side: + + barrier(); + +This has no direct effect on the CPU, which may then reorder things however it +wishes. + + +In addition, accesses to "volatile" memory locations and volatile asm +statements act as implicit compiler barriers. Note, however, that the use of +volatile has two negative consequences: + + (1) it causes the generation of poorer code, and + + (2) it can affect serialisation of events in code distant from the declaration + (consider a structure defined in a header file that has a volatile member + being accessed by the code in a source file). + +The Linux coding style therefore strongly favours the use of explicit barriers +except in small and specific cases. In general, volatile should be avoided. + + +=============================== +EXPLICIT KERNEL MEMORY BARRIERS +=============================== + +The Linux kernel has six basic CPU memory barriers: + + MANDATORY SMP CONDITIONAL + =============== =============== + GENERAL mb() smp_mb() + READ rmb() smp_rmb() + WRITE wmb() smp_wmb() + +General memory barriers give a guarantee that all memory accesses specified +before the barrier will appear to happen before all memory accesses specified +after the barrier with respect to the other components of the system. + +Read and write memory barriers give similar guarantees, but only for memory +reads versus memory reads and memory writes versus memory writes respectively. + +All memory barriers imply compiler barriers. + +SMP memory barriers are only compiler barriers on uniprocessor compiled systems +because it is assumed that a CPU will be apparently self-consistent, and will +order overlapping accesses correctly with respect to itself. + +There is no guarantee that any of the memory accesses specified before a memory +barrier will be complete by the completion of a memory barrier; the barrier can +be considered to draw a line in that CPU's access queue that accesses of the +appropriate type may not cross. + +There is no guarantee that issuing a memory barrier on one CPU will have any +direct effect on another CPU or any other hardware in the system. The indirect +effect will be the order in which the second CPU sees the first CPU's accesses +occur. + +There is no guarantee that some intervening piece of off-the-CPU hardware[*] +will not reorder the memory accesses. CPU cache coherency mechanisms should +propegate the indirect effects of a memory barrier between CPUs. + + [*] For information on bus mastering DMA and coherency please read: + + Documentation/pci.txt + Documentation/DMA-mapping.txt + Documentation/DMA-API.txt + +Note that these are the _minimum_ guarantees. Different architectures may give +more substantial guarantees, but they may not be relied upon outside of arch +specific code. + + +There are some more advanced barrier functions: + + (*) set_mb(var, value) + (*) set_wmb(var, value) + + These assign the value to the variable and then insert at least a write + barrier after it, depending on the function. They aren't guaranteed to + insert anything more than a compiler barrier in a UP compilation. + + +=============================== +IMPLICIT KERNEL MEMORY BARRIERS +=============================== + +Some of the other functions in the linux kernel imply memory barriers, amongst +them are locking and scheduling functions and interrupt management functions. + +This specification is a _minimum_ guarantee; any particular architecture may +provide more substantial guarantees, but these may not be relied upon outside +of arch specific code. + + +LOCKING FUNCTIONS +----------------- + +All the following locking functions imply barriers: + + (*) spin locks + (*) R/W spin locks + (*) mutexes + (*) semaphores + (*) R/W semaphores + +In all cases there are variants on a LOCK operation and an UNLOCK operation. + + (*) LOCK operation implication: + + Memory accesses issued after the LOCK will be completed after the LOCK + accesses have completed. + + Memory accesses issued before the LOCK may be completed after the LOCK + accesses have completed. + + (*) UNLOCK operation implication: + + Memory accesses issued before the UNLOCK will be completed before the + UNLOCK accesses have completed. + + Memory accesses issued after the UNLOCK may be completed before the UNLOCK + accesses have completed. + + (*) LOCK vs UNLOCK implication: + + The LOCK accesses will be completed before the UNLOCK accesses. + +And therefore an UNLOCK followed by a LOCK is equivalent to a full barrier, but +a LOCK followed by an UNLOCK isn't. + +Locks and semaphores may not provide any guarantee of ordering on UP compiled +systems, and so can't be counted on in such a situation to actually do anything +at all, especially with respect to I/O accesses, unless combined with interrupt +disabling operations. + +See also the section on "Inter-CPU locking barrier effects". + + +As an example, consider the following: + + *A = a; + *B = b; + LOCK + *C = c; + *D = d; + UNLOCK + *E = e; + *F = f; + +The following sequence of events is acceptable: + + LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK + +But none of the following are: + + {*F,*A}, *B, LOCK, *C, *D, UNLOCK, *E + *A, *B, *C, LOCK, *D, UNLOCK, *E, *F + *A, *B, LOCK, *C, UNLOCK, *D, *E, *F + *B, LOCK, *C, *D, UNLOCK, {*F,*A}, *E + + +INTERRUPT DISABLING FUNCTIONS +----------------------------- + +Functions that disable interrupts (LOCK equivalent) and enable interrupts +(UNLOCK equivalent) will barrier memory and I/O accesses versus memory and I/O +accesses done in the interrupt handler. This prevents an interrupt routine +interfering with accesses made in a interrupt-disabled section of code and vice +versa. + +Note that whilst disabling or enabling interrupts acts as a compiler barriers +under all circumstances, they only act as memory barriers with respect to +interrupts, not with respect to nested sections. + +Consider the following: + + + *X = x; + + *A = a; + SAVE IRQ AND DISABLE + *B = b; + SAVE IRQ AND DISABLE + *C = c; + RESTORE IRQ + *D = d; + RESTORE IRQ + *E = e; + + *Y = y; + + +It is acceptable to observe the following sequences of events: + + { INT, *X }, *A, SAVE, *B, SAVE, *C, REST, *D, REST, *E, { INT, *Y } + { INT, *X }, *A, SAVE, *B, SAVE, *C, REST, *D, REST, { INT, *Y, *E } + { INT, *X }, SAVE, SAVE, *A, *B, *C, *D, *E, REST, REST, { INT, *Y } + { INT }, *X, SAVE, SAVE, *A, *B, *C, *D, *E, REST, REST, { INT, *Y } + { INT }, *A, *X, SAVE, SAVE, *B, *C, *D, *E, REST, REST, { INT, *Y } + +But not the following: + + { INT }, SAVE, *A, *B, *X, SAVE, *C, REST, *D, REST, *E, { INT, *Y } + { INT, *X }, *A, SAVE, *B, SAVE, *C, REST, REST, { INT, *Y, *D, *E } + + +MISCELLANEOUS FUNCTIONS +----------------------- + +Other functions that imply barriers: + + (*) schedule() and similar imply full memory barriers. + + +================================= +INTER-CPU LOCKING BARRIER EFFECTS +================================= + +On SMP systems locking primitives give a more substantial form of barrier: one +that does affect memory access ordering on other CPUs, within the context of +conflict on any particular lock. + + +LOCKS VS MEMORY ACCESSES +------------------------ + +Consider the following: the system has a pair of spinlocks (N) and (Q), and +three CPUs; then should the following sequence of events occur: + + CPU 1 CPU 2 + =============================== =============================== + *A = a; *E = e; + LOCK M LOCK Q + *B = b; *F = f; + *C = c; *G = g; + UNLOCK M UNLOCK Q + *D = d; *H = h; + +Then there is no guarantee as to what order CPU #3 will see the accesses to *A +through *H occur in, other than the constraints imposed by the separate locks +on the separate CPUs. It might, for example, see: + + *E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M + +But it won't see any of: + + *B, *C or *D preceding LOCK M + *A, *B or *C following UNLOCK M + *F, *G or *H preceding LOCK Q + *E, *F or *G following UNLOCK Q + + +However, if the following occurs: + + CPU 1 CPU 2 + =============================== =============================== + *A = a; + LOCK M [1] + *B = b; + *C = c; + UNLOCK M [1] + *D = d; *E = e; + LOCK M [2] + *F = f; + *G = g; + UNLOCK M [2] + *H = h; + +CPU #3 might see: + + *E, LOCK M [1], *C, *B, *A, UNLOCK M [1], + LOCK M [2], *H, *F, *G, UNLOCK M [2], *D + +But assuming CPU #1 gets the lock first, it won't see any of: + + *B, *C, *D, *F, *G or *H preceding LOCK M [1] + *A, *B or *C following UNLOCK M [1] + *F, *G or *H preceding LOCK M [2] + *A, *B, *C, *E, *F or *G following UNLOCK M [2] + + +LOCKS VS I/O ACCESSES +--------------------- + +Under certain circumstances (such as NUMA), I/O accesses within two spinlocked +sections on two different CPUs may be seen as interleaved by the PCI bridge. + +For example: + + CPU 1 CPU 2 + =============================== =============================== + spin_lock(Q) + writel(0, ADDR) + writel(1, DATA); + spin_unlock(Q); + spin_lock(Q); + writel(4, ADDR); + writel(5, DATA); + spin_unlock(Q); + +may be seen by the PCI bridge as follows: + + WRITE *ADDR = 0, WRITE *ADDR = 4, WRITE *DATA = 1, WRITE *DATA = 5 + +which would probably break. + +What is necessary here is to insert an mmiowb() before dropping the spinlock, +for example: + + CPU 1 CPU 2 + =============================== =============================== + spin_lock(Q) + writel(0, ADDR) + writel(1, DATA); + mmiowb(); + spin_unlock(Q); + spin_lock(Q); + writel(4, ADDR); + writel(5, DATA); + mmiowb(); + spin_unlock(Q); + +this will ensure that the two writes issued on CPU #1 appear at the PCI bridge +before either of the writes issued on CPU #2. + + +Furthermore, following a write by a read to the same device is okay, because +the read forces the write to complete before the read is performed: + + CPU 1 CPU 2 + =============================== =============================== + spin_lock(Q) + writel(0, ADDR) + a = readl(DATA); + spin_unlock(Q); + spin_lock(Q); + writel(4, ADDR); + b = readl(DATA); + spin_unlock(Q); + + +See Documentation/DocBook/deviceiobook.tmpl for more information. + + +========================== +KERNEL I/O BARRIER EFFECTS +========================== + +When accessing I/O memory, drivers should use the appropriate accessor +functions: + + (*) inX(), outX(): + + These are intended to talk to I/O space rather than memory space, but + that's primarily a CPU-specific concept. The i386 and x86_64 processors do + indeed have special I/O space access cycles and instructions, but many + CPUs don't have such a concept. + + The PCI bus, amongst others, defines an I/O space concept - which on such + CPUs as i386 and x86_64 cpus readily maps to the CPU's concept of I/O + space. However, it may also mapped as a virtual I/O space in the CPU's + memory map, particularly on those CPUs that don't support alternate + I/O spaces. + + Accesses to this space may be fully synchronous (as on i386), but + intermediary bridges (such as the PCI host bridge) may not fully honour + that. + + They are guaranteed to be fully ordered with respect to each other. + + They are not guaranteed to be fully ordered with respect to other types of + memory and I/O operation. + + (*) readX(), writeX(): + + Whether these are guaranteed to be fully ordered and uncombined with + respect to each other on the issuing CPU depends on the characteristics + defined for the memory window through which they're accessing. On later + i386 architecture machines, for example, this is controlled by way of the + MTRR registers. + + Ordinarily, these will be guaranteed to be fully ordered and uncombined,, + provided they're not accessing a prefetchable device. + + However, intermediary hardware (such as a PCI bridge) may indulge in + deferral if it so wishes; to flush a write, a read from the same location + is preferred[*], but a read from the same device or from configuration + space should suffice for PCI. + + [*] NOTE! attempting to read from the same location as was written to may + cause a malfunction - consider the 16550 Rx/Tx serial registers for + example. + + Used with prefetchable I/O memory, an mmiowb() barrier may be required to + force writes to be ordered. + + Please refer to the PCI specification for more information on interactions + between PCI transactions. + + (*) readX_relaxed() + + These are similar to readX(), but are not guaranteed to be ordered in any + way. Be aware that there is no I/O read barrier available. + + (*) ioreadX(), iowriteX() + + These will perform as appropriate for the type of access they're actually + doing, be it inX()/outX() or readX()/writeX(). + + +========== +REFERENCES +========== + +AMD64 Architecture Programmer's Manual Volume 2: System Programming + Chapter 7.1: Memory-Access Ordering + Chapter 7.4: Buffering and Combining Memory Writes + +IA-32 Intel Architecture Software Developer's Manual, Volume 3: +System Programming Guide + Chapter 7.1: Locked Atomic Operations + Chapter 7.2: Memory Ordering + Chapter 7.4: Serializing Instructions + +The SPARC Architecture Manual, Version 9 + Chapter 8: Memory Models + Appendix D: Formal Specification of the Memory Models + Appendix J: Programming with the Memory Models + +UltraSPARC Programmer Reference Manual + Chapter 5: Memory Accesses and Cacheability + Chapter 15: Sparc-V9 Memory Models + +UltraSPARC III Cu User's Manual + Chapter 9: Memory Models + +UltraSPARC IIIi Processor User's Manual + Chapter 8: Memory Models + +UltraSPARC Architecture 2005 + Chapter 9: Memory + Appendix D: Formal Specifications of the Memory Models + +UltraSPARC T1 Supplment to the UltraSPARC Architecture 2005 + Chapter 8: Memory Models + Appendix F: Caches and Cache Coherency + +Solaris Internals, Core Kernel Architecture, p63-68: + Chapter 3.3: Hardware Considerations for Locks and + Synchronization + +Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching +for Kernel Programmers: + Chapter 13: Other Memory Models + +Intel Itanium Architecture Software Developer's Manual: Volume 1: + Section 2.6: Speculation + Section 4.4: Memory Access From matthew at wil.cx Thu Mar 9 06:40:37 2006 From: matthew at wil.cx (Matthew Wilcox) Date: Wed, 8 Mar 2006 12:40:37 -0700 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: References: <20060308184500.GA17716@devserv.devel.redhat.com> <20060308173605.GB13063@devserv.devel.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <9834.1141837491@warthog.cambridge.redhat.com> <11922.1141842907@warthog.cambridge.redhat.com> <14275.1141844922@warthog.cambridge.redhat.com> Message-ID: <20060308194037.GO7301@parisc-linux.org> On Wed, Mar 08, 2006 at 11:26:41AM -0800, Linus Torvalds wrote: > > I presume there only needs to be an mmiowb() here if you've got the > > appropriate CPU's I/O memory window set up to be weakly ordered. > > Actually, since the different NUMA things may have different paths to the > PCI thing, I don't think even the mmiowb() will really help. It has > nothing to serialize _with_. > > It only orders mmio from within _one_ CPU and "path" to the destination. > The IO might be posted somewhere on a PCI bridge, and and depending on the > posting rules, the mmiowb() just isn't relevant for IO coming through > another path. Looking at the SGI implementation, it's smarter than you think. Looks like there's a register in the local I/O hub that lets you determine when this write has been queued in the appropriate host->pci bridge. So by the time __sn_mmiowb() returns, you're guaranteed no other CPU can bypass the write because the write's got far enough. From jbarnes at virtuousgeek.org Thu Mar 9 06:54:21 2006 From: jbarnes at virtuousgeek.org (Jesse Barnes) Date: Wed, 8 Mar 2006 11:54:21 -0800 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: References: <20060308184500.GA17716@devserv.devel.redhat.com> <14275.1141844922@warthog.cambridge.redhat.com> Message-ID: <200603081154.21960.jbarnes@virtuousgeek.org> On Wednesday, March 8, 2006 11:26 am, Linus Torvalds wrote: > But if you have a separate IO fabric and basically two different CPU's > can get to one device through two different paths, no amount of write > barriers of any kind will ever help you. No, that's exactly the case that mmiowb() was designed to protect against. It ensures that your writes have arrived at the destination bridge, which means after that point any other CPUs writing to the same device will have their data actually hit the device afterwards. Hopefully deviceiobook.tmpl makes that clear... Jesse From alan at redhat.com Thu Mar 9 07:02:46 2006 From: alan at redhat.com (Alan Cox) Date: Wed, 8 Mar 2006 15:02:46 -0500 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: References: <20060308184500.GA17716@devserv.devel.redhat.com> <20060308173605.GB13063@devserv.devel.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <9834.1141837491@warthog.cambridge.redhat.com> <11922.1141842907@warthog.cambridge.redhat.com> <14275.1141844922@warthog.cambridge.redhat.com> Message-ID: <20060308200246.GA17886@devserv.devel.redhat.com> On Wed, Mar 08, 2006 at 11:26:41AM -0800, Linus Torvalds wrote: > Actually, since the different NUMA things may have different paths to the > PCI thing, I don't think even the mmiowb() will really help. It has > nothing to serialize _with_. It serializes to the bridge. On the Altix for example this is done by reading a local status register with the pending write count in it and waiting until the chip reports the write has propogated across the fabric. At that point it has hit the bridge and the usual PCI posting applies, but the PCI ordering rule will also apply so the write won't be passed by another write issued after the spinlock is then dropped. > The IO might be posted somewhere on a PCI bridge, and and depending on the > posting rules, the mmiowb() just isn't relevant for IO coming through > another path. Yes. mmiowb only serializes to the bridge. Thats how it is defined in the documentation. Thats enough to sort out things like the example with locks, but where a read from the device would be overkill. > general, or even very commonly. The undeniable fact is that "big NUMA" > machines need to validate the drivers they use separately. The fact that > it works on a normal PC - and that it's been tested to death there - does > not guarantee much anything. mmiowb comes about from the Altix folks strangley enough. > The good news, of course, is that you don't use that kind of "big NUMA" > system the same way you'd use a regular desktop SMP. You don't plug in > random devices into it and just expect them to work. I'd hope ;) Various core drivers like tg3 use mmiowb() Alan From dhowells at redhat.com Thu Mar 9 07:16:11 2006 From: dhowells at redhat.com (David Howells) Date: Wed, 08 Mar 2006 20:16:11 +0000 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <20060308161829.GC3669@elf.ucw.cz> References: <20060308161829.GC3669@elf.ucw.cz> <31492.1141753245@warthog.cambridge.redhat.com> Message-ID: <24309.1141848971@warthog.cambridge.redhat.com> Pavel Machek wrote: > > + (*) set_mb(var, value) > > + (*) set_wmb(var, value) > > + > > + These assign the value to the variable and then insert at least a write > > + barrier after it, depending on the function. > > + > > I... don't understand what these do. Better explanation would > help.. .what is function? I can only guess, and hope someone corrects me if I'm wrong. > Does it try to say that set_mb(var, value) is equivalent to var = > value; mb(); Yes. > but here mb() affects that one variable, only? No. set_*mb() is simply a canned sequence of assignment, memory barrier. The type of barrier inserted depends on which function you choose. set_mb() inserts an mb() and set_wmb() inserts a wmb(). > "LOCK access"? The LOCK and UNLOCK functions presumably make at least one memory write apiece to manipulate the target lock (on SMP at least). > Does it try to say that ...will be completed after any access inside lock > region is completed? No. What you get in effect is something like: LOCK { *lock = q; } *A = a; *B = b; UNLOCK { *lock = u; } Except that the accesses to the lock memory are made using special procedures (LOCK prefixed instructions, XCHG, CAS/CMPXCHG, LL/SC, etc). > This makes it sound like pentium-III+ is incompatible with previous > CPUs. Is it really the case? Yes - hence the alternative instruction stuff. David From pavel at ucw.cz Thu Mar 9 03:18:29 2006 From: pavel at ucw.cz (Pavel Machek) Date: Wed, 8 Mar 2006 17:18:29 +0100 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <31492.1141753245@warthog.cambridge.redhat.com> References: <31492.1141753245@warthog.cambridge.redhat.com> Message-ID: <20060308161829.GC3669@elf.ucw.cz> Hi! > +There are some more advanced barriering functions: > + > + (*) set_mb(var, value) > + (*) set_wmb(var, value) > + > + These assign the value to the variable and then insert at least a write > + barrier after it, depending on the function. > + I... don't understand what these do. Better explanation would help.. .what is function? Does it try to say that set_mb(var, value) is equivalent to var = value; mb(); but here mb() affects that one variable, only? > +In all cases there are variants on a LOCK operation and an UNLOCK operation. > + > + (*) LOCK operation implication: > + > + Memory accesses issued after the LOCK will be completed after the LOCK > + accesses have completed. "LOCK access"? Does it try to say that ...will be completed after any access inside lock region is completed? ("LOCK" looks very much like well-known i386 prefix. Calling it *_lock() or something would avoid that confusion. Fortunately there's no UNLOCK instruction :-) > + (*) UNLOCK operation implication: > + > + Memory accesses issued before the UNLOCK will be completed before the > + UNLOCK accesses have completed. > + > + Memory accesses issued after the UNLOCK may be completed before the UNLOCK > + accesses have completed. > + > + (*) LOCK vs UNLOCK implication: > + > + The LOCK accesses will be completed before the unlock accesses. ~~~~~~ capital? Or lower it everywhere? > +============================== > +I386 AND X86_64 SPECIFIC NOTES > +============================== > + > +Earlier i386 CPUs (pre-Pentium-III) are fully ordered - the operations on the > +bus appear in program order - and so there's no requirement for any sort of > +explicit memory barriers. > + > +From the Pentium-III onwards were three new memory barrier instructions: > +LFENCE, SFENCE and MFENCE which correspond to the kernel memory barrier > +functions rmb(), wmb() and mb(). However, there are additional implicit memory > +barriers in the CPU implementation: > + > + (*) Normal writes imply a semi-rmb(): reads before a write may not complete > + after that write, but reads after a write may complete before the write > + (ie: reads may go _ahead_ of writes). This makes it sound like pentium-III+ is incompatible with previous CPUs. Is it really the case? Pavel -- Web maintainer for suspend.sf.net (www.sf.net/projects/suspend) wanted... From benh at kernel.crashing.org Thu Mar 9 08:31:56 2006 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 09 Mar 2006 08:31:56 +1100 Subject: dead hvc_console with kdump kernel In-Reply-To: <4e714f8f51a3ec10b00a99cb45d35f9f@bga.com> References: <200603061329.08262.michael@ellerman.id.au> <200603061726.35175.michael@ellerman.id.au> <200603081736.41406.michael@ellerman.id.au> <4e714f8f51a3ec10b00a99cb45d35f9f@bga.com> Message-ID: <1141853516.11221.174.camel@localhost.localdomain> > On real xics it is safe to do in the initial register irq. Don't > do it every irq register though. > > Hypervisor will cover us for emulated xics if we do that. > > For real mpic? I don't know. Maybe we do an mpic reset and that > will cover us? Ben? We should probably eoi all pending interrupts. That's especially true with machines with HT APICs like the js2x or the quad g5 since if we don't EOI on the APIC, the interrupt will remain blocked. Ben From benh at kernel.crashing.org Thu Mar 9 08:34:44 2006 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 09 Mar 2006 08:34:44 +1100 Subject: dead hvc_console with kdump kernel In-Reply-To: <4e714f8f51a3ec10b00a99cb45d35f9f@bga.com> References: <200603061329.08262.michael@ellerman.id.au> <200603061726.35175.michael@ellerman.id.au> <200603081736.41406.michael@ellerman.id.au> <4e714f8f51a3ec10b00a99cb45d35f9f@bga.com> Message-ID: <1141853684.11221.178.camel@localhost.localdomain> To be more complete, what about a loop that iterates all irq_desc, and for each of them does disable_irq() if (desc->handler->end) desc->handler->end() Or something like that... you could try to test the PENDING and INPROGRESS flags maybe though that wouldn't handle IPIs (but then, I think there should be no problem with those if we do an MPIC reset, not 100% clear there) Ben. From paulus at samba.org Thu Mar 9 08:49:50 2006 From: paulus at samba.org (Paul Mackerras) Date: Thu, 9 Mar 2006 08:49:50 +1100 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <28393.1141823992@warthog.cambridge.redhat.com> References: <17422.19209.60360.178668@cargo.ozlabs.ibm.com> <31492.1141753245@warthog.cambridge.redhat.com> <28393.1141823992@warthog.cambridge.redhat.com> Message-ID: <17423.20862.764098.732463@cargo.ozlabs.ibm.com> David Howells writes: > > Enabling/disabling interrupts doesn't imply a barrier on powerpc, and > > nor does taking an interrupt or returning from one. > > Surely it ought to, otherwise what's to stop accesses done with interrupts > disabled crossing with accesses done inside an interrupt handler? The rule that the CPU always sees its own loads and stores in program order. If a CPU takes an interrupt after doing some stores, and the interrupt handler does loads from the same location(s), it has to see the new values, even if they haven't got to memory yet. The interrupt isn't special in this situation; if the instruction stream has a store to a location followed by a load from it, the load *has* to see the value stored by the store (assuming no other store to the same location in the meantime, of course). That's true whether or not the CPU takes an exception or interrupt between the store and the load. Anything else would make programming really ... um ... interesting. :) > > > +Either interrupt disablement (LOCK) and enablement (UNLOCK) will barrier > > ... > > I don't think this is right, and I don't think it is necessary to > > achieve the end you state, since a CPU will always see its own memory > > accesses in program order. > > But what about a driver accessing some memory that its device is going to > observe under irq disablement, and then getting an interrupt immediately after > from that same device, the handler for which communicates with the device, > possibly then being broken because the CPU hasn't completed all the memory > accesses that the driver made while interrupts are disabled? Well, we have to be clear about what causes what here. Is the device accessing this memory just at a random time, or is the access caused by (in response to) an MMIO store? And what causes the interrupt? Does it just happen to come along at this time or is it in response to one of the stores? If the device accesses to memory are in response to an MMIO store, then the code needs an explicit wmb() between the memory stores and the MMIO store. Disabling interrupts isn't going to help here because the device doesn't see the CPU interrupt enable state. In general it is possible for the CPU to see a different state of memory than the device sees. If the driver needs to be sure that they both see the same view then it needs to use some sort of synchronization. A memory barrier followed by a store to the device, with no further stores to memory until we have an indication from the device that it has received the MMIO store, would be a suitable way to synchronize. Enabling or disabling interrupts does nothing useful here because the device doesn't see that. That applies whether we are in an interrupt routine or not. Do you have a specific scenario in mind, with a particular device and driver? One thing that driver writers do need to be careful about is that if a device writes some data to memory and then causes an interrupt, the fact that the interrupt has reached the CPU and the CPU has invoked the driver's interrupt routine does *not* mean that the data has got to memory from the CPU's point of view. The data could still be queued up in the PCI host bridge or elsewhere. Doing an MMIO read from the device is sufficient to ensure that the CPU will then see the correct data in memory. > Alternatively, might it be possible for communications between two CPUs to be > stuffed because one took an interrupt that also modified common data before > the it had committed the memory accesses done under interrupt disablement? > This would suggest using a lock though. Disabling interrupts doesn't do *anything* to help with communication between CPUs. You have to use locks or explicit barriers for that. It is possible for one CPU to see memory accesses done by another CPU in a different order from the program order on the CPU that did the accesses. That applies whether or not some of the accesses were done inside an interrupt routine. > > What does *F+*A mean? > > Combined accesses. Still opaque, sorry: you mean they both happen in some unspecified order? > > Well, the driver should *not* be doing *ADR at all, it should be using > > read[bwl]/write[bwl]. The architecture code has to implement > > read*/write* in such a way that the accesses generated can't be > > reordered. I _think_ it also has to make sure the write accesses > > can't be write-combined, but it would be good to have that clarified. > > Than what use mmiowb()? That was introduced to help some platforms that have difficulty ensuring that MMIO accesses hit the device in the right order, IIRC. I'm still not entirely clear on exactly where it's needed or what guarantees you can rely on if you do or don't use it. > Surely write combining and out-of-order reads are reasonable for cacheable > devices like framebuffers. They are. read*/write* to non-cacheable non-prefetchable MMIO shouldn't be reordered or write-combined, but for prefetchable MMIO I'm not sure whether read*/write* should allow reordering, or whether drivers should use __raw_read/write* if they want that. (Of course, with the __raw_ functions they don't get the endian conversion either...) Paul. From alan at lxorguk.ukuu.org.uk Thu Mar 9 09:01:44 2006 From: alan at lxorguk.ukuu.org.uk (Alan Cox) Date: Wed, 08 Mar 2006 22:01:44 +0000 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <24309.1141848971@warthog.cambridge.redhat.com> References: <20060308161829.GC3669@elf.ucw.cz> <31492.1141753245@warthog.cambridge.redhat.com> <24309.1141848971@warthog.cambridge.redhat.com> Message-ID: <1141855305.10606.6.camel@localhost.localdomain> On Mer, 2006-03-08 at 20:16 +0000, David Howells wrote: > The LOCK and UNLOCK functions presumably make at least one memory write apiece > to manipulate the target lock (on SMP at least). No they merely perform the bus transactions neccessary to perform an update atomically. They are however "serializing" instructions which means they do cause a certain amount of serialization (see the intel architecture manual on serializing instructions for detail). Athlon and later know how to turn it from locked memory accesses into merely an exclusive cache line grab. > > This makes it sound like pentium-III+ is incompatible with previous > > CPUs. Is it really the case? > > Yes - hence the alternative instruction stuff. It is the case for certain specialist instructions and the fences are provided to go with those but can also help in other cases. PIII and later in particular support explicit non temporal stores. From alan at lxorguk.ukuu.org.uk Thu Mar 9 09:05:14 2006 From: alan at lxorguk.ukuu.org.uk (Alan Cox) Date: Wed, 08 Mar 2006 22:05:14 +0000 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <17423.20862.764098.732463@cargo.ozlabs.ibm.com> References: <17422.19209.60360.178668@cargo.ozlabs.ibm.com> <31492.1141753245@warthog.cambridge.redhat.com> <28393.1141823992@warthog.cambridge.redhat.com> <17423.20862.764098.732463@cargo.ozlabs.ibm.com> Message-ID: <1141855514.10606.10.camel@localhost.localdomain> On Iau, 2006-03-09 at 08:49 +1100, Paul Mackerras wrote: > If the device accesses to memory are in response to an MMIO store, > then the code needs an explicit wmb() between the memory stores and > the MMIO store. Disabling interrupts isn't going to help here because > the device doesn't see the CPU interrupt enable state. Interrupts are themselves entirely asynchronous anyway. The following can occur on SMP Pentium-PIII. Device Raise IRQ CPU writel(MASK_IRQ, &dev->ctrl); readl(&dev->ctrl); IRQ arrives CPU specific IRQ masking is synchronous, but IRQ delivery is not, including IPI delivery (which is asynchronous and not guaranteed to occur only once per IPI but can be replayed in obscure cases on x86). From paulus at samba.org Thu Mar 9 09:01:57 2006 From: paulus at samba.org (Paul Mackerras) Date: Thu, 9 Mar 2006 09:01:57 +1100 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <20060308145506.GA5095@devserv.devel.redhat.com> References: <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> Message-ID: <17423.21589.385336.68518@cargo.ozlabs.ibm.com> Alan Cox writes: > On Wed, Mar 08, 2006 at 02:37:58PM +0000, David Howells wrote: > > + (*) reads can be done speculatively, and then the result discarded should it > > + prove not to be required; > > That might be worth an example with an if() because PPC will do this and if > its a read with a side effect (eg I/O space) you get singed.. On PPC machines, the PTE has a bit called G (for Guarded) which indicates that the memory mapped by it has side effects. It prevents the CPU from doing speculative accesses (i.e. the CPU can't send out a load from the page until it knows for sure that the program will get to that instruction) and from prefetching from the page. The kernel sets G=1 on MMIO and PIO pages in general, as you would expect, although you can get G=0 mappings for framebuffers etc. if you ask specifically for that. Paul. From paulus at samba.org Thu Mar 9 09:10:49 2006 From: paulus at samba.org (Paul Mackerras) Date: Thu, 9 Mar 2006 09:10:49 +1100 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <10095.1141838381@warthog.cambridge.redhat.com> References: <20060308154157.GI7301@parisc-linux.org> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> <10095.1141838381@warthog.cambridge.redhat.com> Message-ID: <17423.22121.254026.487964@cargo.ozlabs.ibm.com> David Howells writes: > > # define smp_read_barrier_depends() do { } while(0) > > What's this one meant to do? On most CPUs, if you load one value and use the value you get to compute the address for a second load, there is an implicit read barrier between the two loads because of the dependency. That's not true on alpha, apparently, because of the way their caches are structured. The smp_read_barrier_depends is a read barrier that you use between two loads when there is already a dependency between the loads, and it is a no-op on everything except alpha (IIRC). Paul. From paulus at samba.org Thu Mar 9 09:06:05 2006 From: paulus at samba.org (Paul Mackerras) Date: Thu, 9 Mar 2006 09:06:05 +1100 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <200603080925.19425.duncan.sands@math.u-psud.fr> References: <1141756825.31814.75.camel@localhost.localdomain> <31492.1141753245@warthog.cambridge.redhat.com> <9551.1141762147@warthog.cambridge.redhat.com> <200603080925.19425.duncan.sands@math.u-psud.fr> Message-ID: <17423.21837.304330.623519@cargo.ozlabs.ibm.com> Duncan Sands writes: > On UP you at least need compiler barriers, right? You're in trouble if you think > you are writing in a certain order, and expect to see the same order from an > interrupt handler, but the compiler decided to rearrange the order of the writes... I'd be interested to know what the C standard says about whether the compiler can reorder writes that may be visible to a signal handler. An interrupt handler in the kernel is logically equivalent to a signal handler in normal C code. Surely there are some C language lawyers on one of the lists that this thread is going to? Paul. From davem at davemloft.net Thu Mar 9 09:23:26 2006 From: davem at davemloft.net (David S. Miller) Date: Wed, 08 Mar 2006 14:23:26 -0800 (PST) Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <17423.21589.385336.68518@cargo.ozlabs.ibm.com> References: <29826.1141828678@warthog.cambridge.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> <17423.21589.385336.68518@cargo.ozlabs.ibm.com> Message-ID: <20060308.142326.116048199.davem@davemloft.net> From: Paul Mackerras Date: Thu, 9 Mar 2006 09:01:57 +1100 > On PPC machines, the PTE has a bit called G (for Guarded) which > indicates that the memory mapped by it has side effects. It prevents > the CPU from doing speculative accesses (i.e. the CPU can't send out a > load from the page until it knows for sure that the program will get > to that instruction) and from prefetching from the page. > > The kernel sets G=1 on MMIO and PIO pages in general, as you would > expect, although you can get G=0 mappings for framebuffers etc. if you > ask specifically for that. Sparc64 has a similar PTE bit called "E" for "side-Effect". And we also do the same thing as powerpc for framebuffers. Note that on sparc64 in our asm/io.h PIO/MMIO accessor macros we use physical addresses, so we don't have to map anything in ioremap(), and use a special address space identifier on the loads and stores that indicates "E" behavior is desired. From davem at davemloft.net Thu Mar 9 09:24:01 2006 From: davem at davemloft.net (David S. Miller) Date: Wed, 08 Mar 2006 14:24:01 -0800 (PST) Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <17423.21837.304330.623519@cargo.ozlabs.ibm.com> References: <9551.1141762147@warthog.cambridge.redhat.com> <200603080925.19425.duncan.sands@math.u-psud.fr> <17423.21837.304330.623519@cargo.ozlabs.ibm.com> Message-ID: <20060308.142401.72886733.davem@davemloft.net> From: Paul Mackerras Date: Thu, 9 Mar 2006 09:06:05 +1100 > I'd be interested to know what the C standard says about whether the > compiler can reorder writes that may be visible to a signal handler. > An interrupt handler in the kernel is logically equivalent to a signal > handler in normal C code. > > Surely there are some C language lawyers on one of the lists that this > thread is going to? Just like for setjmp() I think you have to mark such things as volatile. From torvalds at osdl.org Thu Mar 9 09:31:33 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Wed, 8 Mar 2006 14:31:33 -0800 (PST) Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <20060308.142401.72886733.davem@davemloft.net> References: <9551.1141762147@warthog.cambridge.redhat.com> <200603080925.19425.duncan.sands@math.u-psud.fr> <17423.21837.304330.623519@cargo.ozlabs.ibm.com> <20060308.142401.72886733.davem@davemloft.net> Message-ID: On Wed, 8 Mar 2006, David S. Miller wrote: > > Just like for setjmp() I think you have to mark such things > as volatile. .. and sigatomic_t. Linus From alan at lxorguk.ukuu.org.uk Thu Mar 9 09:42:29 2006 From: alan at lxorguk.ukuu.org.uk (Alan Cox) Date: Wed, 08 Mar 2006 22:42:29 +0000 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <17423.21837.304330.623519@cargo.ozlabs.ibm.com> References: <1141756825.31814.75.camel@localhost.localdomain> <31492.1141753245@warthog.cambridge.redhat.com> <9551.1141762147@warthog.cambridge.redhat.com> <200603080925.19425.duncan.sands@math.u-psud.fr> <17423.21837.304330.623519@cargo.ozlabs.ibm.com> Message-ID: <1141857749.10606.17.camel@localhost.localdomain> On Iau, 2006-03-09 at 09:06 +1100, Paul Mackerras wrote: > I'd be interested to know what the C standard says about whether the > compiler can reorder writes that may be visible to a signal handler. > An interrupt handler in the kernel is logically equivalent to a signal > handler in normal C code. The C standard doesn't have much to say. POSIX has a lot to say and yes it can do this. You do need volatile or store barriers in signal touched code quite often, or for that matter locks POSIX/SuS also has stuff to say about what functions are signal safe and what is not allowed. Alan From michael at ellerman.id.au Thu Mar 9 10:57:02 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Thu, 9 Mar 2006 10:57:02 +1100 Subject: dead hvc_console with kdump kernel In-Reply-To: <20060308114534.GA18522@suse.de> References: <200603081736.41406.michael@ellerman.id.au> <20060308114534.GA18522@suse.de> Message-ID: <200603091057.03476.michael@ellerman.id.au> On Wed, 8 Mar 2006 22:45, Olaf Hering wrote: > On Wed, Mar 08, Michael Ellerman wrote: > > So we take an interrupt, and while we're processing it we decide we > > should panic, this happens for example if we panic via sysrq. > > You are right, sleep 2 ; echo c > /proc/sysrq-trigger ; sleep 2 gives me > a working console. Yep, and with this patch, my veth (irq 185) is dead in the second kernel. Index: kdump/arch/powerpc/kernel/irq.c =================================================================== --- kdump.orig/arch/powerpc/kernel/irq.c 2006-03-07 10:00:41.000000000 +1100 +++ kdump/arch/powerpc/kernel/irq.c 2006-03-08 17:22:44.000000000 +1100 @@ -40,6 +40,7 @@ #include #include #include +#include #include #include #include @@ -181,6 +182,16 @@ void fixup_irqs(cpumask_t map) } #endif +static u32 dodgy_hack_should_crash = 0; + +void dodgy_hack_init(void) +{ + debugfs_create_bool("dodgy_hack_should_crash", + S_IFREG | S_IRWXUGO | ~S_IXUGO, + NULL, &dodgy_hack_should_crash); +} +__initcall(dodgy_hack_init); + void do_IRQ(struct pt_regs *regs) { int irq; @@ -214,6 +225,11 @@ void do_IRQ(struct pt_regs *regs) */ irq = ppc_md.get_irq(regs); + if (dodgy_hack_should_crash && irq == 185) { + printk("do_IRQ: got irq %d, crashing.\n", irq); + crash_kexec(regs); + } + if (irq >= 0) { #ifdef CONFIG_IRQSTACKS /* Switch to the irq stack to handle this */ From david at gibson.dropbear.id.au Thu Mar 9 11:08:07 2006 From: david at gibson.dropbear.id.au (David Gibson) Date: Thu, 9 Mar 2006 11:08:07 +1100 Subject: asm/cputable.h sparse warnings In-Reply-To: <20060308151939.GA12762@lst.de> References: <20060308151939.GA12762@lst.de> Message-ID: <20060309000807.GD17590@localhost.localdomain> On Wed, Mar 08, 2006 at 04:19:39PM +0100, Christoph Hellwig wrote: > Looks like not many people are running sparse on powerpc ;-) For every > file compiled I get the churn of warnings below. The reason seems to be > that it's using large values in enums, something that's very murky in > the C standards, and gcc adds even less well-defined extensions to it > that make this code work in practice. I think the only sane fix is > to switch the cputype constants to cpp macros, although that'd make the > file a lot larger.. Removing the enums is problematical, because the method we use with #ifdefs to get CPU_FTRS_POSSIBLE and CPU_FTRS_ALWAYS correct won't work with cpp defines. Or at least defining POSSIBLE and ALWAYS as macros would be not only much longer, but way harder to maintain correctly (in fact, I would tend to say impossibly hard). I was working on a patch to compute the POSSIBLE and ALWAYS masks from the actual cpu table data structure in cputable.c, which would be nicer than the #ifdef mess in cputable.h anyway. I got bogged down in Kbuild dependency hell and other logistical details, though. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson From paulus at samba.org Thu Mar 9 11:35:17 2006 From: paulus at samba.org (Paul Mackerras) Date: Thu, 9 Mar 2006 11:35:17 +1100 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <19984.1141846302@warthog.cambridge.redhat.com> References: <20060308184500.GA17716@devserv.devel.redhat.com> <20060308173605.GB13063@devserv.devel.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <9834.1141837491@warthog.cambridge.redhat.com> <11922.1141842907@warthog.cambridge.redhat.com> <14275.1141844922@warthog.cambridge.redhat.com> <19984.1141846302@warthog.cambridge.redhat.com> Message-ID: <17423.30789.214209.462657@cargo.ozlabs.ibm.com> David Howells writes: > On NUMA PowerPC, should mmiowb() be a SYNC or an EIEIO instruction then? Those > do inter-component synchronisation. We actually have quite heavy synchronization in read*/write* on PPC, and mmiowb can safely be a no-op. It would be nice to be able to have lighter-weight synchronization, but I'm sure we would see lots of subtle driver bugs cropping up if we did. write* do a full memory barrier (sync) after the store, and read* explicitly wait for the data to come back before. If you ask me, the need for mmiowb on some platforms merely shows that those platforms' implementations of spinlocks and read*/write* are buggy... Paul. From sfr at canb.auug.org.au Thu Mar 9 11:50:36 2006 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Thu, 9 Mar 2006 11:50:36 +1100 Subject: asm/cputable.h sparse warnings In-Reply-To: <20060308151939.GA12762@lst.de> References: <20060308151939.GA12762@lst.de> Message-ID: <20060309115036.0bb4bec4.sfr@canb.auug.org.au> Hi Christoph, On Wed, 8 Mar 2006 16:19:39 +0100 Christoph Hellwig wrote: > > Looks like not many people are running sparse on powerpc ;-) For every > file compiled I get the churn of warnings below. The reason seems to be > that it's using large values in enums, something that's very murky in > the C standards, and gcc adds even less well-defined extensions to it > that make this code work in practice. I think the only sane fix is > to switch the cputype constants to cpp macros, although that'd make the > file a lot larger.. Does the following patch help? I don't think we necessarily want to do this, but we could. -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ diff --git a/include/asm-powerpc/cputable.h b/include/asm-powerpc/cputable.h index 99d12ff..a33ce33 100644 --- a/include/asm-powerpc/cputable.h +++ b/include/asm-powerpc/cputable.h @@ -186,153 +186,154 @@ extern void do_cpu_ftr_fixups(unsigned l !defined(CONFIG_POWER3) && !defined(CONFIG_POWER4) && \ !defined(CONFIG_BOOKE)) -enum { - CPU_FTRS_PPC601 = CPU_FTR_COMMON | CPU_FTR_601 | CPU_FTR_HPTE_TABLE, - CPU_FTRS_603 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | - CPU_FTR_MAYBE_CAN_NAP, - CPU_FTRS_604 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | CPU_FTR_604_PERF_MON | CPU_FTR_HPTE_TABLE, - CPU_FTRS_740_NOTAU = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | - CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP, - CPU_FTRS_740 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | - CPU_FTR_TAU | CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP, - CPU_FTRS_750 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | - CPU_FTR_TAU | CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP, - CPU_FTRS_750FX1 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | - CPU_FTR_TAU | CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP | - CPU_FTR_DUAL_PLL_750FX | CPU_FTR_NO_DPM, - CPU_FTRS_750FX2 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | - CPU_FTR_TAU | CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP | - CPU_FTR_NO_DPM, - CPU_FTRS_750FX = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | - CPU_FTR_TAU | CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP | - CPU_FTR_DUAL_PLL_750FX | CPU_FTR_HAS_HIGH_BATS, - CPU_FTRS_750GX = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_MAYBE_CAN_DOZE | - CPU_FTR_USE_TB | CPU_FTR_L2CR | CPU_FTR_TAU | - CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP | - CPU_FTR_DUAL_PLL_750FX | CPU_FTR_HAS_HIGH_BATS, - CPU_FTRS_7400_NOTAU = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | - CPU_FTR_ALTIVEC_COMP | CPU_FTR_HPTE_TABLE | - CPU_FTR_MAYBE_CAN_NAP, - CPU_FTRS_7400 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | - CPU_FTR_TAU | CPU_FTR_ALTIVEC_COMP | CPU_FTR_HPTE_TABLE | - CPU_FTR_MAYBE_CAN_NAP, - CPU_FTRS_7450_20 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | - CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | - CPU_FTR_NEED_COHERENT, - CPU_FTRS_7450_21 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | - CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | - CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | - CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_L3_DISABLE_NAP | - CPU_FTR_NEED_COHERENT, - CPU_FTRS_7450_23 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | - CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | - CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | - CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_NEED_COHERENT, - CPU_FTRS_7455_1 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | - CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | CPU_FTR_L3CR | - CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | CPU_FTR_HAS_HIGH_BATS | - CPU_FTR_NEED_COHERENT, - CPU_FTRS_7455_20 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | - CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | - CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | - CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_L3_DISABLE_NAP | - CPU_FTR_NEED_COHERENT | CPU_FTR_HAS_HIGH_BATS, - CPU_FTRS_7455 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | - CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | - CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | - CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_HAS_HIGH_BATS | - CPU_FTR_NEED_COHERENT, - CPU_FTRS_7447_10 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | - CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | - CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | - CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_HAS_HIGH_BATS | - CPU_FTR_NEED_COHERENT | CPU_FTR_NO_BTIC, - CPU_FTRS_7447 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | - CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | - CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | - CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_HAS_HIGH_BATS | - CPU_FTR_NEED_COHERENT, - CPU_FTRS_7447A = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | - CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | - CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | - CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_HAS_HIGH_BATS | - CPU_FTR_NEED_COHERENT, - CPU_FTRS_82XX = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB, - CPU_FTRS_G2_LE = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_MAYBE_CAN_DOZE | - CPU_FTR_USE_TB | CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_HAS_HIGH_BATS, - CPU_FTRS_E300 = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_MAYBE_CAN_DOZE | - CPU_FTR_USE_TB | CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_HAS_HIGH_BATS | - CPU_FTR_COMMON, - CPU_FTRS_CLASSIC32 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | CPU_FTR_HPTE_TABLE, - CPU_FTRS_POWER3_32 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | CPU_FTR_HPTE_TABLE, - CPU_FTRS_POWER4_32 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | CPU_FTR_HPTE_TABLE | CPU_FTR_NODSISRALIGN, - CPU_FTRS_970_32 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | CPU_FTR_HPTE_TABLE | CPU_FTR_ALTIVEC_COMP | - CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_NODSISRALIGN, - CPU_FTRS_8XX = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB, - CPU_FTRS_40X = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | - CPU_FTR_NODSISRALIGN, - CPU_FTRS_44X = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | - CPU_FTR_NODSISRALIGN, - CPU_FTRS_E200 = CPU_FTR_USE_TB | CPU_FTR_NODSISRALIGN, - CPU_FTRS_E500 = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | - CPU_FTR_NODSISRALIGN, - CPU_FTRS_E500_2 = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | - CPU_FTR_BIG_PHYS | CPU_FTR_NODSISRALIGN, - CPU_FTRS_GENERIC_32 = CPU_FTR_COMMON | CPU_FTR_NODSISRALIGN, +#define CPU_FTRS_PPC601 (CPU_FTR_COMMON | CPU_FTR_601 | CPU_FTR_HPTE_TABLE) +#define CPU_FTRS_603 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | \ + CPU_FTR_MAYBE_CAN_NAP) +#define CPU_FTRS_604 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | CPU_FTR_604_PERF_MON | CPU_FTR_HPTE_TABLE) +#define CPU_FTRS_740_NOTAU (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | \ + CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP) +#define CPU_FTRS_740 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | \ + CPU_FTR_TAU | CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP) +#define CPU_FTRS_750 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | \ + CPU_FTR_TAU | CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP) +#define CPU_FTRS_750FX1 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | \ + CPU_FTR_TAU | CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP | \ + CPU_FTR_DUAL_PLL_750FX | CPU_FTR_NO_DPM) +#define CPU_FTRS_750FX2 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | \ + CPU_FTR_TAU | CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP | \ + CPU_FTR_NO_DPM) +#define CPU_FTRS_750FX (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | \ + CPU_FTR_TAU | CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP | \ + CPU_FTR_DUAL_PLL_750FX | CPU_FTR_HAS_HIGH_BATS) +#define CPU_FTRS_750GX (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_MAYBE_CAN_DOZE | \ + CPU_FTR_USE_TB | CPU_FTR_L2CR | CPU_FTR_TAU | \ + CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP | \ + CPU_FTR_DUAL_PLL_750FX | CPU_FTR_HAS_HIGH_BATS) +#define CPU_FTRS_7400_NOTAU (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | \ + CPU_FTR_ALTIVEC_COMP | CPU_FTR_HPTE_TABLE | \ + CPU_FTR_MAYBE_CAN_NAP) +#define CPU_FTRS_7400 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | \ + CPU_FTR_TAU | CPU_FTR_ALTIVEC_COMP | CPU_FTR_HPTE_TABLE | \ + CPU_FTR_MAYBE_CAN_NAP) +#define CPU_FTRS_7450_20 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | \ + CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | \ + CPU_FTR_NEED_COHERENT) +#define CPU_FTRS_7450_21 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | \ + CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | \ + CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | \ + CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_L3_DISABLE_NAP | \ + CPU_FTR_NEED_COHERENT) +#define CPU_FTRS_7450_23 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | \ + CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | \ + CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | \ + CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_NEED_COHERENT) +#define CPU_FTRS_7455_1 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | \ + CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | CPU_FTR_L3CR | \ + CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | CPU_FTR_HAS_HIGH_BATS | \ + CPU_FTR_NEED_COHERENT) +#define CPU_FTRS_7455_20 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | \ + CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | \ + CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | \ + CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_L3_DISABLE_NAP | \ + CPU_FTR_NEED_COHERENT | CPU_FTR_HAS_HIGH_BATS) +#define CPU_FTRS_7455 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | \ + CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | \ + CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | \ + CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_HAS_HIGH_BATS | \ + CPU_FTR_NEED_COHERENT) +#define CPU_FTRS_7447_10 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | \ + CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | \ + CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | \ + CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_HAS_HIGH_BATS | \ + CPU_FTR_NEED_COHERENT | CPU_FTR_NO_BTIC) +#define CPU_FTRS_7447 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | \ + CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | \ + CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | \ + CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_HAS_HIGH_BATS | \ + CPU_FTR_NEED_COHERENT) +#define CPU_FTRS_7447A (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | \ + CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | \ + CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | \ + CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_HAS_HIGH_BATS | \ + CPU_FTR_NEED_COHERENT) +#define CPU_FTRS_82XX (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB) +#define CPU_FTRS_G2_LE (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_MAYBE_CAN_DOZE | \ + CPU_FTR_USE_TB | CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_HAS_HIGH_BATS) +#define CPU_FTRS_E300 (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_MAYBE_CAN_DOZE | \ + CPU_FTR_USE_TB | CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_HAS_HIGH_BATS | \ + CPU_FTR_COMMON) +#define CPU_FTRS_CLASSIC32 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | CPU_FTR_HPTE_TABLE) +#define CPU_FTRS_POWER3_32 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | CPU_FTR_HPTE_TABLE) +#define CPU_FTRS_POWER4_32 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | CPU_FTR_HPTE_TABLE | CPU_FTR_NODSISRALIGN) +#define CPU_FTRS_970_32 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | CPU_FTR_HPTE_TABLE | CPU_FTR_ALTIVEC_COMP | \ + CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_NODSISRALIGN) +#define CPU_FTRS_8XX (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB) +#define CPU_FTRS_40X (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | \ + CPU_FTR_NODSISRALIGN) +#define CPU_FTRS_44X (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | \ + CPU_FTR_NODSISRALIGN) +#define CPU_FTRS_E200 (CPU_FTR_USE_TB | CPU_FTR_NODSISRALIGN) +#define CPU_FTRS_E500 (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | \ + CPU_FTR_NODSISRALIGN) +#define CPU_FTRS_E500_2 (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | \ + CPU_FTR_BIG_PHYS | CPU_FTR_NODSISRALIGN) +#define CPU_FTRS_GENERIC_32 (CPU_FTR_COMMON | CPU_FTR_NODSISRALIGN) #ifdef __powerpc64__ - CPU_FTRS_POWER3 = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | - CPU_FTR_HPTE_TABLE | CPU_FTR_IABR, - CPU_FTRS_RS64 = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | - CPU_FTR_HPTE_TABLE | CPU_FTR_IABR | - CPU_FTR_MMCRA | CPU_FTR_CTRL, - CPU_FTRS_POWER4 = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | - CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_MMCRA, - CPU_FTRS_PPC970 = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | - CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | - CPU_FTR_ALTIVEC_COMP | CPU_FTR_CAN_NAP | CPU_FTR_MMCRA, - CPU_FTRS_POWER5 = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | - CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | - CPU_FTR_MMCRA | CPU_FTR_SMT | - CPU_FTR_COHERENT_ICACHE | CPU_FTR_LOCKLESS_TLBIE | - CPU_FTR_MMCRA_SIHV | CPU_FTR_PURR, - CPU_FTRS_CELL = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | - CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | - CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | - CPU_FTR_CTRL | CPU_FTR_PAUSE_ZERO, - CPU_FTRS_COMPATIBLE = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | - CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2, +#define CPU_FTRS_POWER3 (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | \ + CPU_FTR_HPTE_TABLE | CPU_FTR_IABR) +#define CPU_FTRS_RS64 (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | \ + CPU_FTR_HPTE_TABLE | CPU_FTR_IABR | \ + CPU_FTR_MMCRA | CPU_FTR_CTRL) +#define CPU_FTRS_POWER4 (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | \ + CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_MMCRA) +#define CPU_FTRS_PPC970 (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | \ + CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | \ + CPU_FTR_ALTIVEC_COMP | CPU_FTR_CAN_NAP | CPU_FTR_MMCRA) +#define CPU_FTRS_POWER5 (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | \ + CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | \ + CPU_FTR_MMCRA | CPU_FTR_SMT | \ + CPU_FTR_COHERENT_ICACHE | CPU_FTR_LOCKLESS_TLBIE | \ + CPU_FTR_MMCRA_SIHV | CPU_FTR_PURR) +#define CPU_FTRS_CELL (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | \ + CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | \ + CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \ + CPU_FTR_CTRL | CPU_FTR_PAUSE_ZERO) +#define CPU_FTRS_COMPATIBLE (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | \ + CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2) #endif - CPU_FTRS_POSSIBLE = #ifdef __powerpc64__ - CPU_FTRS_POWER3 | CPU_FTRS_RS64 | CPU_FTRS_POWER4 | - CPU_FTRS_PPC970 | CPU_FTRS_POWER5 | CPU_FTRS_CELL | - CPU_FTR_CI_LARGE_PAGE | +#define CPU_FTRS_POSSIBLE \ + (CPU_FTRS_POWER3 | CPU_FTRS_RS64 | CPU_FTRS_POWER4 | \ + CPU_FTRS_PPC970 | CPU_FTRS_POWER5 | CPU_FTRS_CELL | \ + CPU_FTR_CI_LARGE_PAGE) #else +enum { + CPU_FTRS_POSSIBLE = #if CLASSIC_PPC CPU_FTRS_PPC601 | CPU_FTRS_603 | CPU_FTRS_604 | CPU_FTRS_740_NOTAU | CPU_FTRS_740 | CPU_FTRS_750 | CPU_FTRS_750FX1 | @@ -366,14 +367,18 @@ enum { #ifdef CONFIG_E500 CPU_FTRS_E500 | CPU_FTRS_E500_2 | #endif -#endif /* __powerpc64__ */ 0, +}; +#endif /* __powerpc64__ */ - CPU_FTRS_ALWAYS = #ifdef __powerpc64__ - CPU_FTRS_POWER3 & CPU_FTRS_RS64 & CPU_FTRS_POWER4 & - CPU_FTRS_PPC970 & CPU_FTRS_POWER5 & CPU_FTRS_CELL & +#define CPU_FTRS_ALWAYS \ + (CPU_FTRS_POWER3 & CPU_FTRS_RS64 & CPU_FTRS_POWER4 & \ + CPU_FTRS_PPC970 & CPU_FTRS_POWER5 & CPU_FTRS_CELL & \ + CPU_FTRS_POSSIBLE) #else +enum { + CPU_FTRS_ALWAYS = #if CLASSIC_PPC CPU_FTRS_PPC601 & CPU_FTRS_603 & CPU_FTRS_604 & CPU_FTRS_740_NOTAU & CPU_FTRS_740 & CPU_FTRS_750 & CPU_FTRS_750FX1 & @@ -407,9 +412,9 @@ enum { #ifdef CONFIG_E500 CPU_FTRS_E500 & CPU_FTRS_E500_2 & #endif -#endif /* __powerpc64__ */ CPU_FTRS_POSSIBLE, }; +#endif /* __powerpc64__ */ static inline int cpu_has_feature(unsigned long feature) { From torvalds at osdl.org Thu Mar 9 11:54:08 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Wed, 8 Mar 2006 16:54:08 -0800 (PST) Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <17423.30789.214209.462657@cargo.ozlabs.ibm.com> References: <20060308184500.GA17716@devserv.devel.redhat.com> <20060308173605.GB13063@devserv.devel.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <9834.1141837491@warthog.cambridge.redhat.com> <11922.1141842907@warthog.cambridge.redhat.com> <14275.1141844922@warthog.cambridge.redhat.com> <19984.1141846302@warthog.cambridge.redhat.com> <17423.30789.214209.462657@cargo.ozlabs.ibm.com> Message-ID: On Thu, 9 Mar 2006, Paul Mackerras wrote: > > If you ask me, the need for mmiowb on some platforms merely shows that > those platforms' implementations of spinlocks and read*/write* are > buggy... You could also state that same as "If you ask me, the need for mmiowb on some platforms merely shows that those platforms perform like a bat out of hell, and I think they should be slower" because the fact is, x86 memory barrier rules are just about optimal for performance. Linus From paulus at samba.org Thu Mar 9 11:37:32 2006 From: paulus at samba.org (Paul Mackerras) Date: Thu, 9 Mar 2006 11:37:32 +1100 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <20060308194037.GO7301@parisc-linux.org> References: <20060308184500.GA17716@devserv.devel.redhat.com> <20060308173605.GB13063@devserv.devel.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <9834.1141837491@warthog.cambridge.redhat.com> <11922.1141842907@warthog.cambridge.redhat.com> <14275.1141844922@warthog.cambridge.redhat.com> <20060308194037.GO7301@parisc-linux.org> Message-ID: <17423.30924.278031.151438@cargo.ozlabs.ibm.com> Matthew Wilcox writes: > Looking at the SGI implementation, it's smarter than you think. Looks > like there's a register in the local I/O hub that lets you determine > when this write has been queued in the appropriate host->pci bridge. Given that mmiowb takes no arguments, how does it know which is the appropriate PCI host bridge? Paul. From jbarnes at virtuousgeek.org Thu Mar 9 11:55:13 2006 From: jbarnes at virtuousgeek.org (Jesse Barnes) Date: Wed, 8 Mar 2006 16:55:13 -0800 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <17423.30789.214209.462657@cargo.ozlabs.ibm.com> References: <19984.1141846302@warthog.cambridge.redhat.com> <17423.30789.214209.462657@cargo.ozlabs.ibm.com> Message-ID: <200603081655.13672.jbarnes@virtuousgeek.org> On Wednesday, March 8, 2006 4:35 pm, Paul Mackerras wrote: > David Howells writes: > > On NUMA PowerPC, should mmiowb() be a SYNC or an EIEIO instruction > > then? Those do inter-component synchronisation. > > We actually have quite heavy synchronization in read*/write* on PPC, > and mmiowb can safely be a no-op. It would be nice to be able to have > lighter-weight synchronization, but I'm sure we would see lots of > subtle driver bugs cropping up if we did. write* do a full memory > barrier (sync) after the store, and read* explicitly wait for the data > to come back before. > > If you ask me, the need for mmiowb on some platforms merely shows that > those platforms' implementations of spinlocks and read*/write* are > buggy... Or maybe they just wanted to keep them fast. I don't know why you compromised so much in your implementation of read/write and lock/unlock, but given how expensive synchronization is, I'd think it would be better in the long run to make the barrier types explicit (or at least a subset of them) to maximize performance. The rules for using the barriers really aren't that bad... for mmiowb() you basically want to do it before an unlock in any critical section where you've done PIO writes. Of course, that doesn't mean there isn't confusion about existing barriers. There was a long thread a few years ago (Jes worked it all out, iirc) regarding some subtle memory ordering bugs in the tty layer that ended up being due to ia64's very weak spin_unlock ordering guarantees (one way memory barrier only), but I think that's mainly an artifact of how ill defined the semantics of the various arch specific routines are in some cases. That's why I suggested in an earlier thread that you enumerate all the memory ordering combinations on ppc and see if we can't define them all. Then David can roll the implicit ones up into his document, or we can add the appropriate new operations to the kernel. Really getting barriers right shouldn't be much harder than getting DMA mapping right, from a driver writers POV (though people often get that wrong I guess). Jesse From jbarnes at virtuousgeek.org Thu Mar 9 11:59:05 2006 From: jbarnes at virtuousgeek.org (Jesse Barnes) Date: Wed, 8 Mar 2006 16:59:05 -0800 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <17423.30924.278031.151438@cargo.ozlabs.ibm.com> References: <20060308184500.GA17716@devserv.devel.redhat.com> <20060308194037.GO7301@parisc-linux.org> <17423.30924.278031.151438@cargo.ozlabs.ibm.com> Message-ID: <200603081659.05786.jbarnes@virtuousgeek.org> On Wednesday, March 8, 2006 4:37 pm, Paul Mackerras wrote: > Matthew Wilcox writes: > > Looking at the SGI implementation, it's smarter than you think. > > Looks like there's a register in the local I/O hub that lets you > > determine when this write has been queued in the appropriate > > host->pci bridge. > > Given that mmiowb takes no arguments, how does it know which is the > appropriate PCI host bridge? It uses a per-node address space to reference the local bridge. The local bridge then waits until the remote bridge has acked the write before, then sets the outstanding write register to the appropriate value. Jesse From paulus at samba.org Thu Mar 9 12:01:45 2006 From: paulus at samba.org (Paul Mackerras) Date: Thu, 9 Mar 2006 12:01:45 +1100 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <20060309020851.D9651@jurassic.park.msu.ru> References: <20060308154157.GI7301@parisc-linux.org> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> <10095.1141838381@warthog.cambridge.redhat.com> <17423.22121.254026.487964@cargo.ozlabs.ibm.com> <20060309020851.D9651@jurassic.park.msu.ru> Message-ID: <17423.32377.460820.710578@cargo.ozlabs.ibm.com> Ivan Kokshaysky writes: > On Thu, Mar 09, 2006 at 09:10:49AM +1100, Paul Mackerras wrote: > > David Howells writes: > > > > > > # define smp_read_barrier_depends() do { } while(0) > > > > > > What's this one meant to do? > > > > On most CPUs, if you load one value and use the value you get to > > compute the address for a second load, there is an implicit read > > barrier between the two loads because of the dependency. That's not > > true on alpha, apparently, because of the way their caches are > > structured. > > Who said?! ;-) Paul McKenney, after much discussion with Alpha chip designers IIRC. > > The smp_read_barrier_depends is a read barrier that you > > use between two loads when there is already a dependency between the > > loads, and it is a no-op on everything except alpha (IIRC). > > My "Compiler Writer's Guide for the Alpha 21264" says that if the > result of the first load contributes to the address calculation > of the second load, then the second load cannot issue until the data > from the first load is available. Sure, but because of the partitioned caches on some systems, the second load can get older data than the first load, even though it issues later. If you do: CPU 0 CPU 1 foo = val; wmb(); p = &foo; reg = p; bar = *reg; it is apparently possible for CPU 1 to see the new value of p (i.e. &foo) but an old value of foo (i.e. not val). This can happen if p and foo are in different halves of the cache on CPU 1, and there are a lot of updates coming in for the half containing foo but the half containing p is quiet. I added Paul McKenney to the cc list so he can correct anything I have wrong here. Paul. From paulus at samba.org Thu Mar 9 12:08:40 2006 From: paulus at samba.org (Paul Mackerras) Date: Thu, 9 Mar 2006 12:08:40 +1100 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: References: <20060308184500.GA17716@devserv.devel.redhat.com> <20060308173605.GB13063@devserv.devel.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <9834.1141837491@warthog.cambridge.redhat.com> <11922.1141842907@warthog.cambridge.redhat.com> <14275.1141844922@warthog.cambridge.redhat.com> <19984.1141846302@warthog.cambridge.redhat.com> <17423.30789.214209.462657@cargo.ozlabs.ibm.com> Message-ID: <17423.32792.500628.226831@cargo.ozlabs.ibm.com> Linus Torvalds writes: > > If you ask me, the need for mmiowb on some platforms merely shows that > > those platforms' implementations of spinlocks and read*/write* are > > buggy... > > You could also state that same as > > "If you ask me, the need for mmiowb on some platforms merely shows > that those platforms perform like a bat out of hell, and I think > they should be slower" > > because the fact is, x86 memory barrier rules are just about optimal for > performance. ... and x86 mmiowb is a no-op. It's not x86 that I think is buggy. Paul. From miltonm at bga.com Thu Mar 9 12:12:32 2006 From: miltonm at bga.com (Milton Miller) Date: Wed, 8 Mar 2006 19:12:32 -0600 Subject: dead hvc_console with kdump kernel In-Reply-To: <1141853516.11221.174.camel@localhost.localdomain> References: <200603061329.08262.michael@ellerman.id.au> <200603061726.35175.michael@ellerman.id.au> <200603081736.41406.michael@ellerman.id.au> <4e714f8f51a3ec10b00a99cb45d35f9f@bga.com> <1141853516.11221.174.camel@localhost.localdomain> Message-ID: <6f1fad66c932831c02e638e14de8c56b@bga.com> On Mar 8, 2006, at 3:31 PM, Benjamin Herrenschmidt wrote: > >> On real xics it is safe to do in the initial register irq. Don't >> do it every irq register though. >> >> Hypervisor will cover us for emulated xics if we do that. >> >> For real mpic? I don't know. Maybe we do an mpic reset and that >> will cover us? Ben? > > We should probably eoi all pending interrupts. That's especially true > with machines with HT APICs like the js2x or the quad g5 since if we > don't EOI on the APIC, the interrupt will remain blocked. and later: > To be more complete, what about a loop that iterates all irq_desc, and > for each of them does > > disable_irq() > if (desc->handler->end) > desc->handler->end() > > Or something like that... you could try to test the PENDING and > INPROGRESS flags maybe though that wouldn't handle IPIs (but then, I > think there should be no problem with those if we do an MPIC reset, not > 100% clear there) The problem is this is counter to the goal of kdump trusting anything in the old kernel when going to the new kernel. In fact this would be walking dynamic data structures. Also, does it cover us on interrupts that are sent against a cpu that we choose not to online in the second kernel? Or interrupts that are sent after the loop (devices are not stopped at that point) Hence the asked question, does the reset cover us? This can now be tested expermentially with Michael's debugfs patch harness. milton From torvalds at osdl.org Thu Mar 9 12:27:05 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Wed, 8 Mar 2006 17:27:05 -0800 (PST) Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <17423.32792.500628.226831@cargo.ozlabs.ibm.com> References: <20060308184500.GA17716@devserv.devel.redhat.com> <20060308173605.GB13063@devserv.devel.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <9834.1141837491@warthog.cambridge.redhat.com> <11922.1141842907@warthog.cambridge.redhat.com> <14275.1141844922@warthog.cambridge.redhat.com> <19984.1141846302@warthog.cambridge.redhat.com> <17423.30789.214209.462657@cargo.ozlabs.ibm.com> <17423.32792.500628.226831@cargo.ozlabs.ibm.com> Message-ID: On Thu, 9 Mar 2006, Paul Mackerras wrote: > > ... and x86 mmiowb is a no-op. It's not x86 that I think is buggy. x86 mmiowb would have to be a real op too if there were any multi-pathed PCI buses out there for x86, methinks. Basically, the issue boils down to one thing: no "normal" barrier will _ever_ show up on the bus on x86 (ie ia64, afaik). That, together with any situation where there are multiple paths to one physical device means that mmiowb() _has_ to be a special op, and no spinlocks etc will _ever_ do the serialization you look for. Put another way: the only way to avoid mmiowb() being special is either one of: (a) have the bus fabric itself be synchronizing (b) pay a huge expense on the much more critical _regular_ barriers Now, I claim that (b) is just broken. I'd rather take the hit when I need to, than every time. Now, (a) is trivial for small cases, but scales badly unless you do some fancy footwork. I suspect you could do some scalable multi-pathable version with using similar approaches to resolving device conflicts as the cache coherency protocol does (or by having a token-passing thing), but it seems SGI's solution was fairly well thought out. That said, when I heard of the NUMA IO issues on the SGI platform, I was initially pretty horrified. It seems to have worked out ok, and as long as we're talking about machines where you can concentrate on validating just a few drivers, it seems to be a good tradeoff. Would I want the hard-to-think-about IO ordering on a regular desktop platform? No. Linus From michael at ellerman.id.au Thu Mar 9 12:45:49 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Thu, 9 Mar 2006 12:45:49 +1100 Subject: dead hvc_console with kdump kernel In-Reply-To: <6f1fad66c932831c02e638e14de8c56b@bga.com> References: <1141853516.11221.174.camel@localhost.localdomain> <6f1fad66c932831c02e638e14de8c56b@bga.com> Message-ID: <200603091245.54610.michael@ellerman.id.au> On Thu, 9 Mar 2006 12:12, Milton Miller wrote: > On Mar 8, 2006, at 3:31 PM, Benjamin Herrenschmidt wrote: > >> On real xics it is safe to do in the initial register irq. Don't > >> do it every irq register though. > >> > >> Hypervisor will cover us for emulated xics if we do that. > >> > >> For real mpic? I don't know. Maybe we do an mpic reset and that > >> will cover us? Ben? > > > > We should probably eoi all pending interrupts. That's especially true > > with machines with HT APICs like the js2x or the quad g5 since if we > > don't EOI on the APIC, the interrupt will remain blocked. > > and later: > > To be more complete, what about a loop that iterates all irq_desc, and > > for each of them does > > > > disable_irq() > > if (desc->handler->end) > > desc->handler->end() > > > > Or something like that... you could try to test the PENDING and > > INPROGRESS flags maybe though that wouldn't handle IPIs (but then, I > > think there should be no problem with those if we do an MPIC reset, not > > 100% clear there) > > The problem is this is counter to the goal of kdump trusting anything > in the old kernel when going to the new kernel. In fact this would be > walking dynamic data structures. Yeah I agree. Basically if we don't register the irq in the second kernel then we're better off just leaving it. I can't think of a clean way to do this though, xics_enable_irq gets called multiple times and we don't want to end for each one, so we'd need to track that which gets ugly. It'd be nice to come up with a generic (non-xics) solution too. Still thinking. cheers -- Michael Ellerman IBM OzLabs wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060309/a9f5f19a/attachment.pgp From paulus at samba.org Thu Mar 9 12:36:07 2006 From: paulus at samba.org (Paul Mackerras) Date: Thu, 9 Mar 2006 12:36:07 +1100 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <200603081659.05786.jbarnes@virtuousgeek.org> References: <20060308184500.GA17716@devserv.devel.redhat.com> <20060308194037.GO7301@parisc-linux.org> <17423.30924.278031.151438@cargo.ozlabs.ibm.com> <200603081659.05786.jbarnes@virtuousgeek.org> Message-ID: <17423.34439.977741.295065@cargo.ozlabs.ibm.com> Jesse Barnes writes: > It uses a per-node address space to reference the local bridge. The > local bridge then waits until the remote bridge has acked the write > before, then sets the outstanding write register to the appropriate > value. That sounds like mmiowb can only be used when preemption is disabled, such as inside a spin-locked region - is that right? Paul. From paulus at samba.org Thu Mar 9 12:57:27 2006 From: paulus at samba.org (Paul Mackerras) Date: Thu, 9 Mar 2006 12:57:27 +1100 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <200603081655.13672.jbarnes@virtuousgeek.org> References: <19984.1141846302@warthog.cambridge.redhat.com> <17423.30789.214209.462657@cargo.ozlabs.ibm.com> <200603081655.13672.jbarnes@virtuousgeek.org> Message-ID: <17423.35719.758492.297725@cargo.ozlabs.ibm.com> Jesse Barnes writes: > On Wednesday, March 8, 2006 4:35 pm, Paul Mackerras wrote: > > David Howells writes: > > > On NUMA PowerPC, should mmiowb() be a SYNC or an EIEIO instruction > > > then? Those do inter-component synchronisation. > > > > We actually have quite heavy synchronization in read*/write* on PPC, > > and mmiowb can safely be a no-op. It would be nice to be able to have > > lighter-weight synchronization, but I'm sure we would see lots of > > subtle driver bugs cropping up if we did. write* do a full memory > > barrier (sync) after the store, and read* explicitly wait for the data > > to come back before. > > > > If you ask me, the need for mmiowb on some platforms merely shows that > > those platforms' implementations of spinlocks and read*/write* are > > buggy... > > Or maybe they just wanted to keep them fast. I don't know why you > compromised so much in your implementation of read/write and > lock/unlock, but given how expensive synchronization is, I'd think it > would be better in the long run to make the barrier types explicit (or > at least a subset of them) to maximize performance. The PPC read*/write* and in*/out* aim to implement x86 semantics, in order to minimize the number of subtle driver bugs that only show up under heavy load. I agree that in the long run making the barriers more explicit is a good thing. > The rules for using > the barriers really aren't that bad... for mmiowb() you basically want > to do it before an unlock in any critical section where you've done PIO > writes. Do you mean just PIO, or do you mean PIO or MMIO writes? > Of course, that doesn't mean there isn't confusion about existing > barriers. There was a long thread a few years ago (Jes worked it all > out, iirc) regarding some subtle memory ordering bugs in the tty layer > that ended up being due to ia64's very weak spin_unlock ordering > guarantees (one way memory barrier only), but I think that's mainly an > artifact of how ill defined the semantics of the various arch specific > routines are in some cases. Yes, there is a lot of confusion, unfortunately. There is also some difficulty in defining things to be any different from what x86 does. > That's why I suggested in an earlier thread that you enumerate all the > memory ordering combinations on ppc and see if we can't define them all. The main difficulty we strike on PPC is that cacheable accesses tend to get ordered independently of noncacheable accesses. The only instruction we have that orders cacheable accesses with respect to noncacheable accesses is the sync instruction, which is a heavyweight "synchronize everything" operation. It acts as a full memory barrier for both cacheable and noncacheable loads and stores. The other barriers we have are the lwsync instruction and the eieio instruction. The lwsync instruction (light-weight sync) acts as a memory barrier for cacheable loads and stores except that it allows a following load to go before a preceding store. The eieio instruction has two separate and independent effects. It acts as a full barrier for accesses to noncacheable nonprefetchable memory (i.e. MMIO or PIO registers), and it acts as a write barrier for accesses to cacheable memory. It doesn't do any ordering between cacheable and noncacheable accesses though. There is also the isync (instruction synchronize) instruction, which isn't explicitly a memory barrier. It prevents any following instructions from executing until the outcome of any previous conditional branches are known, and until it is known that no previous instruction can generate an exception. Thus it can be used to create a one-way barrier in spin_lock and read*. > Then David can roll the implicit ones up into his document, or we can > add the appropriate new operations to the kernel. Really getting > barriers right shouldn't be much harder than getting DMA mapping right, > from a driver writers POV (though people often get that wrong I guess). Unfortunately, if you get the barriers wrong your driver will still work most of the time on pretty much any machine, whereas if you get the DMA mapping wrong your driver won't work at all on some machines. Nevertheless, we should get these things defined properly and then try to make sure drivers do the right things. Paul. From benh at kernel.crashing.org Thu Mar 9 12:59:48 2006 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 09 Mar 2006 12:59:48 +1100 Subject: dead hvc_console with kdump kernel In-Reply-To: <6f1fad66c932831c02e638e14de8c56b@bga.com> References: <200603061329.08262.michael@ellerman.id.au> <200603061726.35175.michael@ellerman.id.au> <200603081736.41406.michael@ellerman.id.au> <4e714f8f51a3ec10b00a99cb45d35f9f@bga.com> <1141853516.11221.174.camel@localhost.localdomain> <6f1fad66c932831c02e638e14de8c56b@bga.com> Message-ID: <1141869589.11221.209.camel@localhost.localdomain> > The problem is this is counter to the goal of kdump trusting anything > in the old kernel when going to the new kernel. In fact this would be > walking dynamic data structures. Well, I suppose you could just end EOIs to the APICs for every interrupts, won't harm... but you still need the list of APICs... You should also mask on them > Also, does it cover us on interrupts that are sent against a cpu that > we choose not to online in the second kernel? Or interrupts that are > sent after the loop (devices are not stopped at that point) > > Hence the asked question, does the reset cover us? Dunno. Definitely doesn't cover APICs. You really need to disable all IRQ source on the MPIC, I wouldn't trust reset to be implemented properly. > This can now be tested expermentially with Michael's debugfs > patch harness. > > milton From kamitch at cisco.com Thu Mar 9 13:00:40 2006 From: kamitch at cisco.com (Keith Mitchell) Date: Wed, 08 Mar 2006 21:00:40 -0500 Subject: 2.6.15.6 on G5/1.8 (9,1) Message-ID: <440F8C48.6000706@cisco.com> Hi, I have a bunch of different powermac machines that I am trying to upgrade and/or install and am having some difficulty with the 1.8 (9,1) powermacs as well as the newer Dual Core (2.0) machines. The other two types of machines that I have seem to be working well enough (Dual-Proc 2.0, Dual-Proc 2.7 -- Both 7,3). Originally the systems were running a beta version of Yellowdog that had a custom kernel based on 2.6.12.3. That kernel works great on all of the machines except the dual core which doesn't work at all with the kernel (no surprise). When YDL 4.1 came out (with a kernel based on 2.6.15-rc5 plus some patches) I wanted to upgrade to that and have the same image on all of the machines. The hope was that the 1.8ghz-single machines would get thermal support and I would get rudimentary support for the dual core machine. I want to have the same load on all of the machines to make my job easier (since I have 30+ machines total to keep running). But... The stock YDL kernel does not work so well on the 1.8ghz-single machines.... I am able to install the distribution on these machines and reboot. The system will stay up for something like 30 seconds and then it freezes and shows: hda: lost interrupt mipc_enable_irq timeout The dual core machine does something a little different. I logged onto the console and tried to run Xautoconfig and I used tab completion and then it starts scrolling that all up the screen and occasionally I see the above errors but then it keeps scrolling and I can't use the machine. Then I tried to use the stock 2.6.15.6 kernel that I d/l'd from kernel.org. At first I tried to use the 'arch/powerpc/g5_defconfig' but that wouldn't compile, so I tried 'arch/powerpc/ppc64_defconfig' and that didn't compile either (same error). So, then I tried taking the config file from the YDL srpm (i.e. 2.6.15-rc5 based) for the kernel and tried that (the g5-smp version) running it through 'make oldconfig' and taking the default for the new options. This kernel compiled but wouldn't boot at all. It went through the PROM code, cleared the screen and showed me the little logo at the top of the screen and then a blinking cursor but nothing after that.... This was on the single-1.8 machine... I did not try this on the Dual-core machine or the Dual-Proc machines due to lack of time today. FWIW the compile errors I got from the defconfig compiles was: drivers/md/raid6int8.c: In function `raid6_int8_gen_syndrome': drivers/md/raid6int8.c:185: error: unable to find a register to spill in class `FLOAT_REGS' drivers/md/raid6int8.c:185: error: this is the insn: (insn:HI 619 621 640 4 (set (mem:DI (plus:DI (reg/v/f:DI 122 [ p ]) (reg/v:DI 66 ctr [orig:124 d ] [124])) [0 S8 A64]) (reg/v:DI 129 [ wp0 ])) 320 {*movdi_internal64} (nil) (expr_list:REG_DEAD (reg/v:DI 129 [ wp0 ]) (nil))) drivers/md/raid6int8.c:185: confused by earlier errors, bailing out make[2]: *** [drivers/md/raid6int8.o] Error 1 make[1]: *** [drivers/md] Error 2 make: *** [drivers] Error 2 [root at kamitch-lnx linux-2.6.15.6]# gcc --version gcc (GCC) 3.4.4 20050721 (Yellow Dog 3.4.4-2.ydl.2) Copyright (C) 2004 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Does any of this sound familiar to anyone... What kernel/config combo would be recommended for this smattering of Powermac machines? Thanks. From ink at jurassic.park.msu.ru Thu Mar 9 10:08:51 2006 From: ink at jurassic.park.msu.ru (Ivan Kokshaysky) Date: Thu, 9 Mar 2006 02:08:51 +0300 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <17423.22121.254026.487964@cargo.ozlabs.ibm.com>; from paulus@samba.org on Thu, Mar 09, 2006 at 09:10:49AM +1100 References: <20060308154157.GI7301@parisc-linux.org> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> <10095.1141838381@warthog.cambridge.redhat.com> <17423.22121.254026.487964@cargo.ozlabs.ibm.com> Message-ID: <20060309020851.D9651@jurassic.park.msu.ru> On Thu, Mar 09, 2006 at 09:10:49AM +1100, Paul Mackerras wrote: > David Howells writes: > > > > # define smp_read_barrier_depends() do { } while(0) > > > > What's this one meant to do? > > On most CPUs, if you load one value and use the value you get to > compute the address for a second load, there is an implicit read > barrier between the two loads because of the dependency. That's not > true on alpha, apparently, because of the way their caches are > structured. Who said?! ;-) > The smp_read_barrier_depends is a read barrier that you > use between two loads when there is already a dependency between the > loads, and it is a no-op on everything except alpha (IIRC). My "Compiler Writer's Guide for the Alpha 21264" says that if the result of the first load contributes to the address calculation of the second load, then the second load cannot issue until the data from the first load is available. Obviously, we don't care about earlier alphas as they are executing strictly in program order. Ivan. From nickpiggin at yahoo.com.au Thu Mar 9 13:38:38 2006 From: nickpiggin at yahoo.com.au (Nick Piggin) Date: Thu, 09 Mar 2006 13:38:38 +1100 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: References: <20060308184500.GA17716@devserv.devel.redhat.com> <20060308173605.GB13063@devserv.devel.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <9834.1141837491@warthog.cambridge.redhat.com> <11922.1141842907@warthog.cambridge.redhat.com> <14275.1141844922@warthog.cambridge.redhat.com> <19984.1141846302@warthog.cambridge.redhat.com> <17423.30789.214209.462657@cargo.ozlabs.ibm.com> <17423.32792.500628.226831@cargo.ozlabs.ibm.com> Message-ID: <440F952E.90808@yahoo.com.au> Linus Torvalds wrote: > >On Thu, 9 Mar 2006, Paul Mackerras wrote: > >>... and x86 mmiowb is a no-op. It's not x86 that I think is buggy. >> > >x86 mmiowb would have to be a real op too if there were any multi-pathed >PCI buses out there for x86, methinks. > >Basically, the issue boils down to one thing: no "normal" barrier will >_ever_ show up on the bus on x86 (ie ia64, afaik). That, together with any >situation where there are multiple paths to one physical device means that >mmiowb() _has_ to be a special op, and no spinlocks etc will _ever_ do the >serialization you look for. > >Put another way: the only way to avoid mmiowb() being special is either >one of: > (a) have the bus fabric itself be synchronizing > (b) pay a huge expense on the much more critical _regular_ barriers > >Now, I claim that (b) is just broken. I'd rather take the hit when I need >to, than every time. > I'm not very driver-minded; would it make sense to have io versions of locks, which can provide critical sections for IO operations? The number of (uncommented) memory barriers sprinkled around drivers looks pretty scary... -- Send instant messages to your online friends http://au.messenger.yahoo.com From paulus at samba.org Thu Mar 9 14:45:13 2006 From: paulus at samba.org (Paul Mackerras) Date: Thu, 9 Mar 2006 14:45:13 +1100 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: References: <20060308184500.GA17716@devserv.devel.redhat.com> <20060308173605.GB13063@devserv.devel.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <9834.1141837491@warthog.cambridge.redhat.com> <11922.1141842907@warthog.cambridge.redhat.com> <14275.1141844922@warthog.cambridge.redhat.com> <19984.1141846302@warthog.cambridge.redhat.com> <17423.30789.214209.462657@cargo.ozlabs.ibm.com> <17423.32792.500628.226831@cargo.ozlabs.ibm.com> Message-ID: <17423.42185.78767.837295@cargo.ozlabs.ibm.com> Linus Torvalds writes: > x86 mmiowb would have to be a real op too if there were any multi-pathed > PCI buses out there for x86, methinks. Not if the manufacturers wanted to be able to run existing standard x86 operating systems on it, surely. I presume that on x86 the PCI host bridges and caches are all part of the coherence domain, and that the rule about stores being observed in order applies to what the PCI host bridge can see as much as it does to any other agent in the coherence domain. And if I have understood you correctly, the store ordering rule applies both to stores to regular cacheable memory and stores to noncacheable nonprefetchable MMIO registers without distinction. If that is so, then I don't see how the writel's can get out of order. Put another way, we expect spinlock regions to order stores to regular memory, and AFAICS the x86 ordering rules mean that the same guarantee should apply to stores to MMIO registers. (It's entirely possible that I don't fully understand the x86 memory ordering rules, of course. :) > Basically, the issue boils down to one thing: no "normal" barrier will > _ever_ show up on the bus on x86 (ie ia64, afaik). That, together with any A spin_lock does show up on the bus, doesn't it? > Would I want the hard-to-think-about IO ordering on a regular desktop > platform? No. In fact I think that mmiowb can actually be useful on PPC, if we can be sure that all the drivers we care about will use it correctly. If we can have the following rules: * If you have stores to regular memory, followed by an MMIO store, and you want the device to see the stores to regular memory at the point where it receives the MMIO store, then you need a wmb() between the stores to regular memory and the MMIO store. * If you have PIO or MMIO accesses, and you need to ensure the PIO/MMIO accesses don't get reordered with respect to PIO/MMIO accesses on another CPU, put the accesses inside a spin-locked region, and put a mmiowb() between the last access and the spin_unlock. * smp_wmb() doesn't necessarily do any ordering of MMIO accesses vs. other accesses, and in that sense it is weaker than wmb(). ... then I can remove the sync from write*, which would be nice, and make mmiowb() be a sync. I wonder how long we're going to spend chasing driver bugs after that, though. :) Paul. From jbarnes at virtuousgeek.org Thu Mar 9 15:18:04 2006 From: jbarnes at virtuousgeek.org (Jesse Barnes) Date: Wed, 8 Mar 2006 20:18:04 -0800 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <17423.34439.977741.295065@cargo.ozlabs.ibm.com> References: <20060308184500.GA17716@devserv.devel.redhat.com> <200603081659.05786.jbarnes@virtuousgeek.org> <17423.34439.977741.295065@cargo.ozlabs.ibm.com> Message-ID: <200603082018.04385.jbarnes@virtuousgeek.org> On Wednesday, March 08, 2006 5:36 pm, Paul Mackerras wrote: > Jesse Barnes writes: > > It uses a per-node address space to reference the local bridge. > > The local bridge then waits until the remote bridge has acked the > > write before, then sets the outstanding write register to the > > appropriate value. > > That sounds like mmiowb can only be used when preemption is disabled, > such as inside a spin-locked region - is that right? There's a scheduler hook to flush things if a process moves. I think Brent Casavant submitted that patch recently. Jesse From jbarnes at virtuousgeek.org Thu Mar 9 15:26:29 2006 From: jbarnes at virtuousgeek.org (Jesse Barnes) Date: Wed, 8 Mar 2006 20:26:29 -0800 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <17423.35719.758492.297725@cargo.ozlabs.ibm.com> References: <200603081655.13672.jbarnes@virtuousgeek.org> <17423.35719.758492.297725@cargo.ozlabs.ibm.com> Message-ID: <200603082026.29725.jbarnes@virtuousgeek.org> On Wednesday, March 08, 2006 5:57 pm, Paul Mackerras wrote: > > The rules for using > > the barriers really aren't that bad... for mmiowb() you basically > > want to do it before an unlock in any critical section where you've > > done PIO writes. > > Do you mean just PIO, or do you mean PIO or MMIO writes? I'd have to check, but iirc it was just MMIO. We assumed PIO (inX/outX) was defined to be very strongly ordered (and thus slow) in Linux. But Linus is apparently flexible on that point for the new ioreadX/iowriteX stuff. > Yes, there is a lot of confusion, unfortunately. There is also some > difficulty in defining things to be any different from what x86 does. Well, Alpha has smp_barrier_depends or whatever, that's *really* funky. > > That's why I suggested in an earlier thread that you enumerate all > > the memory ordering combinations on ppc and see if we can't define > > them all. > > The main difficulty we strike on PPC is that cacheable accesses tend > to get ordered independently of noncacheable accesses. The only > instruction we have that orders cacheable accesses with respect to > noncacheable accesses is the sync instruction, which is a heavyweight > "synchronize everything" operation. It acts as a full memory barrier > for both cacheable and noncacheable loads and stores. Ah, ok, sounds like your chip needs an ISA extension or two then. :) > The other barriers we have are the lwsync instruction and the eieio > instruction. The lwsync instruction (light-weight sync) acts as a > memory barrier for cacheable loads and stores except that it allows a > following load to go before a preceding store. This sounds like ia64 acquire semantics, a fence, but only in the downward direction. > The eieio instruction has two separate and independent effects. It > acts as a full barrier for accesses to noncacheable nonprefetchable > memory (i.e. MMIO or PIO registers), and it acts as a write barrier > for accesses to cacheable memory. It doesn't do any ordering between > cacheable and noncacheable accesses though. Weird, ok, so for cacheable stuff it's equivalent to ia64's release semantics, but has additional effects for noncacheable accesses. Too bad it doesn't tie the two together somehow. > There is also the isync (instruction synchronize) instruction, which > isn't explicitly a memory barrier. It prevents any following > instructions from executing until the outcome of any previous > conditional branches are known, and until it is known that no > previous instruction can generate an exception. Thus it can be used > to create a one-way barrier in spin_lock and read*. Hm, interesting. > Unfortunately, if you get the barriers wrong your driver will still > work most of the time on pretty much any machine, whereas if you get > the DMA mapping wrong your driver won't work at all on some machines. > Nevertheless, we should get these things defined properly and then > try to make sure drivers do the right things. Agreed. Having a set of rules that driver writers can use would help too. Given that PPC doesn't appear to have a lightweight way of synchronizing between I/O and memory accesses, it sounds like full syncs will be needed in a lot of cases. Jesse From jbarnes at virtuousgeek.org Thu Mar 9 15:34:00 2006 From: jbarnes at virtuousgeek.org (Jesse Barnes) Date: Wed, 8 Mar 2006 20:34:00 -0800 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: References: <17423.32792.500628.226831@cargo.ozlabs.ibm.com> Message-ID: <200603082034.00238.jbarnes@virtuousgeek.org> On Wednesday, March 08, 2006 5:27 pm, Linus Torvalds wrote: > That said, when I heard of the NUMA IO issues on the SGI platform, I > was initially pretty horrified. It seems to have worked out ok, and > as long as we're talking about machines where you can concentrate on > validating just a few drivers, it seems to be a good tradeoff. It's actually not too bad. We tried hard to make the arch code support the semantics that Linux drivers expect. mmiowb() was an optimization we added (though it's much less of an optimization than read_relaxed() was) to make things a little faster. Like you say, the alternative was to embed the same functionality into spin_unlock or something (IRIX actually had an io_spin_unlock that did that iirc), but that would mean an MMIO access on every unlock, which would be bad. So ultimately mmiowb() is the only thing drivers really have to care about on Altix (assuming they do DMA mapping correctly), and the rules for that are fairly simple. Then they can additionally use read_relaxed() to optimize performance a bit (quite a bit on big systems). > Would I want the hard-to-think-about IO ordering on a regular desktop > platform? No. I guess you don't want anyone to send you an O2 then? :) Jesse From jbarnes at virtuousgeek.org Thu Mar 9 15:36:19 2006 From: jbarnes at virtuousgeek.org (Jesse Barnes) Date: Wed, 8 Mar 2006 20:36:19 -0800 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <17423.42185.78767.837295@cargo.ozlabs.ibm.com> References: <17423.42185.78767.837295@cargo.ozlabs.ibm.com> Message-ID: <200603082036.19811.jbarnes@virtuousgeek.org> On Wednesday, March 08, 2006 7:45 pm, Paul Mackerras wrote: > If we can have the following rules: > > * If you have stores to regular memory, followed by an MMIO store, > and you want the device to see the stores to regular memory at the > point where it receives the MMIO store, then you need a wmb() between > the stores to regular memory and the MMIO store. > > * If you have PIO or MMIO accesses, and you need to ensure the > PIO/MMIO accesses don't get reordered with respect to PIO/MMIO > accesses on another CPU, put the accesses inside a spin-locked > region, and put a mmiowb() between the last access and the > spin_unlock. > > * smp_wmb() doesn't necessarily do any ordering of MMIO accesses > vs. other accesses, and in that sense it is weaker than wmb(). This is a good set of rules. Hopefully David can add something like this to his doc. > ... then I can remove the sync from write*, which would be nice, and > make mmiowb() be a sync. I wonder how long we're going to spend > chasing driver bugs after that, though. :) Hm, a static checker should be able to find this stuff, shouldn't it? Jesse From paulus at samba.org Thu Mar 9 15:43:29 2006 From: paulus at samba.org (Paul Mackerras) Date: Thu, 9 Mar 2006 15:43:29 +1100 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <200603082034.00238.jbarnes@virtuousgeek.org> References: <17423.32792.500628.226831@cargo.ozlabs.ibm.com> <200603082034.00238.jbarnes@virtuousgeek.org> Message-ID: <17423.45681.476222.143773@cargo.ozlabs.ibm.com> Jesse Barnes writes: > So ultimately mmiowb() is the only thing drivers really have to care > about on Altix (assuming they do DMA mapping correctly), and the rules > for that are fairly simple. Then they can additionally use > read_relaxed() to optimize performance a bit (quite a bit on big > systems). If I can be sure that all the drivers we care about on PPC use mmiowb correctly, I can reduce or eliminate the barrier in write*, which would be nice. Which drivers have been audited to make sure they use mmiowb correctly? In particular, has the USB driver been audited? Paul. From torvalds at osdl.org Thu Mar 9 16:38:36 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Wed, 8 Mar 2006 21:38:36 -0800 (PST) Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <17423.42185.78767.837295@cargo.ozlabs.ibm.com> References: <20060308184500.GA17716@devserv.devel.redhat.com> <20060308173605.GB13063@devserv.devel.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <9834.1141837491@warthog.cambridge.redhat.com> <11922.1141842907@warthog.cambridge.redhat.com> <14275.1141844922@warthog.cambridge.redhat.com> <19984.1141846302@warthog.cambridge.redhat.com> <17423.30789.214209.462657@cargo.ozlabs.ibm.com> <17423.32792.500628.226831@cargo.ozlabs.ibm.com> <17423.42185.78767.837295@cargo.ozlabs.ibm.com> Message-ID: On Thu, 9 Mar 2006, Paul Mackerras wrote: > > A spin_lock does show up on the bus, doesn't it? Nope. If the lock entity is in a exclusive cache-line, a spinlock does not show up on the bus at _all_. It's all purely in the core. In fact, I think AMD does a spinlock in ~15 CPU cycles (that's the serialization overhead in the core). I think a P-M core is ~25, while the NetBurst (P4) core is much more because they have horrible serialization issues (I think it's on the order of 100 cycles there). Anyway, try doing a spinlock in 15 CPU cycles and going out on the bus for it.. (Couple that with spin_unlock basically being free). Now, if the spinlocks end up _bouncing_ between CPU's, they'll obviously be a lot more expensive. Linus From paulus at samba.org Thu Mar 9 18:41:04 2006 From: paulus at samba.org (Paul Mackerras) Date: Thu, 9 Mar 2006 18:41:04 +1100 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <200603082036.19811.jbarnes@virtuousgeek.org> References: <17423.42185.78767.837295@cargo.ozlabs.ibm.com> <200603082036.19811.jbarnes@virtuousgeek.org> Message-ID: <17423.56336.993754.818818@cargo.ozlabs.ibm.com> Jesse Barnes writes: > Hm, a static checker should be able to find this stuff, shouldn't it? Good idea. I wonder if sparse could be extended to do it. Alternatively, it wouldn't be hard to check dynamically. Just have a per-cpu count of outstanding MMIO stores. Zero it in spin_lock and mmiowb, increment it in write*, and grizzle if spin_unlock finds it non-zero. Should be very little overhead. Paul. From jes at sgi.com Thu Mar 9 21:05:39 2006 From: jes at sgi.com (Jes Sorensen) Date: 09 Mar 2006 05:05:39 -0500 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <17423.45681.476222.143773@cargo.ozlabs.ibm.com> References: <17423.32792.500628.226831@cargo.ozlabs.ibm.com> <200603082034.00238.jbarnes@virtuousgeek.org> <17423.45681.476222.143773@cargo.ozlabs.ibm.com> Message-ID: >>>>> "Paul" == Paul Mackerras writes: Paul> Jesse Barnes writes: >> So ultimately mmiowb() is the only thing drivers really have to >> care about on Altix (assuming they do DMA mapping correctly), and >> the rules for that are fairly simple. Then they can additionally >> use read_relaxed() to optimize performance a bit (quite a bit on >> big systems). Paul> If I can be sure that all the drivers we care about on PPC use Paul> mmiowb correctly, I can reduce or eliminate the barrier in Paul> write*, which would be nice. Paul> Which drivers have been audited to make sure they use mmiowb Paul> correctly? In particular, has the USB driver been audited? I think the primary drivers we've looked at are drivers/net/tg3.c, drivers/net/s2io.c, drivers/scsi/qla1280.c, and possibly the qla[234]xxx series - thats probably it! While we have USB on the systems, I don't think anyone has spend a lot of time verifying it in this context. At least the keyboard and mouse I have on this box seems to behave. Cheers, Jes From dhowells at redhat.com Thu Mar 9 22:41:02 2006 From: dhowells at redhat.com (David Howells) Date: Thu, 09 Mar 2006 11:41:02 +0000 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <1141855305.10606.6.camel@localhost.localdomain> References: <1141855305.10606.6.camel@localhost.localdomain> <20060308161829.GC3669@elf.ucw.cz> <31492.1141753245@warthog.cambridge.redhat.com> <24309.1141848971@warthog.cambridge.redhat.com> Message-ID: <24280.1141904462@warthog.cambridge.redhat.com> Alan Cox wrote: > > The LOCK and UNLOCK functions presumably make at least one memory write apiece > > to manipulate the target lock (on SMP at least). > > No they merely perform the bus transactions neccessary to perform an > update atomically. They are however "serializing" instructions which > means they do cause a certain amount of serialization (see the intel > architecture manual on serializing instructions for detail). > > Athlon and later know how to turn it from locked memory accesses into > merely an exclusive cache line grab. So, you're saying that the LOCK and UNLOCK primitives don't actually modify memory, but rather simply pin the cacheline into the CPU's cache and refuse to let anyone else touch it? No... it can't work like that. It *must* make a memory modification - after all, the CPU doesn't know that what it's doing is a spin_unlock(), say, rather than an atomic_set(). David From mbuesch at freenet.de Thu Mar 9 22:44:09 2006 From: mbuesch at freenet.de (Michael Buesch) Date: Thu, 9 Mar 2006 12:44:09 +0100 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <17423.42185.78767.837295@cargo.ozlabs.ibm.com> References: <17423.42185.78767.837295@cargo.ozlabs.ibm.com> Message-ID: <200603091244.09621.mbuesch@freenet.de> On Thursday 09 March 2006 04:45, you wrote: > ... then I can remove the sync from write*, which would be nice, and > make mmiowb() be a sync. I wonder how long we're going to spend > chasing driver bugs after that, though. :) Can you do a patch, which does the change, so people can actually test their drivers? -- Greetings Michael. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060309/6a25adfe/attachment.pgp From alan at lxorguk.ukuu.org.uk Thu Mar 9 23:28:59 2006 From: alan at lxorguk.ukuu.org.uk (Alan Cox) Date: Thu, 09 Mar 2006 12:28:59 +0000 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <24280.1141904462@warthog.cambridge.redhat.com> References: <1141855305.10606.6.camel@localhost.localdomain> <20060308161829.GC3669@elf.ucw.cz> <31492.1141753245@warthog.cambridge.redhat.com> <24309.1141848971@warthog.cambridge.redhat.com> <24280.1141904462@warthog.cambridge.redhat.com> Message-ID: <1141907339.16745.2.camel@localhost.localdomain> On Iau, 2006-03-09 at 11:41 +0000, David Howells wrote: > Alan Cox wrote: > So, you're saying that the LOCK and UNLOCK primitives don't actually modify > memory, but rather simply pin the cacheline into the CPU's cache and refuse to > let anyone else touch it? Basically yes > No... it can't work like that. It *must* make a memory modification Then you'll have to argue with the chip designers because it doesn't. Its all built around the cache coherency. To make a write to a cache line I must be the sole owner of the line. Look up "MESI cache" in a good book on the subject. If we own the affected line then we can update just the cache and be sure that since we own the cache line and we will write it back if anyone else asks for it (or nowdays on some systems transfer it direct to the other cpu) that we get locked semantics From dhowells at redhat.com Thu Mar 9 23:27:54 2006 From: dhowells at redhat.com (David Howells) Date: Thu, 09 Mar 2006 12:27:54 +0000 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: References: <20060308184500.GA17716@devserv.devel.redhat.com> <20060308173605.GB13063@devserv.devel.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <9834.1141837491@warthog.cambridge.redhat.com> <11922.1141842907@warthog.cambridge.redhat.com> <14275.1141844922@warthog.cambridge.redhat.com> <19984.1141846302@warthog.cambridge.redhat.com> <17423.30789.214209.462657@cargo.ozlabs.ibm.com> <17423.32792.500628.226831@cargo.ozlabs.ibm.com> <17423.42185.78767.837295@cargo.ozlabs.ibm.com> Message-ID: <25437.1141907274@warthog.cambridge.redhat.com> Linus Torvalds wrote: > > A spin_lock does show up on the bus, doesn't it? > > Nope. Yes, sort of, under some circumstances. If the CPU doing the spin_lock() doesn't own the cacheline with the lock, it'll have to resort to the bus to grab the cacheline from the current owner (so another CPU would at least see a read). The effect of the spin_lock() might not be seen outside of the CPU before the spin_unlock() occurs, but it *will* be committed to the CPU's cache, and given cache coherency mechanisms, that's effectively the same as main memory. So it's in effect visible on the bus, given that it will be transferred to another CPU when requested; and as long as the other CPUs expect to see the effects and the ordering imposed, it's immaterial whether the content of the spinlock is actually ever committed to SDRAM or whether it remains perpetually in one or another's CPU cache. David From dhowells at redhat.com Fri Mar 10 00:02:01 2006 From: dhowells at redhat.com (David Howells) Date: Thu, 09 Mar 2006 13:02:01 +0000 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <1141907339.16745.2.camel@localhost.localdomain> References: <1141907339.16745.2.camel@localhost.localdomain> <1141855305.10606.6.camel@localhost.localdomain> <20060308161829.GC3669@elf.ucw.cz> <31492.1141753245@warthog.cambridge.redhat.com> <24309.1141848971@warthog.cambridge.redhat.com> <24280.1141904462@warthog.cambridge.redhat.com> Message-ID: <26313.1141909321@warthog.cambridge.redhat.com> Alan Cox wrote: > > Alan Cox wrote: > > So, you're saying that the LOCK and UNLOCK primitives don't actually modify > > memory, but rather simply pin the cacheline into the CPU's cache and refuse to > > let anyone else touch it? > > Basically yes What you said is incomplete: the cacheline is wangled into the Exclusive state, and there it sits until modified (at which point it shifts to the Modified state) or stolen (when it shifts to the Shared state). Whilst the x86 CPU might pin it there for the duration of the execution of the locked instruction, it can't leave it there until it detects a spin_unlock() or equivalent. I guess LL/SC and LWARX/STWCX work by the reserved load wangling the cacheline into the Exclusive state, and then the conditional store only doing the store if the cacheline is still in that state. I don't know whether the conditional store may modify a cacheline that's in the Modified state, but I'd guess you'd need more state than that, because you have to pair it with a load reserved. With inter-CPU memory barriers I think you have to consider the cache part of the memory, not part of the CPU. The CPU _does_ make a memory modification; it's just that it doesn't proceed any further than the cache, until the cache coherency mechanisms transfer the change to another CPU, or until the cache becomes full and the lock's line gets ejected. > > No... it can't work like that. It *must* make a memory modification > > Then you'll have to argue with the chip designers because it doesn't. > > Its all built around the cache coherency. To make a write to a cache > line I must be the sole owner of the line. Look up "MESI cache" in a > good book on the subject. http://en.wikipedia.org/wiki/MESI_protocol And a picture of the state machine may be found here: https://www.cs.tcd.ie/Jeremy.Jones/vivio/caches/MESIHelp.htm David From dhowells at redhat.com Fri Mar 10 01:01:16 2006 From: dhowells at redhat.com (David Howells) Date: Thu, 09 Mar 2006 14:01:16 +0000 Subject: [PATCH] Document Linux's memory barriers [try #3] In-Reply-To: <21627.1141846631@warthog.cambridge.redhat.com> References: <21627.1141846631@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <31492.1141753245@warthog.cambridge.redhat.com> Message-ID: <27749.1141912876@warthog.cambridge.redhat.com> I'm thinking of adding the attached to the document. Any comments or objections? David diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 6eeb7e4..f9a9192 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -4,6 +4,8 @@ Contents: + (*) What do we consider memory? + (*) What are memory barriers? (*) Where are memory barriers needed? @@ -32,6 +34,82 @@ Contents: (*) References. +=========================== +WHAT DO WE CONSIDER MEMORY? +=========================== + +For the purpose of this specification, "memory", at least as far as cached CPU +vs CPU interactions go, has to include the CPU caches in the system. Although +any particular read or write may not actually appear outside of the CPU that +issued it because the CPU was able to satisfy it from its own cache, it's still +as if the memory access had taken place as far as the other CPUs are concerned +since the cache coherency and ejection mechanisms will propegate the effects +upon conflict. + +Consider the system logically as: + + <--- CPU ---> : <----------- Memory -----------> + : + +--------+ +--------+ : +--------+ +-----------+ + | | | | : | | | | +---------+ + | CPU | | Memory | : | CPU | | | | | + | Core |--->| Access |----->| Cache |<-->| | | | + | | | Queue | : | | | |--->| Memory | + | | | | : | | | | | | + +--------+ +--------+ : +--------+ | | | | + : | Cache | +---------+ + : | Coherency | + : | Mechanism | +---------+ + +--------+ +--------+ : +--------+ | | | | + | | | | : | | | | | | + | CPU | | Memory | : | CPU | | |--->| Device | + | Core |--->| Access |----->| Cache |<-->| | | | + | | | Queue | : | | | | | | + | | | | : | | | | +---------+ + +--------+ +--------+ : +--------+ +-----------+ + : + : + +The CPU core may execute instructions in any order it deems fit, provided the +expected program causality appears to be maintained. Some of the instructions +generate load and store operations which then go into the memory access queue +to be performed. The core may place these in the queue in any order it wishes, +and continue execution until it is forced to wait for an instruction to +complete. + +What memory barriers are concerned with is controlling the order in which +accesses cross from the CPU side of things to the memory side of things, and +the order in which the effects are perceived to happen by the other observers +in the system. + + +Note that the above model does not show uncached memory or I/O accesses. These +procede directly from the queue to the memory or the devices, bypassing any +cache coherency: + + <--- CPU ---> : + : +-----+ + +--------+ +--------+ : | | + | | | | : | | +---------+ + | CPU | | Memory | : | | | | + | Core |--->| Access |--------------->| | | | + | | | Queue | : | |------------->| Memory | + | | | | : | | | | + +--------+ +--------+ : | | | | + : | | +---------+ + : | Bus | + : | | +---------+ + +--------+ +--------+ : | | | | + | | | | : | | | | + | CPU | | Memory | : | |<------------>| Device | + | Core |--->| Access |--------------->| | | | + | | | Queue | : | | | | + | | | | : | | +---------+ + +--------+ +--------+ : | | + : +-----+ + : + + ========================= WHAT ARE MEMORY BARRIERS? ========================= @@ -448,8 +526,8 @@ In all cases there are variants on a LOC The LOCK accesses will be completed before the UNLOCK accesses. -And therefore an UNLOCK followed by a LOCK is equivalent to a full barrier, but -a LOCK followed by an UNLOCK isn't. + Therefore an UNLOCK followed by a LOCK is equivalent to a full barrier, + but a LOCK followed by an UNLOCK is not. Locks and semaphores may not provide any guarantee of ordering on UP compiled systems, and so can't be counted on in such a situation to actually do anything From osv at javad.com Thu Mar 9 23:02:15 2006 From: osv at javad.com (Sergei Organov) Date: Thu, 09 Mar 2006 15:02:15 +0300 Subject: [PATCH] Document Linux's memory barriers [try #2] References: <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> Message-ID: David Howells writes: [...] > +======================================= > +LINUX KERNEL COMPILER BARRIER FUNCTIONS > +======================================= > + > +The Linux kernel has an explicit compiler barrier function that prevents the > +compiler from moving the memory accesses either side of it to the other side: > + > + barrier(); > + > +This has no direct effect on the CPU, which may then reorder things however it > +wishes. > + > +In addition, accesses to "volatile" memory locations and volatile asm > +statements act as implicit compiler barriers. This last statement seems to contradict with what GCC manual says about volatile asm statements: "You can prevent an `asm' instruction from being deleted by writing the keyword `volatile' after the `asm'. [...] The `volatile' keyword indicates that the instruction has important side-effects. GCC will not delete a volatile `asm' if it is reachable. (The instruction can still be deleted if GCC can prove that control-flow will never reach the location of the instruction.) *Note that even a volatile `asm' instruction can be moved relative to other code, including across jump instructions.*" I think that volatile memory locations aren't compiler barriers either, -- GCC only guarantees that it won't remove the access and that it won't re-arrange the access w.r.t. other *volatile* accesses. On the other hand, barrier() indeed prevents *any* memory access from being moved across the barrier. -- Sergei. From ink at jurassic.park.msu.ru Fri Mar 10 03:02:01 2006 From: ink at jurassic.park.msu.ru (Ivan Kokshaysky) Date: Thu, 9 Mar 2006 19:02:01 +0300 Subject: [PATCH] Document Linux's memory barriers [try #2] In-Reply-To: <17423.32377.460820.710578@cargo.ozlabs.ibm.com>; from paulus@samba.org on Thu, Mar 09, 2006 at 12:01:45PM +1100 References: <20060308154157.GI7301@parisc-linux.org> <31492.1141753245@warthog.cambridge.redhat.com> <29826.1141828678@warthog.cambridge.redhat.com> <20060308145506.GA5095@devserv.devel.redhat.com> <10095.1141838381@warthog.cambridge.redhat.com> <17423.22121.254026.487964@cargo.ozlabs.ibm.com> <20060309020851.D9651@jurassic.park.msu.ru> <17423.32377.460820.710578@cargo.ozlabs.ibm.com> Message-ID: <20060309190201.A19243@jurassic.park.msu.ru> On Thu, Mar 09, 2006 at 12:01:45PM +1100, Paul Mackerras wrote: > If you do: > > CPU 0 CPU 1 > > foo = val; > wmb(); > p = &foo; > reg = p; > bar = *reg; > > it is apparently possible for CPU 1 to see the new value of p > (i.e. &foo) but an old value of foo (i.e. not val). This can happen > if p and foo are in different halves of the cache on CPU 1, and there > are a lot of updates coming in for the half containing foo but the > half containing p is quiet. Indeed, this can happen according to architecture reference manual, so CPU 1 needs mb() as well. Thanks for clarification. Ivan. From hch at lst.de Fri Mar 10 03:25:34 2006 From: hch at lst.de (Christoph Hellwig) Date: Thu, 9 Mar 2006 17:25:34 +0100 Subject: asm/cputable.h sparse warnings In-Reply-To: <20060309115036.0bb4bec4.sfr@canb.auug.org.au> References: <20060308151939.GA12762@lst.de> <20060309115036.0bb4bec4.sfr@canb.auug.org.au> Message-ID: <20060309162534.GA15777@lst.de> On Thu, Mar 09, 2006 at 11:50:36AM +1100, Stephen Rothwell wrote: > Hi Christoph, > > On Wed, 8 Mar 2006 16:19:39 +0100 Christoph Hellwig wrote: > > > > Looks like not many people are running sparse on powerpc ;-) For every > > file compiled I get the churn of warnings below. The reason seems to be > > that it's using large values in enums, something that's very murky in > > the C standards, and gcc adds even less well-defined extensions to it > > that make this code work in practice. I think the only sane fix is > > to switch the cputype constants to cpp macros, although that'd make the > > file a lot larger.. > > Does the following patch help? I don't think we necessarily want to do this, > but we could. The first hunk (which has most of the changes) doesn't apply to mainline, looks like something changes there in your tree. From torvalds at osdl.org Fri Mar 10 03:32:50 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Thu, 9 Mar 2006 08:32:50 -0800 (PST) Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <24280.1141904462@warthog.cambridge.redhat.com> References: <1141855305.10606.6.camel@localhost.localdomain> <20060308161829.GC3669@elf.ucw.cz> <31492.1141753245@warthog.cambridge.redhat.com> <24309.1141848971@warthog.cambridge.redhat.com> <24280.1141904462@warthog.cambridge.redhat.com> Message-ID: On Thu, 9 Mar 2006, David Howells wrote: > > So, you're saying that the LOCK and UNLOCK primitives don't actually modify > memory, but rather simply pin the cacheline into the CPU's cache and refuse to > let anyone else touch it? > > No... it can't work like that. It *must* make a memory modification - after > all, the CPU doesn't know that what it's doing is a spin_unlock(), say, rather > than an atomic_set(). Basically, as long as nobody else is reading the lock, the lock will stay in the caches. Only old and stupid architectures go out to the bus for locking. For example, I remember the original alpha "load-locked"/"store-conditional", and it was totally _horrible_ for anything that wanted performance, because it would do the "pending lock" bit on the bus, so it took hundreds of cycles even on UP. Gods, how I hated that. It made it almost totally useless for anything that just wanted to be irq-safe - it was cheaper to just disable interrupts, iirc. STUPID. All modern CPU's do atomic operations entirely within the cache coherency logic. I think x86 still support the notion of a "locked cycle" on the bus, but I think that's entirely relegated to horrible people doing locked operations across PCI, and quite frankly, I suspect that it doesn't actually mean a thing (ie I'd expect no external hardware to actually react to the lock signal). However, nobody really cares, since nobody would be crazy enough to do locked cycles over PCI even if they were to work. So in practice, as far as I know, the way _all_ modern CPU's do locked cycles is that they do it by getting exclusive ownership on the cacheline on the read, and either having logic in place to refuse to do release the cacheline until the write is complete (ie "locked cycles to the cache"), or to re-try the instruction if the cacheline has been released by the time the write is ready (ie "load-locked" + "store-conditional" + "potentially loop" to the cache). NOBODY goes out to the bus for locking any more. That would be insane and stupid. Yes, many spinlocks see contention, and end up going out to the bus. But similarly, many spinlocks do _not_ see any contention at all (or other CPU's even looking at them), and may end up staying exclusive in a CPU cache for a long time. The "no contention" case is actually pretty important. Many real loads on SMP end up being largely single-threaded, and together with some basic CPU affinity, you really _really_ want to make that single-threaded case go as fast as possible. And a pretty big part of that is locking: the difference between a lock that goes to the bus and one that does not is _huge_. And lots of trivial code is almost dominated by locking costs. In some system calls on an SMP kernel, the locking cost can be (depending on how good or bad the CPU is at them) quite noticeable. Just a simple small read() will take several locks and/or do atomic ops, even if it was cached and it looks "trivial". Linus From dhowells at redhat.com Fri Mar 10 04:39:05 2006 From: dhowells at redhat.com (David Howells) Date: Thu, 09 Mar 2006 17:39:05 +0000 Subject: [PATCH] Document Linux's memory barriers In-Reply-To: References: <1141855305.10606.6.camel@localhost.localdomain> <20060308161829.GC3669@elf.ucw.cz> <31492.1141753245@warthog.cambridge.redhat.com> <24309.1141848971@warthog.cambridge.redhat.com> <24280.1141904462@warthog.cambridge.redhat.com> Message-ID: <12101.1141925945@warthog.cambridge.redhat.com> Linus Torvalds wrote: > Basically, as long as nobody else is reading the lock, the lock will stay > in the caches. I think for the purposes of talking about memory barriers, we consider the cache to be part of the memory since the cache coherency mechanisms will give the same effect. I suppose the way the cache can be viewed as working is that bits of memory are shuttled around between the CPUs, RAM and any other devices that partake of the coherency mechanism. > All modern CPU's do atomic operations entirely within the cache coherency > logic. I know that, and I think it's irrelevant to specifying memory barriers. > I think x86 still support the notion of a "locked cycle" on the > bus, I wonder if that's what XCHG and XADD do... There's no particular reason they should be that much slower than LOCK INCL/DECL. Of course, I've only measured this on my Dual-PPro test box, so other i386 arch CPUs may exhibit other behaviour. David From torvalds at osdl.org Fri Mar 10 04:54:07 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Thu, 9 Mar 2006 09:54:07 -0800 (PST) Subject: [PATCH] Document Linux's memory barriers In-Reply-To: <12101.1141925945@warthog.cambridge.redhat.com> References: <1141855305.10606.6.camel@localhost.localdomain> <20060308161829.GC3669@elf.ucw.cz> <31492.1141753245@warthog.cambridge.redhat.com> <24309.1141848971@warthog.cambridge.redhat.com> <24280.1141904462@warthog.cambridge.redhat.com> <12101.1141925945@warthog.cambridge.redhat.com> Message-ID: On Thu, 9 Mar 2006, David Howells wrote: > > I think for the purposes of talking about memory barriers, we consider the > cache to be part of the memory since the cache coherency mechanisms will give > the same effect. Yes and no. The yes comes from the normal "smp_xxx()" barriers. As far as they are concerned, the cache coherency means that caches are invisible. The "no" comes from the IO side. Basically, since IO bypasses caches and sometimes write buffers, it's simply not ordered wrt normal accesses. And that's where "bus cycles" actually matter wrt barriers. If you have a barrier that creates a bus cycle, it suddenly can be ordered wrt IO. So the fact that x86 SMP ops basically never guarantee any bus cycles basically means that they are fundamentally no-ops when it comes to IO serialization. That was really my only point. > > I think x86 still support the notion of a "locked cycle" on the > > bus, > > I wonder if that's what XCHG and XADD do... There's no particular reason they > should be that much slower than LOCK INCL/DECL. Of course, I've only measured > this on my Dual-PPro test box, so other i386 arch CPUs may exhibit other > behaviour. I think it's an internal core implementation detail. I don't think they do anything on the bus, but I suspect that they could easily generate less optimized uops, simply because they didn't matter as much and didn't fit the "normal" core uop sequence. Linus From torvalds at osdl.org Fri Mar 10 04:56:17 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Thu, 9 Mar 2006 09:56:17 -0800 (PST) Subject: [PATCH] Document Linux's memory barriers In-Reply-To: References: <1141855305.10606.6.camel@localhost.localdomain> <20060308161829.GC3669@elf.ucw.cz> <31492.1141753245@warthog.cambridge.redhat.com> <24309.1141848971@warthog.cambridge.redhat.com> <24280.1141904462@warthog.cambridge.redhat.com> <12101.1141925945@warthog.cambridge.redhat.com> Message-ID: On Thu, 9 Mar 2006, Linus Torvalds wrote: > > So the fact that x86 SMP ops basically never guarantee any bus cycles > basically means that they are fundamentally no-ops when it comes to IO > serialization. That was really my only point. Side note: of course, locked cycles _do_ "serialize" the core. So they'll stop at least the core write merging, and speculative reads. So they do have some impact on IO, but they have no way of impacting things like write posting etc that is outside the CPU. Linus From dhowells at redhat.com Fri Mar 10 07:29:22 2006 From: dhowells at redhat.com (David Howells) Date: Thu, 09 Mar 2006 20:29:22 +0000 Subject: [PATCH] Document Linux's memory barriers [try #4] Message-ID: <16835.1141936162@warthog.cambridge.redhat.com> The attached patch documents the Linux kernel's memory barriers. I've updated it from the comments I've been given. The per-arch notes sections are gone because it's clear that there are so many exceptions, that it's not worth having them. I've added a list of references to other documents. I've tried to get rid of the concept of memory accesses appearing on the bus; what matters is apparent behaviour with respect to other observers in the system. Interrupts barrier effects are now considered to be non-existent. They may be there, but you may not rely on them. I've added a couple of definition sections at the top of the document: one to specify the minimum execution model that may be assumed, the other to specify what this document refers to by the term "memory". Signed-Off-By: David Howells --- warthog>diffstat -p1 /tmp/mb.diff Documentation/memory-barriers.txt | 855 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 855 insertions(+) diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt new file mode 100644 index 0000000..04c5c88 --- /dev/null +++ b/Documentation/memory-barriers.txt @@ -0,0 +1,855 @@ + ============================ + LINUX KERNEL MEMORY BARRIERS + ============================ + +Contents: + + (*) Assumed minimum execution ordering model. + + (*) What is considered memory? + + - Cached interactions. + - Uncached interactions. + + (*) What are memory barriers? + + (*) Where are memory barriers needed? + + - Accessing devices. + - Multiprocessor interaction. + - Interrupts. + + (*) Explicit kernel compiler barriers. + + (*) Explicit kernel memory barriers. + + (*) Implicit kernel memory barriers. + + - Locking functions. + - Interrupt disabling functions. + - Miscellaneous functions. + + (*) Inter-CPU locking barrier effects. + + - Locks vs memory accesses. + - Locks vs I/O accesses. + + (*) Kernel I/O barrier effects. + + (*) References. + + +======================================== +ASSUMED MINIMUM EXECUTION ORDERING MODEL +======================================== + +It has to be assumed that the conceptual CPU is weakly-ordered in all respects +but that it will maintain the appearance of program causality with respect to +itself. Some CPUs (such as i386 or x86_64) are more constrained than others +(such as powerpc or frv), and so the worst case must be assumed. + +This means that it must be considered that the CPU will execute its instruction +stream in any order it feels like - or even in parallel - provided that if an +instruction in the stream depends on the an earlier instruction, then that +earlier instruction must be sufficiently complete[*] before the later +instruction may proceed. + + [*] Some instructions have more than one effect[**] and different instructions + may depend on different effects. + + [**] Eg: changes to condition codes and registers; memory changes; barriers. + +A CPU may also discard any instruction sequence that ultimately winds up having +no effect. For example if two adjacent instructions both load an immediate +value into the same register, the first may be discarded. + + +Similarly, it has to be assumed that compiler might reorder the instruction +stream in any way it sees fit, again provided the appearance of causality is +maintained. + + +========================== +WHAT IS CONSIDERED MEMORY? +========================== + +For the purpose of this specification what's meant by "memory" needs to be +defined, and the division between CPU and memory needs to be marked out. + + +CACHED INTERACTIONS +------------------- + +As far as cached CPU vs CPU[*] interactions go, "memory" has to include the CPU +caches in the system. Although any particular read or write may not actually +appear outside of the CPU that issued it (the CPU may may have been able to +satisfy it from its own cache), it's still as if the memory access had taken +place as far as the other CPUs are concerned since the cache coherency and +ejection mechanisms will propegate the effects upon conflict. + + [*] Also applies to CPU vs device when accessed through a cache. + +The system can be considered logically as: + + <--- CPU ---> : <----------- Memory -----------> + : + +--------+ +--------+ : +--------+ +-----------+ + | | | | : | | | | +---------+ + | CPU | | Memory | : | CPU | | | | | + | Core |--->| Access |----->| Cache |<-->| | | | + | | | Queue | : | | | |--->| Memory | + | | | | : | | | | | | + +--------+ +--------+ : +--------+ | | | | + : | Cache | +---------+ + : | Coherency | + : | Mechanism | +---------+ + +--------+ +--------+ : +--------+ | | | | + | | | | : | | | | | | + | CPU | | Memory | : | CPU | | |--->| Device | + | Core |--->| Access |----->| Cache |<-->| | | | + | | | Queue | : | | | | | | + | | | | : | | | | +---------+ + +--------+ +--------+ : +--------+ +-----------+ + : + : + +The CPU core may execute instructions in any order it deems fit, provided the +expected program causality appears to be maintained. Some of the instructions +generate load and store operations which then go into the memory access queue +to be performed. The core may place these in the queue in any order it wishes, +and continue execution until it is forced to wait for an instruction to +complete. + +What memory barriers are concerned with is controlling the order in which +accesses cross from the CPU side of things to the memory side of things, and +the order in which the effects are perceived to happen by the other observers +in the system. + + +UNCACHED INTERACTIONS +--------------------- + +Note that the above model does not show uncached memory or I/O accesses. These +procede directly from the queue to the memory or the devices, bypassing any +cache coherency: + + <--- CPU ---> : + : +-----+ + +--------+ +--------+ : | | + | | | | : | | +---------+ + | CPU | | Memory | : | | | | + | Core |--->| Access |--------------->| | | | + | | | Queue | : | |------------->| Memory | + | | | | : | | | | + +--------+ +--------+ : | | | | + : | | +---------+ + : | Bus | + : | | +---------+ + +--------+ +--------+ : | | | | + | | | | : | | | | + | CPU | | Memory | : | |<------------>| Device | + | Core |--->| Access |--------------->| | | | + | | | Queue | : | | | | + | | | | : | | +---------+ + +--------+ +--------+ : | | + : +-----+ + : + + +========================= +WHAT ARE MEMORY BARRIERS? +========================= + +Memory barriers are instructions to both the compiler and the CPU to impose an +apparent partial ordering between the memory access operations specified either +side of the barrier. They request that the sequence of memory events generated +appears to other components of the system as if the barrier is effective on +that CPU. + +Note that: + + (*) there's no guarantee that the sequence of memory events is _actually_ so + ordered. It's possible for the CPU to do out-of-order accesses _as long + as no-one is looking_, and then fix up the memory if someone else tries to + see what's going on (for instance a bus master device); what matters is + the _apparent_ order as far as other processors and devices are concerned; + and + + (*) memory barriers are only guaranteed to act within the CPU processing them, + and are not, for the most part, guaranteed to percolate down to other CPUs + in the system or to any I/O hardware that that CPU may communicate with. + + +For example, a programmer might take it for granted that the CPU will perform +memory accesses in exactly the order specified, so that if a CPU is, for +example, given the following piece of code: + + a = *A; + *B = b; + c = *C; + d = *D; + *E = e; + +They would then expect that the CPU will complete the memory access for each +instruction before moving on to the next one, leading to a definite sequence of +operations as seen by external observers in the system: + + read *A, write *B, read *C, read *D, write *E. + + +Reality is, of course, much messier. With many CPUs and compilers, this isn't +always true because: + + (*) reads are more likely to need to be completed immediately to permit + execution progress, whereas writes can often be deferred without a + problem; + + (*) reads can be done speculatively, and then the result discarded should it + prove not to be required; + + (*) the order of the memory accesses may be rearranged to promote better use + of the CPU buses and caches; + + (*) reads and writes may be combined to improve performance when talking to + the memory or I/O hardware that can do batched accesses of adjacent + locations, thus cutting down on transaction setup costs (memory and PCI + devices may be able to do this); and + + (*) the CPU's data cache may affect the ordering, though cache-coherency + mechanisms should alleviate this - once the write has actually hit the + cache. + +So what another CPU, say, might actually observe from the above piece of code +is: + + read *A, read {*C,*D}, write *E, write *B + + (By "read {*C,*D}" I mean a combined single read). + + +It is also guaranteed that a CPU will be self-consistent: it will see its _own_ +accesses appear to be correctly ordered, without the need for a memory barrier. +For instance with the following code: + + X = *A; + *A = Y; + Z = *A; + +assuming no intervention by an external influence, it can be taken that: + + (*) X will hold the old value of *A, and will never happen after the write and + thus end up being given the value that was assigned to *A from Y instead; + and + + (*) Z will always be given the value in *A that was assigned there from Y, and + will never happen before the write, and thus end up with the same value + that was in *A initially. + +(This is ignoring the fact that the value initially in *A may appear to be the +same as the value assigned to *A from Y). + + +================================= +WHERE ARE MEMORY BARRIERS NEEDED? +================================= + +Under normal operation, access reordering is probably not going to be a problem +as a linear program will still appear to operate correctly. There are, +however, three circumstances where reordering definitely _could_ be a problem: + + +ACCESSING DEVICES +----------------- + +Many devices can be memory mapped, and so appear to the CPU as if they're just +memory locations. However, to control the device, the driver has to make the +right accesses in exactly the right order. + +Consider, for example, an ethernet chipset such as the AMD PCnet32. It +presents to the CPU an "address register" and a bunch of "data registers". The +way it's accessed is to write the index of the internal register to be accessed +to the address register, and then read or write the appropriate data register +to access the chip's internal register, which could - theoretically - be done +by: + + *ADR = ctl_reg_3; + reg = *DATA; + +The problem with a clever CPU or a clever compiler is that the write to the +address register isn't guaranteed to happen before the access to the data +register, if the CPU or the compiler thinks it is more efficient to defer the +address write: + + read *DATA, write *ADR + +then things will break. + + +In the Linux kernel, however, I/O should be done through the appropriate +accessor routines - such as inb() or writel() - which know how to make such +accesses appropriately sequential. + +On some systems, I/O writes are not strongly ordered across all CPUs, and so +locking should be used, and mmiowb() should be issued prior to unlocking the +critical section. + +See Documentation/DocBook/deviceiobook.tmpl for more information. + + +MULTIPROCESSOR INTERACTION +-------------------------- + +When there's a system with more than one processor, the CPUs in the system may +be working on the same set of data at the same time. This can cause +synchronisation problems, and the usual way of dealing with them is to use +locks - but locks are quite expensive, and so it may be preferable to operate +without the use of a lock if at all possible. In such a case accesses that +affect both CPUs may have to be carefully ordered to prevent error. + +Consider the R/W semaphore slow path. In that, a waiting process is queued on +the semaphore, as noted by it having a record on its stack linked to the +semaphore's list: + + struct rw_semaphore { + ... + struct list_head waiters; + }; + + struct rwsem_waiter { + struct list_head list; + struct task_struct *task; + }; + +To wake up the waiter, the up_read() or up_write() functions have to read the +pointer from this record to know as to where the next waiter record is, clear +the task pointer, call wake_up_process() on the task, and release the reference +held on the waiter's task struct: + + READ waiter->list.next; + READ waiter->task; + WRITE waiter->task; + CALL wakeup + RELEASE task + +If any of these steps occur out of order, then the whole thing may fail. + +Note that the waiter does not get the semaphore lock again - it just waits for +its task pointer to be cleared. Since the record is on its stack, this means +that if the task pointer is cleared _before_ the next pointer in the list is +read, another CPU might start processing the waiter and it might clobber its +stack before up*() functions have a chance to read the next pointer. + + CPU 0 CPU 1 + =============================== =============================== + down_xxx() + Queue waiter + Sleep + up_yyy() + READ waiter->task; + WRITE waiter->task; + + Resume processing + down_xxx() returns + call foo() + foo() clobbers *waiter + + READ waiter->list.next; + --- OOPS --- + +This could be dealt with using a spinlock, but then the down_xxx() function has +to get the spinlock again after it's been woken up, which is a waste of +resources. + +The way to deal with this is to insert an SMP memory barrier: + + READ waiter->list.next; + READ waiter->task; + smp_mb(); + WRITE waiter->task; + CALL wakeup + RELEASE task + +In this case, the barrier makes a guarantee that all memory accesses before the +barrier will appear to happen before all the memory accesses after the barrier +with respect to the other CPUs on the system. It does _not_ guarantee that all +the memory accesses before the barrier will be complete by the time the barrier +itself is complete. + +SMP memory barriers are normally nothing more than compiler barriers on a +kernel compiled for a UP system because the CPU orders overlapping accesses +with respect to itself, and so CPU barriers aren't needed. + + +INTERRUPTS +---------- + +A driver may be interrupted by its own interrupt service routine, and thus they +may interfere with each other's attempts to control or access the device. + +This may be alleviated - at least in part - by disabling interrupts (a form of +locking), such that the critical operations are all contained within the +interrupt-disabled section in the driver. Whilst the driver's interrupt +routine is executing, the driver's core may not run on the same CPU, and its +interrupt is not permitted to happen again until the current interrupt has been +handled, thus the interrupt handler does not need to lock against that. + +However, consider a driver was talking to an ethernet card that sports an +address register and a data register. If that driver's core is talks to the +card under interrupt-disablement and then the driver's interrupt handler is +invoked: + + DISABLE IRQ + writew(ADDR, ctl_reg_3); + writew(DATA, y); + ENABLE IRQ + + writew(ADDR, ctl_reg_4); + q = readw(DATA); + + +If ordering rules are sufficiently relaxed, the write to the data register +might happen after the second write to the address register. + + +It must be assumed that accesses done inside an interrupt disabled section may +leak outside of it and may interleave with accesses performed in an interrupt +and vice versa unless implicit or explicit barriers are used. + +Normally this won't be a problem because the I/O accesses done inside such +sections will include synchronous read operations on strictly ordered I/O +registers that form implicit I/O barriers. If this isn't sufficient then an +mmiowb() may need to be used explicitly. + + +A similar situation may occur between an interrupt routine and two routines +running on separate CPUs that communicate with each other. If such a case is +likely, then interrupt-disabling locks should be used to guarantee ordering. + + +================================= +EXPLICIT KERNEL COMPILER BARRIERS +================================= + +The Linux kernel has an explicit compiler barrier function that prevents the +compiler from moving the memory accesses either side of it to the other side: + + barrier(); + +This has no direct effect on the CPU, which may then reorder things however it +wishes. + + +In addition, accesses to "volatile" memory locations and volatile asm +statements act as implicit compiler barriers. Note, however, that the use of +volatile has two negative consequences: + + (1) it causes the generation of poorer code, and + + (2) it can affect serialisation of events in code distant from the declaration + (consider a structure defined in a header file that has a volatile member + being accessed by the code in a source file). + +The Linux coding style therefore strongly favours the use of explicit barriers +except in small and specific cases. In general, volatile should be avoided. + + +=============================== +EXPLICIT KERNEL MEMORY BARRIERS +=============================== + +The Linux kernel has six basic CPU memory barriers: + + MANDATORY SMP CONDITIONAL + =============== =============== + GENERAL mb() smp_mb() + READ rmb() smp_rmb() + WRITE wmb() smp_wmb() + +General memory barriers give a guarantee that all memory accesses specified +before the barrier will appear to happen before all memory accesses specified +after the barrier with respect to the other components of the system. + +Read and write memory barriers give similar guarantees, but only for memory +reads versus memory reads and memory writes versus memory writes respectively. + +All memory barriers imply compiler barriers. + +SMP memory barriers are only compiler barriers on uniprocessor compiled systems +because it is assumed that a CPU will be apparently self-consistent, and will +order overlapping accesses correctly with respect to itself. + +There is no guarantee that any of the memory accesses specified before a memory +barrier will be complete by the completion of a memory barrier; the barrier can +be considered to draw a line in that CPU's access queue that accesses of the +appropriate type may not cross. + +There is no guarantee that issuing a memory barrier on one CPU will have any +direct effect on another CPU or any other hardware in the system. The indirect +effect will be the order in which the second CPU sees the first CPU's accesses +occur. + +There is no guarantee that some intervening piece of off-the-CPU hardware[*] +will not reorder the memory accesses. CPU cache coherency mechanisms should +propegate the indirect effects of a memory barrier between CPUs. + + [*] For information on bus mastering DMA and coherency please read: + + Documentation/pci.txt + Documentation/DMA-mapping.txt + Documentation/DMA-API.txt + +Note that these are the _minimum_ guarantees. Different architectures may give +more substantial guarantees, but they may not be relied upon outside of arch +specific code. + + +There are some more advanced barrier functions: + + (*) set_mb(var, value) + (*) set_wmb(var, value) + + These assign the value to the variable and then insert at least a write + barrier after it, depending on the function. They aren't guaranteed to + insert anything more than a compiler barrier in a UP compilation. + + +=============================== +IMPLICIT KERNEL MEMORY BARRIERS +=============================== + +Some of the other functions in the linux kernel imply memory barriers, amongst +which are locking and scheduling functions. + +This specification is a _minimum_ guarantee; any particular architecture may +provide more substantial guarantees, but these may not be relied upon outside +of arch specific code. + + +LOCKING FUNCTIONS +----------------- + +All the following locking functions imply barriers: + + (*) spin locks + (*) R/W spin locks + (*) mutexes + (*) semaphores + (*) R/W semaphores + +In all cases there are variants on a LOCK operation and an UNLOCK operation. + + (*) LOCK operation implication: + + Memory accesses issued after the LOCK will be completed after the LOCK + accesses have completed. + + Memory accesses issued before the LOCK may be completed after the LOCK + accesses have completed. + + (*) UNLOCK operation implication: + + Memory accesses issued before the UNLOCK will be completed before the + UNLOCK accesses have completed. + + Memory accesses issued after the UNLOCK may be completed before the UNLOCK + accesses have completed. + + (*) LOCK vs UNLOCK implication: + + The LOCK accesses will be completed before the UNLOCK accesses. + + Therefore an UNLOCK followed by a LOCK is equivalent to a full barrier, + but a LOCK followed by an UNLOCK is not. + +Locks and semaphores may not provide any guarantee of ordering on UP compiled +systems, and so can't be counted on in such a situation to actually do anything +at all, especially with respect to I/O accesses, unless combined with interrupt +disabling operations. + +See also the section on "Inter-CPU locking barrier effects". + + +As an example, consider the following: + + *A = a; + *B = b; + LOCK + *C = c; + *D = d; + UNLOCK + *E = e; + *F = f; + +The following sequence of events is acceptable: + + LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK + +But none of the following are: + + {*F,*A}, *B, LOCK, *C, *D, UNLOCK, *E + *A, *B, *C, LOCK, *D, UNLOCK, *E, *F + *A, *B, LOCK, *C, UNLOCK, *D, *E, *F + *B, LOCK, *C, *D, UNLOCK, {*F,*A}, *E + + +INTERRUPT DISABLING FUNCTIONS +----------------------------- + +Functions that disable interrupts (LOCK equivalent) and enable interrupts +(UNLOCK equivalent) will act as compiler barriers only. So if memory or I/O +barriers are required in such a situation, they must be provided from some +other means. + + +MISCELLANEOUS FUNCTIONS +----------------------- + +Other functions that imply barriers: + + (*) schedule() and similar imply full memory barriers. + + +================================= +INTER-CPU LOCKING BARRIER EFFECTS +================================= + +On SMP systems locking primitives give a more substantial form of barrier: one +that does affect memory access ordering on other CPUs, within the context of +conflict on any particular lock. + + +LOCKS VS MEMORY ACCESSES +------------------------ + +Consider the following: the system has a pair of spinlocks (N) and (Q), and +three CPUs; then should the following sequence of events occur: + + CPU 1 CPU 2 + =============================== =============================== + *A = a; *E = e; + LOCK M LOCK Q + *B = b; *F = f; + *C = c; *G = g; + UNLOCK M UNLOCK Q + *D = d; *H = h; + +Then there is no guarantee as to what order CPU #3 will see the accesses to *A +through *H occur in, other than the constraints imposed by the separate locks +on the separate CPUs. It might, for example, see: + + *E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M + +But it won't see any of: + + *B, *C or *D preceding LOCK M + *A, *B or *C following UNLOCK M + *F, *G or *H preceding LOCK Q + *E, *F or *G following UNLOCK Q + + +However, if the following occurs: + + CPU 1 CPU 2 + =============================== =============================== + *A = a; + LOCK M [1] + *B = b; + *C = c; + UNLOCK M [1] + *D = d; *E = e; + LOCK M [2] + *F = f; + *G = g; + UNLOCK M [2] + *H = h; + +CPU #3 might see: + + *E, LOCK M [1], *C, *B, *A, UNLOCK M [1], + LOCK M [2], *H, *F, *G, UNLOCK M [2], *D + +But assuming CPU #1 gets the lock first, it won't see any of: + + *B, *C, *D, *F, *G or *H preceding LOCK M [1] + *A, *B or *C following UNLOCK M [1] + *F, *G or *H preceding LOCK M [2] + *A, *B, *C, *E, *F or *G following UNLOCK M [2] + + +LOCKS VS I/O ACCESSES +--------------------- + +Under certain circumstances (such as NUMA), I/O accesses within two spinlocked +sections on two different CPUs may be seen as interleaved by the PCI bridge. + +For example: + + CPU 1 CPU 2 + =============================== =============================== + spin_lock(Q) + writel(0, ADDR) + writel(1, DATA); + spin_unlock(Q); + spin_lock(Q); + writel(4, ADDR); + writel(5, DATA); + spin_unlock(Q); + +may be seen by the PCI bridge as follows: + + WRITE *ADDR = 0, WRITE *ADDR = 4, WRITE *DATA = 1, WRITE *DATA = 5 + +which would probably break. + +What is necessary here is to insert an mmiowb() before dropping the spinlock, +for example: + + CPU 1 CPU 2 + =============================== =============================== + spin_lock(Q) + writel(0, ADDR) + writel(1, DATA); + mmiowb(); + spin_unlock(Q); + spin_lock(Q); + writel(4, ADDR); + writel(5, DATA); + mmiowb(); + spin_unlock(Q); + +this will ensure that the two writes issued on CPU #1 appear at the PCI bridge +before either of the writes issued on CPU #2. + + +Furthermore, following a write by a read to the same device is okay, because +the read forces the write to complete before the read is performed: + + CPU 1 CPU 2 + =============================== =============================== + spin_lock(Q) + writel(0, ADDR) + a = readl(DATA); + spin_unlock(Q); + spin_lock(Q); + writel(4, ADDR); + b = readl(DATA); + spin_unlock(Q); + + +See Documentation/DocBook/deviceiobook.tmpl for more information. + + +========================== +KERNEL I/O BARRIER EFFECTS +========================== + +When accessing I/O memory, drivers should use the appropriate accessor +functions: + + (*) inX(), outX(): + + These are intended to talk to I/O space rather than memory space, but + that's primarily a CPU-specific concept. The i386 and x86_64 processors do + indeed have special I/O space access cycles and instructions, but many + CPUs don't have such a concept. + + The PCI bus, amongst others, defines an I/O space concept - which on such + CPUs as i386 and x86_64 cpus readily maps to the CPU's concept of I/O + space. However, it may also mapped as a virtual I/O space in the CPU's + memory map, particularly on those CPUs that don't support alternate + I/O spaces. + + Accesses to this space may be fully synchronous (as on i386), but + intermediary bridges (such as the PCI host bridge) may not fully honour + that. + + They are guaranteed to be fully ordered with respect to each other. + + They are not guaranteed to be fully ordered with respect to other types of + memory and I/O operation. + + (*) readX(), writeX(): + + Whether these are guaranteed to be fully ordered and uncombined with + respect to each other on the issuing CPU depends on the characteristics + defined for the memory window through which they're accessing. On later + i386 architecture machines, for example, this is controlled by way of the + MTRR registers. + + Ordinarily, these will be guaranteed to be fully ordered and uncombined,, + provided they're not accessing a prefetchable device. + + However, intermediary hardware (such as a PCI bridge) may indulge in + deferral if it so wishes; to flush a write, a read from the same location + is preferred[*], but a read from the same device or from configuration + space should suffice for PCI. + + [*] NOTE! attempting to read from the same location as was written to may + cause a malfunction - consider the 16550 Rx/Tx serial registers for + example. + + Used with prefetchable I/O memory, an mmiowb() barrier may be required to + force writes to be ordered. + + Please refer to the PCI specification for more information on interactions + between PCI transactions. + + (*) readX_relaxed() + + These are similar to readX(), but are not guaranteed to be ordered in any + way. Be aware that there is no I/O read barrier available. + + (*) ioreadX(), iowriteX() + + These will perform as appropriate for the type of access they're actually + doing, be it inX()/outX() or readX()/writeX(). + + +========== +REFERENCES +========== + +AMD64 Architecture Programmer's Manual Volume 2: System Programming + Chapter 7.1: Memory-Access Ordering + Chapter 7.4: Buffering and Combining Memory Writes + +IA-32 Intel Architecture Software Developer's Manual, Volume 3: +System Programming Guide + Chapter 7.1: Locked Atomic Operations + Chapter 7.2: Memory Ordering + Chapter 7.4: Serializing Instructions + +The SPARC Architecture Manual, Version 9 + Chapter 8: Memory Models + Appendix D: Formal Specification of the Memory Models + Appendix J: Programming with the Memory Models + +UltraSPARC Programmer Reference Manual + Chapter 5: Memory Accesses and Cacheability + Chapter 15: Sparc-V9 Memory Models + +UltraSPARC III Cu User's Manual + Chapter 9: Memory Models + +UltraSPARC IIIi Processor User's Manual + Chapter 8: Memory Models + +UltraSPARC Architecture 2005 + Chapter 9: Memory + Appendix D: Formal Specifications of the Memory Models + +UltraSPARC T1 Supplment to the UltraSPARC Architecture 2005 + Chapter 8: Memory Models + Appendix F: Caches and Cache Coherency + +Solaris Internals, Core Kernel Architecture, p63-68: + Chapter 3.3: Hardware Considerations for Locks and + Synchronization + +Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching +for Kernel Programmers: + Chapter 13: Other Memory Models + +Intel Itanium Architecture Software Developer's Manual: Volume 1: + Section 2.6: Speculation + Section 4.4: Memory Access From paulus at samba.org Fri Mar 10 10:34:53 2006 From: paulus at samba.org (Paul Mackerras) Date: Fri, 10 Mar 2006 10:34:53 +1100 Subject: [PATCH] Document Linux's memory barriers [try #4] In-Reply-To: <16835.1141936162@warthog.cambridge.redhat.com> References: <16835.1141936162@warthog.cambridge.redhat.com> Message-ID: <17424.48029.481013.502855@cargo.ozlabs.ibm.com> David Howells writes: > +On some systems, I/O writes are not strongly ordered across all CPUs, and so > +locking should be used, and mmiowb() should be issued prior to unlocking the > +critical section. I think we should say more strongly that mmiowb() is required where MMIO accesses are done under a spinlock, and that if your driver is missing them then that is a bug. I don't think it makes sense to say that mmiowb is required "on some systems". At least, we should either make that statement, or we should not require any driver on any platform to use mmiowb explicitly. (In that case, the platforms that need it could do something like keep a per-cpu count of MMIO accesses, which is zeroed in spin_lock and incremented by read*/write*, and if spin_unlock finds it non-zero, it does the mmiowb().) Also, this section doesn't sound right to me: > + DISABLE IRQ > + writew(ADDR, ctl_reg_3); > + writew(DATA, y); > + ENABLE IRQ > + > + writew(ADDR, ctl_reg_4); > + q = readw(DATA); > + > + > +If ordering rules are sufficiently relaxed, the write to the data register > +might happen after the second write to the address register. > + > + > +It must be assumed that accesses done inside an interrupt disabled section may > +leak outside of it and may interleave with accesses performed in an interrupt > +and vice versa unless implicit or explicit barriers are used. > + > +Normally this won't be a problem because the I/O accesses done inside such > +sections will include synchronous read operations on strictly ordered I/O > +registers that form implicit I/O barriers. If this isn't sufficient then an > +mmiowb() may need to be used explicitly. There shouldn't be any problem here, because readw/writew _must_ ensure that the device accesses are serialized. Just saying "if this isn't sufficient" leaves the reader wondering when it might not be sufficient or how they would know when it wasn't sufficient, and introduces doubt where there needn't be any. Of course, on an SMP system it would be quite possible for the interrupt to be taken on another CPU, and in that case disabling interrupts (I assume that by "DISABLE IRQ" you mean local_irq_disable() or some such) gets you absolutely nothing; you need to use a spinlock, and then the mmiowb is required. You may like to include these words describing some of the rules: * If you have stores to regular memory, followed by an MMIO store, and you want the device to see the stores to regular memory at the point where it receives the MMIO store, then you need a wmb() between the stores to regular memory and the MMIO store. * If you have PIO or MMIO accesses, and you need to ensure the PIO/MMIO accesses don't get reordered with respect to PIO/MMIO accesses on another CPU, put the accesses inside a spin-locked region, and put a mmiowb() between the last access and the spin_unlock. * smp_wmb() doesn't necessarily do any ordering of MMIO accesses vs. other accesses, and in that sense it is weaker than wmb(). Paul. From mbuesch at freenet.de Fri Mar 10 10:45:09 2006 From: mbuesch at freenet.de (Michael Buesch) Date: Fri, 10 Mar 2006 00:45:09 +0100 Subject: [PATCH] Document Linux's memory barriers [try #4] In-Reply-To: <17424.48029.481013.502855@cargo.ozlabs.ibm.com> References: <16835.1141936162@warthog.cambridge.redhat.com> <17424.48029.481013.502855@cargo.ozlabs.ibm.com> Message-ID: <200603100045.10375.mbuesch@freenet.de> On Friday 10 March 2006 00:34, you wrote: > David Howells writes: > > > +On some systems, I/O writes are not strongly ordered across all CPUs, and so > > +locking should be used, and mmiowb() should be issued prior to unlocking the > > +critical section. > > I think we should say more strongly that mmiowb() is required where > MMIO accesses are done under a spinlock, and that if your driver is > missing them then that is a bug. I don't think it makes sense to say > that mmiowb is required "on some systems". So what about: #define spin_lock_mmio(lock) spin_lock(lock) #define spin_unlock_mmio(lock) do { spin_unlock(lock); mmiowb(); } while (0) #define spin_lock_mmio_irqsave(lock, flags) spin_lock_irqsave(lock, flags) #define spin_unlock_mmio_irqrestore(lock, flags) do { spin_unlock_irqrestore(lock, flags); mmiowb(); } while (0) -- Greetings Michael. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060310/83ab9987/attachment.pgp From torvalds at osdl.org Fri Mar 10 10:56:35 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Thu, 9 Mar 2006 15:56:35 -0800 (PST) Subject: [PATCH] Document Linux's memory barriers [try #4] In-Reply-To: <200603100045.10375.mbuesch@freenet.de> References: <16835.1141936162@warthog.cambridge.redhat.com> <17424.48029.481013.502855@cargo.ozlabs.ibm.com> <200603100045.10375.mbuesch@freenet.de> Message-ID: On Fri, 10 Mar 2006, Michael Buesch wrote: > > So what about: > > #define spin_lock_mmio(lock) spin_lock(lock) > #define spin_unlock_mmio(lock) do { spin_unlock(lock); mmiowb(); } while (0) You need to put the mmiowb() inside the spinlock. Yes, that is painful. But the point being that if it's outside, then when somebody else gets the lock, the previous lock-owners MMIO stores may still be in flight, which is what you didn't want in the first place. Anyway, no need to make a new name for it, since you might as well just use the mmiowb() explicitly. At least until this has been shown to be a really common pattern (it clearly isn't, right now ;) Linus From mbuesch at freenet.de Fri Mar 10 11:07:46 2006 From: mbuesch at freenet.de (Michael Buesch) Date: Fri, 10 Mar 2006 01:07:46 +0100 Subject: [PATCH] Document Linux's memory barriers [try #4] In-Reply-To: References: <16835.1141936162@warthog.cambridge.redhat.com> <200603100045.10375.mbuesch@freenet.de> Message-ID: <200603100107.46655.mbuesch@freenet.de> On Friday 10 March 2006 00:56, you wrote: > > On Fri, 10 Mar 2006, Michael Buesch wrote: > > > > So what about: > > > > #define spin_lock_mmio(lock) spin_lock(lock) > > #define spin_unlock_mmio(lock) do { spin_unlock(lock); mmiowb(); } while (0) > > You need to put the mmiowb() inside the spinlock. Ok, sorry. That was a typo. I should not do more than 3 things at the same time. :) > Yes, that is painful. But the point being that if it's outside, then when > somebody else gets the lock, the previous lock-owners MMIO stores may > still be in flight, which is what you didn't want in the first place. > > Anyway, no need to make a new name for it, since you might as well just > use the mmiowb() explicitly. At least until this has been shown to be a > really common pattern (it clearly isn't, right now ;) Ok, so maybe it is best if every device creates its own macros for convenience (if needed => if it is a common pattern in the scope of the driver). Example: #define bcm43xx_lock(bcm, flags) spin_lock_irqsave(&(bcm)->lock, flags) #define bcm43xx_unlock(bcm, flags) do { mmiowb(); spin_unlock_irqrestore(&(bcm)->lock, flags); } while (0) -- Greetings Michael. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060310/983b546e/attachment.pgp From alan at redhat.com Fri Mar 10 11:48:15 2006 From: alan at redhat.com (Alan Cox) Date: Thu, 9 Mar 2006 19:48:15 -0500 Subject: [PATCH] Document Linux's memory barriers [try #4] In-Reply-To: <17424.48029.481013.502855@cargo.ozlabs.ibm.com> References: <16835.1141936162@warthog.cambridge.redhat.com> <17424.48029.481013.502855@cargo.ozlabs.ibm.com> Message-ID: <20060310004815.GD24904@devserv.devel.redhat.com> On Fri, Mar 10, 2006 at 10:34:53AM +1100, Paul Mackerras wrote: > MMIO accesses are done under a spinlock, and that if your driver is > missing them then that is a bug. I don't think it makes sense to say > that mmiowb is required "on some systems". Agreed. But if it is missing it may not be a bug. It depends what the lock actually protects. From paulus at samba.org Fri Mar 10 11:54:19 2006 From: paulus at samba.org (Paul Mackerras) Date: Fri, 10 Mar 2006 11:54:19 +1100 Subject: [PATCH] Document Linux's memory barriers [try #4] In-Reply-To: <20060310004815.GD24904@devserv.devel.redhat.com> References: <16835.1141936162@warthog.cambridge.redhat.com> <17424.48029.481013.502855@cargo.ozlabs.ibm.com> <20060310004815.GD24904@devserv.devel.redhat.com> Message-ID: <17424.52795.142571.746064@cargo.ozlabs.ibm.com> Alan Cox writes: > On Fri, Mar 10, 2006 at 10:34:53AM +1100, Paul Mackerras wrote: > > MMIO accesses are done under a spinlock, and that if your driver is > > missing them then that is a bug. I don't think it makes sense to say > > that mmiowb is required "on some systems". > > Agreed. But if it is missing it may not be a bug. It depends what the lock > actually protects. True. What I want is a statement that if one of the purposes of the spinlock is to provide ordering of the MMIO accesses, then leaving out the mmiowb is a bug. I want it to be like the PCI DMA API in that drivers are required to use it even on platforms where it's a no-op. Paul. From sfr at canb.auug.org.au Fri Mar 10 12:00:49 2006 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Fri, 10 Mar 2006 12:00:49 +1100 Subject: asm/cputable.h sparse warnings In-Reply-To: <20060309162534.GA15777@lst.de> References: <20060308151939.GA12762@lst.de> <20060309115036.0bb4bec4.sfr@canb.auug.org.au> <20060309162534.GA15777@lst.de> Message-ID: <20060310120049.4b5ab37d.sfr@canb.auug.org.au> On Thu, 9 Mar 2006 17:25:34 +0100 Christoph Hellwig wrote: > > The first hunk (which has most of the changes) doesn't apply to > mainline, looks like something changes there in your tree. The patch was against the current powerpc tree. Here is one against current mainline. -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ diff --git a/include/asm-powerpc/cputable.h b/include/asm-powerpc/cputable.h index 5638518..ba6f557 100644 --- a/include/asm-powerpc/cputable.h +++ b/include/asm-powerpc/cputable.h @@ -186,153 +186,154 @@ extern void do_cpu_ftr_fixups(unsigned l !defined(CONFIG_POWER3) && !defined(CONFIG_POWER4) && \ !defined(CONFIG_BOOKE)) -enum { - CPU_FTRS_PPC601 = CPU_FTR_COMMON | CPU_FTR_601 | CPU_FTR_HPTE_TABLE, - CPU_FTRS_603 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | - CPU_FTR_MAYBE_CAN_NAP, - CPU_FTRS_604 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | CPU_FTR_604_PERF_MON | CPU_FTR_HPTE_TABLE, - CPU_FTRS_740_NOTAU = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | - CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP, - CPU_FTRS_740 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | - CPU_FTR_TAU | CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP, - CPU_FTRS_750 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | - CPU_FTR_TAU | CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP, - CPU_FTRS_750FX1 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | - CPU_FTR_TAU | CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP | - CPU_FTR_DUAL_PLL_750FX | CPU_FTR_NO_DPM, - CPU_FTRS_750FX2 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | - CPU_FTR_TAU | CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP | - CPU_FTR_NO_DPM, - CPU_FTRS_750FX = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | - CPU_FTR_TAU | CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP | - CPU_FTR_DUAL_PLL_750FX | CPU_FTR_HAS_HIGH_BATS, - CPU_FTRS_750GX = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_MAYBE_CAN_DOZE | - CPU_FTR_USE_TB | CPU_FTR_L2CR | CPU_FTR_TAU | - CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP | - CPU_FTR_DUAL_PLL_750FX | CPU_FTR_HAS_HIGH_BATS, - CPU_FTRS_7400_NOTAU = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | - CPU_FTR_ALTIVEC_COMP | CPU_FTR_HPTE_TABLE | - CPU_FTR_MAYBE_CAN_NAP, - CPU_FTRS_7400 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | - CPU_FTR_TAU | CPU_FTR_ALTIVEC_COMP | CPU_FTR_HPTE_TABLE | - CPU_FTR_MAYBE_CAN_NAP, - CPU_FTRS_7450_20 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | - CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | - CPU_FTR_NEED_COHERENT, - CPU_FTRS_7450_21 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | - CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | - CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | - CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_L3_DISABLE_NAP | - CPU_FTR_NEED_COHERENT, - CPU_FTRS_7450_23 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | - CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | - CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | - CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_NEED_COHERENT, - CPU_FTRS_7455_1 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | - CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | CPU_FTR_L3CR | - CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | CPU_FTR_HAS_HIGH_BATS | - CPU_FTR_NEED_COHERENT, - CPU_FTRS_7455_20 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | - CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | - CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | - CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_L3_DISABLE_NAP | - CPU_FTR_NEED_COHERENT | CPU_FTR_HAS_HIGH_BATS, - CPU_FTRS_7455 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | - CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | - CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | - CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_HAS_HIGH_BATS | - CPU_FTR_NEED_COHERENT, - CPU_FTRS_7447_10 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | - CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | - CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | - CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_HAS_HIGH_BATS | - CPU_FTR_NEED_COHERENT | CPU_FTR_NO_BTIC, - CPU_FTRS_7447 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | - CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | - CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | - CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_HAS_HIGH_BATS | - CPU_FTR_NEED_COHERENT, - CPU_FTRS_7447A = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | - CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | - CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | - CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_HAS_HIGH_BATS | - CPU_FTR_NEED_COHERENT, - CPU_FTRS_82XX = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB, - CPU_FTRS_G2_LE = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_MAYBE_CAN_DOZE | - CPU_FTR_USE_TB | CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_HAS_HIGH_BATS, - CPU_FTRS_E300 = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_MAYBE_CAN_DOZE | - CPU_FTR_USE_TB | CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_HAS_HIGH_BATS | - CPU_FTR_COMMON, - CPU_FTRS_CLASSIC32 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | CPU_FTR_HPTE_TABLE, - CPU_FTRS_POWER3_32 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | CPU_FTR_HPTE_TABLE, - CPU_FTRS_POWER4_32 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | CPU_FTR_HPTE_TABLE | CPU_FTR_NODSISRALIGN, - CPU_FTRS_970_32 = CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | - CPU_FTR_USE_TB | CPU_FTR_HPTE_TABLE | CPU_FTR_ALTIVEC_COMP | - CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_NODSISRALIGN, - CPU_FTRS_8XX = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB, - CPU_FTRS_40X = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | - CPU_FTR_NODSISRALIGN, - CPU_FTRS_44X = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | - CPU_FTR_NODSISRALIGN, - CPU_FTRS_E200 = CPU_FTR_USE_TB | CPU_FTR_NODSISRALIGN, - CPU_FTRS_E500 = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | - CPU_FTR_NODSISRALIGN, - CPU_FTRS_E500_2 = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | - CPU_FTR_BIG_PHYS | CPU_FTR_NODSISRALIGN, - CPU_FTRS_GENERIC_32 = CPU_FTR_COMMON | CPU_FTR_NODSISRALIGN, +#define CPU_FTRS_PPC601 (CPU_FTR_COMMON | CPU_FTR_601 | CPU_FTR_HPTE_TABLE) +#define CPU_FTRS_603 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | \ + CPU_FTR_MAYBE_CAN_NAP) +#define CPU_FTRS_604 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | CPU_FTR_604_PERF_MON | CPU_FTR_HPTE_TABLE) +#define CPU_FTRS_740_NOTAU (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | \ + CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP) +#define CPU_FTRS_740 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | \ + CPU_FTR_TAU | CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP) +#define CPU_FTRS_750 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | \ + CPU_FTR_TAU | CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP) +#define CPU_FTRS_750FX1 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | \ + CPU_FTR_TAU | CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP | \ + CPU_FTR_DUAL_PLL_750FX | CPU_FTR_NO_DPM) +#define CPU_FTRS_750FX2 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | \ + CPU_FTR_TAU | CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP | \ + CPU_FTR_NO_DPM) +#define CPU_FTRS_750FX (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | \ + CPU_FTR_TAU | CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP | \ + CPU_FTR_DUAL_PLL_750FX | CPU_FTR_HAS_HIGH_BATS) +#define CPU_FTRS_750GX (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_MAYBE_CAN_DOZE | \ + CPU_FTR_USE_TB | CPU_FTR_L2CR | CPU_FTR_TAU | \ + CPU_FTR_HPTE_TABLE | CPU_FTR_MAYBE_CAN_NAP | \ + CPU_FTR_DUAL_PLL_750FX | CPU_FTR_HAS_HIGH_BATS) +#define CPU_FTRS_7400_NOTAU (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | \ + CPU_FTR_ALTIVEC_COMP | CPU_FTR_HPTE_TABLE | \ + CPU_FTR_MAYBE_CAN_NAP) +#define CPU_FTRS_7400 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | CPU_FTR_L2CR | \ + CPU_FTR_TAU | CPU_FTR_ALTIVEC_COMP | CPU_FTR_HPTE_TABLE | \ + CPU_FTR_MAYBE_CAN_NAP) +#define CPU_FTRS_7450_20 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | \ + CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | \ + CPU_FTR_NEED_COHERENT) +#define CPU_FTRS_7450_21 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | \ + CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | \ + CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | \ + CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_L3_DISABLE_NAP | \ + CPU_FTR_NEED_COHERENT) +#define CPU_FTRS_7450_23 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | \ + CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | \ + CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | \ + CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_NEED_COHERENT) +#define CPU_FTRS_7455_1 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | \ + CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | CPU_FTR_L3CR | \ + CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | CPU_FTR_HAS_HIGH_BATS | \ + CPU_FTR_NEED_COHERENT) +#define CPU_FTRS_7455_20 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | \ + CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | \ + CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | \ + CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_L3_DISABLE_NAP | \ + CPU_FTR_NEED_COHERENT | CPU_FTR_HAS_HIGH_BATS) +#define CPU_FTRS_7455 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | \ + CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | \ + CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | \ + CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_HAS_HIGH_BATS | \ + CPU_FTR_NEED_COHERENT) +#define CPU_FTRS_7447_10 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | \ + CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | \ + CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | \ + CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_HAS_HIGH_BATS | \ + CPU_FTR_NEED_COHERENT | CPU_FTR_NO_BTIC) +#define CPU_FTRS_7447 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | \ + CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | \ + CPU_FTR_L3CR | CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | \ + CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_HAS_HIGH_BATS | \ + CPU_FTR_NEED_COHERENT) +#define CPU_FTRS_7447A (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | \ + CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_L2CR | CPU_FTR_ALTIVEC_COMP | \ + CPU_FTR_HPTE_TABLE | CPU_FTR_SPEC7450 | \ + CPU_FTR_NAP_DISABLE_L2_PR | CPU_FTR_HAS_HIGH_BATS | \ + CPU_FTR_NEED_COHERENT) +#define CPU_FTRS_82XX (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB) +#define CPU_FTRS_G2_LE (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_MAYBE_CAN_DOZE | \ + CPU_FTR_USE_TB | CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_HAS_HIGH_BATS) +#define CPU_FTRS_E300 (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_MAYBE_CAN_DOZE | \ + CPU_FTR_USE_TB | CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_HAS_HIGH_BATS | \ + CPU_FTR_COMMON) +#define CPU_FTRS_CLASSIC32 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | CPU_FTR_HPTE_TABLE) +#define CPU_FTRS_POWER3_32 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | CPU_FTR_HPTE_TABLE) +#define CPU_FTRS_POWER4_32 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | CPU_FTR_HPTE_TABLE | CPU_FTR_NODSISRALIGN) +#define CPU_FTRS_970_32 (CPU_FTR_COMMON | CPU_FTR_SPLIT_ID_CACHE | \ + CPU_FTR_USE_TB | CPU_FTR_HPTE_TABLE | CPU_FTR_ALTIVEC_COMP | \ + CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_NODSISRALIGN) +#define CPU_FTRS_8XX (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB) +#define CPU_FTRS_40X (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | \ + CPU_FTR_NODSISRALIGN) +#define CPU_FTRS_44X (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | \ + CPU_FTR_NODSISRALIGN) +#define CPU_FTRS_E200 (CPU_FTR_USE_TB | CPU_FTR_NODSISRALIGN) +#define CPU_FTRS_E500 (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | \ + CPU_FTR_NODSISRALIGN) +#define CPU_FTRS_E500_2 (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | \ + CPU_FTR_BIG_PHYS | CPU_FTR_NODSISRALIGN) +#define CPU_FTRS_GENERIC_32 (CPU_FTR_COMMON | CPU_FTR_NODSISRALIGN) #ifdef __powerpc64__ - CPU_FTRS_POWER3 = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | - CPU_FTR_HPTE_TABLE | CPU_FTR_IABR, - CPU_FTRS_RS64 = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | - CPU_FTR_HPTE_TABLE | CPU_FTR_IABR | - CPU_FTR_MMCRA | CPU_FTR_CTRL, - CPU_FTRS_POWER4 = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | - CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_MMCRA, - CPU_FTRS_PPC970 = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | - CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | - CPU_FTR_ALTIVEC_COMP | CPU_FTR_CAN_NAP | CPU_FTR_MMCRA, - CPU_FTRS_POWER5 = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | - CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | - CPU_FTR_MMCRA | CPU_FTR_SMT | - CPU_FTR_COHERENT_ICACHE | CPU_FTR_LOCKLESS_TLBIE | - CPU_FTR_MMCRA_SIHV, - CPU_FTRS_CELL = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | - CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | - CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | - CPU_FTR_CTRL | CPU_FTR_PAUSE_ZERO, - CPU_FTRS_COMPATIBLE = CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | - CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2, +#define CPU_FTRS_POWER3 (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | \ + CPU_FTR_HPTE_TABLE | CPU_FTR_IABR) +#define CPU_FTRS_RS64 (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | \ + CPU_FTR_HPTE_TABLE | CPU_FTR_IABR | \ + CPU_FTR_MMCRA | CPU_FTR_CTRL) +#define CPU_FTRS_POWER4 (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | \ + CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_MMCRA) +#define CPU_FTRS_PPC970 (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | \ + CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | \ + CPU_FTR_ALTIVEC_COMP | CPU_FTR_CAN_NAP | CPU_FTR_MMCRA) +#define CPU_FTRS_POWER5 (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | \ + CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | \ + CPU_FTR_MMCRA | CPU_FTR_SMT | \ + CPU_FTR_COHERENT_ICACHE | CPU_FTR_LOCKLESS_TLBIE | \ + CPU_FTR_MMCRA_SIHV) +#define CPU_FTRS_CELL (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | \ + CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | \ + CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \ + CPU_FTR_CTRL | CPU_FTR_PAUSE_ZERO) +#define CPU_FTRS_COMPATIBLE (CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | \ + CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2) #endif - CPU_FTRS_POSSIBLE = #ifdef __powerpc64__ - CPU_FTRS_POWER3 | CPU_FTRS_RS64 | CPU_FTRS_POWER4 | - CPU_FTRS_PPC970 | CPU_FTRS_POWER5 | CPU_FTRS_CELL | - CPU_FTR_CI_LARGE_PAGE | +#define CPU_FTRS_POSSIBLE \ + (CPU_FTRS_POWER3 | CPU_FTRS_RS64 | CPU_FTRS_POWER4 | \ + CPU_FTRS_PPC970 | CPU_FTRS_POWER5 | CPU_FTRS_CELL | \ + CPU_FTR_CI_LARGE_PAGE) #else +enum { + CPU_FTRS_POSSIBLE = #if CLASSIC_PPC CPU_FTRS_PPC601 | CPU_FTRS_603 | CPU_FTRS_604 | CPU_FTRS_740_NOTAU | CPU_FTRS_740 | CPU_FTRS_750 | CPU_FTRS_750FX1 | @@ -366,14 +367,18 @@ enum { #ifdef CONFIG_E500 CPU_FTRS_E500 | CPU_FTRS_E500_2 | #endif -#endif /* __powerpc64__ */ 0, +}; +#endif /* __powerpc64__ */ - CPU_FTRS_ALWAYS = #ifdef __powerpc64__ - CPU_FTRS_POWER3 & CPU_FTRS_RS64 & CPU_FTRS_POWER4 & - CPU_FTRS_PPC970 & CPU_FTRS_POWER5 & CPU_FTRS_CELL & +#define CPU_FTRS_ALWAYS \ + (CPU_FTRS_POWER3 & CPU_FTRS_RS64 & CPU_FTRS_POWER4 & \ + CPU_FTRS_PPC970 & CPU_FTRS_POWER5 & CPU_FTRS_CELL & \ + CPU_FTRS_POSSIBLE) #else +enum { + CPU_FTRS_ALWAYS = #if CLASSIC_PPC CPU_FTRS_PPC601 & CPU_FTRS_603 & CPU_FTRS_604 & CPU_FTRS_740_NOTAU & CPU_FTRS_740 & CPU_FTRS_750 & CPU_FTRS_750FX1 & @@ -407,9 +412,9 @@ enum { #ifdef CONFIG_E500 CPU_FTRS_E500 & CPU_FTRS_E500_2 & #endif -#endif /* __powerpc64__ */ CPU_FTRS_POSSIBLE, }; +#endif /* __powerpc64__ */ static inline int cpu_has_feature(unsigned long feature) { From michael at ellerman.id.au Fri Mar 10 15:01:08 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Fri, 10 Mar 2006 15:01:08 +1100 Subject: [PATCH] powerpc: Clarify wording for CRASH_DUMP Kconfig option Message-ID: <20060310040136.5C973679E7@ozlabs.org> The wording of the CRASH_DUMP Kconfig option is not very clear. It gives you a kernel that can be used _as_ the kdump kernel, not a kernel that can boot into a kdump kernel. Signed-off-by: Michael Ellerman --- arch/powerpc/Kconfig | 2 +- 1 files changed, 1 insertion(+), 1 deletion(-) Index: to-merge/arch/powerpc/Kconfig =================================================================== --- to-merge.orig/arch/powerpc/Kconfig +++ to-merge/arch/powerpc/Kconfig @@ -607,7 +607,7 @@ config KEXEC strongly in flux, so no good recommendation can be made. config CRASH_DUMP - bool "kernel crash dumps (EXPERIMENTAL)" + bool "Build a kdump crash kernel (EXPERIMENTAL)" depends on PPC_MULTIPLATFORM && PPC64 && EXPERIMENTAL help Build a kernel suitable for use as a kdump capture kernel. From nickpiggin at yahoo.com.au Fri Mar 10 16:28:51 2006 From: nickpiggin at yahoo.com.au (Nick Piggin) Date: Fri, 10 Mar 2006 16:28:51 +1100 Subject: [PATCH] Document Linux's memory barriers [try #4] In-Reply-To: <16835.1141936162@warthog.cambridge.redhat.com> References: <16835.1141936162@warthog.cambridge.redhat.com> Message-ID: <44110E93.8060504@yahoo.com.au> David Howells wrote: > +========================== > +WHAT IS CONSIDERED MEMORY? > +========================== > + > +For the purpose of this specification what's meant by "memory" needs to be > +defined, and the division between CPU and memory needs to be marked out. > + > + > +CACHED INTERACTIONS > +------------------- > + > +As far as cached CPU vs CPU[*] interactions go, "memory" has to include the CPU > +caches in the system. Although any particular read or write may not actually > +appear outside of the CPU that issued it (the CPU may may have been able to > +satisfy it from its own cache), it's still as if the memory access had taken > +place as far as the other CPUs are concerned since the cache coherency and > +ejection mechanisms will propegate the effects upon conflict. > + Isn't the Alpha's split caches a counter-example of your model, because the coherency itself is out of order? Why do you need to include caches and queues in your model? Do programmers care? Isn't the following sufficient... : | m | CPU -----> | e | : | m | : | o | CPU -----> | r | : | y | ... and bugger the implementation details? > + [*] Also applies to CPU vs device when accessed through a cache. > + > +The system can be considered logically as: > + > + <--- CPU ---> : <----------- Memory -----------> > + : > + +--------+ +--------+ : +--------+ +-----------+ > + | | | | : | | | | +---------+ > + | CPU | | Memory | : | CPU | | | | | > + | Core |--->| Access |----->| Cache |<-->| | | | > + | | | Queue | : | | | |--->| Memory | > + | | | | : | | | | | | > + +--------+ +--------+ : +--------+ | | | | > + : | Cache | +---------+ > + : | Coherency | > + : | Mechanism | +---------+ > + +--------+ +--------+ : +--------+ | | | | > + | | | | : | | | | | | > + | CPU | | Memory | : | CPU | | |--->| Device | > + | Core |--->| Access |----->| Cache |<-->| | | | > + | | | Queue | : | | | | | | > + | | | | : | | | | +---------+ > + +--------+ +--------+ : +--------+ +-----------+ > + : > + : > + -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com From dhowells at redhat.com Sat Mar 11 02:19:10 2006 From: dhowells at redhat.com (David Howells) Date: Fri, 10 Mar 2006 15:19:10 +0000 Subject: [PATCH] Document Linux's memory barriers [try #4] In-Reply-To: <17424.48029.481013.502855@cargo.ozlabs.ibm.com> References: <17424.48029.481013.502855@cargo.ozlabs.ibm.com> <16835.1141936162@warthog.cambridge.redhat.com> Message-ID: <26486.1142003950@warthog.cambridge.redhat.com> Paul Mackerras wrote: > > +On some systems, I/O writes are not strongly ordered across all CPUs, and > > so +locking should be used, and mmiowb() should be issued prior to > > unlocking the +critical section. > > I think we should say more strongly that mmiowb() is required where > MMIO accesses are done under a spinlock, and that if your driver is > missing them then that is a bug. I don't think it makes sense to say > that mmiowb is required "on some systems". The point I was trying to make was that on some systems writes are not strongly ordered, so we need mmiowb() on _all_ systems. I'll fix the text to make that point. > There shouldn't be any problem here, because readw/writew _must_ > ensure that the device accesses are serialized. No. That depends on the properties of the memory window readw/writew write through, the properties of the CPU wrt memory accesses, and what explicit barriers at interpolated inside readw/writew themselves. If we're accessing a frame buffer, for instance, we might want it to be able to reorder and combine reads and writes. > Of course, on an SMP system it would be quite possible for the > interrupt to be taken on another CPU, and in that case disabling > interrupts (I assume that by "DISABLE IRQ" you mean > local_irq_disable() or some such) Yes. There are quite a few different ways to disable interrupts. > gets you absolutely nothing; you need to use a spinlock, and then the mmiowb > is required. I believe I've said that, though perhaps not sufficiently clearly. > You may like to include these words describing some of the rules: Thanks, I probably will. David From hch at lst.de Sat Mar 11 04:11:42 2006 From: hch at lst.de (Christoph Hellwig) Date: Fri, 10 Mar 2006 18:11:42 +0100 Subject: asm/cputable.h sparse warnings In-Reply-To: <20060310120049.4b5ab37d.sfr@canb.auug.org.au> References: <20060308151939.GA12762@lst.de> <20060309115036.0bb4bec4.sfr@canb.auug.org.au> <20060309162534.GA15777@lst.de> <20060310120049.4b5ab37d.sfr@canb.auug.org.au> Message-ID: <20060310171142.GA12664@lst.de> On Fri, Mar 10, 2006 at 12:00:49PM +1100, Stephen Rothwell wrote: > On Thu, 9 Mar 2006 17:25:34 +0100 Christoph Hellwig wrote: > > > > The first hunk (which has most of the changes) doesn't apply to > > mainline, looks like something changes there in your tree. > > The patch was against the current powerpc tree. Here is one against > current mainline. Thanks. This one makes all these warnings go away. From hch at lst.de Sat Mar 11 06:15:25 2006 From: hch at lst.de (Christoph Hellwig) Date: Fri, 10 Mar 2006 20:15:25 +0100 Subject: asm/cputable.h sparse warnings In-Reply-To: <20060308151939.GA12762@lst.de> References: <20060308151939.GA12762@lst.de> Message-ID: <20060310191524.GA14818@lst.de> For some reasons I got all the replies to my mail only directly and not through the list aswell. Could it be that mailman on ozlabs.org is misconfigured? From markh at osdl.org Sat Mar 11 06:23:51 2006 From: markh at osdl.org (Mark Haverkamp) Date: Fri, 10 Mar 2006 11:23:51 -0800 Subject: asm/cputable.h sparse warnings In-Reply-To: <20060310191524.GA14818@lst.de> References: <20060308151939.GA12762@lst.de> <20060310191524.GA14818@lst.de> Message-ID: <1142018631.3700.13.camel@markh3.pdx.osdl.net> On Fri, 2006-03-10 at 20:15 +0100, Christoph Hellwig wrote: > For some reasons I got all the replies to my mail only directly and not > through the list aswell. Could it be that mailman on ozlabs.org is > misconfigured? I can't seem to get to the web page just now, but I think that there is a preference setting to eliminate duplicate mail which may be on by default. > _______________________________________________ > Linuxppc64-dev mailing list > Linuxppc64-dev at ozlabs.org > https://ozlabs.org/mailman/listinfo/linuxppc64-dev -- Mark Haverkamp From hch at lst.de Sat Mar 11 06:53:52 2006 From: hch at lst.de (Christoph Hellwig) Date: Fri, 10 Mar 2006 20:53:52 +0100 Subject: asm/cputable.h sparse warnings In-Reply-To: <1142018631.3700.13.camel@markh3.pdx.osdl.net> References: <20060308151939.GA12762@lst.de> <20060310191524.GA14818@lst.de> <1142018631.3700.13.camel@markh3.pdx.osdl.net> Message-ID: <20060310195352.GA15498@lst.de> On Fri, Mar 10, 2006 at 11:23:51AM -0800, Mark Haverkamp wrote: > On Fri, 2006-03-10 at 20:15 +0100, Christoph Hellwig wrote: > > For some reasons I got all the replies to my mail only directly and not > > through the list aswell. Could it be that mailman on ozlabs.org is > > misconfigured? > > I can't seem to get to the web page just now, but I think that there is > a preference setting to eliminate duplicate mail which may be on by > default. Yes, I know that this setting exists. But until recently mailman on ozlabs.org didn't misbehave in such a way, and switching it on silently for everyone would be really bad. It's nice to have such an option for those who want it, but breaking existing setups with it by default is really bad. From sfr at ozlabs.org Sat Mar 11 10:38:03 2006 From: sfr at ozlabs.org (Stephen Rothwell) Date: Sat, 11 Mar 2006 10:38:03 +1100 (EST) Subject: asm/cputable.h sparse warnings In-Reply-To: <20060310191524.GA14818@lst.de> Message-ID: <20060310233803.D205A679E7@ozlabs.org> Hi Christoph, > For some reasons I got all the replies to my mail only directly and not > through the list aswell. Could it be that mailman on ozlabs.org is > misconfigured? We have records in our logs of those messages being accepted by your mail server (verein.lst.de) after being sent by the mailing list. I can send you the logs if you like. Cheers, Stephen Rothwell From sfr at ozlabs.org Sat Mar 11 10:41:44 2006 From: sfr at ozlabs.org (Stephen Rothwell) Date: Sat, 11 Mar 2006 10:41:44 +1100 (EST) Subject: asm/cputable.h sparse warnings In-Reply-To: <1142018631.3700.13.camel@markh3.pdx.osdl.net> Message-ID: <20060310234144.8A34E679E7@ozlabs.org> > I can't seem to get to the web page just now, but I think that there is > a preference setting to eliminate duplicate mail which may be on by > default. Which makes it all the more intriguing :-) Cheers, Stephen Rothwell From paulus at samba.org Sat Mar 11 11:01:53 2006 From: paulus at samba.org (Paul Mackerras) Date: Sat, 11 Mar 2006 11:01:53 +1100 Subject: [PATCH] Document Linux's memory barriers [try #4] In-Reply-To: <26486.1142003950@warthog.cambridge.redhat.com> References: <17424.48029.481013.502855@cargo.ozlabs.ibm.com> <16835.1141936162@warthog.cambridge.redhat.com> <26486.1142003950@warthog.cambridge.redhat.com> Message-ID: <17426.4977.893926.803202@cargo.ozlabs.ibm.com> David Howells writes: > Paul Mackerras wrote: > > There shouldn't be any problem here, because readw/writew _must_ > > ensure that the device accesses are serialized. > > No. That depends on the properties of the memory window readw/writew write > through, the properties of the CPU wrt memory accesses, and what explicit > barriers at interpolated inside readw/writew themselves. The properties of the memory window are certainly relevant. For a non-prefetchable PCI MMIO region, the readw/writew must ensure that the accesses are serialized w.r.t. each other, although not necessarily serialized with accesses to normal memory. That is a requirement that the driver writer can rely on, and the implementor of readw/writew must ensure is met, taking into account the properties of the CPU (presumably by putting explicit barriers inside readw/write). For prefetchable regions, or if the cookie used with readw/writew has been obtained by something other than the normal ioremap{,_nocache}, then it's more of an open question. > > Of course, on an SMP system it would be quite possible for the > > interrupt to be taken on another CPU, and in that case disabling > > interrupts (I assume that by "DISABLE IRQ" you mean > > local_irq_disable() or some such) > > Yes. There are quite a few different ways to disable interrupts. I think it wasn't clear to me which of the following you meant: (a) Telling this CPU not to take any interrupts (e.g. local_irq_disable()) (b) Telling the interrupt controller not to allow interrupts from that device (e.g. disable_irq(irq_num)) (c) Telling the device not to generate interrupts in some device-specific fashion They all have different characteristics w.r.t. timing and synchronization, so I think it's important to be clear which one you mean. For example, if it's (c), then after doing a writel (or whatever) to the device, you then need at least to do a readl to make sure the write has got to the device, and even then there might be an interrupt signal still wending its way through the interrupt controller etc., which might arrive after the readl has finished. I think you meant (a), but DISABLE_IRQ actually sounds more like (b). Paul. From benh at kernel.crashing.org Sun Mar 12 10:22:19 2006 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sun, 12 Mar 2006 10:22:19 +1100 Subject: [PATCH] powerpc: add for_each_node_by_foo helpers In-Reply-To: <20060308154700.GA15859@lst.de> References: <20060308154700.GA15859@lst.de> Message-ID: <1142119340.4057.38.camel@localhost.localdomain> On Wed, 2006-03-08 at 16:47 +0100, Christoph Hellwig wrote: > Typical use for of_find_node_by_name and of_find_node_by_type is to > iterate over all nodes of a given type/name. Add a helper macro to > do that (in spirit of the list_for_each* macros). Looks good. Maybe you have noted however that we aren't entirely safe yet vs. device-tree being dynamic however. There is a per-node lock, but no lock protecting the global node chain, which as usual means iterating the list isn't entirely safe vs. removal... I've been wondering about using some of the new klist stuff for that... Ben. From benh at kernel.crashing.org Sun Mar 12 10:25:29 2006 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sun, 12 Mar 2006 10:25:29 +1100 Subject: 2.6.15.6 on G5/1.8 (9,1) In-Reply-To: <440F8C48.6000706@cisco.com> References: <440F8C48.6000706@cisco.com> Message-ID: <1142119529.4057.42.camel@localhost.localdomain> On Wed, 2006-03-08 at 21:00 -0500, Keith Mitchell wrote: > Hi, > > I have a bunch of different powermac machines that I am trying to > upgrade and/or install and am having some difficulty with the 1.8 (9,1) > powermacs as well as the newer Dual Core (2.0) machines. The other two > types of machines that I have seem to be working well enough (Dual-Proc > 2.0, Dual-Proc 2.7 -- Both 7,3). > > Originally the systems were running a beta version of Yellowdog that had > a custom kernel based on 2.6.12.3. That kernel works great on all of > the machines except the dual core which doesn't work at all with the > kernel (no surprise). When YDL 4.1 came out (with a kernel based on > 2.6.15-rc5 plus some patches) I wanted to upgrade to that and have the > same image on all of the machines. The hope was that the 1.8ghz-single > machines would get thermal support and I would get rudimentary support > for the dual core machine. I want to have the same load on all of the > machines to make my job easier (since I have 30+ machines total to keep > running). But... The stock YDL kernel does not work so well on the > 1.8ghz-single machines.... I am able to install the distribution on > these machines and reboot. The system will stay up for something like > 30 seconds and then it freezes and shows: I'm not sure what's up there but could you try 2.6.16-rc6 ? It should be working on both machines types and have working thermal control for both too. > hen I tried to use the stock 2.6.15.6 kernel that I d/l'd from > kernel.org. At first I tried to use the 'arch/powerpc/g5_defconfig' but > that wouldn't compile, so I tried 'arch/powerpc/ppc64_defconfig' and > that didn't compile either (same error). Weird... from the error you copied below, it smells like you are hitting a gcc bug. I really wonder why the compiler would try to use a floating point register to genetate that altivec code. In the meantime, disable RAID6 support, and report to YDL so they can maybe try to update their gcc Ben. From michael at ellerman.id.au Sun Mar 12 11:03:45 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Sun, 12 Mar 2006 11:03:45 +1100 Subject: 2.6.15.6 on G5/1.8 (9,1) In-Reply-To: <440F8C48.6000706@cisco.com> References: <440F8C48.6000706@cisco.com> Message-ID: <200603121103.48788.michael@ellerman.id.au> On Thu, 9 Mar 2006 13:00, Keith Mitchell wrote: > FWIW the compile errors I got from the defconfig compiles was: > > drivers/md/raid6int8.c: In function `raid6_int8_gen_syndrome': > drivers/md/raid6int8.c:185: error: unable to find a register to spill in > class `FLOAT_REGS' drivers/md/raid6int8.c:185: error: this is the insn: > (insn:HI 619 621 640 4 (set (mem:DI (plus:DI (reg/v/f:DI 122 [ p ]) > (reg/v:DI 66 ctr [orig:124 d ] [124])) [0 S8 A64]) > (reg/v:DI 129 [ wp0 ])) 320 {*movdi_internal64} (nil) > (expr_list:REG_DEAD (reg/v:DI 129 [ wp0 ]) > (nil))) > drivers/md/raid6int8.c:185: confused by earlier errors, bailing out > make[2]: *** [drivers/md/raid6int8.o] Error 1 > make[1]: *** [drivers/md] Error 2 > make: *** [drivers] Error 2 > [root at kamitch-lnx linux-2.6.15.6]# gcc --version > gcc (GCC) 3.4.4 20050721 (Yellow Dog 3.4.4-2.ydl.2) > Copyright (C) 2004 Free Software Foundation, Inc. > This is free software; see the source for copying conditions. There is NO > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. I used to see that one, I think moving to GCC 4 made it go away, although I didn't pay it much attention because I don't need RAID. cheers -- Michael Ellerman IBM OzLabs wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060312/e854a63c/attachment.pgp From anton at samba.org Sun Mar 12 12:52:02 2006 From: anton at samba.org (Anton Blanchard) Date: Sun, 12 Mar 2006 12:52:02 +1100 Subject: [PATCH] ppc64: Allow non zero boot cpuids Message-ID: <20060312015202.GH19683@krispykreme> Hi, We currently have a hack to flip the boot cpu and its secondary thread to logical cpuid 0 and 1. There are a few issues: - The logical - physical mapping will differ depending on which cpu is boot cpu. This is most apparent on kexec, where we might kexec on any cpu and therefore change the mapping. - There are some strange bugs when we flip cpus between nodes - the scheduler locks up. The patch below does a first pass early on to work out the logical cpuid of the boot thread. We then fix up some paca structures to match. Ive also removed the boot_cpuid_phys variable for ppc64, to be consistent we use get_hard_smp_processor_id(boot_cpuid) everywhere. Signed-off-by: Anton Blanchard --- Index: build/arch/powerpc/kernel/setup-common.c =================================================================== --- build.orig/arch/powerpc/kernel/setup-common.c 2006-03-08 14:54:48.000000000 +1100 +++ build/arch/powerpc/kernel/setup-common.c 2006-03-08 14:55:39.000000000 +1100 @@ -353,12 +353,13 @@ void __init check_for_initrd(void) * must be called before using this. * * While we're here, we may as well set the "physical" cpu ids in the paca. + * + * NOTE: This must match the parsing done in early_init_dt_scan_cpus. */ void __init smp_setup_cpu_maps(void) { struct device_node *dn = NULL; int cpu = 0; - int swap_cpuid = 0; while ((dn = of_find_node_by_type(dn, "cpu")) && cpu < NR_CPUS) { int *intserv; @@ -377,24 +378,11 @@ void __init smp_setup_cpu_maps(void) for (j = 0; j < nthreads && cpu < NR_CPUS; j++) { cpu_set(cpu, cpu_present_map); set_hard_smp_processor_id(cpu, intserv[j]); - - if (intserv[j] == boot_cpuid_phys) - swap_cpuid = cpu; cpu_set(cpu, cpu_possible_map); cpu++; } } - /* Swap CPU id 0 with boot_cpuid_phys, so we can always assume that - * boot cpu is logical 0. - */ - if (boot_cpuid_phys != get_hard_smp_processor_id(0)) { - u32 tmp; - tmp = get_hard_smp_processor_id(0); - set_hard_smp_processor_id(0, boot_cpuid_phys); - set_hard_smp_processor_id(swap_cpuid, tmp); - } - #ifdef CONFIG_PPC64 /* * On pSeries LPAR, we need to know how many cpus Index: build/arch/powerpc/kernel/setup_64.c =================================================================== --- build.orig/arch/powerpc/kernel/setup_64.c 2006-03-08 14:54:55.000000000 +1100 +++ build/arch/powerpc/kernel/setup_64.c 2006-03-08 14:55:39.000000000 +1100 @@ -73,7 +73,6 @@ int have_of = 1; int boot_cpuid = 0; -int boot_cpuid_phys = 0; dev_t boot_dev; u64 ppc64_pft_size; @@ -208,7 +207,6 @@ static struct machdep_calls __initdata * void __init early_setup(unsigned long dt_ptr) { - struct paca_struct *lpaca = get_paca(); static struct machdep_calls **mach; /* Enable early debugging if any specified (see udbg.h) */ @@ -223,6 +221,14 @@ void __init early_setup(unsigned long dt */ early_init_devtree(__va(dt_ptr)); + /* Now we know the logical id of our boot cpu, setup the paca. */ + setup_boot_paca(); + + /* Fix up paca fields required for the boot cpu */ + get_paca()->cpu_start = 1; + get_paca()->stab_real = __pa((u64)&initial_stab); + get_paca()->stab_addr = (u64)&initial_stab; + /* * Iterate all ppc_md structures until we find the proper * one for the current machine type @@ -260,7 +266,7 @@ void __init early_setup(unsigned long dt if (cpu_has_feature(CPU_FTR_SLB)) slb_initialize(); else - stab_initialize(lpaca->stab_real); + stab_initialize(get_paca()->stab_real); } DBG(" <- early_setup()\n"); Index: build/arch/powerpc/platforms/pseries/xics.c =================================================================== --- build.orig/arch/powerpc/platforms/pseries/xics.c 2006-03-08 14:54:48.000000000 +1100 +++ build/arch/powerpc/platforms/pseries/xics.c 2006-03-08 14:55:39.000000000 +1100 @@ -499,7 +499,7 @@ nextnode: np; np = of_find_node_by_type(np, "cpu")) { ireg = (uint *)get_property(np, "reg", &ilen); - if (ireg && ireg[0] == boot_cpuid_phys) { + if (ireg && ireg[0] == get_hard_smp_processor_id(boot_cpuid)) { ireg = (uint *)get_property(np, "ibm,ppc-interrupt-gserver#s", &ilen); i = ilen / sizeof(int); Index: build/include/asm-powerpc/smp.h =================================================================== --- build.orig/include/asm-powerpc/smp.h 2006-03-08 14:54:49.000000000 +1100 +++ build/include/asm-powerpc/smp.h 2006-03-08 14:55:39.000000000 +1100 @@ -29,7 +29,6 @@ #endif extern int boot_cpuid; -extern int boot_cpuid_phys; extern void cpu_die(void); @@ -99,6 +98,7 @@ extern void smp_release_cpus(void); #else /* 32-bit */ #ifndef CONFIG_SMP +extern int boot_cpuid_phys; #define get_hard_smp_processor_id(cpu) boot_cpuid_phys #define set_hard_smp_processor_id(cpu, phys) #endif Index: build/arch/powerpc/kernel/head_64.S =================================================================== --- build.orig/arch/powerpc/kernel/head_64.S 2006-03-08 14:54:48.000000000 +1100 +++ build/arch/powerpc/kernel/head_64.S 2006-03-08 14:55:39.000000000 +1100 @@ -1840,21 +1840,6 @@ _STATIC(start_here_multiplatform) bl .__save_cpu_setup sync - /* Setup a valid physical PACA pointer in SPRG3 for early_setup - * note that boot_cpuid can always be 0 nowadays since there is - * nowhere it can be initialized differently before we reach this - * code - */ - LOAD_REG_IMMEDIATE(r27, boot_cpuid) - add r27,r27,r26 - lwz r27,0(r27) - - LOAD_REG_IMMEDIATE(r24, paca) /* Get base vaddr of paca array */ - mulli r13,r27,PACA_SIZE /* Calculate vaddr of right paca */ - add r13,r13,r24 /* for this processor. */ - add r13,r13,r26 /* convert to physical addr */ - mtspr SPRN_SPRG3,r13 - /* Do very early kernel initializations, including initial hash table, * stab and slb setup before we turn on relocation. */ @@ -1923,6 +1908,17 @@ _STATIC(start_here_common) /* Not reached */ BUG_OPCODE +/* Put the paca pointer into r13 and SPRG3 */ +_GLOBAL(setup_boot_paca) + LOAD_REG_IMMEDIATE(r3, boot_cpuid) + lwz r3,0(r3) + LOAD_REG_IMMEDIATE(r4, paca) /* Get base vaddr of paca array */ + mulli r3,r3,PACA_SIZE /* Calculate vaddr of right paca */ + add r13,r3,r4 /* for this processor. */ + mtspr SPRN_SPRG3,r13 + + blr + /* * We put a few things here that have to be page-aligned. * This stuff goes at the beginning of the bss, which is page-aligned. Index: build/arch/powerpc/kernel/prom.c =================================================================== --- build.orig/arch/powerpc/kernel/prom.c 2006-03-08 14:54:48.000000000 +1100 +++ build/arch/powerpc/kernel/prom.c 2006-03-08 14:55:39.000000000 +1100 @@ -858,35 +858,70 @@ void __init unflatten_device_tree(void) DBG(" <- unflatten_device_tree()\n"); } - static int __init early_init_dt_scan_cpus(unsigned long node, - const char *uname, int depth, void *data) + const char *uname, int depth, + void *data) { - u32 *prop; - unsigned long size; - char *type = of_get_flat_dt_prop(node, "device_type", &size); + static int logical_cpuid = 0; + char *type = of_get_flat_dt_prop(node, "device_type", NULL); + u32 *prop, *intserv; + int i, nthreads; + unsigned long len; + int found = 0; /* We are scanning "cpu" nodes only */ if (type == NULL || strcmp(type, "cpu") != 0) return 0; - boot_cpuid = 0; - boot_cpuid_phys = 0; - if (initial_boot_params && initial_boot_params->version >= 2) { - /* version 2 of the kexec param format adds the phys cpuid - * of booted proc. - */ - boot_cpuid_phys = initial_boot_params->boot_cpuid_phys; + /* Get physical cpuid */ + intserv = of_get_flat_dt_prop(node, "ibm,ppc-interrupt-server#s", &len); + if (intserv) { + nthreads = len / sizeof(int); } else { - /* Check if it's the boot-cpu, set it's hw index now */ - if (of_get_flat_dt_prop(node, + intserv = of_get_flat_dt_prop(node, "reg", NULL); + nthreads = 1; + } + + /* + * Now see if any of these threads match our boot cpu. + * NOTE: This must match the parsing done in smp_setup_cpu_maps. + */ + for (i = 0; i < nthreads; i++) { + /* + * version 2 of the kexec param format adds the phys cpuid of + * booted proc. + */ + if (initial_boot_params && initial_boot_params->version >= 2) { + if (intserv[i] == + initial_boot_params->boot_cpuid_phys) { + found = 1; + break; + } + } else { + /* + * Check if it's the boot-cpu, set it's hw index now, + * unfortunately this format did not support booting + * off secondary threads. + */ + if (of_get_flat_dt_prop(node, "linux,boot-cpu", NULL) != NULL) { - prop = of_get_flat_dt_prop(node, "reg", NULL); - if (prop != NULL) - boot_cpuid_phys = *prop; + found = 1; + break; + } } + +#ifdef CONFIG_SMP + /* logical cpu id is always 0 on UP kernels */ + logical_cpuid++; +#endif + } + + if (found) { + DBG("boot cpu: logical %d physical %d\n", logical_cpuid, + intserv[i]); + boot_cpuid = logical_cpuid; + set_hard_smp_processor_id(boot_cpuid, intserv[i]); } - set_hard_smp_processor_id(0, boot_cpuid_phys); #ifdef CONFIG_ALTIVEC /* Check if we have a VMX and eventually update CPU features */ @@ -905,16 +940,10 @@ static int __init early_init_dt_scan_cpu #endif /* CONFIG_ALTIVEC */ #ifdef CONFIG_PPC_PSERIES - /* - * Check for an SMT capable CPU and set the CPU feature. We do - * this by looking at the size of the ibm,ppc-interrupt-server#s - * property - */ - prop = (u32 *)of_get_flat_dt_prop(node, "ibm,ppc-interrupt-server#s", - &size); - cur_cpu_spec->cpu_features &= ~CPU_FTR_SMT; - if (prop && ((size / sizeof(u32)) > 1)) + if (nthreads > 1) cur_cpu_spec->cpu_features |= CPU_FTR_SMT; + else + cur_cpu_spec->cpu_features &= ~CPU_FTR_SMT; #endif return 0; Index: build/arch/powerpc/kernel/paca.c =================================================================== --- build.orig/arch/powerpc/kernel/paca.c 2006-03-08 14:54:48.000000000 +1100 +++ build/arch/powerpc/kernel/paca.c 2006-03-08 14:55:39.000000000 +1100 @@ -56,14 +56,11 @@ struct lppaca lppaca[] = { * processors. The processor VPD array needs one entry per physical * processor (not thread). */ -#define PACA_INIT_COMMON(number, start, asrr, asrv) \ +#define PACA_INIT_COMMON(number) \ .lppaca_ptr = &lppaca[number], \ .lock_token = 0x8000, \ .paca_index = (number), /* Paca Index */ \ .kernel_toc = (unsigned long)(&__toc_start) + 0x8000UL, \ - .stab_real = (asrr), /* Real pointer to segment table */ \ - .stab_addr = (asrv), /* Virt pointer to segment table */ \ - .cpu_start = (start), /* Processor start */ \ .hw_cpu_id = 0xffff, #ifdef CONFIG_PPC_ISERIES @@ -72,30 +69,20 @@ struct lppaca lppaca[] = { #define PACA_INIT(number) \ { \ - PACA_INIT_COMMON(number, 0, 0, 0) \ - PACA_INIT_ISERIES(number) \ -} - -#define BOOTCPU_PACA_INIT(number) \ -{ \ - PACA_INIT_COMMON(number, 1, 0, (u64)&initial_stab) \ + PACA_INIT_COMMON(number) \ PACA_INIT_ISERIES(number) \ } #else #define PACA_INIT(number) \ { \ - PACA_INIT_COMMON(number, 0, 0, 0) \ + PACA_INIT_COMMON(number) \ } -#define BOOTCPU_PACA_INIT(number) \ -{ \ - PACA_INIT_COMMON(number, 1, STAB0_PHYS_ADDR, (u64)&initial_stab) \ -} #endif struct paca_struct paca[] = { - BOOTCPU_PACA_INIT(0), + PACA_INIT(0), #if NR_CPUS > 1 PACA_INIT( 1), PACA_INIT( 2), PACA_INIT( 3), #if NR_CPUS > 4 From ebiederm at xmission.com Mon Mar 13 04:15:20 2006 From: ebiederm at xmission.com (Eric W. Biederman) Date: Sun, 12 Mar 2006 10:15:20 -0700 Subject: [PATCH] Document Linux's memory barriers [try #4] In-Reply-To: <16835.1141936162@warthog.cambridge.redhat.com> (David Howells's message of "Thu, 09 Mar 2006 20:29:22 +0000") References: <16835.1141936162@warthog.cambridge.redhat.com> Message-ID: A small nit. You are not documenting the most subtle memory barrier: smp_read_barrier_depends(); Which is a deep requirement of the RCU code. As I understand it. On some architectures (alpha) without at least this a load from a pointer can load from the an old pointer value. At one point it was suggested this be called: read_memory_barrier_data_dependent(). Simply calling: rcu_dereference is what all users should call but the semantics should be documented at least so that people porting Linux can have a chance of getting it right. Eric From segher at kernel.crashing.org Mon Mar 13 08:54:42 2006 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Sun, 12 Mar 2006 22:54:42 +0100 Subject: 2.6.15.6 on G5/1.8 (9,1) In-Reply-To: <200603121103.48788.michael@ellerman.id.au> References: <440F8C48.6000706@cisco.com> <200603121103.48788.michael@ellerman.id.au> Message-ID: <0d1795383782ab2e605b7cc47ded66e8@kernel.crashing.org> >> drivers/md/raid6int8.c: In function `raid6_int8_gen_syndrome': >> drivers/md/raid6int8.c:185: error: unable to find a register to spill >> in >> class `FLOAT_REGS' drivers/md/raid6int8.c:185: error: this is the >> insn: > I used to see that one, I think moving to GCC 4 made it go away, > although I > didn't pay it much attention because I don't need RAID. This is GCC PR26610; this problem does not exist in GCC-4.x. It might be fixed in GCC-3.4.6; if not, that is the last release ever from the 3.4 series, so you're out of luck... better update to 4.1 or so :-) Segher From ntl at pobox.com Mon Mar 13 09:18:17 2006 From: ntl at pobox.com (Nathan Lynch) Date: Sun, 12 Mar 2006 16:18:17 -0600 Subject: [PATCH] powerpc: add for_each_node_by_foo helpers In-Reply-To: <1142119340.4057.38.camel@localhost.localdomain> References: <20060308154700.GA15859@lst.de> <1142119340.4057.38.camel@localhost.localdomain> Message-ID: <20060312221817.GA3205@localhost.localdomain> Benjamin Herrenschmidt wrote: > On Wed, 2006-03-08 at 16:47 +0100, Christoph Hellwig wrote: > > Typical use for of_find_node_by_name and of_find_node_by_type is to > > iterate over all nodes of a given type/name. Add a helper macro to > > do that (in spirit of the list_for_each* macros). > > Looks good. Maybe you have noted however that we aren't entirely safe > yet vs. device-tree being dynamic however. There is a per-node lock, but > no lock protecting the global node chain, which as usual means iterating > the list isn't entirely safe vs. removal... Eh? There isn't any per-node lock, and the global chain is protected by a rwlock (devtree_lock). From benh at kernel.crashing.org Mon Mar 13 10:29:52 2006 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 13 Mar 2006 10:29:52 +1100 Subject: [PATCH] powerpc: add for_each_node_by_foo helpers In-Reply-To: <20060312221817.GA3205@localhost.localdomain> References: <20060308154700.GA15859@lst.de> <1142119340.4057.38.camel@localhost.localdomain> <20060312221817.GA3205@localhost.localdomain> Message-ID: <1142206193.4433.17.camel@localhost.localdomain> On Sun, 2006-03-12 at 16:18 -0600, Nathan Lynch wrote: > Benjamin Herrenschmidt wrote: > > On Wed, 2006-03-08 at 16:47 +0100, Christoph Hellwig wrote: > > > Typical use for of_find_node_by_name and of_find_node_by_type is to > > > iterate over all nodes of a given type/name. Add a helper macro to > > > do that (in spirit of the list_for_each* macros). > > > > Looks good. Maybe you have noted however that we aren't entirely safe > > yet vs. device-tree being dynamic however. There is a per-node lock, but > > no lock protecting the global node chain, which as usual means iterating > > the list isn't entirely safe vs. removal... > > Eh? There isn't any per-node lock, and the global chain is protected > by a rwlock (devtree_lock). I recently showed a variety of scenarios that would blow up in funny ways.. It's not enough. It would only work if the global chain lock was taken by callers before iterating. Ben. From ntl at pobox.com Mon Mar 13 14:11:38 2006 From: ntl at pobox.com (Nathan Lynch) Date: Sun, 12 Mar 2006 21:11:38 -0600 Subject: [PATCH] powerpc: add for_each_node_by_foo helpers In-Reply-To: <1142206193.4433.17.camel@localhost.localdomain> References: <20060308154700.GA15859@lst.de> <1142119340.4057.38.camel@localhost.localdomain> <20060312221817.GA3205@localhost.localdomain> <1142206193.4433.17.camel@localhost.localdomain> Message-ID: <20060313031138.GB3205@localhost.localdomain> Benjamin Herrenschmidt wrote: > On Sun, 2006-03-12 at 16:18 -0600, Nathan Lynch wrote: > > Benjamin Herrenschmidt wrote: > > > On Wed, 2006-03-08 at 16:47 +0100, Christoph Hellwig wrote: > > > > Typical use for of_find_node_by_name and of_find_node_by_type is to > > > > iterate over all nodes of a given type/name. Add a helper macro to > > > > do that (in spirit of the list_for_each* macros). > > > > > > Looks good. Maybe you have noted however that we aren't entirely safe > > > yet vs. device-tree being dynamic however. There is a per-node lock, but > > > no lock protecting the global node chain, which as usual means iterating > > > the list isn't entirely safe vs. removal... > > > > Eh? There isn't any per-node lock, and the global chain is protected > > by a rwlock (devtree_lock). > > I recently showed a variety of scenarios that would blow up in funny > ways.. It's not enough. A variety of scenarios? I'm aware of only one [1], which is kind of a corner case, and has never come up in practice to my knowledge. If you know of others, I'd like to know the details. > It would only work if the global chain lock was > taken by callers before iterating. Are we talking about the same thing? I'm talking about the of_find_node_by_foo iterators and they all hold the lock while traversing the tree. [1] IIRC, it's something like this: Nodes A, B, and C are adjacent in the allnext list -- A->allnext = B, B->allnext = C. A is unplugged from the tree (of_detach_node), but something still holds a reference to it so A is not freed yet. B is then unplugged from the tree, but we can't get to A to fix up its allnext field to point to C. B has no outstanding references and is freed. The user of A attempts to dereference A->allnext, but B isn't there anymore. This could happen with the sibling list and parent pointer as well. So the basic issue is that traversing back into the tree from a node that has been unplugged from the tree is invalid -- the state of the tree is indeterminate with respect to the removed node's links into it. Seems to me that one way to address this is to keep a node from being unplugged until all other users have dropped their references to it. From benh at kernel.crashing.org Mon Mar 13 14:17:52 2006 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 13 Mar 2006 14:17:52 +1100 Subject: [PATCH] powerpc: add for_each_node_by_foo helpers In-Reply-To: <20060313031138.GB3205@localhost.localdomain> References: <20060308154700.GA15859@lst.de> <1142119340.4057.38.camel@localhost.localdomain> <20060312221817.GA3205@localhost.localdomain> <1142206193.4433.17.camel@localhost.localdomain> <20060313031138.GB3205@localhost.localdomain> Message-ID: <1142219872.32330.12.camel@localhost.localdomain> > > I recently showed a variety of scenarios that would blow up in funny > > ways.. It's not enough. > > A variety of scenarios? I'm aware of only one [1], which is kind of a > corner case, and has never come up in practice to my knowledge. > > If you know of others, I'd like to know the details. I actually have one in mind that I remember, but I'm pretty sure I saw another one... it's not that much of a corner case, I think in general, it's almost impossible to handle safe management of a list/tree without an external lock taken by iterators around the iteration, which I think is why klist were created... Unless maybe we deal with node deletion using some kind of RCU mecanism > Are we talking about the same thing? I'm talking about the > of_find_node_by_foo iterators and they all hold the lock while > traversing the tree. Yes but not between calls... > [1] IIRC, it's something like this: > > Nodes A, B, and C are adjacent in the allnext list -- A->allnext = B, > B->allnext = C. > > A is unplugged from the tree (of_detach_node), but something still > holds a reference to it so A is not freed yet. > > B is then unplugged from the tree, but we can't get to A to fix up > its allnext field to point to C. B has no outstanding references > and is freed. > > The user of A attempts to dereference A->allnext, but B isn't > there anymore. > > This could happen with the sibling list and parent pointer as well. > > So the basic issue is that traversing back into the tree from a node > that has been unplugged from the tree is invalid -- the state of the > tree is indeterminate with respect to the removed node's links into > it. > > Seems to me that one way to address this is to keep a node from being > unplugged until all other users have dropped their references to it. It might help by having references associated with links, but we end up with cases where the tree will still reference a node after returning from remove... Maybe putting a flag in the node when unliked to cause other iterators to "skip" it and making sure that reference to a node done by it's peers is counted would help... Ben. From ntl at pobox.com Mon Mar 13 19:15:12 2006 From: ntl at pobox.com (Nathan Lynch) Date: Mon, 13 Mar 2006 02:15:12 -0600 Subject: [PATCH] powerpc: add for_each_node_by_foo helpers In-Reply-To: <1142219872.32330.12.camel@localhost.localdomain> References: <20060308154700.GA15859@lst.de> <1142119340.4057.38.camel@localhost.localdomain> <20060312221817.GA3205@localhost.localdomain> <1142206193.4433.17.camel@localhost.localdomain> <20060313031138.GB3205@localhost.localdomain> <1142219872.32330.12.camel@localhost.localdomain> Message-ID: <20060313081511.GC3205@localhost.localdomain> Benjamin Herrenschmidt wrote: > > > > I recently showed a variety of scenarios that would blow up in funny > > > ways.. It's not enough. > > > > A variety of scenarios? I'm aware of only one [1], which is kind of a > > corner case, and has never come up in practice to my knowledge. > > > > If you know of others, I'd like to know the details. > > I actually have one in mind that I remember, but I'm pretty sure I saw > another one... More details, please. If you know of another issue besides the one pointed out already, I'd prefer to know the nature of it before putting more time into fixing this one. > > Are we talking about the same thing? I'm talking about the > > of_find_node_by_foo iterators and they all hold the lock while > > traversing the tree. > > Yes but not between calls... That's by design. It's better to avoid exposing the lock to users of the iterators. > > So the basic issue is that traversing back into the tree from a node > > that has been unplugged from the tree is invalid -- the state of the > > tree is indeterminate with respect to the removed node's links into > > it. > > > > Seems to me that one way to address this is to keep a node from being > > unplugged until all other users have dropped their references to it. Here's an attempt at that -- build-tested only. It's a bit hacky since it's checking the kref refcount directly, but maybe I can come up with something nicer later. (It looks like klist would support doing the same thing, but at the cost of bloating up struct device_node.) @@ -1672,10 +1672,11 @@ * simplify writing of callers * */ -void of_node_put(struct device_node *node) +int of_node_put(struct device_node *node) { if (node) - kref_put(&node->kref, of_node_release); + return kref_put(&node->kref, of_node_release); + return 0; } EXPORT_SYMBOL(of_node_put); @@ -1697,11 +1698,38 @@ * a reference to the node. The memory associated with the node * is not freed until its refcount goes to zero. */ -void of_detach_node(const struct device_node *np) +void of_detach_node(struct device_node *np, int refs) { struct device_node *parent; + /* Account for the reference the device tree itself has to the + * node. + */ + of_node_put(np); +retry: + /* Wait until refcount == the number of references this caller + * has to the node. + */ + while (atomic_read(&np->kref.refcount) != refs) + schedule(); + + /* Someone else can grab a reference in this window... */ + write_lock(&devtree_lock); + + /* ... but not now (assuming they're using the proper + * iterators). Test whether the refcount got bumped in the + * window and retry if necessary. + * + * The reason we do this is to prevent someone from iterating + * from a detached node into the tree, e.g. through ->allnext. + * The detached node's former neighbor in the list might not + * be there. + */ + if (atomic_read(&np->kref.refcount) > refs) { + write_unlock(&devtree_lock); + goto retry; + } parent = np->parent; diff -r 8e71be242cff arch/powerpc/platforms/pseries/reconfig.c --- a/arch/powerpc/platforms/pseries/reconfig.c Mon Mar 13 08:41:27 2006 +0800 +++ b/arch/powerpc/platforms/pseries/reconfig.c Mon Mar 13 01:52:53 2006 -0600 @@ -156,7 +156,7 @@ return err; } -static int pSeries_reconfig_remove_node(struct device_node *np) +static int pSeries_reconfig_remove_node(struct device_node *np, int refs) { struct device_node *parent, *child; @@ -173,10 +173,9 @@ notifier_call_chain(&pSeries_reconfig_chain, PSERIES_RECONFIG_REMOVE, np); - of_detach_node(np); + of_detach_node(np, refs); of_node_put(parent); - of_node_put(np); /* Must decrement the refcount */ return 0; } @@ -341,12 +340,13 @@ static int do_remove_node(char *buf) { struct device_node *node; - int rv = -ENODEV; + int deleted, rv = -ENODEV; if ((node = of_find_node_by_path(buf))) - rv = pSeries_reconfig_remove_node(node); - - of_node_put(node); + rv = pSeries_reconfig_remove_node(node, 1); + + deleted = of_node_put(node); + BUG_ON(!deleted); return rv; } diff -r 8e71be242cff include/asm-powerpc/prom.h --- a/include/asm-powerpc/prom.h Mon Mar 13 08:41:27 2006 +0800 +++ b/include/asm-powerpc/prom.h Mon Mar 13 01:52:53 2006 -0600 @@ -140,7 +140,7 @@ const char *name, int *lenp); extern struct device_node *of_node_get(struct device_node *node); -extern void of_node_put(struct device_node *node); +extern int of_node_put(struct device_node *node); /* For scanning the flat device-tree at boot time */ int __init of_scan_flat_dt(int (*it)(unsigned long node, @@ -152,7 +152,7 @@ /* For updating the device tree at runtime */ extern void of_attach_node(struct device_node *); -extern void of_detach_node(const struct device_node *); +extern void of_detach_node(struct device_node *, int); /* Other Prototypes */ extern void finish_device_tree(void); From michael at ellerman.id.au Mon Mar 13 19:16:32 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Mon, 13 Mar 2006 19:16:32 +1100 Subject: [RFC/PATCH] powerpc: Fix kdump EOI bug (often exhibits as dead console) Message-ID: <20060313081709.20471679F2@ozlabs.org> If we take an interrupt, and while processing it, decide to kdump we never EOI that interrupt. This can happen for any interrupt, but most commonly it's the console interrupt from a user hitting 'sysrq-c', which prevents the console from working in the second kernel. We're panicking so we don't want to do any more than we need to in the first kernel, so leave things alone there. When we come back up iff we reenable the interrupt in question, do an EOI then. This fixes the bug for me, and appears to cause no issue for other interrupts. Tested on P5 LPAR. Signed-off-by: Michael Ellerman --- arch/powerpc/platforms/pseries/xics.c | 31 +++++++++++++++++++++---------- 1 files changed, 21 insertions(+), 10 deletions(-) Index: kdump/arch/powerpc/platforms/pseries/xics.c =================================================================== --- kdump.orig/arch/powerpc/platforms/pseries/xics.c +++ kdump/arch/powerpc/platforms/pseries/xics.c @@ -208,11 +208,32 @@ xics_ops pSeriesLP_ops = { pSeriesLP_qirr_info }; +static unsigned int real_irq_to_virt(unsigned int real_irq) +{ + unsigned int *ptr; + + ptr = radix_tree_lookup(&irq_map, real_irq); + if (ptr == NULL) + return NO_IRQ; + return ptr - virt_irq_to_real_map; +} + static unsigned int xics_startup(unsigned int virq) { unsigned int irq; irq = irq_offset_down(virq); + +#ifdef CONFIG_CRASH_DUMP + /* We may have kdumped from an interrupt handler in which case we + * won't have EOI'ed that irq, so do it now. We only want to do + * this the first time we're called for each irq, so exploit the + * radix tree as a way to check if we've already enabled this irq. + */ + if (real_irq_to_virt(virt_irq_to_real(irq)) == NO_IRQ) + xics_end_irq(virq); +#endif + if (radix_tree_insert(&irq_map, virt_irq_to_real(irq), &virt_irq_to_real_map[irq]) == -ENOMEM) printk(KERN_CRIT "Out of memory creating real -> virtual" @@ -222,16 +243,6 @@ static unsigned int xics_startup(unsigne return 0; /* return value is ignored */ } -static unsigned int real_irq_to_virt(unsigned int real_irq) -{ - unsigned int *ptr; - - ptr = radix_tree_lookup(&irq_map, real_irq); - if (ptr == NULL) - return NO_IRQ; - return ptr - virt_irq_to_real_map; -} - #ifdef CONFIG_SMP static int get_irq_server(unsigned int irq) { From benh at kernel.crashing.org Mon Mar 13 19:44:12 2006 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 13 Mar 2006 19:44:12 +1100 Subject: [RFC/PATCH] powerpc: Fix kdump EOI bug (often exhibits as dead console) In-Reply-To: <20060313081709.20471679F2@ozlabs.org> References: <20060313081709.20471679F2@ozlabs.org> Message-ID: <1142239453.32330.51.camel@localhost.localdomain> On Mon, 2006-03-13 at 19:16 +1100, Michael Ellerman wrote: > If we take an interrupt, and while processing it, decide to kdump we never > EOI that interrupt. This can happen for any interrupt, but most commonly it's > the console interrupt from a user hitting 'sysrq-c', which prevents the > console from working in the second kernel. > > We're panicking so we don't want to do any more than we need to in the first > kernel, so leave things alone there. When we come back up iff we reenable the > interrupt in question, do an EOI then. This fixes the bug for me, and appears > to cause no issue for other interrupts. You may want to do the same for mpic.c ... Ben. From schwab at suse.de Tue Mar 14 01:49:12 2006 From: schwab at suse.de (Andreas Schwab) Date: Mon, 13 Mar 2006 15:49:12 +0100 Subject: GigE on PowerMac G5 In-Reply-To: (Andreas Schwab's message of "Tue, 07 Mar 2006 13:53:21 +0100") References: <1141507000.17127.4.camel@localhost.localdomain> <1141650907.11221.61.camel@localhost.localdomain> Message-ID: Andreas Schwab writes: > Benjamin Herrenschmidt writes: > >> At this point, all I can say is... does it work in OS X ? > > Strange, OS X can't do it either. Looks like I have a hardware problem. It turned out that one of the contacts in the RJ-45 jack was twisted. After straightening it the Gb connection is working now. Andreas. -- Andreas Schwab, SuSE Labs, schwab at suse.de SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." From kamitch at cisco.com Tue Mar 14 02:39:03 2006 From: kamitch at cisco.com (Keith Mitchell) Date: Mon, 13 Mar 2006 10:39:03 -0500 Subject: 2.6.15.6 on G5/1.8 (9,1) In-Reply-To: <1142119529.4057.42.camel@localhost.localdomain> References: <440F8C48.6000706@cisco.com> <1142119529.4057.42.camel@localhost.localdomain> Message-ID: <44159217.1000603@cisco.com> Benjamin Herrenschmidt wrote: > On Wed, 2006-03-08 at 21:00 -0500, Keith Mitchell wrote: > >> Hi, >> >> I have a bunch of different powermac machines that I am trying to >> upgrade and/or install and am having some difficulty with the 1.8 (9,1) >> powermacs as well as the newer Dual Core (2.0) machines. The other two >> types of machines that I have seem to be working well enough (Dual-Proc >> 2.0, Dual-Proc 2.7 -- Both 7,3). >> >> Originally the systems were running a beta version of Yellowdog that had >> a custom kernel based on 2.6.12.3. That kernel works great on all of >> the machines except the dual core which doesn't work at all with the >> kernel (no surprise). When YDL 4.1 came out (with a kernel based on >> 2.6.15-rc5 plus some patches) I wanted to upgrade to that and have the >> same image on all of the machines. The hope was that the 1.8ghz-single >> machines would get thermal support and I would get rudimentary support >> for the dual core machine. I want to have the same load on all of the >> machines to make my job easier (since I have 30+ machines total to keep >> running). But... The stock YDL kernel does not work so well on the >> 1.8ghz-single machines.... I am able to install the distribution on >> these machines and reboot. The system will stay up for something like >> 30 seconds and then it freezes and shows: >> > > I'm not sure what's up there but could you try 2.6.16-rc6 ? It should be > working on both machines types and have working thermal control for both > too. > It looks like 2.6.16-rc6 works fine on my single 1.8 systems. BTW is there any equivalent to the "server_mode" that the PMU driver provides that will setup the system so that it will auto-poweron on a power failure? Or does this need to be setup still in MacOS before installing linux on the machine? Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060313/98b276e8/attachment.htm From miltonm at bga.com Tue Mar 14 02:59:15 2006 From: miltonm at bga.com (Milton Miller) Date: Mon, 13 Mar 2006 09:59:15 -0600 Subject: [RFC/PATCH] powerpc: Fix kdump EOI bug (often exhibits as dead console) In-Reply-To: <1142239453.32330.51.camel@localhost.localdomain> References: <20060313081709.20471679F2@ozlabs.org> <1142239453.32330.51.camel@localhost.localdomain> Message-ID: <0646a9f7bbbf563af2f23fe7229a4b01@bga.com> On Mar 13, 2006, at 2:44 AM, Benjamin Herrenschmidt wrote: > On Mon, 2006-03-13 at 19:16 +1100, Michael Ellerman wrote: >> If we take an interrupt, and while processing it, decide to kdump we >> never >> EOI that interrupt. This can happen for any interrupt, but most >> commonly it's >> the console interrupt from a user hitting 'sysrq-c', which prevents >> the >> console from working in the second kernel. >> >> We're panicking so we don't want to do any more than we need to in >> the first >> kernel, so leave things alone there. When we come back up iff we >> reenable the >> interrupt in question, do an EOI then. This fixes the bug for me, and >> appears >> to cause no issue for other interrupts. > > You may want to do the same for mpic.c ... > > Ben. Is that possible? My memory says that, at least for the distributed pic in Power3 boxes, part of the information to do the EOI was remembered in a stack in the interrupt controller. This means (1) the EOI must be issued from the processor server that took the interrupt, (2) there are a limited number of interrupts that can be presented before they are EOId (3) they must be EOId in reverse order. (4) I don't know what happens if we issue and EOI with no hardware. However, a write to the reset register sent a packet to all pics to reset them. What we are doing here is a possibly extranious, third party EOI (device X interrupted cpu Y, and cpu Z is issuing the EOI to allow the device to reissue the interrupt). For real XICS I know that is both possible and safe as long as the interrupt X exists; I am familiar with the hardware implementation. milton From cfriesen at nortel.com Tue Mar 14 04:49:01 2006 From: cfriesen at nortel.com (Christopher Friesen) Date: Mon, 13 Mar 2006 11:49:01 -0600 Subject: long-term support for G5s? Message-ID: <4415B08D.1060309@nortel.com> With Apple switching away from the 970, what are the long-term plans for linux on the G5? Are there enough users/developers that we can expect linux to continue to run on it for the reasonable future? We've got a bunch of ppc/ppc64-based products in the field, and the question has come up about what we're going to use as development machines. Thanks, Chris From olh at suse.de Tue Mar 14 07:09:10 2006 From: olh at suse.de (Olaf Hering) Date: Mon, 13 Mar 2006 21:09:10 +0100 Subject: 2.6.16-rc6: known regressions In-Reply-To: <20060313200544.GG13973@stusta.de> References: <20060313200544.GG13973@stusta.de> Message-ID: <20060313200910.GA11366@suse.de> On Mon, Mar 13, Adrian Bunk wrote: > Subject : 2.6.16-rc5-git14 crash in spin_bug on ppc64 > References : http://lkml.org/lkml/2006/3/10/190 > Submitter : Olaf Hering > Status : unknown I have seen it only once, and I rebooted that kernel alot. From gregkh at suse.de Mon Mar 13 23:09:15 2006 From: gregkh at suse.de (Greg KH) Date: Mon, 13 Mar 2006 12:09:15 +0000 Subject: 2.6.16-rc6: known regressions In-Reply-To: <20060313200544.GG13973@stusta.de> References: <20060313200544.GG13973@stusta.de> Message-ID: <20060313120915.GA13652@suse.de> On Mon, Mar 13, 2006 at 09:05:44PM +0100, Adrian Bunk wrote: > Subject : Slab corruption in usbserial when disconnecting device > References : http://lkml.org/lkml/2006/3/8/58 > Submitter : pete.chapman at exgate.tek.com > Status : unknown Should already be fixed in 2.6.16-rc6, with this patch that went in after 2.6.16-rc5 came out: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=91c0bce29e4050a59ee5fdc1192b60bbf8693a6d Pete, can you verify this change works for you? thanks, greg k-h From gregkh at suse.de Mon Mar 13 23:12:19 2006 From: gregkh at suse.de (Greg KH) Date: Mon, 13 Mar 2006 12:12:19 +0000 Subject: 2.6.16-rc6: known regressions In-Reply-To: <20060313200544.GG13973@stusta.de> References: <20060313200544.GG13973@stusta.de> Message-ID: <20060313121219.GB13652@suse.de> On Mon, Mar 13, 2006 at 09:05:44PM +0100, Adrian Bunk wrote: > Subject : Stradis driver udev brekage > References : http://bugzilla.kernel.org/show_bug.cgi?id=6170 > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=181063 > http://lkml.org/lkml/2006/2/18/204 > Submitter : Tom Seeley > Dave Jones > Handled-By : Jiri Slaby > Status : unknown Jiri, why did you create a kernel.org bugzilla bug with almost no information in it? Anyway, this is the first I've heard of this, more information is needed to help track it down. How about the contents of /sys/class/dvb/ ? thanks, greg k-h From benh at kernel.crashing.org Tue Mar 14 08:14:33 2006 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 14 Mar 2006 08:14:33 +1100 Subject: 2.6.15.6 on G5/1.8 (9,1) In-Reply-To: <44159217.1000603@cisco.com> References: <440F8C48.6000706@cisco.com> <1142119529.4057.42.camel@localhost.localdomain> <44159217.1000603@cisco.com> Message-ID: <1142284473.32330.56.camel@localhost.localdomain> > BTW is there any equivalent to the "server_mode" that the PMU driver > provides that will setup the system so that it will auto-poweron on a > power failure? Or does this need to be setup still in MacOS before > installing linux on the machine? I havent looked at it for SMU based machines yet, though it shouldn't be too hard... I'll let you know if I find something. Ben. From hollis at penguinppc.org Tue Mar 14 08:16:34 2006 From: hollis at penguinppc.org (Hollis Blanchard) Date: Mon, 13 Mar 2006 15:16:34 -0600 Subject: long-term support for G5s? In-Reply-To: <4415B08D.1060309@nortel.com> References: <4415B08D.1060309@nortel.com> Message-ID: <200603131516.34816.hollis@penguinppc.org> On Monday 13 March 2006 11:49, Christopher Friesen wrote: > > With Apple switching away from the 970, what are the long-term plans for > linux on the G5? Are you asking about the Apple G5 in particular? Or the PowerPC 970? > Are there enough users/developers that we can expect linux to continue > to run on it for the reasonable future? I don't think anybody will rip out G5 support. After all, we still have PReP support in the kernel, and those systems are all about 10 years old now. > We've got a bunch of ppc/ppc64-based products in the field, and the > question has come up about what we're going to use as development machines. A quick survey of current and announced PPC970-based platforms: Apple G5s (PowerMac, Xserve, etc) Maple reference board IBM JS20 blades IBM p185 workstation Mercury XR9 IBM JS21 blades (expected to GA in March) Genesi Open Server Workstation (I haven't seen a GA date) There may be others, but these are the ones I know off the top of my head. Not all of them may fit your needs as a "development machine"... so what *are* your needs for a development machine? -Hollis From benh at kernel.crashing.org Tue Mar 14 08:32:09 2006 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 14 Mar 2006 08:32:09 +1100 Subject: [RFC/PATCH] powerpc: Fix kdump EOI bug (often exhibits as dead console) In-Reply-To: <0646a9f7bbbf563af2f23fe7229a4b01@bga.com> References: <20060313081709.20471679F2@ozlabs.org> <1142239453.32330.51.camel@localhost.localdomain> <0646a9f7bbbf563af2f23fe7229a4b01@bga.com> Message-ID: <1142285529.32330.68.camel@localhost.localdomain> > Is that possible? My memory says that, at least for the distributed > pic in Power3 boxes, part of the information to do the EOI was > remembered in a stack in the interrupt controller. This means (1) the > EOI must be issued from the processor server that took the interrupt, > (2) there are a limited number of interrupts that can be presented > before they are EOId (3) they must be EOId in reverse order. (4) I > don't know what happens if we issue and EOI with no hardware. However, > a write to the reset register sent a packet to all pics to reset them. > > What we are doing here is a possibly extranious, third party EOI > (device X interrupted cpu Y, and cpu Z is issuing the EOI to allow the > device to reissue the interrupt). For real XICS I know that is both > possible and safe as long as the interrupt X exists; I am familiar with > the hardware implementation. We can maybe send the EOI's to all processor block (EOI registers have separate addresses for each CPU). I'm sure if we did something like sending 128 EOI to all CPUs it would keep the mpic quiet :) But them, if an MPIC reset works, then go for it... though it will not be enough in the case of HT APICs as I said earlier (thus on js20/21 & quad g5) EOI on MPIC doesn't carry the irq number, the MPIC will automatically EOI whatever was pending on that CPU. So a sequence of: - disable all irqs - go through all HT APICs and disable all irqs & EOI them - either reset the MPIC or send a bunch of EOIs to each processor block Would probably work Ben. From jirislaby at gmail.com Tue Mar 14 08:06:22 2006 From: jirislaby at gmail.com (Jiri Slaby) Date: Mon, 13 Mar 2006 22:06:22 +0100 Subject: 2.6.16-rc6: known regressions In-Reply-To: <20060313121219.GB13652@suse.de> References: <20060313200544.GG13973@stusta.de> Message-ID: Greg KH wrote: >On Mon, Mar 13, 2006 at 09:05:44PM +0100, Adrian Bunk wrote: >> Subject : Stradis driver udev brekage >> References : http://bugzilla.kernel.org/show_bug.cgi?id=6170 >> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=181063 >> http://lkml.org/lkml/2006/2/18/204 >> Submitter : Tom Seeley >> Dave Jones >> Handled-By : Jiri Slaby >> Status : unknown > >Jiri, why did you create a kernel.org bugzilla bug with almost no >information in it? > >Anyway, this is the first I've heard of this, more information is >needed to help track it down. How about the contents of /sys/class/dvb/ ? Hello, sorry for that, I expected Tom to help us track this down -- he has this problem, but he haven't replied yet. Nobody else is complaining, would we defer or close it for now? best regards, -- Jiri Slaby www.fi.muni.cz/~xslaby \_.-^-._ jirislaby at gmail.com _.-^-._/ B67499670407CE62ACC8 22A032CC55C339D47A7E From benh at kernel.crashing.org Tue Mar 14 08:41:11 2006 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 14 Mar 2006 08:41:11 +1100 Subject: long-term support for G5s? In-Reply-To: <4415B08D.1060309@nortel.com> References: <4415B08D.1060309@nortel.com> Message-ID: <1142286071.32330.79.camel@localhost.localdomain> On Mon, 2006-03-13 at 11:49 -0600, Christopher Friesen wrote: > With Apple switching away from the 970, what are the long-term plans for > linux on the G5? It will be supported for as long as paulus and I have hardware :) If you are talking about the 970 CPU, it's fully supported by IBM and will continue to be as we are still developping/selling hardware with that processor. Besides, linux still works on POWER3's so ... :) > Are there enough users/developers that we can expect linux to continue > to run on it for the reasonable future? I think so yes. > We've got a bunch of ppc/ppc64-based products in the field, and the > question has come up about what we're going to use as development machines. I would say current G5 machines still qualify pretty well, at least until genesi releases their workstation in which case it may become a better choice. You may want to contact them for infos about possible release dates. Ben. From bunk at stusta.de Tue Mar 14 07:05:44 2006 From: bunk at stusta.de (Adrian Bunk) Date: Mon, 13 Mar 2006 21:05:44 +0100 Subject: 2.6.16-rc6: known regressions In-Reply-To: References: Message-ID: <20060313200544.GG13973@stusta.de> This email lists some known regressions in 2.6.16-rc6 compared to 2.6.15. If you find your name in the Cc header, you are either submitter of one of the bugs, maintainer of an affectected subsystem or driver, a patch of you was declared guilty for a breakage or I'm considering you in any other way possibly involved with one or more of these issues. Due to the huge amount of recipients, please trim the Cc when answering. Subject : XFS oopses on my box sometimes References : http://bugzilla.kernel.org/show_bug.cgi?id=6180 Submitter : Avuton Olrich Status : unknown Subject : 2.6.16-rc5 acpi slab corruption References : http://lkml.org/lkml/2006/3/1/223 Submitter : Dave Jones Status : unknown Subject : edac slab corruption References : http://lkml.org/lkml/2006/3/5/14 Submitter : Dave Jones Status : unknown Subject : yet more slab corruption References : https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=184310 Submitter : Dave Jones Status : unknown Subject : Slab corruption in usbserial when disconnecting device References : http://lkml.org/lkml/2006/3/8/58 Submitter : pete.chapman at exgate.tek.com Status : unknown Subject : 2.6.16-rc5-git14 crash in spin_bug on ppc64 References : http://lkml.org/lkml/2006/3/10/190 Submitter : Olaf Hering Status : unknown Subject : Stradis driver udev brekage References : http://bugzilla.kernel.org/show_bug.cgi?id=6170 https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=181063 http://lkml.org/lkml/2006/2/18/204 Submitter : Tom Seeley Dave Jones Handled-By : Jiri Slaby Status : unknown cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed From js at linuxtv.org Tue Mar 14 08:22:15 2006 From: js at linuxtv.org (Johannes Stezenbach) Date: Mon, 13 Mar 2006 22:22:15 +0100 Subject: [v4l-dvb-maintainer] Re: 2.6.16-rc6: known regressions In-Reply-To: <20060313121219.GB13652@suse.de> References: <20060313200544.GG13973@stusta.de> <20060313121219.GB13652@suse.de> Message-ID: <20060313212215.GA6041@linuxtv.org> On Mon, Mar 13, 2006 at 12:12:19PM +0000, Greg KH wrote: > On Mon, Mar 13, 2006 at 09:05:44PM +0100, Adrian Bunk wrote: > > Subject : Stradis driver udev brekage > > References : http://bugzilla.kernel.org/show_bug.cgi?id=6170 > > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=181063 > > http://lkml.org/lkml/2006/2/18/204 > > Submitter : Tom Seeley > > Dave Jones > > Handled-By : Jiri Slaby > > Status : unknown > > Jiri, why did you create a kernel.org bugzilla bug with almost no > information in it? > > Anyway, this is the first I've heard of this, more information is > needed to help track it down. How about the contents of /sys/class/dvb/ ? Stradis is not a DVB driver. AFAIK it uses V4L devices. http://bugzilla.kernel.org/show_bug.cgi?id=6170 and https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=181063 seem to be two totally different bugs. First thing to check for the Nova-T is dmesg, to see if the device was recognized at all by the driver, so we know if it is an udev problem or not. BTW: http://mpeg.openprojects.net/ doesn't exist diff --git a/MAINTAINERS b/MAINTAINERS index 3d7d30d..922a290 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -2525,7 +2525,6 @@ S: Unsupported ? STRADIS MPEG-2 DECODER DRIVER P: Nathan Laredo M: laredo at gnu.org -W: http://mpeg.openprojects.net/ W: http://www.stradis.com/ S: Maintained Johannes From laredo at gnu.org Tue Mar 14 09:14:22 2006 From: laredo at gnu.org (Nathan Laredo) Date: Mon, 13 Mar 2006 14:14:22 -0800 Subject: [v4l-dvb-maintainer] Re: 2.6.16-rc6: known regressions In-Reply-To: <20060313212215.GA6041@linuxtv.org> References: <20060313200544.GG13973@stusta.de> <20060313121219.GB13652@suse.de> <20060313212215.GA6041@linuxtv.org> Message-ID: Stradis does not support my driver. Please use http://stradis.nathanlaredo.com/ such as it is now and I'll update it later. Secondly, please confirm that the person reporting this bug actually has the hardware since the driver *will* refuse to load without hardware installed. To my knowledge I am currently the only one of about 10 people using this hardware under linux. Thanks, -- Nathan Laredo laredo at gnu.org On 3/13/06, Johannes Stezenbach wrote: > On Mon, Mar 13, 2006 at 12:12:19PM +0000, Greg KH wrote: > > On Mon, Mar 13, 2006 at 09:05:44PM +0100, Adrian Bunk wrote: > > > Subject : Stradis driver udev brekage > > > References : http://bugzilla.kernel.org/show_bug.cgi?id=6170 > > > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=181063 > > > http://lkml.org/lkml/2006/2/18/204 > > > Submitter : Tom Seeley > > > Dave Jones > > > Handled-By : Jiri Slaby > > > Status : unknown > > > > Jiri, why did you create a kernel.org bugzilla bug with almost no > > information in it? > > > > Anyway, this is the first I've heard of this, more information is > > needed to help track it down. How about the contents of /sys/class/dvb/ ? > > Stradis is not a DVB driver. AFAIK it uses V4L devices. > > http://bugzilla.kernel.org/show_bug.cgi?id=6170 and > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=181063 > seem to be two totally different bugs. First thing to check > for the Nova-T is dmesg, to see if the device was recognized > at all by the driver, so we know if it is an udev > problem or not. > > > BTW: http://mpeg.openprojects.net/ doesn't exist > > diff --git a/MAINTAINERS b/MAINTAINERS > index 3d7d30d..922a290 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -2525,7 +2525,6 @@ S: Unsupported ? > STRADIS MPEG-2 DECODER DRIVER > P: Nathan Laredo > M: laredo at gnu.org > -W: http://mpeg.openprojects.net/ > W: http://www.stradis.com/ > S: Maintained > > > Johannes > From michael at ellerman.id.au Tue Mar 14 15:18:32 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Tue, 14 Mar 2006 15:18:32 +1100 Subject: [RFC/PATCH] powerpc: Fix kdump EOI bug (often exhibits as dead console) In-Reply-To: <0646a9f7bbbf563af2f23fe7229a4b01@bga.com> References: <20060313081709.20471679F2@ozlabs.org> <1142239453.32330.51.camel@localhost.localdomain> <0646a9f7bbbf563af2f23fe7229a4b01@bga.com> Message-ID: <200603141518.40447.michael@ellerman.id.au> On Tue, 14 Mar 2006 02:59, Milton Miller wrote: > What we are doing here is a possibly extranious, third party EOI > (device X interrupted cpu Y, and cpu Z is issuing the EOI to allow the > device to reissue the interrupt). For real XICS I know that is both > possible and safe as long as the interrupt X exists; I am familiar with > the hardware implementation. I'll take your word for it ;) I've tested on virtual xics that if we take the interrupt on Y and EOI on Z then it works just fine. cheers -- Michael Ellerman IBM OzLabs wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060314/8b0df63b/attachment.pgp From mikey at neuling.org Tue Mar 14 16:17:39 2006 From: mikey at neuling.org (Michael Neuling) Date: Tue, 14 Mar 2006 16:17:39 +1100 Subject: [PATCH] powerpc: LMB bogus loop conditional removal Message-ID: <20060314051741.18FF0679EB@ozlabs.org> In lmb_add_region in mm/lmb.c we have: /* Couldn't coalesce the LMB, so add it to the sorted table. */ for (i = rgn->cnt-1; i >= 0; i--) { if (base < rgn->region[i].base) { rgn->region[i+1].size = rgn->region[i].size; } else { rgn->region[i+1].base = base; rgn->region[i+1].size = size; break; } } but i is an unsigned long, so i >= 0 is always true. This is OK since in lmb_init we have an entry where base == 0 and hence we'll always hit the break. Patch below removes the bogus i >= 0 and updates the comment. Signed-off-by: Michael Neuling --- arch/powerpc/mm/lmb.c | 7 +++++-- 1 files changed, 5 insertions(+), 2 deletions(-) Index: linux-2.6-powerpc-merge/arch/powerpc/mm/lmb.c =================================================================== --- linux-2.6-powerpc-merge.orig/arch/powerpc/mm/lmb.c +++ linux-2.6-powerpc-merge/arch/powerpc/mm/lmb.c @@ -164,8 +164,11 @@ static long __init lmb_add_region(struct if (rgn->cnt >= MAX_LMB_REGIONS) return -1; - /* Couldn't coalesce the LMB, so add it to the sorted table. */ - for (i = rgn->cnt-1; i >= 0; i--) { + /* Couldn't coalesce the LMB, so add it to the sorted table. + * lmb_init ensures we have a region with base == 0, so we'll + * always hit the break eventually. + */ + for (i = rgn->cnt - 1; ; i--) { if (base < rgn->region[i].base) { rgn->region[i+1].base = rgn->region[i].base; rgn->region[i+1].size = rgn->region[i].size; From mikey at neuling.org Tue Mar 14 17:11:51 2006 From: mikey at neuling.org (Michael Neuling) Date: Tue, 14 Mar 2006 17:11:51 +1100 Subject: [PATCH] powerpc: RTC memory corruption Message-ID: <20060314061155.93F5E679F7@ozlabs.org> We should be memset'ing the data we are pointing to, not the pointer itself. This is in an error path so we probably don't hit it much. Signed-off-by: Michael Neuling --- arch/powerpc/kernel/rtas-rtc.c | 2 +- 1 files changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6-powerpc-merge/arch/powerpc/kernel/rtas-rtc.c =================================================================== --- linux-2.6-powerpc-merge.orig/arch/powerpc/kernel/rtas-rtc.c +++ linux-2.6-powerpc-merge/arch/powerpc/kernel/rtas-rtc.c @@ -52,7 +52,7 @@ void rtas_get_rtc_time(struct rtc_time * error = rtas_call(rtas_token("get-time-of-day"), 0, 8, ret); if (error == RTAS_CLOCK_BUSY || rtas_is_extended_busy(error)) { if (in_interrupt() && printk_ratelimit()) { - memset(&rtc_tm, 0, sizeof(struct rtc_time)); + memset(rtc_tm, 0, sizeof(struct rtc_time)); printk(KERN_WARNING "error: reading clock" " would delay interrupt\n"); return; /* delay not allowed */ From redhat at tomseeley.co.uk Tue Mar 14 20:36:30 2006 From: redhat at tomseeley.co.uk (Tom Seeley) Date: Tue, 14 Mar 2006 09:36:30 +0000 Subject: 2.6.16-rc6: known regressions In-Reply-To: References: <20060313200544.GG13973@stusta.de> Message-ID: <44168E9E.7050503@tomseeley.co.uk> Jiri Slaby wrote: > Greg KH wrote: >> On Mon, Mar 13, 2006 at 09:05:44PM +0100, Adrian Bunk wrote: >>> Subject : Stradis driver udev brekage >>> References : http://bugzilla.kernel.org/show_bug.cgi?id=6170 >>> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=181063 >>> http://lkml.org/lkml/2006/2/18/204 >>> Submitter : Tom Seeley >>> Dave Jones >>> Handled-By : Jiri Slaby >>> Status : unknown >> Jiri, why did you create a kernel.org bugzilla bug with almost no >> information in it? >> >> Anyway, this is the first I've heard of this, more information is >> needed to help track it down. How about the contents of /sys/class/dvb/ ? > Hello, > > sorry for that, I expected Tom to help us track this down -- he has this > problem, but he haven't replied yet. Nobody else is complaining, would we defer > or close it for now? > > best regards, Apologies for the lack of additional information, this is simply a lack of time on my behalf. My first attempt to bisect 2.6.15 <-> 2.6.16-rc5 produced a kernel which caused udev to crash (and stop init). I will shift the goalposts and try again. Once I have the results I will post them to bugzilla. I will also post the contents of /sys/class/dvb as requested above. Thanks, Tom. From dhowells at redhat.com Wed Mar 15 08:26:52 2006 From: dhowells at redhat.com (David Howells) Date: Tue, 14 Mar 2006 21:26:52 +0000 Subject: [PATCH] Document Linux's memory barriers [try #4] In-Reply-To: References: <16835.1141936162@warthog.cambridge.redhat.com> Message-ID: <32068.1142371612@warthog.cambridge.redhat.com> Eric W. Biederman wrote: > A small nit. You are not documenting the most subtle memory barrier: > smp_read_barrier_depends(); Which is a deep requirement of the RCU > code. How about this the attached adjustment? David diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 3ec9ff4..0c38bea 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -457,13 +457,14 @@ except in small and specific cases. In EXPLICIT KERNEL MEMORY BARRIERS =============================== -The Linux kernel has six basic CPU memory barriers: +The Linux kernel has eight basic CPU memory barriers: - MANDATORY SMP CONDITIONAL - =============== =============== - GENERAL mb() smp_mb() - READ rmb() smp_rmb() - WRITE wmb() smp_wmb() + TYPE MANDATORY SMP CONDITIONAL + =============== ======================= =========================== + GENERAL mb() smp_mb() + WRITE wmb() smp_wmb() + READ rmb() smp_rmb() + DATA DEPENDENCY read_barrier_depends() smp_read_barrier_depends() General memory barriers give a guarantee that all memory accesses specified before the barrier will appear to happen before all memory accesses specified @@ -472,6 +473,36 @@ after the barrier with respect to the ot Read and write memory barriers give similar guarantees, but only for memory reads versus memory reads and memory writes versus memory writes respectively. +Data dependency memory barriers ensure that if two reads are issued that +depend on each other, that the first read is completed _before_ the dependency +comes into effect. For instance, consider a case where the address used in +the second read is calculated from the result of the first read: + + CPU 1 CPU 2 COMMENT + =============== =============== ======================================= + a == 0, b == 1 and p == &a, q == &a + b = 2; + smp_wmb(); Make sure b is changed before p + p = &b; q = p; + d = *q; + +then old data values may be used in the address calculation for the second +value, potentially resulting in q == &b and d == 0 being seen, which is never +correct. What is required is a data dependency memory barrier: + + CPU 1 CPU 2 COMMENT + =============== =============== ======================================= + a == 0, b == 1 and p == &a, q == &a + b = 2; + smp_wmb(); Make sure b is changed before p + p = &b; q = p; + smp_read_barrier_depends(); + Make sure q is changed before d is read + d = *q; + +This forces the result to be either q == &a and d == 0 or q == &b and d == 2. +The result of q == &b and d == 0 will never be seen. + All memory barriers imply compiler barriers. SMP memory barriers are only compiler barriers on uniprocessor compiled systems From paulus at samba.org Wed Mar 15 08:48:03 2006 From: paulus at samba.org (Paul Mackerras) Date: Wed, 15 Mar 2006 08:48:03 +1100 Subject: [PATCH] Document Linux's memory barriers [try #4] In-Reply-To: <32068.1142371612@warthog.cambridge.redhat.com> References: <16835.1141936162@warthog.cambridge.redhat.com> <32068.1142371612@warthog.cambridge.redhat.com> Message-ID: <17431.14867.211423.851470@cargo.ozlabs.ibm.com> David Howells writes: > + CPU 1 CPU 2 COMMENT > + =============== =============== ======================================= > + a == 0, b == 1 and p == &a, q == &a > + b = 2; > + smp_wmb(); Make sure b is changed before p > + p = &b; q = p; > + d = *q; > + > +then old data values may be used in the address calculation for the second > +value, potentially resulting in q == &b and d == 0 being seen, which is never > +correct. What is required is a data dependency memory barrier: No, that's not the problem. The problem is that you can get q == &b and d == 1, believe it or not. That is, you can see the new value of the pointer but the old value of the thing pointed to. Paul. From johnrose at austin.ibm.com Wed Mar 15 10:46:45 2006 From: johnrose at austin.ibm.com (John Rose) Date: Tue, 14 Mar 2006 17:46:45 -0600 Subject: [PATCH] powerpc: properly configure DDR/P5IOC children devs Message-ID: <1142380005.11994.90.camel@sinatra.austin.ibm.com> The dynamic add path for PCI Host Bridges can fail to configure children adapters under P5IOC controllers. It fails to properly fixup bus/device resources, and it fails to properly enable EEH. Both of these steps need to occur before any children devices are enabled in pci_bus_add_devices(). Signed-off-by: John Rose --- This is a single concise respin of the 3 patches sent on March 6th: http://ozlabs.org/pipermail/linuxppc64-dev/2006-March/008275.html The first patch has been set aside as not essential, and the other two have been combined. This is a bug fix that will hopefully make 2.6.16. This has been tested for P5IOC and non-P5IOC slots. Thanks- John diff -puN arch/powerpc/kernel/rtas_pci.c~move_init_phb_dyn arch/powerpc/kernel/rtas_pci.c --- 2_6_p5_2/arch/powerpc/kernel/rtas_pci.c~move_init_phb_dyn 2006-03-14 17:28:18.000000000 -0600 +++ 2_6_p5_2-johnrose/arch/powerpc/kernel/rtas_pci.c 2006-03-14 17:29:25.000000000 -0600 @@ -280,8 +280,7 @@ static int phb_set_bus_ranges(struct dev return 0; } -static int __devinit setup_phb(struct device_node *dev, - struct pci_controller *phb) +int __devinit setup_phb(struct device_node *dev, struct pci_controller *phb) { if (is_python(dev)) python_countermeasures(dev); @@ -359,27 +358,6 @@ unsigned long __init find_and_init_phbs( return 0; } -struct pci_controller * __devinit init_phb_dynamic(struct device_node *dn) -{ - struct pci_controller *phb; - int primary; - - primary = list_empty(&hose_list); - phb = pcibios_alloc_controller(dn); - if (!phb) - return NULL; - setup_phb(dn, phb); - pci_process_bridge_OF_ranges(phb, dn, primary); - - pci_setup_phb_io_dynamic(phb, primary); - - pci_devs_phb_init_dynamic(phb); - scan_phb(phb); - - return phb; -} -EXPORT_SYMBOL(init_phb_dynamic); - /* RPA-specific bits for removing PHBs */ int pcibios_remove_root_bus(struct pci_controller *phb) { diff -puN arch/powerpc/platforms/pseries/pci_dlpar.c~move_init_phb_dyn arch/powerpc/platforms/pseries/pci_dlpar.c --- 2_6_p5_2/arch/powerpc/platforms/pseries/pci_dlpar.c~move_init_phb_dyn 2006-03-14 17:28:18.000000000 -0600 +++ 2_6_p5_2-johnrose/arch/powerpc/platforms/pseries/pci_dlpar.c 2006-03-14 17:37:09.000000000 -0600 @@ -27,6 +27,7 @@ #include #include +#include static struct pci_bus * find_bus_among_children(struct pci_bus *bus, @@ -179,3 +180,30 @@ pcibios_add_pci_devices(struct pci_bus * } } EXPORT_SYMBOL_GPL(pcibios_add_pci_devices); + +struct pci_controller * __devinit init_phb_dynamic(struct device_node *dn) +{ + struct pci_controller *phb; + int primary; + + primary = list_empty(&hose_list); + phb = pcibios_alloc_controller(dn); + if (!phb) + return NULL; + setup_phb(dn, phb); + pci_process_bridge_OF_ranges(phb, dn, 0); + + pci_setup_phb_io_dynamic(phb, primary); + + pci_devs_phb_init_dynamic(phb); + + if (dn->child) + eeh_add_device_tree_early(dn); + + scan_phb(phb); + pcibios_fixup_new_pci_devices(phb->bus, 0); + pci_bus_add_devices(phb->bus); + + return phb; +} +EXPORT_SYMBOL_GPL(init_phb_dynamic); diff -puN include/asm-powerpc/ppc-pci.h~move_init_phb_dyn include/asm-powerpc/ppc-pci.h --- 2_6_p5_2/include/asm-powerpc/ppc-pci.h~move_init_phb_dyn 2006-03-14 17:28:18.000000000 -0600 +++ 2_6_p5_2-johnrose/include/asm-powerpc/ppc-pci.h 2006-03-14 17:28:18.000000000 -0600 @@ -38,6 +38,7 @@ void *traverse_pci_devices(struct device void pci_devs_phb_init(void); void pci_devs_phb_init_dynamic(struct pci_controller *phb); +int setup_phb(struct device_node *dev, struct pci_controller *phb); void __devinit scan_phb(struct pci_controller *hose); /* From rtas_pci.h */ diff -puN arch/powerpc/kernel/pci_64.c~move_init_phb_dyn arch/powerpc/kernel/pci_64.c --- 2_6_p5_2/arch/powerpc/kernel/pci_64.c~move_init_phb_dyn 2006-03-14 17:37:01.000000000 -0600 +++ 2_6_p5_2-johnrose/arch/powerpc/kernel/pci_64.c 2006-03-14 17:37:09.000000000 -0600 @@ -589,7 +589,6 @@ void __devinit scan_phb(struct pci_contr #endif /* CONFIG_PPC_MULTIPLATFORM */ if (mode == PCI_PROBE_NORMAL) hose->last_busno = bus->subordinate = pci_scan_child_bus(bus); - pci_bus_add_devices(bus); } static int __init pcibios_init(void) @@ -608,8 +607,10 @@ static int __init pcibios_init(void) printk("PCI: Probing PCI hardware\n"); /* Scan all of the recorded PCI controllers. */ - list_for_each_entry_safe(hose, tmp, &hose_list, list_node) + list_for_each_entry_safe(hose, tmp, &hose_list, list_node) { scan_phb(hose); + pci_bus_add_devices(hose->bus); + } #ifndef CONFIG_PPC_ISERIES if (pci_probe_only) _ From dhowells at redhat.com Wed Mar 15 10:59:28 2006 From: dhowells at redhat.com (David Howells) Date: Tue, 14 Mar 2006 23:59:28 +0000 Subject: [PATCH] Document Linux's memory barriers [try #4] In-Reply-To: <17431.14867.211423.851470@cargo.ozlabs.ibm.com> References: <17431.14867.211423.851470@cargo.ozlabs.ibm.com> <16835.1141936162@warthog.cambridge.redhat.com> <32068.1142371612@warthog.cambridge.redhat.com> Message-ID: <2301.1142380768@warthog.cambridge.redhat.com> Paul Mackerras wrote: > No, that's not the problem. The problem is that you can get q == &b > and d == 1, believe it or not. That is, you can see the new value of > the pointer but the old value of the thing pointed to. But that doesn't make any sense! That would mean we that we'd've read b into d before having read the new value of p into q, and thus before having calculated the address from which to read d (ie: &b) - so how could we know we were supposed to read d from b and not from a without first having read p? Unless, of course, the smp_wmb() isn't effective, and the write to b happens after the write to p; or the Alpha's cache isn't fully coherent. David From torvalds at osdl.org Wed Mar 15 11:20:29 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Tue, 14 Mar 2006 16:20:29 -0800 (PST) Subject: [PATCH] Document Linux's memory barriers [try #4] In-Reply-To: <2301.1142380768@warthog.cambridge.redhat.com> References: <17431.14867.211423.851470@cargo.ozlabs.ibm.com> <16835.1141936162@warthog.cambridge.redhat.com> <32068.1142371612@warthog.cambridge.redhat.com> <2301.1142380768@warthog.cambridge.redhat.com> Message-ID: On Tue, 14 Mar 2006, David Howells wrote: > > But that doesn't make any sense! > > That would mean we that we'd've read b into d before having read the new value > of p into q, and thus before having calculated the address from which to read d > (ie: &b) - so how could we know we were supposed to read d from b and not from > a without first having read p? > > Unless, of course, the smp_wmb() isn't effective, and the write to b happens > after the write to p; or the Alpha's cache isn't fully coherent. The cache is fully coherent, but the coherency isn't _ordered_. Remember: the smp_wmb() only orders on the _writer_ side. Not on the reader side. The writer may send out the stuff in a particular order, but the reader might see them in a different order because _it_ might queue the bus events internally for its caches (in particular, it could end up delaying updating a particular way in the cache because it's busy). [ The issue of read_barrier_depends() can also come up if you do data speculation. Currently I don't think anybody does speculation for anything but control speculation, but it's at least possible that a read that "depends" on a previous read actually could take place before the read it depends on if the previous read had its result speculated. For example, you already have to handle the case of if (read a) read b; where we can read b _before_ we read a, because the CPU speculated the branch as being not taken, and then re-ordered the reads, even though they are "dependent" on each other. That's not that different from doing ptr = read a data = read [ptr] and speculating the result of the first read. Such a CPU would also need a non-empty read-barrier-depends ] So memory ordering is interesting. Some "clearly impossible" orderings actually suddenly become possible just because the CPU can do things speculatively and thus things aren't necessarily causally ordered any more. Linus From paulus at samba.org Wed Mar 15 11:54:45 2006 From: paulus at samba.org (Paul Mackerras) Date: Wed, 15 Mar 2006 11:54:45 +1100 Subject: [PATCH] Document Linux's memory barriers [try #4] In-Reply-To: <2301.1142380768@warthog.cambridge.redhat.com> References: <17431.14867.211423.851470@cargo.ozlabs.ibm.com> <16835.1141936162@warthog.cambridge.redhat.com> <32068.1142371612@warthog.cambridge.redhat.com> <2301.1142380768@warthog.cambridge.redhat.com> Message-ID: <17431.26069.168489.609677@cargo.ozlabs.ibm.com> David Howells writes: > Paul Mackerras wrote: > > > No, that's not the problem. The problem is that you can get q == &b > > and d == 1, believe it or not. That is, you can see the new value of > > the pointer but the old value of the thing pointed to. > > But that doesn't make any sense! It certainly violates the principle of least surprise. :) Apparently this can occur on some Alpha machines that have a partitioned cache. Although CPU 1 sends out the updates to b and p in the right order because of the smp_wmb(), it's possible that b and p are present in CPU 2's cache, one in each half of the cache. If there are a lot of updates coming in for the half containing b, but the half containing p is quiet, it is possible for CPU 2 to see a new value of p but an old value of b, unless you put an rmb instruction between the two loads from memory. I haven't heard of this being an issue on any other architecture. On PowerPC it can't happen because the architecture specifies that a data dependency creates an implicit read barrier. Paul. From dhowells at redhat.com Wed Mar 15 12:19:02 2006 From: dhowells at redhat.com (David Howells) Date: Wed, 15 Mar 2006 01:19:02 +0000 Subject: [PATCH] Document Linux's memory barriers [try #4] In-Reply-To: References: <17431.14867.211423.851470@cargo.ozlabs.ibm.com> <16835.1141936162@warthog.cambridge.redhat.com> <32068.1142371612@warthog.cambridge.redhat.com> <2301.1142380768@warthog.cambridge.redhat.com> Message-ID: <3762.1142385542@warthog.cambridge.redhat.com> Linus Torvalds wrote: > That's not that different from doing > > ptr = read a > data = read [ptr] > > and speculating the result of the first read. But that would lead to the situation I suggested (q == &b and d == a), not the one Paul suggested (q == &b and d == old b) because we'd speculate on the old value of the pointer, and so see it before it's updated, and thus still pointing to a. > The cache is fully coherent, but the coherency isn't _ordered_. > > Remember: the smp_wmb() only orders on the _writer_ side. Not on the > reader side. The writer may send out the stuff in a particular order, but > the reader might see them in a different order because _it_ might queue > the bus events internally for its caches (in particular, it could end up > delaying updating a particular way in the cache because it's busy). Ummm... So whilst smp_wmb() commits writes to the mercy of the cache coherency system in a particular order, the updates can be passed over from one cache to another and committed to the reader's cache in any order, and can even be delayed: CPU 1 CPU 2 COMMENT =============== =============== ======================================= a == 0, b == 1 and p == &a, q == &a b = 2; smp_wmb(); Make sure b is changed before p p = &b; q = p; d = *q; Reads from b before b updated in cache I presume the Alpha MB instruction forces cache queue completion in addition to a partial ordering on memory accesses: CPU 1 CPU 2 COMMENT =============== =============== ======================================= a == 0, b == 1 and p == &a, q == &a b = 2; smp_wmb(); Make sure b is changed before p p = &b; q = p; smp_read_barrier_depends(); d = *q; Reads new value of b David From nickpiggin at yahoo.com.au Wed Mar 15 12:25:13 2006 From: nickpiggin at yahoo.com.au (Nick Piggin) Date: Wed, 15 Mar 2006 12:25:13 +1100 Subject: [PATCH] Document Linux's memory barriers [try #4] In-Reply-To: References: <17431.14867.211423.851470@cargo.ozlabs.ibm.com> <16835.1141936162@warthog.cambridge.redhat.com> <32068.1142371612@warthog.cambridge.redhat.com> <2301.1142380768@warthog.cambridge.redhat.com> Message-ID: <44176CF9.90909@yahoo.com.au> Linus Torvalds wrote: > >On Tue, 14 Mar 2006, David Howells wrote: > >>But that doesn't make any sense! >> >>That would mean we that we'd've read b into d before having read the new value >>of p into q, and thus before having calculated the address from which to read d >>(ie: &b) - so how could we know we were supposed to read d from b and not from >>a without first having read p? >> >>Unless, of course, the smp_wmb() isn't effective, and the write to b happens >>after the write to p; or the Alpha's cache isn't fully coherent. >> > >The cache is fully coherent, but the coherency isn't _ordered_. > > This is what I was referring to when I said your (David's) idea of "memory" WRT memory consistency isn't correct -- cache coherency can be out of order. -- Send instant messages to your online friends http://au.messenger.yahoo.com From torvalds at osdl.org Wed Mar 15 12:47:40 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Tue, 14 Mar 2006 17:47:40 -0800 (PST) Subject: [PATCH] Document Linux's memory barriers [try #4] In-Reply-To: <3762.1142385542@warthog.cambridge.redhat.com> References: <17431.14867.211423.851470@cargo.ozlabs.ibm.com> <16835.1141936162@warthog.cambridge.redhat.com> <32068.1142371612@warthog.cambridge.redhat.com> <2301.1142380768@warthog.cambridge.redhat.com> <3762.1142385542@warthog.cambridge.redhat.com> Message-ID: On Wed, 15 Mar 2006, David Howells wrote: > Linus Torvalds wrote: > > > That's not that different from doing > > > > ptr = read a > > data = read [ptr] > > > > and speculating the result of the first read. > > But that would lead to the situation I suggested (q == &b and d == a), not the > one Paul suggested (q == &b and d == old b) because we'd speculate on the old > value of the pointer, and so see it before it's updated, and thus still > pointing to a. No. If it _speculates_ the old value, and the value has actually changed when it checks the speculation, it would generally result in a uarch trap, and re-do of the instruction without speculation. So for data speculation to make a difference in this case, it would speculate the _new_ value (hey, doesn't matter _why_ - it could be that a previous load at a previous time had gotten that value), and then load the old value off the new pointer, and when the speculation ends up being checked, it all pans out (the speculated value matched the value when "a" was actually later read), and you get a "non-causal" result. Now, nobody actually does this kind of data speculation as far as I know, and there are perfectly valid arguments for why outside of control speculation nobody likely will (at least partly due to the fact that it would screw up existing expectations for memory ordering). It's also pretty damn complicated to do. But data speculation has certainly been a research subject, and there are papers on it. > > Remember: the smp_wmb() only orders on the _writer_ side. Not on the > > reader side. The writer may send out the stuff in a particular order, but > > the reader might see them in a different order because _it_ might queue > > the bus events internally for its caches (in particular, it could end up > > delaying updating a particular way in the cache because it's busy). > > Ummm... So whilst smp_wmb() commits writes to the mercy of the cache coherency > system in a particular order, the updates can be passed over from one cache to > another and committed to the reader's cache in any order, and can even be > delayed: Right. You should _always_ have as a rule of thinking that a "smp_wmb()" on one side absolutely _has_ to be paired with a "smp_rmb()" on the other side. If they aren't paired, something is _wrong_. Now, the data-dependent reads is actually a very specific optimization where we say that on certain architectures you don't need it, so we relax the rule to be "the reader has to have a smp_rmb() _or_ a smp_read_barrier_depends(), where the latter is only valid if the address of the dependent read depends directly on the first one". But the read barrier always has to be there, even though it can be of the "weaker" type. And note that the address really has to have a _data_ dependency, not a control dependency. If the address is dependent on the first read, but the dependency is through a conditional rather than actually reading the address itself, then it's a control dependency, and existing CPU's already short-circuit those through branch prediction. Linus From ntl at pobox.com Wed Mar 15 16:16:57 2006 From: ntl at pobox.com (Nathan Lynch) Date: Tue, 14 Mar 2006 23:16:57 -0600 Subject: [PATCH] powerpc: add for_each_node_by_foo helpers In-Reply-To: <20060313081511.GC3205@localhost.localdomain> References: <20060308154700.GA15859@lst.de> <1142119340.4057.38.camel@localhost.localdomain> <20060312221817.GA3205@localhost.localdomain> <1142206193.4433.17.camel@localhost.localdomain> <20060313031138.GB3205@localhost.localdomain> <1142219872.32330.12.camel@localhost.localdomain> <20060313081511.GC3205@localhost.localdomain> Message-ID: <20060315051657.GE3205@localhost.localdomain> Nathan Lynch wrote: > Benjamin Herrenschmidt wrote: > > > > > > I recently showed a variety of scenarios that would blow up in funny > > > > ways.. It's not enough. > > > > > > A variety of scenarios? I'm aware of only one [1], which is kind of a > > > corner case, and has never come up in practice to my knowledge. > > > > > > If you know of others, I'd like to know the details. > > > > I actually have one in mind that I remember, but I'm pretty sure I saw > > another one... > > More details, please. If you know of another issue besides the one > pointed out already, I'd prefer to know the nature of it before > putting more time into fixing this one. Milton just pointed out to me another problem. There is a race between add and remove operations in that a child can be added to a node that is undergoing removal, which would cause an inconsistent tree: static int pSeries_reconfig_remove_node(struct device_node *np, int refs) { struct device_node *parent, *child; parent = of_get_parent(np); if (!parent) return -EINVAL; if ((child = of_get_next_child(np, NULL))) { of_node_put(child); return -EBUSY; } >>>> /* child could be added now */ remove_node_proc_entries(np); notifier_call_chain(&pSeries_reconfig_chain, PSERIES_RECONFIG_REMOVE, np); of_detach_node(np, refs); Unlike the double-remove problem, this one seems easy to fix :) The check for the child should be moved under the devtree_lock, or we could serialize reconfig operations with a mutex. From hch at lst.de Wed Mar 15 21:30:44 2006 From: hch at lst.de (Christoph Hellwig) Date: Wed, 15 Mar 2006 11:30:44 +0100 Subject: [PATCH] spidernet: select FW_LOADER Message-ID: <20060315103044.GA14919@lst.de> The spidernet drivers uses request_firmware() and thus needs to select FW_LOADER. Signed-off-by: Christoph Hellwig Index: systemsim/drivers/net/Kconfig =================================================================== --- systemsim.orig/drivers/net/Kconfig 2006-03-14 17:08:07.000000000 +0100 +++ systemsim/drivers/net/Kconfig 2006-03-15 11:03:20.000000000 +0100 @@ -2184,6 +2184,7 @@ config SPIDER_NET tristate "Spider Gigabit Ethernet driver" depends on PCI && PPC_CELL + select FW_LOADER help This driver supports the Gigabit Ethernet chips present on the Cell Processor-Based Blades from IBM. From dhowells at redhat.com Wed Mar 15 22:10:18 2006 From: dhowells at redhat.com (David Howells) Date: Wed, 15 Mar 2006 11:10:18 +0000 Subject: [PATCH] Document Linux's memory barriers [try #4] In-Reply-To: <44110E93.8060504@yahoo.com.au> References: <44110E93.8060504@yahoo.com.au> <16835.1141936162@warthog.cambridge.redhat.com> Message-ID: <14886.1142421018@warthog.cambridge.redhat.com> Nick Piggin wrote: > Isn't the Alpha's split caches a counter-example of your model, > because the coherency itself is out of order? I'd forgotten I need to adjust my documentation to deal with this. It seems this is the reason for read_barrier_depends(), and that read_barrier_depends() is also a partial cache sync. Do you know of any docs on Alpha's split caches? The Alpha Arch Handbook doesn't say very much about cache operation on the Alpha. I've looked around for what exactly is meant by "split cache" in conjunction with Alpha CPUs, and I've found three different opinions of what it means: (1) Separate Data and Instruction caches. (2) Serial data caches (CPU -> L1 Cache -> L2 Cache -> L3 Cache -> Memory). (3) Parallel linked data caches, where a CPU's request can be satisfied by either data cache, in which whilst one data cache is being interrogated by the CPU, the other one can use the memory bus (at least, that's what I understand). > Why do you need to include caches and queues in your model? Do > programmers care? Isn't the following sufficient... I don't think it is sufficient, given the number of times the way the cache interacts with everything has come up in this discussion. > : | m | > CPU -----> | e | > : | m | > : | o | > CPU -----> | r | > : | y | > > ... and bugger the implementation details? Ah, but if the cache is on the CPU side of the dotted line, does that then mean that a write memory barrier guarantees the CPU's cache to have updated memory? David From nickpiggin at yahoo.com.au Wed Mar 15 22:51:30 2006 From: nickpiggin at yahoo.com.au (Nick Piggin) Date: Wed, 15 Mar 2006 22:51:30 +1100 Subject: [PATCH] Document Linux's memory barriers [try #4] In-Reply-To: <14886.1142421018@warthog.cambridge.redhat.com> References: <44110E93.8060504@yahoo.com.au> <16835.1141936162@warthog.cambridge.redhat.com> <14886.1142421018@warthog.cambridge.redhat.com> Message-ID: <4417FFC2.8040909@yahoo.com.au> David Howells wrote: > Nick Piggin wrote: > > >>Isn't the Alpha's split caches a counter-example of your model, >>because the coherency itself is out of order? > > > I'd forgotten I need to adjust my documentation to deal with this. It seems > this is the reason for read_barrier_depends(), and that read_barrier_depends() > is also a partial cache sync. > > Do you know of any docs on Alpha's split caches? The Alpha Arch Handbook > doesn't say very much about cache operation on the Alpha. > > I've looked around for what exactly is meant by "split cache" in conjunction > with Alpha CPUs, and I've found three different opinions of what it means: > > (1) Separate Data and Instruction caches. > > (2) Serial data caches (CPU -> L1 Cache -> L2 Cache -> L3 Cache -> Memory). > > (3) Parallel linked data caches, where a CPU's request can be satisfied by > either data cache, in which whilst one data cache is being interrogated by > the CPU, the other one can use the memory bus (at least, that's what I > understand). > I don't have any docs myself, Paul might be the one to talk to as he's done the most recent research on this (though some of it directly with Alpha engineers, if I remember correctly). IIRC some alpha models have split cache close to what you describe in #3, however I'm not sure of the fine details (eg. I don't think the split caches are redundant, but just treated as two entities by the cache coherence protocol). Again, I don't have anything definitive that you can put in your docco, sorry. > >>Why do you need to include caches and queues in your model? Do >>programmers care? Isn't the following sufficient... > > > I don't think it is sufficient, given the number of times the way the cache > interacts with everything has come up in this discussion. > > >> : | m | >> CPU -----> | e | >> : | m | >> : | o | >> CPU -----> | r | >> : | y | >> >>... and bugger the implementation details? > > > Ah, but if the cache is on the CPU side of the dotted line, does that then mean > that a write memory barrier guarantees the CPU's cache to have updated memory? > I don't think it has to[*]. It would guarantee the _order_ in which "global memory" of this model ie. visibility for other "CPUs" see the writes, whether that visibility ultimately be implemented by cache coherency protocol or something else, I don't think matters (for a discussion of memory ordering). If anything it confused the matter for the case of Alpha. All the programmer needs to know is that there is some horizon (memory) beyond which stores are visible to other CPUs, and stores can travel there at different speeds so later ones can overtake earlier ones. And likewise loads can come from memory to the CPU at different speeds too, so later loads can contain earlier results. [*] Nor would your model require a smp_wmb() to update CPU caches either, I think: it wouldn't have to flush the store buffer, just order it. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com From dhowells at redhat.com Thu Mar 16 00:47:34 2006 From: dhowells at redhat.com (David Howells) Date: Wed, 15 Mar 2006 13:47:34 +0000 Subject: [PATCH] Document Linux's memory barriers [try #4] In-Reply-To: <4417FFC2.8040909@yahoo.com.au> References: <4417FFC2.8040909@yahoo.com.au> <44110E93.8060504@yahoo.com.au> <16835.1141936162@warthog.cambridge.redhat.com> <14886.1142421018@warthog.cambridge.redhat.com> Message-ID: <17625.1142430454@warthog.cambridge.redhat.com> Nick Piggin wrote: > > Ah, but if the cache is on the CPU side of the dotted line, does that then > > mean that a write memory barrier guarantees the CPU's cache to have > > updated memory? > > I don't think it has to[*]. It would guarantee the _order_ in which "global > memory" of this model ie. visibility for other "CPUs" see the writes, > whether that visibility ultimately be implemented by cache coherency > protocol or something else, I don't think matters (for a discussion of > memory ordering). It does matter, because I have to make it clear that the effect of the memory barrier usually stops at the cache, and in fact memory barriers may have no visibility at all on another CPU because it's all done inside a CPU's cache, until that other CPU tries to observe the results. > If anything it confused the matter for the case of Alpha. Nah... Alpha is self-confusing:-) > All the programmer needs to know is that there is some horizon (memory) > beyond which stores are visible to other CPUs, and stores can travel there > at different speeds so later ones can overtake earlier ones. And likewise > loads can come from memory to the CPU at different speeds too, so later > loads can contain earlier results. They also need to know that memory barriers don't imply an ordering on the cache. > [*] Nor would your model require a smp_wmb() to update CPU caches either, I > think: it wouldn't have to flush the store buffer, just order it. Exactly. But in your diagram, given that it doesn't show the cache, you don't know that the memory barrier doesn't extend through the cache and all the way to memory. David From dhowells at redhat.com Thu Mar 16 01:23:19 2006 From: dhowells at redhat.com (David Howells) Date: Wed, 15 Mar 2006 14:23:19 +0000 Subject: [PATCH] Document Linux's memory barriers [try #5] In-Reply-To: <16835.1141936162@warthog.cambridge.redhat.com> References: <16835.1141936162@warthog.cambridge.redhat.com> Message-ID: <18351.1142432599@warthog.cambridge.redhat.com> The attached patch documents the Linux kernel's memory barriers. I've updated it from the comments I've been given. The per-arch notes sections are gone because it's clear that there are so many exceptions, that it's not worth having them. I've added a list of references to other documents. I've tried to get rid of the concept of memory accesses appearing on the bus; what matters is apparent behaviour with respect to other observers in the system. Interrupts barrier effects are now considered to be non-existent. They may be there, but you may not rely on them. I've added a couple of definition sections at the top of the document: one to specify the minimum execution model that may be assumed, the other to specify what this document refers to by the term "memory". I've made greater mention of the use of mmiowb(). I've adjusted the way in which caches are described, and described the fun that can be had with cache coherence maintenance being unordered and data dependency not being necessarily implicit. I've described (smp_)read_barrier_depends(). Signed-Off-By: David Howells --- warthog>diffstat -p1 /tmp/mb.diff Documentation/memory-barriers.txt | 1039 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 1039 insertions(+) diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt new file mode 100644 index 0000000..fd7a6f1 --- /dev/null +++ b/Documentation/memory-barriers.txt @@ -0,0 +1,1039 @@ + ============================ + LINUX KERNEL MEMORY BARRIERS + ============================ + +Contents: + + (*) Assumed minimum execution ordering model. + + (*) What is considered memory? + + - Cached interactions. + - Cache coherency. + - Uncached interactions. + + (*) What are memory barriers? + + (*) Where are memory barriers needed? + + - Accessing devices. + - Multiprocessor interaction. + - Interrupts. + + (*) Explicit kernel compiler barriers. + + (*) Explicit kernel memory barriers. + + - Barrier pairing. + + (*) Implicit kernel memory barriers. + + - Locking functions. + - Interrupt disabling functions. + - Miscellaneous functions. + + (*) Inter-CPU locking barrier effects. + + - Locks vs memory accesses. + - Locks vs I/O accesses. + + (*) Kernel I/O barrier effects. + + (*) References. + + +======================================== +ASSUMED MINIMUM EXECUTION ORDERING MODEL +======================================== + +It has to be assumed that the conceptual CPU is weakly-ordered in all respects +but that it will maintain the appearance of program causality with respect to +itself. Some CPUs (such as i386 or x86_64) are more constrained than others +(such as powerpc or frv), and so the most relaxed case must be assumed outside +of arch-specific code. + +This means that it must be considered that the CPU will execute its instruction +stream in any order it feels like - or even in parallel - provided that if an +instruction in the stream depends on the an earlier instruction, then that +earlier instruction must be sufficiently complete[*] before the later +instruction may proceed. + + [*] Some instructions have more than one effect[**] and different instructions + may depend on different effects. + + [**] Eg: changes to condition codes and registers; memory events. + +A CPU may also discard any instruction sequence that ultimately winds up having +no effect. For example if two adjacent instructions both load an immediate +value into the same register, the first may be discarded. + + +Similarly, it has to be assumed that compiler might reorder the instruction +stream in any way it sees fit, again provided the appearance of causality is +maintained. + + +========================== +WHAT IS CONSIDERED MEMORY? +========================== + +For the purpose of this specification what's meant by "memory" needs to be +defined, and the division between CPU and memory needs to be marked out. + + +CACHED INTERACTIONS +------------------- + +As far as cached CPU vs CPU[*] interactions go, "memory" has to include the CPU +caches in the system. Although any particular read or write may not actually +appear outside of the CPU that issued it (the CPU may may have been able to +satisfy it from its own cache), it's still as if the memory access had taken +place as far as the other CPUs are concerned since the cache coherency and +ejection mechanisms will propegate the effects upon conflict. + + [*] Also applies to CPU vs device when accessed through a cache. + +The system can be considered logically as: + + <--- CPU ---> : <----------- Memory -----------> + : + +--------+ +--------+ : +--------+ +-----------+ + | | | | : | | | | +--------+ + | CPU | | Memory | : | CPU | | | | | + | Core |--->| Access |----->| Cache |<-->| | | | + | | | Queue | : | | | |--->| Memory | + | | | | : | | | | | | + +--------+ +--------+ : +--------+ | | | | + : | Cache | +--------+ + : | Coherency | + : | Mechanism | +--------+ + +--------+ +--------+ : +--------+ | | | | + | | | | : | | | | | | + | CPU | | Memory | : | CPU | | |--->| Device | + | Core |--->| Access |----->| Cache |<-->| | | | + | | | Queue | : | | | | | | + | | | | : | | | | +--------+ + +--------+ +--------+ : +--------+ +-----------+ + : + : + +The CPU core may execute instructions in any order it deems fit, provided the +expected program causality appears to be maintained. Some of the instructions +generate load and store operations which then go into the memory access queue +to be performed. The core may place these in the queue in any order it wishes, +and continue execution until it is forced to wait for an instruction to +complete. + +What memory barriers are concerned with is controlling the order in which +accesses cross from the CPU side of things to the memory side of things, and +the order in which the effects are perceived to happen by the other observers +in the system. + + +CACHE COHERENCY +--------------- + +Life isn't quite as simple as it may appear above, however: for while the +caches are expected to be coherent, there's no guarantee that that coherency +will be ordered. This means that whilst changes made on one CPU will +eventually become visible on all CPUs, there's no guarantee that they will +become apparent in the same order on those other CPUs. + + +Consider dealing with a pair of CPUs (1 & 2), each of which has a pair of +parallel data caches (A,B and C,D) with the following properties: + + (*) a cacheline may be in one of the caches; + + (*) whilst the CPU core is interrogating one cache, the other cache may be + making use of the bus to access the rest of the system - perhaps to + displace a dirty cacheline or to do a speculative read; + + (*) each cache has a queue of operations that need to be applied to that cache + to maintain coherency with the rest of the system; + + (*) the coherency queue is not flushed by normal reads to lines already + present in the cache, even though the contents of the queue may + potentially effect those reads. + +Imagine, then, two writes made on the first CPU, with a barrier between them to +guarantee that they will reach that CPU's caches in the requisite order: + + CPU 1 CPU 2 COMMENT + =============== =============== ======================================= + a == 0, b == 1 and p == &a, q == &a + b = 2; + smp_wmb(); Make sure b is changed before p + The cacheline of b is now ours + p = &b; + The cacheline of p is now ours + +The write memory barrier forces the local CPU caches to be updated in the +correct order. But now imagine that the second CPU that wants to read those +values: + + CPU 1 CPU 2 COMMENT + =============== =============== ======================================= + ... + q = p; + d = *q; + +The above pair of reads may then fail to apparently happen in expected order, +as the cacheline holding p may get updated in one of the second CPU's caches +whilst the update to the cacheline holding b is delayed in the other of the +second CPU's caches by some other cache event: + + CPU 1 CPU 2 COMMENT + =============== =============== ======================================= + a == 0, b == 1 and p == &a, q == &a + b = 2; + smp_wmb(); Make sure b is changed before p + + + p = &b; q = p; + + + + d = *q; + Reads from b before b updated in cache + + + +Basically, whilst both cachelines will be updated on CPU 2 eventually, there's +no guarantee that, without intervention, the order of update will be the same +as that committed on CPU 1. + + +To intervene, we need to emplace a data dependency barrier or a read barrier. +This will force the cache to commit its coherency queue before processing any +further requests: + + CPU 1 CPU 2 COMMENT + =============== =============== ======================================= + a == 0, b == 1 and p == &a, q == &a + b = 2; + smp_wmb(); Make sure b is changed before p + + + p = &b; q = p; + + + + smp_read_barrier_depends() + + + d = *q; + Reads from b after b updated in cache + + +This sort of problem can be encountered on Alpha processors as they have a +split cache that improves performance by making better use of the data bus. +Whilst most CPUs do imply a data dependency barrier on the read when a memory +access depends on a read, not all do, so it may not be relied on. + + +UNCACHED INTERACTIONS +--------------------- + +Note that the above model does not show uncached memory or I/O accesses. These +procede directly from the queue to the memory or the devices, bypassing any +caches or cache coherency: + + <--- CPU ---> : + : +-----+ + +--------+ +--------+ : | | + | | | | : | | +---------+ + | CPU | | Memory | : | | | | + | Core |--->| Access |--------------->| | | | + | | | Queue | : | |------------->| Memory | + | | | | : | | | | + +--------+ +--------+ : | | | | + : | | +---------+ + : | Bus | + : | | +---------+ + +--------+ +--------+ : | | | | + | | | | : | | | | + | CPU | | Memory | : | |<------------>| Device | + | Core |--->| Access |--------------->| | | | + | | | Queue | : | | | | + | | | | : | | +---------+ + +--------+ +--------+ : | | + : +-----+ + : + + +========================= +WHAT ARE MEMORY BARRIERS? +========================= + +Memory barriers are instructions to both the compiler and the CPU to impose an +apparent partial ordering between the memory access operations specified either +side of the barrier. They request that the sequence of memory events generated +appears to other components of the system as if the barrier is effective on +that CPU. + +Note that: + + (*) there's no guarantee that the sequence of memory events is _actually_ so + ordered. It's possible for the CPU to do out-of-order accesses _as long + as no-one is looking_, and then fix up the memory if someone else tries to + see what's going on (for instance a bus master device); what matters is + the _apparent_ order as far as other processors and devices are concerned; + and + + (*) memory barriers are only guaranteed to act within the CPU processing them, + and are not, for the most part, guaranteed to percolate down to other CPUs + in the system or to any I/O hardware that that CPU may communicate with. + + +For example, a programmer might take it for granted that the CPU will perform +memory accesses in exactly the order specified, so that if a CPU is, for +example, given the following piece of code: + + a = *A; + *B = b; + c = *C; + d = *D; + *E = e; + +They would then expect that the CPU will complete the memory access for each +instruction before moving on to the next one, leading to a definite sequence of +operations as seen by external observers in the system: + + read *A, write *B, read *C, read *D, write *E. + + +Reality is, of course, much messier. With many CPUs and compilers, this isn't +always true because: + + (*) reads are more likely to need to be completed immediately to permit + execution progress, whereas writes can often be deferred without a + problem; + + (*) reads can be done speculatively, and then the result discarded should it + prove not to be required; + + (*) the order of the memory accesses may be rearranged to promote better use + of the CPU buses and caches; + + (*) reads and writes may be combined to improve performance when talking to + the memory or I/O hardware that can do batched accesses of adjacent + locations, thus cutting down on transaction setup costs (memory and PCI + devices may be able to do this); and + + (*) the CPU's data cache may affect the ordering, and whilst cache-coherency + mechanisms may alleviate this - once the write has actually hit the cache + - there's no guarantee that the coherency management will be propegated in + order to other CPUs. + +So what another CPU, say, might actually observe from the above piece of code +is: + + read *A, read {*C,*D}, write *E, write *B + + (By "read {*C,*D}" I mean a combined single read). + + +It is also guaranteed that a CPU will be self-consistent: it will see its _own_ +accesses appear to be correctly ordered, without the need for a memory barrier. +For instance with the following code: + + X = *A; + *A = Y; + Z = *A; + +assuming no intervention by an external influence, it can be taken that: + + (*) X will hold the old value of *A, and will never happen after the write and + thus end up being given the value that was assigned to *A from Y instead; + and + + (*) Z will always be given the value in *A that was assigned there from Y, and + will never happen before the write, and thus end up with the same value + that was in *A initially. + +(This is ignoring the fact that the value initially in *A may appear to be the +same as the value assigned to *A from Y). + + +================================= +WHERE ARE MEMORY BARRIERS NEEDED? +================================= + +Under normal operation, access reordering is probably not going to be a problem +as a linear program will still appear to operate correctly. There are, +however, three circumstances where reordering definitely _could_ be a problem: + + +ACCESSING DEVICES +----------------- + +Many devices can be memory mapped, and so appear to the CPU as if they're just +memory locations. However, to control the device, the driver has to make the +right accesses in exactly the right order. + +Consider, for example, an ethernet chipset such as the AMD PCnet32. It +presents to the CPU an "address register" and a bunch of "data registers". The +way it's accessed is to write the index of the internal register to be accessed +to the address register, and then read or write the appropriate data register +to access the chip's internal register, which could - theoretically - be done +by: + + *ADR = ctl_reg_3; + reg = *DATA; + +The problem with a clever CPU or a clever compiler is that the write to the +address register isn't guaranteed to happen before the access to the data +register, if the CPU or the compiler thinks it is more efficient to defer the +address write: + + read *DATA, write *ADR + +then things will break. + + +In the Linux kernel, however, I/O should be done through the appropriate +accessor routines - such as inb() or writel() - which know how to make such +accesses appropriately sequential. + +On some systems, I/O writes are not strongly ordered across all CPUs, and so +locking should be used, and mmiowb() must be issued prior to unlocking the +critical section. + +See Documentation/DocBook/deviceiobook.tmpl for more information. + + +MULTIPROCESSOR INTERACTION +-------------------------- + +When there's a system with more than one processor, the CPUs in the system may +be working on the same set of data at the same time. This can cause +synchronisation problems, and the usual way of dealing with them is to use +locks - but locks are quite expensive, and so it may be preferable to operate +without the use of a lock if at all possible. In such a case accesses that +affect both CPUs may have to be carefully ordered to prevent error. + +Consider the R/W semaphore slow path. In that, a waiting process is queued on +the semaphore, as noted by it having a record on its stack linked to the +semaphore's list: + + struct rw_semaphore { + ... + struct list_head waiters; + }; + + struct rwsem_waiter { + struct list_head list; + struct task_struct *task; + }; + +To wake up the waiter, the up_read() or up_write() functions have to read the +pointer from this record to know as to where the next waiter record is, clear +the task pointer, call wake_up_process() on the task, and release the reference +held on the waiter's task struct: + + READ waiter->list.next; + READ waiter->task; + WRITE waiter->task; + CALL wakeup + RELEASE task + +If any of these steps occur out of order, then the whole thing may fail. + +Note that the waiter does not get the semaphore lock again - it just waits for +its task pointer to be cleared. Since the record is on its stack, this means +that if the task pointer is cleared _before_ the next pointer in the list is +read, another CPU might start processing the waiter and it might clobber its +stack before up*() functions have a chance to read the next pointer. + + CPU 1 CPU 2 + =============================== =============================== + down_xxx() + Queue waiter + Sleep + up_yyy() + READ waiter->task; + WRITE waiter->task; + + Resume processing + down_xxx() returns + call foo() + foo() clobbers *waiter + + READ waiter->list.next; + --- OOPS --- + +This could be dealt with using a spinlock, but then the down_xxx() function has +to get the spinlock again after it's been woken up, which is a waste of +resources. + +The way to deal with this is to insert an SMP memory barrier: + + READ waiter->list.next; + READ waiter->task; + smp_mb(); + WRITE waiter->task; + CALL wakeup + RELEASE task + +In this case, the barrier makes a guarantee that all memory accesses before the +barrier will appear to happen before all the memory accesses after the barrier +with respect to the other CPUs on the system. It does _not_ guarantee that all +the memory accesses before the barrier will be complete by the time the barrier +itself is complete. + +SMP memory barriers are normally nothing more than compiler barriers on a +kernel compiled for a UP system because the CPU orders overlapping accesses +with respect to itself, and so CPU barriers aren't needed. + + +INTERRUPTS +---------- + +A driver may be interrupted by its own interrupt service routine, and thus they +may interfere with each other's attempts to control or access the device. + +This may be alleviated - at least in part - by disabling interrupts (a form of +locking), such that the critical operations are all contained within the +interrupt-disabled section in the driver. Whilst the driver's interrupt +routine is executing, the driver's core may not run on the same CPU, and its +interrupt is not permitted to happen again until the current interrupt has been +handled, thus the interrupt handler does not need to lock against that. + +However, consider a driver was talking to an ethernet card that sports an +address register and a data register. If that driver's core is talks to the +card under interrupt-disablement and then the driver's interrupt handler is +invoked: + + LOCAL IRQ DISABLE + writew(ADDR, ctl_reg_3); + writew(DATA, y); + LOCAL IRQ ENABLE + + writew(ADDR, ctl_reg_4); + q = readw(DATA); + + +If ordering rules are sufficiently relaxed, the write to the data register +might happen after the second write to the address register. + +If ordering rules are relaxed, it must be assumed that accesses done inside an +interrupt disabled section may leak outside of it and may interleave with +accesses performed in an interrupt and vice versa unless implicit or explicit +barriers are used. + +Normally this won't be a problem because the I/O accesses done inside such +sections will include synchronous read operations on strictly ordered I/O +registers that form implicit I/O barriers. If this isn't sufficient then an +mmiowb() may need to be used explicitly. + + +A similar situation may occur between an interrupt routine and two routines +running on separate CPUs that communicate with each other. If such a case is +likely, then interrupt-disabling locks should be used to guarantee ordering. + + +================================= +EXPLICIT KERNEL COMPILER BARRIERS +================================= + +The Linux kernel has an explicit compiler barrier function that prevents the +compiler from moving the memory accesses either side of it to the other side: + + barrier(); + +This has no direct effect on the CPU, which may then reorder things however it +wishes. + + +=============================== +EXPLICIT KERNEL MEMORY BARRIERS +=============================== + +The Linux kernel has eight basic CPU memory barriers: + + TYPE MANDATORY SMP CONDITIONAL + =============== ======================= =========================== + GENERAL mb() smp_mb() + WRITE wmb() smp_wmb() + READ rmb() smp_rmb() + DATA DEPENDENCY read_barrier_depends() smp_read_barrier_depends() + +General memory barriers give a guarantee that all memory accesses specified +before the barrier will appear to happen before all memory accesses specified +after the barrier with respect to the other components of the system. + +Read and write memory barriers give similar guarantees, but only for memory +reads versus memory reads and memory writes versus memory writes respectively. + +Data dependency memory barriers ensure that if two reads are issued such that +the second depends on the result of the first, then the first read is completed +and the cache coherence is up to date before the second read is performed. The +primary case is where the address used in a subsequent read is calculated from +the result of the first read: + + CPU 1 CPU 2 + =============== =============== + { a == 0, b == 1 and p == &a, q == &a } + ... + b = 2; + smp_wmb(); + p = &b; q = p; + smp_read_barrier_depends(); + d = *q; + +Without the data dependency barrier, any of the following results could be +seen: + + POSSIBLE RESULT PERMISSIBLE ORIGIN + =============== =============== ======================================= + q == &a, d == 0 Yes + q == &b, d == 2 Yes + q == &b, d == 1 No Cache coherency maintenance delay + q == &b, d == 0 No q read after a + +See the "Cache Coherency" section above. + + +All memory barriers imply compiler barriers. + +Read memory barriers imply data dependency barriers. + +SMP memory barriers are reduced to compiler barriers on uniprocessor compiled +systems because it is assumed that a CPU will be apparently self-consistent, +and will order overlapping accesses correctly with respect to itself. + +There is no guarantee that any of the memory accesses specified before a memory +barrier will be complete by the completion of a memory barrier; the barrier can +be considered to draw a line in that CPU's access queue that accesses of the +appropriate type may not cross. + +There is no guarantee that issuing a memory barrier on one CPU will have any +direct effect on another CPU or any other hardware in the system. The indirect +effect will be the order in which the second CPU sees the effects of the first +CPU's accesses occur. + +There is no guarantee that some intervening piece of off-the-CPU hardware[*] +will not reorder the memory accesses. CPU cache coherency mechanisms should +propegate the indirect effects of a memory barrier between CPUs, but may not do +so in order (see above). + + [*] For information on bus mastering DMA and coherency please read: + + Documentation/pci.txt + Documentation/DMA-mapping.txt + Documentation/DMA-API.txt + +Note that these are the _minimum_ guarantees. Different architectures may give +more substantial guarantees, but they may not be relied upon outside of arch +specific code. + + +There are some more advanced barrier functions: + + (*) set_mb(var, value) + (*) set_wmb(var, value) + + These assign the value to the variable and then insert at least a write + barrier after it, depending on the function. They aren't guaranteed to + insert anything more than a compiler barrier in a UP compilation. + + +BARRIER PAIRING +--------------- + +Certain types of memory barrier should always be paired. A lack of an +appropriate pairing is almost certainly an error. + +An smp_wmb() should always be paired with an smp_read_barrier_depends() or an +smp_rmb(), though an smp_mb() would also be acceptable. Similarly an smp_rmb() +or an smp_read_barrier_depends() should always be paired with at least an +smp_wmb(), though, again, an smp_mb() is acceptable: + + CPU 1 CPU 2 COMMENT + =============== =============== ======================= + a = 1; + smp_wmb(); Could be smp_mb() + b = 2; + x = a; + smp_rmb(); Could be smp_mb() + y = b; + +The data dependency barrier - smp_read_barrier_depends() - is a very specific +optimisation. It's a weaker form of the read memory barrier, for use in the +case where there's a dependency between two reads - since on some architectures +we can't rely on the CPU to imply a data dependency barrier. The typical case +is where a read is used to generate an address for a subsequent memory access: + + CPU 1 CPU 2 COMMENT + =============== =============================== ======================= + a = 1; + smp_wmb(); Could be smp_mb() + b = &a; + x = b; + smp_read_barrier_depends(); Or smp_rmb()/smp_mb() + y = *x; + +This is used in the RCU system. + +If the two reads are independent, an smp_read_barrier_depends() is not +sufficient, and an smp_rmb() or better must be employed. + +Basically, the read barrier always has to be there, even though it can be of +the "weaker" type. + +Note also that the address really has to have a _data_ dependency, not a +control dependency. If the address is dependent on the first read, but the +dependency is through a conditional rather than actually reading the address +itself, then it's a control dependency, and existing CPU's already deal with +those through branch prediction. + + +=============================== +IMPLICIT KERNEL MEMORY BARRIERS +=============================== + +Some of the other functions in the linux kernel imply memory barriers, amongst +which are locking and scheduling functions. + +This specification is a _minimum_ guarantee; any particular architecture may +provide more substantial guarantees, but these may not be relied upon outside +of arch specific code. + + +LOCKING FUNCTIONS +----------------- + +The Linux kernel has a number of locking constructs: + + (*) spin locks + (*) R/W spin locks + (*) mutexes + (*) semaphores + (*) R/W semaphores + +In all cases there are variants on "LOCK" operations and "UNLOCK" operations +for each construct. These operations all imply certain barriers: + + (*) LOCK operation implication: + + Memory accesses issued after the LOCK will be completed after the LOCK + accesses have completed. + + Memory accesses issued before the LOCK may be completed after the LOCK + accesses have completed. + + (*) UNLOCK operation implication: + + Memory accesses issued before the UNLOCK will be completed before the + UNLOCK accesses have completed. + + Memory accesses issued after the UNLOCK may be completed before the UNLOCK + accesses have completed. + + (*) LOCK vs UNLOCK implication: + + The LOCK accesses will be completed before the UNLOCK accesses. + + Therefore an UNLOCK followed by a LOCK is equivalent to a full barrier, + but a LOCK followed by an UNLOCK is not. + + (*) Failed conditional LOCK implication: + + Certain variants of the LOCK operation may fail, either due to being + unable to get the lock immediately, or due to receiving an unblocked + signal whilst asleep waiting for the lock to become available. Failed + locks do not imply any sort of barrier. + +Locks and semaphores may not provide any guarantee of ordering on UP compiled +systems, and so cannot be counted on in such a situation to actually achieve +anything at all - especially with respect to I/O accesses - unless combined +with interrupt disabling operations. + +See also the section on "Inter-CPU locking barrier effects". + + +As an example, consider the following: + + *A = a; + *B = b; + LOCK + *C = c; + *D = d; + UNLOCK + *E = e; + *F = f; + +The following sequence of events is acceptable: + + LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK + +But none of the following are: + + {*F,*A}, *B, LOCK, *C, *D, UNLOCK, *E + *A, *B, *C, LOCK, *D, UNLOCK, *E, *F + *A, *B, LOCK, *C, UNLOCK, *D, *E, *F + *B, LOCK, *C, *D, UNLOCK, {*F,*A}, *E + + +INTERRUPT DISABLING FUNCTIONS +----------------------------- + +Functions that disable interrupts (LOCK equivalent) and enable interrupts +(UNLOCK equivalent) will act as compiler barriers only. So if memory or I/O +barriers are required in such a situation, they must be provided from some +other means. + + +MISCELLANEOUS FUNCTIONS +----------------------- + +Other functions that imply barriers: + + (*) schedule() and similar imply full memory barriers. + + +================================= +INTER-CPU LOCKING BARRIER EFFECTS +================================= + +On SMP systems locking primitives give a more substantial form of barrier: one +that does affect memory access ordering on other CPUs, within the context of +conflict on any particular lock. + + +LOCKS VS MEMORY ACCESSES +------------------------ + +Consider the following: the system has a pair of spinlocks (N) and (Q), and +three CPUs; then should the following sequence of events occur: + + CPU 1 CPU 2 + =============================== =============================== + *A = a; *E = e; + LOCK M LOCK Q + *B = b; *F = f; + *C = c; *G = g; + UNLOCK M UNLOCK Q + *D = d; *H = h; + +Then there is no guarantee as to what order CPU #3 will see the accesses to *A +through *H occur in, other than the constraints imposed by the separate locks +on the separate CPUs. It might, for example, see: + + *E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M + +But it won't see any of: + + *B, *C or *D preceding LOCK M + *A, *B or *C following UNLOCK M + *F, *G or *H preceding LOCK Q + *E, *F or *G following UNLOCK Q + + +However, if the following occurs: + + CPU 1 CPU 2 + =============================== =============================== + *A = a; + LOCK M [1] + *B = b; + *C = c; + UNLOCK M [1] + *D = d; *E = e; + LOCK M [2] + *F = f; + *G = g; + UNLOCK M [2] + *H = h; + +CPU #3 might see: + + *E, LOCK M [1], *C, *B, *A, UNLOCK M [1], + LOCK M [2], *H, *F, *G, UNLOCK M [2], *D + +But assuming CPU #1 gets the lock first, it won't see any of: + + *B, *C, *D, *F, *G or *H preceding LOCK M [1] + *A, *B or *C following UNLOCK M [1] + *F, *G or *H preceding LOCK M [2] + *A, *B, *C, *E, *F or *G following UNLOCK M [2] + + +LOCKS VS I/O ACCESSES +--------------------- + +Under certain circumstances (such as NUMA), I/O accesses within two spinlocked +sections on two different CPUs may be seen as interleaved by the PCI bridge. + +For example: + + CPU 1 CPU 2 + =============================== =============================== + spin_lock(Q) + writel(0, ADDR) + writel(1, DATA); + spin_unlock(Q); + spin_lock(Q); + writel(4, ADDR); + writel(5, DATA); + spin_unlock(Q); + +may be seen by the PCI bridge as follows: + + WRITE *ADDR = 0, WRITE *ADDR = 4, WRITE *DATA = 1, WRITE *DATA = 5 + +which would probably break. + +What is necessary here is to insert an mmiowb() before dropping the spinlock, +for example: + + CPU 1 CPU 2 + =============================== =============================== + spin_lock(Q) + writel(0, ADDR) + writel(1, DATA); + mmiowb(); + spin_unlock(Q); + spin_lock(Q); + writel(4, ADDR); + writel(5, DATA); + mmiowb(); + spin_unlock(Q); + +this will ensure that the two writes issued on CPU #1 appear at the PCI bridge +before either of the writes issued on CPU #2. + + +Furthermore, following a write by a read to the same device is okay, because +the read forces the write to complete before the read is performed: + + CPU 1 CPU 2 + =============================== =============================== + spin_lock(Q) + writel(0, ADDR) + a = readl(DATA); + spin_unlock(Q); + spin_lock(Q); + writel(4, ADDR); + b = readl(DATA); + spin_unlock(Q); + + +See Documentation/DocBook/deviceiobook.tmpl for more information. + + +========================== +KERNEL I/O BARRIER EFFECTS +========================== + +When accessing I/O memory, drivers should use the appropriate accessor +functions: + + (*) inX(), outX(): + + These are intended to talk to I/O space rather than memory space, but + that's primarily a CPU-specific concept. The i386 and x86_64 processors do + indeed have special I/O space access cycles and instructions, but many + CPUs don't have such a concept. + + The PCI bus, amongst others, defines an I/O space concept - which on such + CPUs as i386 and x86_64 cpus readily maps to the CPU's concept of I/O + space. However, it may also mapped as a virtual I/O space in the CPU's + memory map, particularly on those CPUs that don't support alternate + I/O spaces. + + Accesses to this space may be fully synchronous (as on i386), but + intermediary bridges (such as the PCI host bridge) may not fully honour + that. + + They are guaranteed to be fully ordered with respect to each other. + + They are not guaranteed to be fully ordered with respect to other types of + memory and I/O operation. + + (*) readX(), writeX(): + + Whether these are guaranteed to be fully ordered and uncombined with + respect to each other on the issuing CPU depends on the characteristics + defined for the memory window through which they're accessing. On later + i386 architecture machines, for example, this is controlled by way of the + MTRR registers. + + Ordinarily, these will be guaranteed to be fully ordered and uncombined,, + provided they're not accessing a prefetchable device. + + However, intermediary hardware (such as a PCI bridge) may indulge in + deferral if it so wishes; to flush a write, a read from the same location + is preferred[*], but a read from the same device or from configuration + space should suffice for PCI. + + [*] NOTE! attempting to read from the same location as was written to may + cause a malfunction - consider the 16550 Rx/Tx serial registers for + example. + + Used with prefetchable I/O memory, an mmiowb() barrier may be required to + force writes to be ordered. + + Please refer to the PCI specification for more information on interactions + between PCI transactions. + + (*) readX_relaxed() + + These are similar to readX(), but are not guaranteed to be ordered in any + way. Be aware that there is no I/O read barrier available. + + (*) ioreadX(), iowriteX() + + These will perform as appropriate for the type of access they're actually + doing, be it inX()/outX() or readX()/writeX(). + + +========== +REFERENCES +========== + +AMD64 Architecture Programmer's Manual Volume 2: System Programming + Chapter 7.1: Memory-Access Ordering + Chapter 7.4: Buffering and Combining Memory Writes + +IA-32 Intel Architecture Software Developer's Manual, Volume 3: +System Programming Guide + Chapter 7.1: Locked Atomic Operations + Chapter 7.2: Memory Ordering + Chapter 7.4: Serializing Instructions + +The SPARC Architecture Manual, Version 9 + Chapter 8: Memory Models + Appendix D: Formal Specification of the Memory Models + Appendix J: Programming with the Memory Models + +UltraSPARC Programmer Reference Manual + Chapter 5: Memory Accesses and Cacheability + Chapter 15: Sparc-V9 Memory Models + +UltraSPARC III Cu User's Manual + Chapter 9: Memory Models + +UltraSPARC IIIi Processor User's Manual + Chapter 8: Memory Models + +UltraSPARC Architecture 2005 + Chapter 9: Memory + Appendix D: Formal Specifications of the Memory Models + +UltraSPARC T1 Supplement to the UltraSPARC Architecture 2005 + Chapter 8: Memory Models + Appendix F: Caches and Cache Coherency + +Solaris Internals, Core Kernel Architecture, p63-68: + Chapter 3.3: Hardware Considerations for Locks and + Synchronization + +Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching +for Kernel Programmers: + Chapter 13: Other Memory Models + +Intel Itanium Architecture Software Developer's Manual: Volume 1: + Section 2.6: Speculation + Section 4.4: Memory Access From nickpiggin at yahoo.com.au Thu Mar 16 10:21:53 2006 From: nickpiggin at yahoo.com.au (Nick Piggin) Date: Thu, 16 Mar 2006 10:21:53 +1100 Subject: [PATCH] Document Linux's memory barriers [try #4] In-Reply-To: <17625.1142430454@warthog.cambridge.redhat.com> References: <4417FFC2.8040909@yahoo.com.au> <44110E93.8060504@yahoo.com.au> <16835.1141936162@warthog.cambridge.redhat.com> <14886.1142421018@warthog.cambridge.redhat.com> <17625.1142430454@warthog.cambridge.redhat.com> Message-ID: <4418A191.5010108@yahoo.com.au> David Howells wrote: >Nick Piggin wrote: > > >>>Ah, but if the cache is on the CPU side of the dotted line, does that then >>>mean that a write memory barrier guarantees the CPU's cache to have >>>updated memory? >>> >>I don't think it has to[*]. It would guarantee the _order_ in which "global >>memory" of this model ie. visibility for other "CPUs" see the writes, >>whether that visibility ultimately be implemented by cache coherency >>protocol or something else, I don't think matters (for a discussion of >>memory ordering). >> > >It does matter, because I have to make it clear that the effect of the memory >barrier usually stops at the cache, and in fact memory barriers may have no >visibility at all on another CPU because it's all done inside a CPU's cache, >until that other CPU tries to observe the results. > > But that's a cache coherency issue that is really orthogonal to the memory consistency one. WHY, when explaining memory consistency, do they need to know that a barrier "usually stops at cache" (except for alpha)? They already _know_ that barriers may have no visibility on any other CPU because you should tell them that barriers only imply an ordering over the horizon, nothing more (ie. they need not imply a "push"). >>If anything it confused the matter for the case of Alpha. >> > >Nah... Alpha is self-confusing:-) > > Well maybe ;) But for better or worse, it is what kernel programmers now have to deal with. >>All the programmer needs to know is that there is some horizon (memory) >>beyond which stores are visible to other CPUs, and stores can travel there >>at different speeds so later ones can overtake earlier ones. And likewise >>loads can come from memory to the CPU at different speeds too, so later >>loads can contain earlier results. >> > >They also need to know that memory barriers don't imply an ordering on the >cache. > > Why? I'm contending that this is exactly what they don't need to know. >>[*] Nor would your model require a smp_wmb() to update CPU caches either, I >>think: it wouldn't have to flush the store buffer, just order it. >> > >Exactly. > >But in your diagram, given that it doesn't show the cache, you don't know that >the memory barrier doesn't extend through the cache and all the way to memory. > > What do you mean "extend"? I don't think that is good terminology. What it does is provide an ordering of traffic going over the vertical line dividing CPU and memory. It does not matter whether "memory" is actually "cache + coherency" or not, just that the vertical line is the horizon between "visible to other CPUs" and "not". Nick -- Send instant messages to your online friends http://au.messenger.yahoo.com From david at gibson.dropbear.id.au Thu Mar 16 11:58:35 2006 From: david at gibson.dropbear.id.au (David Gibson) Date: Thu, 16 Mar 2006 11:58:35 +1100 Subject: dtc git tree has moved (and maintainership announcement) Message-ID: <20060316005835.GB1512@localhost.localdomain> For the next six months or so, I will be available intermittently at best for dtc work. For this reason, Jon Loeliger of Freescale has agreed to take over dtc maintainership for the interim. As part of the logistics of this handover, the location of the dtc git tree has changed. It is now available *only* via git daemon (not http or rsync, sorry), the address is: git://ozlabs.org/srv/projects/dtc/dtc.git There is also now a web page for dtc, although at present there's basically no information there, at: http://dtc.ozlabs.org/ -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson From michael at ellerman.id.au Thu Mar 16 14:47:20 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Thu, 16 Mar 2006 14:47:20 +1100 Subject: [PATCH] powerpc: Fix bug in bug fix for bug in lmb_alloc() Message-ID: <20060316034749.F18B9679FD@ozlabs.org> My patch (d7a5b2ffa1352f0310630934a56aecbdfb617b72) to always panic if lmb_alloc() fails is broken because it checks alloc < 0, but should be checking alloc == 0. This should go into 2.6.16 :/ Signed-off-by: Michael Ellerman --- arch/powerpc/mm/lmb.c | 2 +- 1 files changed, 1 insertion(+), 1 deletion(-) Index: to-merge/arch/powerpc/mm/lmb.c =================================================================== --- to-merge.orig/arch/powerpc/mm/lmb.c +++ to-merge/arch/powerpc/mm/lmb.c @@ -232,7 +232,7 @@ unsigned long __init lmb_alloc_base(unsi alloc = __lmb_alloc_base(size, align, max_addr); - if (alloc < 0) + if (alloc == 0) panic("ERROR: Failed to allocate 0x%lx bytes below 0x%lx.\n", size, max_addr); From ananth at in.ibm.com Thu Mar 16 21:01:24 2006 From: ananth at in.ibm.com (Ananth N Mavinakayanahalli) Date: Thu, 16 Mar 2006 15:31:24 +0530 Subject: [PATCH] kprobes: fix single-stepping when probing a trap variant Message-ID: <20060316100124.GA5497@in.ibm.com> Hi, We currently single-step inline if the instruction on which a kprobe is set is a trap variant. - variants (such as tdnei, used by BUG()) typically evaluate a condition and cause a trap only if the condition is satisfied - kprobes uses the unconditional trap instruction (0x7fe00008) and single-stepping again on this instruction, resulting in another trap without evaluating the condition is obviously incorrect. Signed-off-by: Ananth N Mavinakayanahalli --- arch/powerpc/kernel/kprobes.c | 12 +++++++----- 1 files changed, 7 insertions(+), 5 deletions(-) Index: linux-2.6.16-rc6/arch/powerpc/kernel/kprobes.c =================================================================== --- linux-2.6.16-rc6.orig/arch/powerpc/kernel/kprobes.c +++ linux-2.6.16-rc6/arch/powerpc/kernel/kprobes.c @@ -92,11 +92,13 @@ static inline void prepare_singlestep(st regs->msr |= MSR_SE; - /* single step inline if it is a trap variant */ - if (is_trap(insn)) - regs->nip = (unsigned long)p->addr; - else - regs->nip = (unsigned long)p->ainsn.insn; + /* + * On powerpc we should single step on the original + * instruction even if the probed insn is a trap + * variant as values in regs could play a part in + * if the trap is taken or not + */ + regs->nip = (unsigned long)p->ainsn.insn; } static inline void save_previous_kprobe(struct kprobe_ctlblk *kcb) From paulmck at us.ibm.com Fri Mar 17 10:17:23 2006 From: paulmck at us.ibm.com (Paul E. McKenney) Date: Thu, 16 Mar 2006 15:17:23 -0800 Subject: [PATCH] Document Linux's memory barriers [try #5] In-Reply-To: <18351.1142432599@warthog.cambridge.redhat.com> References: <16835.1141936162@warthog.cambridge.redhat.com> <18351.1142432599@warthog.cambridge.redhat.com> Message-ID: <20060316231723.GB1323@us.ibm.com> On Wed, Mar 15, 2006 at 02:23:19PM +0000, David Howells wrote: > > The attached patch documents the Linux kernel's memory barriers. > > I've updated it from the comments I've been given. > > The per-arch notes sections are gone because it's clear that there are so many > exceptions, that it's not worth having them. > > I've added a list of references to other documents. > > I've tried to get rid of the concept of memory accesses appearing on the bus; > what matters is apparent behaviour with respect to other observers in the > system. > > Interrupts barrier effects are now considered to be non-existent. They may be > there, but you may not rely on them. > > I've added a couple of definition sections at the top of the document: one to > specify the minimum execution model that may be assumed, the other to specify > what this document refers to by the term "memory". > > I've made greater mention of the use of mmiowb(). > > I've adjusted the way in which caches are described, and described the fun > that can be had with cache coherence maintenance being unordered and data > dependency not being necessarily implicit. > > I've described (smp_)read_barrier_depends(). Good stuff!!! Please see comments interspersed, search for empty lines. One particularly serious issue involve your smp_read_barrier_depends() example. > Signed-Off-By: David Howells > --- > warthog>diffstat -p1 /tmp/mb.diff > Documentation/memory-barriers.txt | 1039 ++++++++++++++++++++++++++++++++++++++ > 1 files changed, 1039 insertions(+) > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt > new file mode 100644 > index 0000000..fd7a6f1 > --- /dev/null > +++ b/Documentation/memory-barriers.txt > @@ -0,0 +1,1039 @@ > + ============================ > + LINUX KERNEL MEMORY BARRIERS > + ============================ > + > +Contents: > + > + (*) Assumed minimum execution ordering model. > + > + (*) What is considered memory? Suggest the following -- most people need to rethink what memory means: (*) What is memory? Or maybe "What is 'memory'"? > + > + - Cached interactions. > + - Cache coherency. > + - Uncached interactions. > + > + (*) What are memory barriers? > + > + (*) Where are memory barriers needed? > + > + - Accessing devices. > + - Multiprocessor interaction. > + - Interrupts. > + > + (*) Explicit kernel compiler barriers. > + > + (*) Explicit kernel memory barriers. > + > + - Barrier pairing. > + > + (*) Implicit kernel memory barriers. > + > + - Locking functions. > + - Interrupt disabling functions. > + - Miscellaneous functions. > + > + (*) Inter-CPU locking barrier effects. > + > + - Locks vs memory accesses. > + - Locks vs I/O accesses. > + > + (*) Kernel I/O barrier effects. > + > + (*) References. > + > + > +======================================== > +ASSUMED MINIMUM EXECUTION ORDERING MODEL > +======================================== > + > +It has to be assumed that the conceptual CPU is weakly-ordered in all respects > +but that it will maintain the appearance of program causality with respect to > +itself. Some CPUs (such as i386 or x86_64) are more constrained than others > +(such as powerpc or frv), and so the most relaxed case must be assumed outside Might as well call it out: +(such as powerpc or frv), and so the most relaxed case (namely DEC Alpha) +must be assumed outside > +of arch-specific code. Also, I have some verbiage and diagrams of Alpha's operation at http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2006.03.13a.pdf Feel free to take any that helps. (Source for paper is Latex and xfig, for whatever that is worth.) > +This means that it must be considered that the CPU will execute its instruction > +stream in any order it feels like - or even in parallel - provided that if an > +instruction in the stream depends on the an earlier instruction, then that > +earlier instruction must be sufficiently complete[*] before the later > +instruction may proceed. Suggest replacing the last two lines with: +instruction may proceed, in other words, provided that the appearance of +causality is maintained. > + Suggest just folding the following footnotes into the above paragraph: > + [*] Some instructions have more than one effect[**] and different instructions > + may depend on different effects. > + > + [**] Eg: changes to condition codes and registers; memory events. > + > +A CPU may also discard any instruction sequence that ultimately winds up having > +no effect. For example if two adjacent instructions both load an immediate > +value into the same register, the first may be discarded. > + > + > +Similarly, it has to be assumed that compiler might reorder the instruction > +stream in any way it sees fit, again provided the appearance of causality is > +maintained. > + > + > +========================== > +WHAT IS CONSIDERED MEMORY? > +========================== +=============== +WHAT IS MEMORY? +=============== > +For the purpose of this specification what's meant by "memory" needs to be > +defined, and the division between CPU and memory needs to be marked out. > + > + > +CACHED INTERACTIONS > +------------------- > + > +As far as cached CPU vs CPU[*] interactions go, "memory" has to include the CPU > +caches in the system. Although any particular read or write may not actually > +appear outside of the CPU that issued it (the CPU may may have been able to > +satisfy it from its own cache), it's still as if the memory access had taken > +place as far as the other CPUs are concerned since the cache coherency and > +ejection mechanisms will propegate the effects upon conflict. > + > + [*] Also applies to CPU vs device when accessed through a cache. > + > +The system can be considered logically as: Suggest showing the (typical) possibility of MMIO bypassing the CPU cache, as happens in many systems. See below for a hacky suggested change. > + <--- CPU ---> : <----------- Memory -----------> > + : > + +--------+ +--------+ : +--------+ +-----------+ > + | | | | : | | | | +--------+ > + | CPU | | Memory | : | CPU | | | | | > + | Core |--->| Access |----->| Cache |<-->| | | | > + | | | Queue | : | | | |--->| Memory | > + | | | | : | | | | | | > + +--------+ +--------+ : +--------+ | | | | > + V : | Cache | +--------+ > + | : | Coherency | > + | : | Mechanism | +--------+ > + +--------+ +--------+ : +--------+ | | | | > + | | | | : | | | | | | > + | CPU | | Memory | : | CPU | | |--->| Device | > + | Core |--->| Access |----->| Cache |<-->| | | | > + | | | Queue | : | | | | | | > + | | | | : | | | | +----+---+ > + +--------+ +--------+ : +--------+ +-----------+ ^ > + V : | > + | : | > + +----------------------------------------------+ > + : > + : In the past, there have been systems in which the devices did -not- see coherent memory, where (for example) CPU caches had to be manually flushed before a DMA operation was allowed to procede. > +The CPU core may execute instructions in any order it deems fit, provided the > +expected program causality appears to be maintained. Some of the instructions > +generate load and store operations which then go into the memory access queue > +to be performed. The core may place these in the queue in any order it wishes, > +and continue execution until it is forced to wait for an instruction to > +complete. > + > +What memory barriers are concerned with is controlling the order in which > +accesses cross from the CPU side of things to the memory side of things, and > +the order in which the effects are perceived to happen by the other observers > +in the system. Suggest emphasizing this point: +Memory barriers are -not- needed within a given CPU, as CPUs always see +their own loads and stores as if they had happened in program order. Might also want to define program order, execution order, and perceived order, or similar terms. > +CACHE COHERENCY > +--------------- > + > +Life isn't quite as simple as it may appear above, however: for while the > +caches are expected to be coherent, there's no guarantee that that coherency > +will be ordered. This means that whilst changes made on one CPU will > +eventually become visible on all CPUs, there's no guarantee that they will > +become apparent in the same order on those other CPUs. Nit, perhaps not worth calling out -- a given change might well be overwritten by a second change, so that the first change is never seen by other CPUs. > +Consider dealing with a pair of CPUs (1 & 2), each of which has a pair of > +parallel data caches (A,B and C,D) with the following properties: > + > + (*) a cacheline may be in one of the caches; > + > + (*) whilst the CPU core is interrogating one cache, the other cache may be > + making use of the bus to access the rest of the system - perhaps to > + displace a dirty cacheline or to do a speculative read; > + > + (*) each cache has a queue of operations that need to be applied to that cache > + to maintain coherency with the rest of the system; > + > + (*) the coherency queue is not flushed by normal reads to lines already > + present in the cache, even though the contents of the queue may > + potentially effect those reads. Suggest the following: + (*) the coherency queue is not flushed by normal reads to lines already + present in the cache, however, the reads will access the data in + the coherency queue in preference to the (obsoleted) data in the cache. + This is necessary to give the CPU the illusion that its operations + are being performed in order. > +Imagine, then, two writes made on the first CPU, with a barrier between them to > +guarantee that they will reach that CPU's caches in the requisite order: > + > + CPU 1 CPU 2 COMMENT > + =============== =============== ======================================= > + a == 0, b == 1 and p == &a, q == &a > + b = 2; > + smp_wmb(); Make sure b is changed before p Suggest the more accurate: + smp_wmb(); Make sure change to b visible before p The CPU is within its rights to execute these out of order as long as all the other CPUs perceive the change to b as preceding the change to p. And I believe that many CPUs in fact do just this sort of sneaky reordering, e.g., via speculative execution. > + The cacheline of b is now ours > + p = &b; > + The cacheline of p is now ours > + > +The write memory barrier forces the local CPU caches to be updated in the > +correct order. It actually forces the other CPUs to see the updates in the correct order, the updates to the local CPU caches might well be misordered. Also, shouldn't one or the other of the "A:"s above be "B:"? I am assuming that they are supposed to refer to the "parallel data caches (A,B and C,D)" mentioned earlier. If they don't, I have no idea what they are supposed to mean. > + But now imagine that the second CPU that wants to read those > +values: > + > + CPU 1 CPU 2 COMMENT > + =============== =============== ======================================= > + ... > + q = p; > + d = *q; > + > +The above pair of reads may then fail to apparently happen in expected order, > +as the cacheline holding p may get updated in one of the second CPU's caches > +whilst the update to the cacheline holding b is delayed in the other of the > +second CPU's caches by some other cache event: > + > + CPU 1 CPU 2 COMMENT > + =============== =============== ======================================= > + a == 0, b == 1 and p == &a, q == &a > + b = 2; > + smp_wmb(); Make sure b is changed before p > + > + > + p = &b; q = p; > + > + > + > + d = *q; > + Reads from b before b updated in cache > + > + I don't understand the "A:", "B:", etc. in the above example. Is variable "b" in cacheline A or in cacheline C? Or do A and C mean something other than "cacheline" as indicated earlier. > +Basically, whilst both cachelines will be updated on CPU 2 eventually, there's > +no guarantee that, without intervention, the order of update will be the same > +as that committed on CPU 1. > + > + > +To intervene, we need to emplace a data dependency barrier or a read barrier. > +This will force the cache to commit its coherency queue before processing any > +further requests: > + > + CPU 1 CPU 2 COMMENT > + =============== =============== ======================================= > + a == 0, b == 1 and p == &a, q == &a > + b = 2; > + smp_wmb(); Make sure b is changed before p > + > + > + p = &b; q = p; > + > + > + > + smp_read_barrier_depends() > + > + > + d = *q; > + Reads from b after b updated in cache Again, I don't understand the "A:", "B:", etc. in the above example. Is variable "b" in cacheline A or in cacheline C? Or do A and C mean something other than "cacheline" as indicated earlier. > +This sort of problem can be encountered on Alpha processors as they have a > +split cache that improves performance by making better use of the data bus. > +Whilst most CPUs do imply a data dependency barrier on the read when a memory > +access depends on a read, not all do, so it may not be relied on. > + > + > +UNCACHED INTERACTIONS > +--------------------- > + > +Note that the above model does not show uncached memory or I/O accesses. These > +procede directly from the queue to the memory or the devices, bypassing any > +caches or cache coherency: > + > + <--- CPU ---> : > + : +-----+ There are some gratuitious tabs in the above line. > + +--------+ +--------+ : | | > + | | | | : | | +---------+ > + | CPU | | Memory | : | | | | > + | Core |--->| Access |--------------->| | | | > + | | | Queue | : | |------------->| Memory | > + | | | | : | | | | > + +--------+ +--------+ : | | | | > + : | | +---------+ > + : | Bus | > + : | | +---------+ > + +--------+ +--------+ : | | | | > + | | | | : | | | | > + | CPU | | Memory | : | |<------------>| Device | > + | Core |--->| Access |--------------->| | | | > + | | | Queue | : | | | | > + | | | | : | | +---------+ > + +--------+ +--------+ : | | > + : +-----+ > + : > + > + > +========================= > +WHAT ARE MEMORY BARRIERS? > +========================= > + > +Memory barriers are instructions to both the compiler and the CPU to impose an > +apparent partial ordering between the memory access operations specified either Suggest s/apparent/perceived/ > +side of the barrier. They request that the sequence of memory events generated > +appears to other components of the system as if the barrier is effective on > +that CPU. > + > +Note that: > + > + (*) there's no guarantee that the sequence of memory events is _actually_ so > + ordered. It's possible for the CPU to do out-of-order accesses _as long > + as no-one is looking_, and then fix up the memory if someone else tries to > + see what's going on (for instance a bus master device); what matters is > + the _apparent_ order as far as other processors and devices are concerned; > + and > + > + (*) memory barriers are only guaranteed to act within the CPU processing them, > + and are not, for the most part, guaranteed to percolate down to other CPUs > + in the system or to any I/O hardware that that CPU may communicate with. Suggest s/not, for the most part,/_not_ necessarily/ > +For example, a programmer might take it for granted that the CPU will perform > +memory accesses in exactly the order specified, so that if a CPU is, for > +example, given the following piece of code: > + > + a = *A; > + *B = b; > + c = *C; > + d = *D; > + *E = e; > + > +They would then expect that the CPU will complete the memory access for each > +instruction before moving on to the next one, leading to a definite sequence of > +operations as seen by external observers in the system: > + > + read *A, write *B, read *C, read *D, write *E. > + > + > +Reality is, of course, much messier. With many CPUs and compilers, this isn't > +always true because: > + > + (*) reads are more likely to need to be completed immediately to permit > + execution progress, whereas writes can often be deferred without a > + problem; > + > + (*) reads can be done speculatively, and then the result discarded should it > + prove not to be required; > + > + (*) the order of the memory accesses may be rearranged to promote better use > + of the CPU buses and caches; > + > + (*) reads and writes may be combined to improve performance when talking to > + the memory or I/O hardware that can do batched accesses of adjacent > + locations, thus cutting down on transaction setup costs (memory and PCI > + devices may be able to do this); and > + > + (*) the CPU's data cache may affect the ordering, and whilst cache-coherency > + mechanisms may alleviate this - once the write has actually hit the cache > + - there's no guarantee that the coherency management will be propegated in > + order to other CPUs. > + > +So what another CPU, say, might actually observe from the above piece of code > +is: > + > + read *A, read {*C,*D}, write *E, write *B > + > + (By "read {*C,*D}" I mean a combined single read). > + > + > +It is also guaranteed that a CPU will be self-consistent: it will see its _own_ > +accesses appear to be correctly ordered, without the need for a memory barrier. > +For instance with the following code: > + > + X = *A; > + *A = Y; > + Z = *A; > + > +assuming no intervention by an external influence, it can be taken that: > + > + (*) X will hold the old value of *A, and will never happen after the write and > + thus end up being given the value that was assigned to *A from Y instead; > + and > + > + (*) Z will always be given the value in *A that was assigned there from Y, and > + will never happen before the write, and thus end up with the same value > + that was in *A initially. > + > +(This is ignoring the fact that the value initially in *A may appear to be the > +same as the value assigned to *A from Y). > + > + > +================================= > +WHERE ARE MEMORY BARRIERS NEEDED? > +================================= > + > +Under normal operation, access reordering is probably not going to be a problem > +as a linear program will still appear to operate correctly. There are, > +however, three circumstances where reordering definitely _could_ be a problem: Suggest listing the three circumstances (accessing devices, multiprocessor interaction, interrupts) before the individual sections. Don't leave us in suspense! ;-) > +ACCESSING DEVICES > +----------------- > + > +Many devices can be memory mapped, and so appear to the CPU as if they're just > +memory locations. However, to control the device, the driver has to make the > +right accesses in exactly the right order. > + > +Consider, for example, an ethernet chipset such as the AMD PCnet32. It > +presents to the CPU an "address register" and a bunch of "data registers". The > +way it's accessed is to write the index of the internal register to be accessed > +to the address register, and then read or write the appropriate data register > +to access the chip's internal register, which could - theoretically - be done > +by: > + > + *ADR = ctl_reg_3; > + reg = *DATA; > + > +The problem with a clever CPU or a clever compiler is that the write to the > +address register isn't guaranteed to happen before the access to the data > +register, if the CPU or the compiler thinks it is more efficient to defer the > +address write: > + > + read *DATA, write *ADR > + > +then things will break. > + > + > +In the Linux kernel, however, I/O should be done through the appropriate > +accessor routines - such as inb() or writel() - which know how to make such > +accesses appropriately sequential. > + > +On some systems, I/O writes are not strongly ordered across all CPUs, and so > +locking should be used, and mmiowb() must be issued prior to unlocking the > +critical section. > + > +See Documentation/DocBook/deviceiobook.tmpl for more information. > + > + > +MULTIPROCESSOR INTERACTION > +-------------------------- > + > +When there's a system with more than one processor, the CPUs in the system may > +be working on the same set of data at the same time. This can cause > +synchronisation problems, and the usual way of dealing with them is to use > +locks - but locks are quite expensive, and so it may be preferable to operate > +without the use of a lock if at all possible. In such a case accesses that > +affect both CPUs may have to be carefully ordered to prevent error. > + > +Consider the R/W semaphore slow path. In that, a waiting process is queued on > +the semaphore, as noted by it having a record on its stack linked to the > +semaphore's list: > + > + struct rw_semaphore { > + ... > + struct list_head waiters; > + }; > + > + struct rwsem_waiter { > + struct list_head list; > + struct task_struct *task; > + }; > + > +To wake up the waiter, the up_read() or up_write() functions have to read the > +pointer from this record to know as to where the next waiter record is, clear > +the task pointer, call wake_up_process() on the task, and release the reference > +held on the waiter's task struct: > + > + READ waiter->list.next; > + READ waiter->task; > + WRITE waiter->task; > + CALL wakeup > + RELEASE task > + > +If any of these steps occur out of order, then the whole thing may fail. > + > +Note that the waiter does not get the semaphore lock again - it just waits for > +its task pointer to be cleared. Since the record is on its stack, this means > +that if the task pointer is cleared _before_ the next pointer in the list is > +read, another CPU might start processing the waiter and it might clobber its > +stack before up*() functions have a chance to read the next pointer. > + > + CPU 1 CPU 2 > + =============================== =============================== > + down_xxx() > + Queue waiter > + Sleep > + up_yyy() > + READ waiter->task; > + WRITE waiter->task; > + > + Resume processing > + down_xxx() returns > + call foo() > + foo() clobbers *waiter > + > + READ waiter->list.next; > + --- OOPS --- > + > +This could be dealt with using a spinlock, but then the down_xxx() function has > +to get the spinlock again after it's been woken up, which is a waste of > +resources. > + > +The way to deal with this is to insert an SMP memory barrier: > + > + READ waiter->list.next; > + READ waiter->task; > + smp_mb(); > + WRITE waiter->task; > + CALL wakeup > + RELEASE task > + > +In this case, the barrier makes a guarantee that all memory accesses before the > +barrier will appear to happen before all the memory accesses after the barrier > +with respect to the other CPUs on the system. It does _not_ guarantee that all > +the memory accesses before the barrier will be complete by the time the barrier > +itself is complete. > + > +SMP memory barriers are normally nothing more than compiler barriers on a Suggest s/barriers are/barriers (e.g., smp_mb(), as opposed to mb()) are/ > +kernel compiled for a UP system because the CPU orders overlapping accesses > +with respect to itself, and so CPU barriers aren't needed. > + > + > +INTERRUPTS > +---------- > + > +A driver may be interrupted by its own interrupt service routine, and thus they > +may interfere with each other's attempts to control or access the device. > + > +This may be alleviated - at least in part - by disabling interrupts (a form of > +locking), such that the critical operations are all contained within the > +interrupt-disabled section in the driver. Whilst the driver's interrupt > +routine is executing, the driver's core may not run on the same CPU, and its > +interrupt is not permitted to happen again until the current interrupt has been > +handled, thus the interrupt handler does not need to lock against that. > + > +However, consider a driver was talking to an ethernet card that sports an > +address register and a data register. If that driver's core is talks to the > +card under interrupt-disablement and then the driver's interrupt handler is > +invoked: > + > + LOCAL IRQ DISABLE > + writew(ADDR, ctl_reg_3); > + writew(DATA, y); > + LOCAL IRQ ENABLE > + > + writew(ADDR, ctl_reg_4); > + q = readw(DATA); > + > + > +If ordering rules are sufficiently relaxed, the write to the data register > +might happen after the second write to the address register. > + > +If ordering rules are relaxed, it must be assumed that accesses done inside an > +interrupt disabled section may leak outside of it and may interleave with > +accesses performed in an interrupt and vice versa unless implicit or explicit > +barriers are used. > + > +Normally this won't be a problem because the I/O accesses done inside such > +sections will include synchronous read operations on strictly ordered I/O > +registers that form implicit I/O barriers. If this isn't sufficient then an > +mmiowb() may need to be used explicitly. > + > + > +A similar situation may occur between an interrupt routine and two routines > +running on separate CPUs that communicate with each other. If such a case is > +likely, then interrupt-disabling locks should be used to guarantee ordering. > + > + > +================================= > +EXPLICIT KERNEL COMPILER BARRIERS > +================================= > + > +The Linux kernel has an explicit compiler barrier function that prevents the > +compiler from moving the memory accesses either side of it to the other side: > + > + barrier(); > + > +This has no direct effect on the CPU, which may then reorder things however it > +wishes. > + > + > +=============================== > +EXPLICIT KERNEL MEMORY BARRIERS > +=============================== > + > +The Linux kernel has eight basic CPU memory barriers: > + > + TYPE MANDATORY SMP CONDITIONAL > + =============== ======================= =========================== > + GENERAL mb() smp_mb() > + WRITE wmb() smp_wmb() > + READ rmb() smp_rmb() > + DATA DEPENDENCY read_barrier_depends() smp_read_barrier_depends() > + > +General memory barriers give a guarantee that all memory accesses specified > +before the barrier will appear to happen before all memory accesses specified > +after the barrier with respect to the other components of the system. > + > +Read and write memory barriers give similar guarantees, but only for memory > +reads versus memory reads and memory writes versus memory writes respectively. > + > +Data dependency memory barriers ensure that if two reads are issued such that > +the second depends on the result of the first, then the first read is completed > +and the cache coherence is up to date before the second read is performed. The > +primary case is where the address used in a subsequent read is calculated from > +the result of the first read: > + > + CPU 1 CPU 2 > + =============== =============== > + { a == 0, b == 1 and p == &a, q == &a } > + ... > + b = 2; > + smp_wmb(); > + p = &b; q = p; > + smp_read_barrier_depends(); > + d = *q; > + > +Without the data dependency barrier, any of the following results could be > +seen: > + > + POSSIBLE RESULT PERMISSIBLE ORIGIN > + =============== =============== ======================================= > + q == &a, d == 0 Yes > + q == &b, d == 2 Yes > + q == &b, d == 1 No Cache coherency maintenance delay Either s/No/Yes/ in the preceding line, or s/Without/With/ earlier. I believe the former is better. In absence of the smp_read_barrier_depends(), you really -can- see the (q == &b, d == 1) case on DEC Alpha!!! So the preceding line should instead be: + q == &b, d == 1 Yes This extremely counterintuitive situation arises most easily on machines with split caches, so that (for example) one cache bank processes even-numbered cache lines and the other bank processes odd-numbered cache lines. The pointer p might be stored in an odd-numbered cache line, and the variable b might be stored in an even-numbered cache line. Then, if the even-numbered bank of the reading CPU's cache is extremely busy while the odd-numbered bank is idle, one can see the new value of the pointer (&b), but the old value of the variable (1). I must confess to having strenuously resisted this scenario when first introduced to it, and Wayne Cardoza should be commended on his patience in leading me through it... Please see http://www.openvms.compaq.com/wizard/wiz_2637.html if you think that I am making all this up! This DEC-Alpha example is extremely important, since it defines Linux's memory-barrier semantics. I have some verbiage and diagrams in the aforementioned PDF, feel free to use them. > + q == &b, d == 0 No q read after a > + > +See the "Cache Coherency" section above. > + > + > +All memory barriers imply compiler barriers. > + > +Read memory barriers imply data dependency barriers. > + > +SMP memory barriers are reduced to compiler barriers on uniprocessor compiled > +systems because it is assumed that a CPU will be apparently self-consistent, > +and will order overlapping accesses correctly with respect to itself. Suggest adding: However, mandatory memory barriers still have effect in uniprocessor kernel builds due to the need to correctly order reads and writes to device registers. to this paragraph. > +There is no guarantee that any of the memory accesses specified before a memory > +barrier will be complete by the completion of a memory barrier; the barrier can > +be considered to draw a line in that CPU's access queue that accesses of the > +appropriate type may not cross. > + > +There is no guarantee that issuing a memory barrier on one CPU will have any > +direct effect on another CPU or any other hardware in the system. The indirect > +effect will be the order in which the second CPU sees the effects of the first > +CPU's accesses occur. And the second CPU may need to execute memory barriers of its own to reliably see the ordering effects. Might be good to explicitly note this. > +There is no guarantee that some intervening piece of off-the-CPU hardware[*] > +will not reorder the memory accesses. CPU cache coherency mechanisms should > +propegate the indirect effects of a memory barrier between CPUs, but may not do Suggest s/propegate/propagate/ > +so in order (see above). > + > + [*] For information on bus mastering DMA and coherency please read: > + > + Documentation/pci.txt > + Documentation/DMA-mapping.txt > + Documentation/DMA-API.txt > + > +Note that these are the _minimum_ guarantees. Different architectures may give > +more substantial guarantees, but they may not be relied upon outside of arch > +specific code. > + > + > +There are some more advanced barrier functions: > + > + (*) set_mb(var, value) > + (*) set_wmb(var, value) > + > + These assign the value to the variable and then insert at least a write > + barrier after it, depending on the function. They aren't guaranteed to > + insert anything more than a compiler barrier in a UP compilation. smp_mb__before_atomic_dec() and friends as well? > +BARRIER PAIRING > +--------------- Great section, good how-to information!!! > +Certain types of memory barrier should always be paired. A lack of an > +appropriate pairing is almost certainly an error. > + > +An smp_wmb() should always be paired with an smp_read_barrier_depends() or an > +smp_rmb(), though an smp_mb() would also be acceptable. Similarly an smp_rmb() > +or an smp_read_barrier_depends() should always be paired with at least an > +smp_wmb(), though, again, an smp_mb() is acceptable: > + > + CPU 1 CPU 2 COMMENT > + =============== =============== ======================= > + a = 1; > + smp_wmb(); Could be smp_mb() > + b = 2; > + x = a; > + smp_rmb(); Could be smp_mb() > + y = b; Need to note that (for example) set_mb(), smp_mb__before_atomic_dec(), and friends can stand in for smp_mb(). Similarly with related primitives. > +The data dependency barrier - smp_read_barrier_depends() - is a very specific > +optimisation. It's a weaker form of the read memory barrier, for use in the > +case where there's a dependency between two reads - since on some architectures > +we can't rely on the CPU to imply a data dependency barrier. The typical case > +is where a read is used to generate an address for a subsequent memory access: > + > + CPU 1 CPU 2 COMMENT > + =============== =============================== ======================= > + a = 1; > + smp_wmb(); Could be smp_mb() > + b = &a; > + x = b; > + smp_read_barrier_depends(); Or smp_rmb()/smp_mb() > + y = *x; > + > +This is used in the RCU system. Strictly speaking, this is not dependent on RCU, but probably close enough for now... Might be worth noting that array accesses are data-dependent on the corresponding index. Some architectures have more expansive definition of data dependency, including then- and else-clauses being data-dependent on the if-condition, but this is probably too much detail. > +If the two reads are independent, an smp_read_barrier_depends() is not > +sufficient, and an smp_rmb() or better must be employed. > + > +Basically, the read barrier always has to be there, even though it can be of > +the "weaker" type. > + > +Note also that the address really has to have a _data_ dependency, not a > +control dependency. If the address is dependent on the first read, but the > +dependency is through a conditional rather than actually reading the address > +itself, then it's a control dependency, and existing CPU's already deal with > +those through branch prediction. > + > + > +=============================== > +IMPLICIT KERNEL MEMORY BARRIERS > +=============================== > + > +Some of the other functions in the linux kernel imply memory barriers, amongst > +which are locking and scheduling functions. > + > +This specification is a _minimum_ guarantee; any particular architecture may > +provide more substantial guarantees, but these may not be relied upon outside > +of arch specific code. > + > + > +LOCKING FUNCTIONS > +----------------- > + > +The Linux kernel has a number of locking constructs: > + > + (*) spin locks > + (*) R/W spin locks > + (*) mutexes > + (*) semaphores > + (*) R/W semaphores > + > +In all cases there are variants on "LOCK" operations and "UNLOCK" operations > +for each construct. These operations all imply certain barriers: > + > + (*) LOCK operation implication: > + > + Memory accesses issued after the LOCK will be completed after the LOCK > + accesses have completed. Suggest s/accesses have/operation has/ here and below. > + Memory accesses issued before the LOCK may be completed after the LOCK > + accesses have completed. Suggest making the consequence very clear, perhaps adding ", so that code preceding a lock acquisition can "bleed" into the critical section". > + (*) UNLOCK operation implication: > + > + Memory accesses issued before the UNLOCK will be completed before the > + UNLOCK accesses have completed. > + > + Memory accesses issued after the UNLOCK may be completed before the UNLOCK > + accesses have completed. Again, suggest making the consequence very clear, perhaps adding ", so that code following a lock release can "bleed into" the critical section". > + (*) LOCK vs UNLOCK implication: > + > + The LOCK accesses will be completed before the UNLOCK accesses. > + > + Therefore an UNLOCK followed by a LOCK is equivalent to a full barrier, > + but a LOCK followed by an UNLOCK is not. > + > + (*) Failed conditional LOCK implication: > + > + Certain variants of the LOCK operation may fail, either due to being > + unable to get the lock immediately, or due to receiving an unblocked > + signal whilst asleep waiting for the lock to become available. Failed > + locks do not imply any sort of barrier. > + > +Locks and semaphores may not provide any guarantee of ordering on UP compiled > +systems, and so cannot be counted on in such a situation to actually achieve > +anything at all - especially with respect to I/O accesses - unless combined > +with interrupt disabling operations. > + > +See also the section on "Inter-CPU locking barrier effects". > + > + > +As an example, consider the following: > + > + *A = a; > + *B = b; > + LOCK > + *C = c; > + *D = d; > + UNLOCK > + *E = e; > + *F = f; > + > +The following sequence of events is acceptable: > + > + LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK > + > +But none of the following are: > + > + {*F,*A}, *B, LOCK, *C, *D, UNLOCK, *E > + *A, *B, *C, LOCK, *D, UNLOCK, *E, *F > + *A, *B, LOCK, *C, UNLOCK, *D, *E, *F > + *B, LOCK, *C, *D, UNLOCK, {*F,*A}, *E > + > + > +INTERRUPT DISABLING FUNCTIONS > +----------------------------- > + > +Functions that disable interrupts (LOCK equivalent) and enable interrupts > +(UNLOCK equivalent) will act as compiler barriers only. So if memory or I/O > +barriers are required in such a situation, they must be provided from some > +other means. > + > + > +MISCELLANEOUS FUNCTIONS > +----------------------- > + > +Other functions that imply barriers: > + > + (*) schedule() and similar imply full memory barriers. > + > + > +================================= > +INTER-CPU LOCKING BARRIER EFFECTS > +================================= > + > +On SMP systems locking primitives give a more substantial form of barrier: one > +that does affect memory access ordering on other CPUs, within the context of > +conflict on any particular lock. > + > + > +LOCKS VS MEMORY ACCESSES > +------------------------ > + > +Consider the following: the system has a pair of spinlocks (N) and (Q), and > +three CPUs; then should the following sequence of events occur: > + > + CPU 1 CPU 2 > + =============================== =============================== > + *A = a; *E = e; > + LOCK M LOCK Q > + *B = b; *F = f; > + *C = c; *G = g; > + UNLOCK M UNLOCK Q > + *D = d; *H = h; > + > +Then there is no guarantee as to what order CPU #3 will see the accesses to *A > +through *H occur in, other than the constraints imposed by the separate locks > +on the separate CPUs. It might, for example, see: > + > + *E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M > + > +But it won't see any of: > + > + *B, *C or *D preceding LOCK M > + *A, *B or *C following UNLOCK M > + *F, *G or *H preceding LOCK Q > + *E, *F or *G following UNLOCK Q > + > + > +However, if the following occurs: > + > + CPU 1 CPU 2 > + =============================== =============================== > + *A = a; > + LOCK M [1] > + *B = b; > + *C = c; > + UNLOCK M [1] > + *D = d; *E = e; > + LOCK M [2] > + *F = f; > + *G = g; > + UNLOCK M [2] > + *H = h; > + > +CPU #3 might see: > + > + *E, LOCK M [1], *C, *B, *A, UNLOCK M [1], > + LOCK M [2], *H, *F, *G, UNLOCK M [2], *D > + > +But assuming CPU #1 gets the lock first, it won't see any of: > + > + *B, *C, *D, *F, *G or *H preceding LOCK M [1] > + *A, *B or *C following UNLOCK M [1] > + *F, *G or *H preceding LOCK M [2] > + *A, *B, *C, *E, *F or *G following UNLOCK M [2] > + > + > +LOCKS VS I/O ACCESSES > +--------------------- > + > +Under certain circumstances (such as NUMA), I/O accesses within two spinlocked > +sections on two different CPUs may be seen as interleaved by the PCI bridge. Suggest adding something to the effect of ", because the PCI bridge is not participating in the cache-coherence protocol, and therefore is incapable of issuing the required read-side memory barriers." > +For example: > + > + CPU 1 CPU 2 > + =============================== =============================== > + spin_lock(Q) > + writel(0, ADDR) > + writel(1, DATA); > + spin_unlock(Q); > + spin_lock(Q); > + writel(4, ADDR); > + writel(5, DATA); > + spin_unlock(Q); > + > +may be seen by the PCI bridge as follows: > + > + WRITE *ADDR = 0, WRITE *ADDR = 4, WRITE *DATA = 1, WRITE *DATA = 5 > + > +which would probably break. > + > +What is necessary here is to insert an mmiowb() before dropping the spinlock, > +for example: > + > + CPU 1 CPU 2 > + =============================== =============================== > + spin_lock(Q) > + writel(0, ADDR) > + writel(1, DATA); > + mmiowb(); > + spin_unlock(Q); > + spin_lock(Q); > + writel(4, ADDR); > + writel(5, DATA); > + mmiowb(); > + spin_unlock(Q); > + > +this will ensure that the two writes issued on CPU #1 appear at the PCI bridge > +before either of the writes issued on CPU #2. > + > + > +Furthermore, following a write by a read to the same device is okay, because > +the read forces the write to complete before the read is performed: > + > + CPU 1 CPU 2 > + =============================== =============================== > + spin_lock(Q) > + writel(0, ADDR) > + a = readl(DATA); > + spin_unlock(Q); > + spin_lock(Q); > + writel(4, ADDR); > + b = readl(DATA); > + spin_unlock(Q); > + > + > +See Documentation/DocBook/deviceiobook.tmpl for more information. > + > + > +========================== > +KERNEL I/O BARRIER EFFECTS > +========================== > + > +When accessing I/O memory, drivers should use the appropriate accessor > +functions: > + > + (*) inX(), outX(): > + > + These are intended to talk to I/O space rather than memory space, but > + that's primarily a CPU-specific concept. The i386 and x86_64 processors do > + indeed have special I/O space access cycles and instructions, but many > + CPUs don't have such a concept. > + > + The PCI bus, amongst others, defines an I/O space concept - which on such > + CPUs as i386 and x86_64 cpus readily maps to the CPU's concept of I/O > + space. However, it may also mapped as a virtual I/O space in the CPU's > + memory map, particularly on those CPUs that don't support alternate > + I/O spaces. > + > + Accesses to this space may be fully synchronous (as on i386), but > + intermediary bridges (such as the PCI host bridge) may not fully honour > + that. > + > + They are guaranteed to be fully ordered with respect to each other. > + > + They are not guaranteed to be fully ordered with respect to other types of > + memory and I/O operation. > + > + (*) readX(), writeX(): > + > + Whether these are guaranteed to be fully ordered and uncombined with > + respect to each other on the issuing CPU depends on the characteristics > + defined for the memory window through which they're accessing. On later > + i386 architecture machines, for example, this is controlled by way of the > + MTRR registers. > + > + Ordinarily, these will be guaranteed to be fully ordered and uncombined,, > + provided they're not accessing a prefetchable device. > + > + However, intermediary hardware (such as a PCI bridge) may indulge in > + deferral if it so wishes; to flush a write, a read from the same location > + is preferred[*], but a read from the same device or from configuration > + space should suffice for PCI. > + > + [*] NOTE! attempting to read from the same location as was written to may > + cause a malfunction - consider the 16550 Rx/Tx serial registers for > + example. > + > + Used with prefetchable I/O memory, an mmiowb() barrier may be required to > + force writes to be ordered. > + > + Please refer to the PCI specification for more information on interactions > + between PCI transactions. > + > + (*) readX_relaxed() > + > + These are similar to readX(), but are not guaranteed to be ordered in any > + way. Be aware that there is no I/O read barrier available. > + > + (*) ioreadX(), iowriteX() > + > + These will perform as appropriate for the type of access they're actually > + doing, be it inX()/outX() or readX()/writeX(). > + > + > +========== > +REFERENCES > +========== Suggest adding: Alpha AXP Architecture Reference Manual, Second Edition (Sites & Witek, Digital Press) Chapter 5.2: Physical Address Space Characteristics Chapter 5.4: Caches and Write Buffers Chapter 5.5: Data Sharing Chapter 5.6: Read/Write Ordering > +AMD64 Architecture Programmer's Manual Volume 2: System Programming > + Chapter 7.1: Memory-Access Ordering > + Chapter 7.4: Buffering and Combining Memory Writes > + > +IA-32 Intel Architecture Software Developer's Manual, Volume 3: > +System Programming Guide > + Chapter 7.1: Locked Atomic Operations > + Chapter 7.2: Memory Ordering > + Chapter 7.4: Serializing Instructions > + > +The SPARC Architecture Manual, Version 9 > + Chapter 8: Memory Models > + Appendix D: Formal Specification of the Memory Models > + Appendix J: Programming with the Memory Models > + > +UltraSPARC Programmer Reference Manual > + Chapter 5: Memory Accesses and Cacheability > + Chapter 15: Sparc-V9 Memory Models > + > +UltraSPARC III Cu User's Manual > + Chapter 9: Memory Models > + > +UltraSPARC IIIi Processor User's Manual > + Chapter 8: Memory Models > + > +UltraSPARC Architecture 2005 > + Chapter 9: Memory > + Appendix D: Formal Specifications of the Memory Models > + > +UltraSPARC T1 Supplement to the UltraSPARC Architecture 2005 > + Chapter 8: Memory Models > + Appendix F: Caches and Cache Coherency > + > +Solaris Internals, Core Kernel Architecture, p63-68: > + Chapter 3.3: Hardware Considerations for Locks and > + Synchronization > + > +Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching > +for Kernel Programmers: > + Chapter 13: Other Memory Models > + > +Intel Itanium Architecture Software Developer's Manual: Volume 1: > + Section 2.6: Speculation > + Section 4.4: Memory Access > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > From torvalds at osdl.org Fri Mar 17 10:55:07 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Thu, 16 Mar 2006 15:55:07 -0800 (PST) Subject: [PATCH] Document Linux's memory barriers [try #5] In-Reply-To: <20060316231723.GB1323@us.ibm.com> References: <16835.1141936162@warthog.cambridge.redhat.com> <18351.1142432599@warthog.cambridge.redhat.com> <20060316231723.GB1323@us.ibm.com> Message-ID: On Thu, 16 Mar 2006, Paul E. McKenney wrote: > > Also, I have some verbiage and diagrams of Alpha's operation at > http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2006.03.13a.pdf > Feel free to take any that helps. (Source for paper is Latex and xfig, > for whatever that is worth.) This paper too claims that x86-64 has somehow different memory ordering constraints than regular x86. Do you actually have a source for that statement, or is it just a continuation of what looks like confusion in the Linux x86-64 header files? (Also, x86 doesn't have an incoherent instruction cache - some older x86 cores have an incoherent instruction decode _buffer_, but that's a slightly different issue with basically no effect on any sane program). Linus From torvalds at osdl.org Fri Mar 17 16:32:03 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Thu, 16 Mar 2006 21:32:03 -0800 (PST) Subject: [PATCH] Document Linux's memory barriers [try #5] In-Reply-To: <20060317012916.GD1323@us.ibm.com> References: <16835.1141936162@warthog.cambridge.redhat.com> <18351.1142432599@warthog.cambridge.redhat.com> <20060316231723.GB1323@us.ibm.com> <20060317012916.GD1323@us.ibm.com> Message-ID: On Thu, 16 Mar 2006, Paul E. McKenney wrote: > > In section 7.1.1 on page 195, it says: > > For cacheable memory types, the following rules govern > read ordering: > > o Out-of-order reads are allowed. Out-of-order reads > can occur as a result of out-of-order instruction > execution or speculative execution. The processor > can read memory out-of-order to allow out-of-order > to proceed. > > o Speculative reads are allows ... [but no effect on > ordering beyond that given in the other rules, near > as I can tell] > > o Reads can be reordered ahead of writes. Reads are > generally given a higher priority by the processor > than writes because instruction execution stalls > if the read data required by an instruction is not > immediately available. Allowing reads ahead of > writes usually maximizes software performance. These are just the same as the x86 ordering. Notice how reads can pass (earlier) writes, but won't be pushed back after later writes. That's very much the x86 ordering (together with the "CPU ordering" for writes). > > (Also, x86 doesn't have an incoherent instruction cache - some older x86 > > cores have an incoherent instruction decode _buffer_, but that's a > > slightly different issue with basically no effect on any sane program). > > Newer cores check the linear address, so code generated in a different > address space now needs to do CPUID. This is admittedly an unusual > case -- perhaps I was getting overly worked up about it. I based this > on Section 10.6 on page 10-21 (physical page 405) of Intel's "IA-32 > Intel Architecture Software Developer's Manual Volume 3: System > Programming Guide", 2004. PDF available (as of 2/16/2005) from: > > ftp://download.intel.com/design/Pentium4/manuals/25366814.pdf Not according to the docs I have. The _prefetch_ queue is invalidated based on the linear address, but not the caches. The caches are coherent, and the prefetch is also coherent in modern cores wrt linear address (but old cores, like the original i386, would literally not see the write, so you could do movl $1234,1f 1: xorl %eax,%eax and the "movl" would overwrite the "xorl", but the "xorl" would still get executed if it was in the 16-byte prefetch buffer or whatever). Modern cores will generally be _totally_ serialized, so if you write to the next instruction, I think most modern cores will notice it. It's only if you use paging or something to write to the physical address to something that is in the prefetch buffers that it can get you. Now, the prefetching has gotten longer over time, but it is basically still just a few tens of instructions, and any serializing instruction will force it to be serialized with the cache. It's really a non-issue, because regular self-modifying code will trigger the linear address check, and system code will always end up doing an "iret" or other serializing instruction, so it really doesn't trigger. So in practice, you really should see it as being entirely coherent. You have to do some _really_ strange sh*t to ever see anything different. Linus From paulmck at us.ibm.com Fri Mar 17 17:23:00 2006 From: paulmck at us.ibm.com (Paul E. McKenney) Date: Thu, 16 Mar 2006 22:23:00 -0800 Subject: [PATCH] Document Linux's memory barriers [try #5] In-Reply-To: References: <16835.1141936162@warthog.cambridge.redhat.com> <18351.1142432599@warthog.cambridge.redhat.com> <20060316231723.GB1323@us.ibm.com> <20060317012916.GD1323@us.ibm.com> Message-ID: <20060317062300.GE1323@us.ibm.com> On Thu, Mar 16, 2006 at 09:32:03PM -0800, Linus Torvalds wrote: > > > On Thu, 16 Mar 2006, Paul E. McKenney wrote: > > > > In section 7.1.1 on page 195, it says: > > > > For cacheable memory types, the following rules govern > > read ordering: > > > > o Out-of-order reads are allowed. Out-of-order reads > > can occur as a result of out-of-order instruction > > execution or speculative execution. The processor > > can read memory out-of-order to allow out-of-order > > to proceed. > > > > o Speculative reads are allows ... [but no effect on > > ordering beyond that given in the other rules, near > > as I can tell] > > > > o Reads can be reordered ahead of writes. Reads are > > generally given a higher priority by the processor > > than writes because instruction execution stalls > > if the read data required by an instruction is not > > immediately available. Allowing reads ahead of > > writes usually maximizes software performance. > > These are just the same as the x86 ordering. Notice how reads can pass > (earlier) writes, but won't be pushed back after later writes. That's very > much the x86 ordering (together with the "CPU ordering" for writes). OK, so you are not arguing with the "AMD" row, but rather with the x86 row. So I was looking in the wrong manual. Specifically, you are saying that the x86's "Loads Reordered After Stores" cell should be blank rather than "Y", right? > > > (Also, x86 doesn't have an incoherent instruction cache - some older x86 > > > cores have an incoherent instruction decode _buffer_, but that's a > > > slightly different issue with basically no effect on any sane program). > > > > Newer cores check the linear address, so code generated in a different > > address space now needs to do CPUID. This is admittedly an unusual > > case -- perhaps I was getting overly worked up about it. I based this > > on Section 10.6 on page 10-21 (physical page 405) of Intel's "IA-32 > > Intel Architecture Software Developer's Manual Volume 3: System > > Programming Guide", 2004. PDF available (as of 2/16/2005) from: > > > > ftp://download.intel.com/design/Pentium4/manuals/25366814.pdf > > Not according to the docs I have. > > The _prefetch_ queue is invalidated based on the linear address, but not > the caches. The caches are coherent, and the prefetch is also coherent in > modern cores wrt linear address (but old cores, like the original i386, > would literally not see the write, so you could do > > movl $1234,1f > 1: xorl %eax,%eax > > and the "movl" would overwrite the "xorl", but the "xorl" would still get > executed if it was in the 16-byte prefetch buffer or whatever). > > Modern cores will generally be _totally_ serialized, so if you write to > the next instruction, I think most modern cores will notice it. It's only > if you use paging or something to write to the physical address to > something that is in the prefetch buffers that it can get you. Yep. But I would not put it past some JIT writer to actually do something like double-mapping the JITed code to two different linear addresses. > Now, the prefetching has gotten longer over time, but it is basically > still just a few tens of instructions, and any serializing instruction > will force it to be serialized with the cache. Agreed. > It's really a non-issue, because regular self-modifying code will trigger > the linear address check, and system code will always end up doing an > "iret" or other serializing instruction, so it really doesn't trigger. Only if there is a context switch between the writing of the instruction and the executing of it. Might not be the case if someone double-maps the memory or some other similar stunt. And I agree modern cores seem to be getting less aggressive in their search for instruction-level parallelism, but it doesn't take too much speculative-execution capability to (sometimes!) get some pretty strange stuff loaded into the instruction prefetch buffer. > So in practice, you really should see it as being entirely coherent. You > have to do some _really_ strange sh*t to ever see anything different. No argument with your last sentence! (And, believe it or not, there was a time when self-modifying code was considered manly rather than strange. But that was back in the days of 4096-byte main memories...) I believe we are in violent agreement on this one. The column label in the table is "Incoherent Instruction Cache/Pipeline". You are saying that only the pipeline can be incoherent, and even then, only in strange situations. But the only situation in which I would leave a cell in this column blank would be if -both- the cache -and- the pipeline were -always- coherent, even in strange situations. So I believe that this one still needs to stay "Y". I would rather see someone's JIT execute an extra CPUID after generating a new chunk of code than to see it fail strangely -- but only sometimes, and not reproducibly. Would it help if the column were instead labelled "Incoherent Instruction Cache or Pipeline", replacing the current "/" with "or"? Thanx, Paul From jirislaby at gmail.com Sat Mar 18 04:30:26 2006 From: jirislaby at gmail.com (Jiri Slaby) Date: Fri, 17 Mar 2006 18:29:26 +0059 Subject: 2.6.16-rc6: known regressions In-Reply-To: <20060313200544.GG13973@stusta.de> References: <20060313200544.GG13973@stusta.de> Message-ID: <441AF20D.5010901@gmail.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Adrian Bunk napsal(a): > This email lists some known regressions in 2.6.16-rc6 compared to 2.6.15. [snip] > Subject : Stradis driver udev brekage > References : http://bugzilla.kernel.org/show_bug.cgi?id=6170 > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=181063 > http://lkml.org/lkml/2006/2/18/204 > Submitter : Tom Seeley > Dave Jones > Handled-By : Jiri Slaby > Status : unknown Solved, see https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=181063#c16 regards, - -- Jiri Slaby www.fi.muni.cz/~xslaby \_.-^-._ jirislaby at gmail.com _.-^-._/ B67499670407CE62ACC8 22A032CC55C339D47A7E -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.1 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iD8DBQFEGvINMsxVwznUen4RAqW3AJ9vgpxMrf7oXzj46zxtee4J4WthmQCgqYU0 xmCArHJ8Nr3UCyt68HAdbDI= =u8yx -----END PGP SIGNATURE----- From sfr at canb.auug.org.au Tue Mar 21 10:34:09 2006 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Tue, 21 Mar 2006 10:34:09 +1100 Subject: merging lists Message-ID: <20060321103409.48b10200.sfr@canb.auug.org.au> Hi all, I am bout the merge the two mailing lists (so that only linuxppc-dev will continue). To do this, I will subscribe all the members of the linuxppc64-dev lists that are not members of linuxppc-dev to the second list. If you are subscribed under different addresses to the two different lists, you will start getting two copies of everything until you unsubscribe one of your addresses. There may be a small amount of disruption, I apologise in advance. I will redirect the linuxppc64-dev mail addresses to the linucppc-dev ones and the archives and patchwork will remain available at their current addresses. -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 191 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20060321/29167371/attachment.pgp