From sfr at canb.auug.org.au Mon Dec 1 13:42:11 2003 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Mon, 1 Dec 2003 13:42:11 +1100 Subject: Remove flight recorder Message-ID: <20031201134211.0db3853a.sfr@canb.auug.org.au> Hi all, During the porting of the 2.6 kernel to iSeries (or the iSeries kernel to 2.6 :-)) I have been #defining out the flight recorder stuff as I had no idea what it was for. Anton suggested that it may have been a useful bringup tool, but that we don't have the necessary tools to use it. Does anyone mind if I just excise the flight recorder code (from iSeries in particular)? Please don't bite my head off, I am just asking. :-) -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Mon Dec 1 16:15:05 2003 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 01 Dec 2003 16:15:05 +1100 Subject: [Fwd: [PATCH] ppc64: 2.6 altivec support, sys_swapcontext & signal32 rework] Message-ID: <1070255704.658.108.camel@gaston> (Re-post with the patch bzip2'ed...) Hi ! It wasn't very easy to split this patch, so here it is in one piece for now. It does a few things: - Adds basic Altivec support to the context switching code - Adds Altivec support to the 64 bits sigcontext - Rewrite part of the signal32 compat code based on the new ppc32 implementation, with Altivec support in the contexts, factors out some sigset flipping code, and fixes a long standing bug where an 32 bits RT context would have an incorrectly flipped sigset (a 64 bits one instead of a 32 bits one). - Adds sys_swapcontext syscall (and sys32_swapcontext) for kernel based implementation of {set,get,swap}_context with Altivec support So far, it appears to work fine (when run as part of my G5 kernel which contains a bunch of other changes though). It would need some more testing hopefully. What is needed now is a glibc implementation of the ucontext calls for both 32 and 64 bits that makes use of the new sys_swapcontext, at least with 2.6. I may give it a try, but I'm sure somebody more familiar with glibc than I am would get this done much much more quickly... So if you are that person, please speak up ;) Regarding the details of the Altivec stuff: On a ppc64 signal frame, a kernel that supports altivec (AT_HWCAP) will always fill properly the pointer to the altivec context. In there, VRSAVE is always set, regardless of the usage of altivec done by the process. MSR:VEC in the regs context will be set is the other altivec registers (vr0..31 and vscr) have valid values in the context. It is important to split vrsave from the rest of the context as a process may set vrsave prior to doing its first vector operation, and get preempted with a signal in between those, thus having a valid vrsave context that needs to be saved & restored without having taken its first altivec exception yet, thus not having an altivec context to save yet. I'm not sure what's the best way to deal with the availability of vrsave on ppc32 contexts, it's a bit more nasty here. (kernel version ?). Some ppc32 kernels will report supporting altivec via AT_HWCAP without actually implementing altivec sig/u context stuff. We may need to based ourself on some kernel versioning here, maybe consider 2.4.23 as the minimum version to rely on kernel sigcontext containing proper vrsave for ppc32 ? Ben. -------------- next part -------------- A non-text attachment was scrubbed... Name: ppc64-altivec_and_sig.diff.bz2 Type: application/x-bzip Size: 17700 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20031201/2e38bba6/attachment.bin From boutcher at us.ibm.com Tue Dec 2 00:09:46 2003 From: boutcher at us.ibm.com (David Boutcher) Date: Mon, 1 Dec 2003 07:09:46 -0600 Subject: Remove flight recorder In-Reply-To: <20031201134211.0db3853a.sfr@canb.auug.org.au> Message-ID: owner-linuxppc64-dev at lists.linuxppc.org wrote on 11/30/2003 08:42:11 PM: > Hi all, > > During the porting of the 2.6 kernel to iSeries (or the iSeries kernel > to 2.6 :-)) I have been #defining out the flight recorder stuff as I > had no idea what it was for. Anton suggested that it may have been a > useful bringup tool, but that we don't have the necessary tools to use > it. > > Does anyone mind if I just excise the flight recorder code (from iSeries > in particular)? Hi Stephen, Which flight recorder stuff? Do you mean the HvCall_WriteLogBuffer stuff? In that case, it probably should be left in. Those calls write the console output to a hypervisor buffer that can be retreived later through a couple of different mechanisms. One of the key uses for that is if you don't have a console connected when your linux crashes, you can come in later and dump the last console output. It is also handy when linux crashes, because frequently not all output makes it out through the tortuous console connection, and you can dump the tail end of any kernel output. To dump the hypervisor log buffer, type ctrl-x ctrl-x on the console screen, or there is a path through the green-screen SST screens if you really want to use it. If I just answered the wrong question, remind me which flight recorder you are pulling out? Thanks, Dave Boutcher IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From sfr at canb.auug.org.au Tue Dec 2 01:23:02 2003 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Tue, 2 Dec 2003 01:23:02 +1100 Subject: Remove flight recorder In-Reply-To: References: <20031201134211.0db3853a.sfr@canb.auug.org.au> Message-ID: <20031202012302.6ad03b7f.sfr@canb.auug.org.au> On Mon, 1 Dec 2003 07:09:46 -0600 "David Boutcher" wrote: > > If I just answered the wrong question, remind me which flight recorder you > are pulling out? Wrong question :-) sorry. In 2.4 there is arch/ppc64/flight_recorder.c which allows you to log to a buffer that is accessible through the proc file system. It doesn't exist in the 2.6 kernel (so I had to ifdef out the places in the 2.4 iSeries code that I am forward porting) so presumably the pSeries guys won't miss it :-) I know about the Hypervisor log and have used it quite a lot so far. -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From boutcher at us.ibm.com Tue Dec 2 01:45:03 2003 From: boutcher at us.ibm.com (David Boutcher) Date: Mon, 1 Dec 2003 08:45:03 -0600 Subject: Remove flight recorder In-Reply-To: <20031202012302.6ad03b7f.sfr@canb.auug.org.au> Message-ID: owner-linuxppc64-dev at lists.linuxppc.org wrote on 12/01/2003 08:23:02 AM: > On Mon, 1 Dec 2003 07:09:46 -0600 "David Boutcher" com> wrote: > Wrong question :-) sorry. In 2.4 there is arch/ppc64/flight_recorder.c > which allows you to log to a buffer that is accessible through the proc > file system. It doesn't exist in the 2.6 kernel (so I had to ifdef out > the places in the 2.4 iSeries code that I am forward porting) so > presumably the pSeries guys won't miss it :-) Oh THAT flight recorder. Ya, I don't think anyone is using that flight recorder. Blow it away. Dave Boutcher IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Tue Dec 2 08:09:40 2003 From: anton at samba.org (Anton Blanchard) Date: Tue, 2 Dec 2003 08:09:40 +1100 Subject: [PATCH] nvram buffering/error logging port to 2.6 In-Reply-To: <1068210343.21219.17.camel@tin.ibm.com> References: <1068210343.21219.17.camel@tin.ibm.com> Message-ID: <20031201210940.GD22620@krispykreme> Hi, > This is a port of the nvram buffering/error logging code from 2.4 to > 2.6. I should also note that I included moving /proc/rtas to > /proc/ppc64/rtas. Taking a step back, why do we need to buffer error log entries in NVRAM? My thoughts when I added the original event-scan userspace interface were: 1. Machine boots, we execute event-scans but dont request error log entries. 2. When the rtas proc file is opened we then start requesting error log information. Is this less reliable? Well we already have a window between where we do the event scan and when we write the information to NVRAM. Im guessing writing NVRAM isnt fast, we could easily lose or get corrupted event scan data if the machine locked up in this window. NVRAM is a limited resource, how do we avoid overflowing it during boot? Could we lose error log information if we end up with a bunch of event-scan error logs? The real way to fix this window is to have a better interface to the error log information (ie a read error log RTAS call and a discard error log RTAS call, you call discard error log once you have successfully committed the error log to disk). Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From moilanen at austin.ibm.com Tue Dec 2 10:19:06 2003 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Mon, 01 Dec 2003 17:19:06 -0600 Subject: [PATCH] nvram buffering/error logging port to 2.6 In-Reply-To: <20031201210940.GD22620@krispykreme> References: <1068210343.21219.17.camel@tin.ibm.com> <20031201210940.GD22620@krispykreme> Message-ID: <1070320746.1129.676.camel@tin.ibm.com > > 1. Machine boots, we execute event-scans but dont request error log > entries. > 2. When the rtas proc file is opened we then start requesting error log > information. > > Is this less reliable? In a little more detail: 1.) On boot, what was in NVRAM is store into memory. 2.) Event-scans wills start pulling error logs and if there is an error-log entry from rtas, that data overwrites what was in NVRAM. 3.) rtas_errd will pull from /proc/ppc64/rtas/error_log 4.) When the data is stored on disk, rtas_errd will go and read from error_log again and this signals that it is safe to clear NVRAM of the event that was stored. So it is possible to lose the event stored from last boot if on the current boot the system goes down inbetween the first event-scan (and the case that there is a new event-log entry) and when the rtas_errd runs for the first time. I do not feel that this is a big hole, but this hole could be closed by not starting event-scans until rtas_errd has started. This does not seem smart, as if rtas_errd is not installed on the system we will get a surveillance timeout. > Well we already have a window between where we do > the event scan and when we write the information to NVRAM. Im guessing > writing NVRAM isnt fast, we could easily lose or get corrupted event > scan data if the machine locked up in this window. There is nothing that can be done about this. > NVRAM is a limited resource, how do we avoid overflowing it during boot? The OS is guaranteed 1K of NVRAM per partition. If for some reason we do not have the space we should not do the NVRAM buffering of the events coming in. > Could we lose error log information if we end up with a bunch of > event-scan error logs? Yes, if we are over 64 error-logs and rtas_errd is not processing them fast enough it is possible. The most I have ever seen is 3 come in at once. If 64 come in at one time, then there is something severly broken. > The real way to fix this window is to have a better interface to the error > log information (ie a read error log RTAS call and a discard error log > RTAS call, you call discard error log once you have successfully > committed the error log to disk). I'm not clear. How is this different then what is currently there? Do you mean storing every single error-log in NVRAM until it is on disk? Currently we only store 1 error log because we are only guaranteed that much space in NVRAM (i.e. could lose that NVRAM space on the next boot and nullify the buffering of error-logs in NVRAM). So the last fatal error-log received is what is stored into NVRAM or if there was no fatal then just the last error-log received. Thanks, Jake ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From moilanen at austin.ibm.com Wed Dec 3 04:43:07 2003 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Tue, 02 Dec 2003 11:43:07 -0600 Subject: bogus changes for generic pci_scan_slot() in ameslab-2.5 In-Reply-To: <20031128154151.GA30606@suse.de> References: <20031128154151.GA30606@suse.de> Message-ID: <1070385249.1123.1659.camel@tin.ibm.com > This looks like the workaround for the pci multifunc problem that is seen on LPARs. The linux-2.6-pcibios-scan-all-fns-1.patch I posted in the JS20 support email fixes the same problem, does not break ppc32, and should probably used instead. Thanks, Jake On Fri, 2003-11-28 at 09:41, Olaf Hering wrote: > Good morning, > > what is the purpose of this change? > > > diff -purN linux-2.5/drivers/pci/probe.c linuxppc64-2.5/drivers/pci/probe.c > --- linux-2.5/drivers/pci/probe.c 2003-08-06 15:34:30.000000000 +0000 > +++ linuxppc64-2.5/drivers/pci/probe.c 2003-11-05 22:12:33.000000000 +0000 > @@ -552,6 +552,7 @@ int __devinit pci_scan_slot(struct pci_b > struct pci_dev *dev; > > dev = pci_scan_device(bus, devfn); > +#if 0 > if (func == 0) { > if (!dev) > break; > @@ -560,6 +561,10 @@ int __devinit pci_scan_slot(struct pci_b > continue; > dev->multifunction = 1; > } > +#else > + if (!dev) > + continue; > +#endif > > /* Fix up broken headers */ > pci_fixup_device(PCI_FIXUP_HEADER, dev); > > It breaks on ppc32, B&W G3, dies in indirect_read_config() because the > pointer *cfg_data becomes bogus, devfn is > 0xff (no idea if that > matters). > > turning #if 0 into #if 1 cures it. I havent tried it on other systems > yet, but at least a PReP MTX+ works with the patch above. > > > > -- > USB is for mice, FireWire is for men! > > sUse lINUX ag, n?RNBERG > ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olh at suse.de Wed Dec 3 04:51:15 2003 From: olh at suse.de (Olaf Hering) Date: Tue, 2 Dec 2003 18:51:15 +0100 Subject: bogus changes for generic pci_scan_slot() in ameslab-2.5 In-Reply-To: <1070385249.1123.1659.camel@tin.ibm.com > References: <20031128154151.GA30606@suse.de> <1070385249.1123.1659.camel@tin.ibm.com > Message-ID: <20031202175115.GA3508@suse.de> On Tue, Dec 02, Jake Moilanen wrote: > This looks like the workaround for the pci multifunc problem that is > seen on LPARs. The linux-2.6-pcibios-scan-all-fns-1.patch I posted in > the JS20 support email fixes the same problem, does not break ppc32, and > should probably used instead. I will give it a try on the beige G3, thanks. -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From jimix at watson.ibm.com Sat Dec 6 02:47:29 2003 From: jimix at watson.ibm.com (Jimi Xenidis) Date: Fri, 5 Dec 2003 10:47:29 -0500 Subject: alignment and correction bug in glibc Message-ID: <16336.43153.27131.648204@kitch0.watson.ibm.com> File linuxthreads/sysdeps/unix/sysv/linux/powerpc/powerpc64/sysdep-cancel.h performs a ld, cmpdi with 0 on a 32 bit value before every system call in a threaded app. diff of proposed fixe below. -JX --- sysdep-cancel.h Tue Jun 17 18:22:57 2003 +++ /tmp/fix.S Fri Dec 5 10:43:37 2003 @@ -103,8 +103,8 @@ .tc __local_multiple_threads[TC],__local_multiple_threads; \ .previous; \ ld 10,.LC__local_multiple_threads at toc(2); \ - ld 10,0(10); \ - cmpdi 10,0 + lwz 10,0(10); \ + cmpwi 10,0 # endif #elif !defined __ASSEMBLER__ ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From aprasad at in.ibm.com Sun Dec 7 11:58:06 2003 From: aprasad at in.ibm.com (Anil K Prasad) Date: Sun, 7 Dec 2003 06:28:06 +0530 Subject: [PATCH] ppc64 kdb to print SDR1 along with other SPRs Message-ID: Hi, I am not sure why there is no option to see SDR1 value in kdb. I had expected it to be under 'superreg' option. Anyway, I have just added extra printf for SDR1.. and below is patch for the same. Thanks, Anil. --------------------------------------------------------------------------PATCH BEGIN--------------------------------------------------------------------------------- --- linux-2.4.21/arch/ppc64/kdb/kdbasupport.c 2003-12-06 16:30:53.000000000 -0800 +++ linux-myfix/arch/ppc64/kdb/kdbasupport.c 2003-12-06 15:13:10.000000000 -0800 @@ -1661,6 +1661,7 @@ kdb_printf("toc = %.16lx dar = %.16lx\n", toc, get_dar()); kdb_printf("srr0 = %.16lx srr1 = %.16lx\n", get_srr0(), get_srr1()); kdb_printf("asr = %.16lx\n", mfasr()); + kdb_printf("sdr1 = %.16lx\n", mfsdr1()); for (i = 0; i < 8; ++i) kdb_printf("sr%.2ld = %.16lx sr%.2ld = %.16lx\n", (long int)i, (unsigned long)get_sr(i), (long int)(i+8), (long unsigned int) get_sr(i+8)); --- linux-2.4.21/include/asm-ppc64/processor.h 2003-12-06 16:29:13.000000000 -0800 +++ linux-myfix/include/asm-ppc64/processor.h 2003-12-06 16:28:56.000000000 -0800 @@ -594,6 +594,8 @@ #define mfasr() ({unsigned long rval; \ asm volatile("mfasr %0" : "=r" (rval)); rval;}) +#define mfsdr1() ({unsigned long rval; \ + asm volatile("mfsdr1 %0" : "=r" (rval)); rval;}) #ifndef __ASSEMBLY__ extern int have_of; ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From sjmunroe at us.ibm.com Mon Dec 8 07:40:55 2003 From: sjmunroe at us.ibm.com (Steve Munroe) Date: Sun, 7 Dec 2003 14:40:55 -0600 Subject: [Fwd: [PATCH] ppc64: 2.6 altivec support, sys_swapcontext & signal32 rework] Message-ID: Ben thanks. With Tom Gall's I have your kernel running on my G5. This was pull from your BK about our Friday noon. We had a few glitches and had to deconfigure NVRAM and pmac_seriel. The plan is to build glibc with some VMX patches and start testing next week. Steven J. Munroe Power Linux Toolchain Architect IBM Corporation, Linux Technology Center Benjamin Herrenschmidt To: linuxppc64-dev at lists.linuxppc.org Sent by: cc: owner-linuxppc64-dev at lists.l Subject: [Fwd: [PATCH] ppc64: 2.6 altivec support, sys_swapcontext & signal32 inuxppc.org rework] 11/30/03 11:15 PM (Re-post with the patch bzip2'ed...) Hi ! It wasn't very easy to split this patch, so here it is in one piece for now. It does a few things: - Adds basic Altivec support to the context switching code - Adds Altivec support to the 64 bits sigcontext - Rewrite part of the signal32 compat code based on the new ppc32 implementation, with Altivec support in the contexts, factors out some sigset flipping code, and fixes a long standing bug where an 32 bits RT context would have an incorrectly flipped sigset (a 64 bits one instead of a 32 bits one). - Adds sys_swapcontext syscall (and sys32_swapcontext) for kernel based implementation of {set,get,swap}_context with Altivec support So far, it appears to work fine (when run as part of my G5 kernel which contains a bunch of other changes though). It would need some more testing hopefully. What is needed now is a glibc implementation of the ucontext calls for both 32 and 64 bits that makes use of the new sys_swapcontext, at least with 2.6. I may give it a try, but I'm sure somebody more familiar with glibc than I am would get this done much much more quickly... So if you are that person, please speak up ;) Regarding the details of the Altivec stuff: On a ppc64 signal frame, a kernel that supports altivec (AT_HWCAP) will always fill properly the pointer to the altivec context. In there, VRSAVE is always set, regardless of the usage of altivec done by the process. MSR:VEC in the regs context will be set is the other altivec registers (vr0..31 and vscr) have valid values in the context. It is important to split vrsave from the rest of the context as a process may set vrsave prior to doing its first vector operation, and get preempted with a signal in between those, thus having a valid vrsave context that needs to be saved & restored without having taken its first altivec exception yet, thus not having an altivec context to save yet. I'm not sure what's the best way to deal with the availability of vrsave on ppc32 contexts, it's a bit more nasty here. (kernel version ?). Some ppc32 kernels will report supporting altivec via AT_HWCAP without actually implementing altivec sig/u context stuff. We may need to based ourself on some kernel versioning here, maybe consider 2.4.23 as the minimum version to rely on kernel sigcontext containing proper vrsave for ppc32 ? Ben. #### ppc64-altivec_and_sig.diff.bz2 has been removed from this note on December 07, 2003 by Steve Munroe ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Mon Dec 8 10:28:19 2003 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 08 Dec 2003 10:28:19 +1100 Subject: [Fwd: [PATCH] ppc64: 2.6 altivec support, sys_swapcontext & signal32 rework] In-Reply-To: References: Message-ID: <1070839698.12501.58.camel@gaston> On Mon, 2003-12-08 at 07:40, Steve Munroe wrote: > Ben thanks. > > With Tom Gall's I have your kernel running on my G5. This was pull from > your BK about our Friday noon. We had a few glitches and had to deconfigure > NVRAM and pmac_seriel. Ah ? I have both working here. I use pmac_zilog for serial console using a stealth serial adapter and nvram works fine so far. What kind of glitches did you have ? > The plan is to build glibc with some VMX patches and start testing next > week. Great ! Note that paulus also noticed that the saved_msr & saved_ee thingy in the ppc64 signal handling appear to be broken (and makes little sense in the first place). The plan is to remove it completely. I'll do that later this week. Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Tue Dec 9 02:58:00 2003 From: johnrose at austin.ibm.com (John Rose) Date: Mon, 08 Dec 2003 09:58:00 -0600 Subject: [PATCH] 2.6 - OF dynamic update __init funcs Message-ID: <1070899080.18287.10.camel@verve> The current of_finish_dynamic_node() calls some prom.c functions that are marked __init. Since this function is for use after boot, the functions should be changed to __devinit. If there are no comments, I'll push this to 2.6 shortly. Thanks- John diff -Nru a/arch/ppc64/kernel/prom.c b/arch/ppc64/kernel/prom.c --- a/arch/ppc64/kernel/prom.c Sun Dec 7 20:18:16 2003 +++ b/arch/ppc64/kernel/prom.c Sun Dec 7 20:18:16 2003 @@ -1701,7 +1701,7 @@ /* * Find the interrupt parent of a node. */ -static struct device_node * __init +static struct device_node * __devinit intr_parent(struct device_node *p) { phandle *parp; @@ -1716,7 +1716,7 @@ * Find out the size of each entry of the interrupts property * for a node. */ -static int __init +static int __devinit prom_n_intr_cells(struct device_node *np) { struct device_node *p; @@ -1744,7 +1744,7 @@ * Map an interrupt from a device up to the platform interrupt * descriptor. */ -static int __init +static int __devinit map_interrupt(unsigned int **irq, struct device_node **ictrler, struct device_node *np, unsigned int *ints, int nintrc) { ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From sjmunroe at us.ibm.com Tue Dec 9 03:29:24 2003 From: sjmunroe at us.ibm.com (Steve Munroe) Date: Mon, 8 Dec 2003 10:29:24 -0600 Subject: [Fwd: [PATCH] ppc64: 2.6 altivec support, sys_swapcontext & signal32 rework] Message-ID: Ben Herrenschmidt writes: >> sjmunroe writes: >> With Tom Gall's I have your kernel running on my G5. This was pull from >> your BK about our Friday noon. We had a few glitches and had to deconfigure >> NVRAM and pmac_seriel. > >Ah ? I have both working here. I use pmac_zilog for serial console >using a stealth serial adapter and nvram works fine so far. What kind >of glitches did you have ? These where compile time fails: I think in one case NVRAM_BYTES? was not defined. I don't remember the specifics of why pmac_zilog failed. Steven J. Munroe Power Linux Toolchain Architect IBM Corporation, Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olh at suse.de Tue Dec 9 09:42:30 2003 From: olh at suse.de (Olaf Hering) Date: Mon, 8 Dec 2003 23:42:30 +0100 Subject: stopping all cpus in one go Message-ID: <20031208224230.GA22205@suse.de> Is there a good reason to leave the other cpus running? could this change deadlock? diff -p -purNX kernel_exclude.txt orig/linux-2.6.0-test11/arch/ppc64/xmon/xmon.c linux-2.6.0-test11/arch/ppc64/xmon/xmon.c --- orig/linux-2.6.0-test11/arch/ppc64/xmon/xmon.c 2003-11-26 20:45:27.000000000 +0000 +++ linux-2.6.0-test11/arch/ppc64/xmon/xmon.c 2003-12-08 22:15:43.000000000 +0000 @@ -228,7 +228,7 @@ xmon(struct pt_regs *excp) { struct pt_regs regs; int cmd; - unsigned long msr; + unsigned long msr, cpu; if (excp == NULL) { /* Ok, grab regs as they are now. @@ -300,6 +300,8 @@ xmon(struct pt_regs *excp) #endif /* CONFIG_SMP */ remove_bpts(); disable_surveillance(); + cpu = MSG_ALL_BUT_SELF; + smp_send_xmon_break(cpu); cmd = cmds(excp); if (cmd == 's') { xmon_trace[smp_processor_id()] = SSTEP; -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Tue Dec 9 10:20:06 2003 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 09 Dec 2003 10:20:06 +1100 Subject: [Fwd: [PATCH] ppc64: 2.6 altivec support, sys_swapcontext & signal32 rework] In-Reply-To: References: Message-ID: <1070925605.11006.138.camel@gaston> On Tue, 2003-12-09 at 03:29, Steve Munroe wrote: > Ben Herrenschmidt writes: > > >> sjmunroe writes: > >> With Tom Gall's I have your kernel running on my G5. This was pull from > >> your BK about our Friday noon. We had a few glitches and had to > deconfigure > >> NVRAM and pmac_seriel. > > > >Ah ? I have both working here. I use pmac_zilog for serial console > >using a stealth serial adapter and nvram works fine so far. What kind > >of glitches did you have ? > > These where compile time fails: I think in one case NVRAM_BYTES? was not > defined. I don't remember the specifics of why pmac_zilog failed. Weird. I'll check that with Tom. Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Tue Dec 9 18:56:35 2003 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 09 Dec 2003 18:56:35 +1100 Subject: hash table Message-ID: <1070956595.11006.183.camel@gaston> Here's a first shot at my rework of __hash_page, if you want to have a quick look... It did a few tests but didn't really stress the box that badly, so there may be bugs in there. At this point, the goal isn't (yet) perfs, it is to get rid of the page table lock in hash_page(). (though my simple tests showed an approximate 10% improvement of hash_page duration). There is still room for optimisation. It would be nice for example to move the lazy cache flush to asm to avoid the overhead of function calls & additional stackframe, and I could rewrite the non-HV verions of the low level ppc_md. functions in asm with some wins I supposed looking at the C code... There is also room for optimisation in my asm code (like some bit manipulations or better scheduling). I think there is no race with the flush code. The reason is that the case where flush is called on a present page seem to be strictly limited to a PP bits update (or an accessed bits update in some error case, but we can dismiss that one completely I beleive). Since flush uses pte_update, it will not race with a pending _hash_page. The only possible race is a __hash_page occuring during a flush. But in this case, the PTE will have the new PP bits already so at worst, we exit flush with an entry present... but that has the new bits. So it's ok. I don't think we can race on the content of the HPTE neither as we have the HPTE lock bit there. I'd still like your point of view though. Ben. diff -urN linux-g5-ppc64/arch/ppc64/kernel/htab.c linux-g5-htab/arch/ppc64/kernel/htab.c --- linux-g5-ppc64/arch/ppc64/kernel/htab.c 2003-12-08 20:27:20.084329896 +1100 +++ linux-g5-htab/arch/ppc64/kernel/htab.c 2003-12-09 18:15:10.064315512 +1100 @@ -27,6 +27,7 @@ #include #include #include +#include #include #include @@ -129,7 +130,7 @@ } } -void +void __init htab_initialize(void) { unsigned long table, htab_size_bytes; @@ -231,6 +232,47 @@ } /* + * Called by asm hashtable.S for doing lazy icache flush + */ +unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap) +{ + struct page *page; + +#define PPC64_HWNOEXEC (1 << 2) + + if (!pfn_valid(pte_pfn(pte))) + return pp; + + page = pte_page(pte); + + /* page is dirty */ + if (!PageReserved(page) && !test_bit(PG_arch_1, &page->flags)) { + if (trap == 0x400) { + __flush_dcache_icache(page_address(page)); + set_bit(PG_arch_1, &page->flags); + } else + pp |= PPC64_HWNOEXEC; + } + return pp; +} + +/* + * Called by asm hashtable.S in case of critical insert failure + */ +void htab_insert_failure(void) +{ + panic("hash_page: pte_insert failed\n"); +} + +/* + * Handle a fault by adding an HPTE. If the address can't be determined + * to be valid via Linux page tables, return 1. If handled return 0 + */ +extern int __hash_page(unsigned long ea, unsigned long access, unsigned long vsid, + pte_t *ptep, unsigned long trap, int local); + +#if 0 +/* * Handle a fault by adding an HPTE. If the address can't be determined * to be valid via Linux page tables, return 1. If handled return 0 */ @@ -380,6 +422,7 @@ return 0; } +#endif int hash_page(unsigned long ea, unsigned long access, unsigned long trap) { @@ -444,24 +487,20 @@ if (pgdir == NULL) return 1; - /* - * Lock the Linux page table to prevent mmap and kswapd - * from modifying entries while we search and update - */ - spin_lock(&mm->page_table_lock); - tmp = cpumask_of_cpu(smp_processor_id()); if (user_region && cpus_equal(mm->cpu_vm_mask, tmp)) local = 1; - ret = hash_huge_page(mm, access, ea, vsid, local); - if (ret < 0) { + /* Is this a huge page ? */ + if (unlikely(in_hugepage_area(mm->context, ea))) + ret = hash_huge_page(mm, access, ea, vsid, local); + else { ptep = find_linux_pte(pgdir, ea); + if (ptep == NULL) + return 1; ret = __hash_page(ea, access, vsid, ptep, trap, local); } - spin_unlock(&mm->page_table_lock); - #ifdef CONFIG_HTABLE_STATS if (ret == 0) { duration = mftb() - duration; @@ -519,3 +558,26 @@ local); } } + +static inline void make_bl(unsigned int *insn_addr, void *func) +{ + unsigned long funcp = *((unsigned long *)func); + int offset = funcp - (unsigned long)insn_addr; + + *insn_addr = (unsigned int)(0x48000001 | (offset & 0x03fffffc)); + flush_icache_range((unsigned long)insn_addr, 4+ + (unsigned long)insn_addr); +} + +void __init htab_finish_init(void) +{ + extern unsigned int *htab_call_hpte_insert1; + extern unsigned int *htab_call_hpte_insert2; + extern unsigned int *htab_call_hpte_remove; + extern unsigned int *htab_call_hpte_updatepp; + + make_bl(htab_call_hpte_insert1, ppc_md.hpte_insert); + make_bl(htab_call_hpte_insert2, ppc_md.hpte_insert); + make_bl(htab_call_hpte_remove, ppc_md.hpte_remove); + make_bl(htab_call_hpte_updatepp, ppc_md.hpte_updatepp); +} diff -urN linux-g5-ppc64/arch/ppc64/kernel/setup.c linux-g5-htab/arch/ppc64/kernel/setup.c --- linux-g5-ppc64/arch/ppc64/kernel/setup.c 2003-12-08 20:15:07.922635392 +1100 +++ linux-g5-htab/arch/ppc64/kernel/setup.c 2003-12-09 18:14:21.331723992 +1100 @@ -246,6 +246,10 @@ pmac_init(r3, r4, r5, r6, r7); } #endif + /* Finish initializing the hash table (do the dynamic + * patching for the fast-path hashtable.S code) + */ + htab_finish_init(); printk("Starting Linux PPC64 %s\n", UTS_RELEASE); diff -urN linux-g5-ppc64/arch/ppc64/mm/Makefile linux-g5-htab/arch/ppc64/mm/Makefile --- linux-g5-ppc64/arch/ppc64/mm/Makefile 2003-11-19 21:20:09.000000000 +1100 +++ linux-g5-htab/arch/ppc64/mm/Makefile 2003-12-08 17:33:31.452722880 +1100 @@ -4,6 +4,6 @@ EXTRA_CFLAGS += -mno-minimal-toc -obj-y := fault.o init.o extable.o imalloc.o +obj-y := fault.o init.o extable.o imalloc.o hashtable.o obj-$(CONFIG_DISCONTIGMEM) += numa.o obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o diff -urN linux-g5-ppc64/arch/ppc64/mm/hashtable.S linux-g5-htab/arch/ppc64/mm/hashtable.S --- linux-g5-ppc64/arch/ppc64/mm/hashtable.S Thu Jan 01 10:00:00 1970 +++ linux-g5-htab/arch/ppc64/mm/hashtable.S Tue Dec 09 18:54:52 2003 @@ -0,0 +1,289 @@ +/* + * ppc64 MMU hashtable management routines + * + * (c) Copyright IBM Corp. 2003 + * + * Maintained by: Benjamin Herrenschmidt + * + * + * This file is covered by the GNU Public Licence v2 as + * described in the kernel's COPYING file. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + + .text + +/* + * Stackframe: + * + * +-> Back chain (SP + 256) + * | General register save area (SP + 112) + * | Parameter save area (SP + 48) + * | TOC save area (SP + 40) + * | link editor doubleword (SP + 32) + * | compiler doubleword (SP + 24) + * | LR save area (SP + 16) + * | CR save area (SP + 8) + * SP ---> +-- Back chain (SP + 0) + */ +#define STACKFRAMESIZE 256 + +/* Save parameters offsets */ +#define STK_PARM(i) (STACKFRAMESIZE + 48 + ((i)-3)*8) + +/* Save non-volatile offsets */ +#define STK_REG(i) (112 + ((i)-14)*8) + +/* + * _hash_page(unsigned long ea, unsigned long access, unsigned long vsid, + * pte_t *ptep, unsigned long trap, int local) + * + * Adds a page to the hash table. This is the non-LPAR version for now + */ + +_GLOBAL(__hash_page) + mflr r0 + std r0,16(r1) + stdu r1,-STACKFRAMESIZE(r1) + /* Save all params that we need after a function call */ + std r6,STK_PARM(r6)(r1) + std r8,STK_PARM(r8)(r1) + + /* Add _PAGE_PRESENT to access */ + ori r4,r4,_PAGE_PRESENT + + /* Save non-volatile registers. + * r31 will hold "old PTE" + * r30 is "new PTE" + * r29 is "va" + * r28 is a hash value + * r27 is hashtab mask (maybe dynamic patched instead ?) + */ + std r27,STK_REG(r27)(r1) + std r28,STK_REG(r28)(r1) + std r29,STK_REG(r29)(r1) + std r30,STK_REG(r30)(r1) + std r31,STK_REG(r31)(r1) + + /* Step 1: + * + * Check permissions, atomically mark the linux PTE busy + * and hashed. + */ +1: + ldarx r31,0,r6 + /* Check access rights (access & ~(pte_val(*ptep))) */ + andc. r0,r4,r31 + bne- htab_wrong_access + /* Check if PTE is busy */ + andi. r0,r31,_PAGE_BUSY + bne- 1b + /* Prepare new PTE value (turn access RW into DIRTY, then + * add BUSY,HASHPTE and ACCESSED) + */ + rlwinm r30,r4,5,24,24 /* _PAGE_RW -> _PAGE_DIRTY */ + or r30,r30,r31 + ori r30,r30,_PAGE_BUSY | _PAGE_ACCESSED | _PAGE_HASHPTE + /* Write the linux PTE atomically (setting busy) */ + stdcx. r30,0,r6 + bne- 1b + + + /* Step 2: + * + * Insert/Update the HPTE in the hash table. At this point, + * r4 (access) is re-useable, we use it for the new HPTE flags + */ + + /* Calc va and put it in r29 */ + rldicr r29,r5,28,63-28 + rldicl r3,r3,0,36 + or r29,r3,r29 + + /* Calculate hash value for primary slot and store it in r28 */ + rldicl r5,r5,0,25 /* vsid & 0x0000007fffffffff */ + rldicl r0,r3,64-12,48 /* (ea >> 12) & 0xffff */ + xor r28,r5,r0 + + /* Convert linux PTE bits into HW equivalents. Fix using + * mask inserts instead + */ + rlwinm r3,r30,32-1,31,31 /* _PAGE_USER -> PP lsb */ + rlwinm r0,r30,32-2,31,31 /* _PAGE_RW -> r0 lsb */ + rlwinm r4,r30,32-7,31,31 /* _PAGE_DIRTY -> r4 lsb */ + and r0,r0,r4 /* _PAGE_RW & _PAGE_DIRTY -> r0 lsb */ + andc r3,r3,r0 /* PP lsb &= ~(PAGE_RW & _PAGE_DIRTY) */ + andi. r4,r30,_PAGE_USER /* _PAGE_USER -> r4 msb */ + or r3,r3,r4 /* PP msb = r4 msb */ + andi. r0,r30,0x1f8 /* Add in other flags */ + or r3,r3,r0 + + /* We eventually do the icache sync here (maybe inline that + * code rather than call a C function... + */ +BEGIN_FTR_SECTION + mr r4,r30 + mr r5,r7 + bl .hash_page_do_lazy_icache +END_FTR_SECTION_IFSET(CPU_FTR_NOEXECUTE) + + /* At this point, r3 contains new PP bits, save them in + * place of "access" in the param area (sic) + */ + std r3,STK_PARM(r4)(r1) + + /* Get htab_hash_mask */ + ld r4,htab_data at got(2) + ld r27,16(r4) /* htab_data.htab_hash_mask -> r27 */ + + /* Check if we may already be in the hashtable, in this case, we + * go to out-of-line code to try to modify the HPTE + */ + andi. r0,r31,_PAGE_HASHPTE + bne htab_modify_pte + +htab_insert_pte: + /* Clear hpte bits in new pte (we also clear BUSY btw) and + * add _PAGE_HASHPTE + */ + lis r0,_PAGE_HPTEFLAGS at h + ori r0,r0,_PAGE_HPTEFLAGS at l + andc r30,r30,r0 + ori r30,r30,_PAGE_HASHPTE + +1: + /* page number in r5 */ + rldicl r5,r31,64-PTE_SHIFT,PTE_SHIFT + + /* Calculate primary group hash */ + and r0,r28,r27 + rldicr r3,r0,3,63-3 /* r0 = (hash & mask) << 3 */ + + /* Call ppc_md.hpte_insert */ + ld r7,STK_PARM(r4)(r1) /* Retreive new pp bits */ + mr r4,r29 /* Retreive va */ + li r6,0 /* primary slot * + li r8,0 /* not bolted and not large */ + li r9,0 +_GLOBAL(htab_call_hpte_insert1) + bl . /* Will be patched by htab_finish_init() */ + cmpi 0,r3,0 + bge htab_pte_insert_ok /* Insertion successful */ + cmpi 0,r3,-2 /* Critical failure */ + beq- htab_pte_insert_failure + + /* Now try secondary slot */ + ori r30,r30,_PAGE_SECONDARY + + /* page number in r5 */ + rldicl r5,r31,64-PTE_SHIFT,PTE_SHIFT + + /* Calculate secondary group hash */ + not r3,r28 + and r0,r3,r27 + rldicr r3,r0,3,63-3 /* r0 = (~hash & mask) << 3 */ + + /* Call ppc_md.hpte_insert */ + ld r7,STK_PARM(r4)(r1) /* Retreive new pp bits */ + mr r4,r29 /* Retreive va */ + li r6,1 /* secondary slot * + li r8,0 /* not bolted and not large */ + li r9,0 +_GLOBAL(htab_call_hpte_insert2) + bl . /* Will be patched by htab_finish_init() */ + cmpi 0,r3,0 + bge+ htab_pte_insert_ok /* Insertion successful */ + cmpi 0,r3,-2 /* Critical failure */ + beq- htab_pte_insert_failure + + /* Both are full, we need to evict something */ + mftb r0 + /* Pick a random group based on TB */ + andi. r0,r0,1 + mr r5,r28 + bne 2f + not r5,r5 +2: and r0,r5,r27 + rldicr r3,r0,3,63-3 /* r0 = (hash & mask) << 3 */ + /* Call ppc_md.hpte_remove */ +_GLOBAL(htab_call_hpte_remove) + bl . /* Will be patched by htab_finish_init() */ + + /* Try all again */ + b 1b + +htab_pte_insert_ok: + /* Insert slot number in PTE */ + rldimi r30,r3,12,63-14 + + /* Write out the PTE with a normal write + * (maybe add eieio may be good still ?) + */ +htab_write_out_pte: + ld r6,STK_PARM(r6)(r1) + std r30,0(r6) + li r3, 0 +bail: + ld r27,STK_REG(r27)(r1) + ld r28,STK_REG(r28)(r1) + ld r29,STK_REG(r29)(r1) + ld r30,STK_REG(r30)(r1) + ld r31,STK_REG(r31)(r1) + addi r1,r1,STACKFRAMESIZE + ld r0,16(r1) + mtlr r0 + blr + +htab_modify_pte: + /* Keep PP bits in r4 and slot idx from the PTE around in r3 */ + mr r4,r3 + rlwinm r3,r31,32-12,29,31 + + /* Secondary group ? if yes, get a inverted hash value */ + mr r5,r28 + andi. r0,r31,_PAGE_SECONDARY + beq 1f + not r5,r5 +1: + /* Calculate proper slot value for ppc_md.hpte_updatepp */ + and r0,r5,r27 + rldicr r0,r0,3,63-3 /* r0 = (hash & mask) << 3 */ + add r3,r0,r3 /* add slot idx */ + + /* Call ppc_md.hpte_updatepp */ + mr r5,r29 /* va */ + li r6,0 /* large is 0 */ + ld r7,STK_PARM(r8)(r1) /* get "local" param */ +_GLOBAL(htab_call_hpte_updatepp) + bl . /* Will be patched by htab_finish_init() */ + + /* if we failed because typically the HPTE wasn't really here + * we try an insertion. + */ + cmpi 0,r3,-1 + beq- htab_insert_pte + + /* Clear the BUSY bit and Write out the PTE */ + li r0,_PAGE_BUSY + andc r30,r30,r0 + b htab_write_out_pte + +htab_wrong_access: + /* Bail out clearing reservation */ + stdcx. r31,0,r6 + li r3,1 + b bail + +htab_pte_insert_failure: + b .htab_insert_failure + + diff -urN linux-g5-ppc64/arch/ppc64/mm/hugetlbpage.c linux-g5-htab/arch/ppc64/mm/hugetlbpage.c --- linux-g5-ppc64/arch/ppc64/mm/hugetlbpage.c 2003-12-02 13:11:59.000000000 +1100 +++ linux-g5-htab/arch/ppc64/mm/hugetlbpage.c 2003-12-08 15:52:18.100012832 +1100 @@ -655,10 +655,6 @@ unsigned long hpteflags, prpn; long slot; - /* Is this for us? */ - if (!in_hugepage_area(mm->context, ea)) - return -1; - ea &= ~(HPAGE_SIZE-1); /* We have to find the first hugepte in the batch, since diff -urN linux-g5-ppc64/include/asm-ppc64/mmu.h linux-g5-htab/include/asm-ppc64/mmu.h --- linux-g5-ppc64/include/asm-ppc64/mmu.h 2003-12-01 14:40:29.000000000 +1100 +++ linux-g5-htab/include/asm-ppc64/mmu.h 2003-12-09 17:23:43.436554256 +1100 @@ -225,6 +225,8 @@ asm volatile("ptesync": : :"memory"); } +extern void htab_finish_init(void); + #endif /* __ASSEMBLY__ */ /* diff -urN linux-g5-ppc64/include/asm-ppc64/pgtable.h linux-g5-htab/include/asm-ppc64/pgtable.h --- linux-g5-ppc64/include/asm-ppc64/pgtable.h 2003-12-05 13:58:59.000000000 +1100 +++ linux-g5-htab/include/asm-ppc64/pgtable.h 2003-12-09 13:57:28.174879992 +1100 @@ -74,22 +74,23 @@ * Bits in a linux-style PTE. These match the bits in the * (hardware-defined) PowerPC PTE as closely as possible. */ -#define _PAGE_PRESENT 0x001UL /* software: pte contains a translation */ -#define _PAGE_USER 0x002UL /* matches one of the PP bits */ -#define _PAGE_RW 0x004UL /* software: user write access allowed */ -#define _PAGE_GUARDED 0x008UL -#define _PAGE_COHERENT 0x010UL /* M: enforce memory coherence (SMP systems) */ -#define _PAGE_NO_CACHE 0x020UL /* I: cache inhibit */ -#define _PAGE_WRITETHRU 0x040UL /* W: cache write-through */ -#define _PAGE_DIRTY 0x080UL /* C: page changed */ -#define _PAGE_ACCESSED 0x100UL /* R: page referenced */ -#define _PAGE_FILE 0x200UL /* software: pte holds file offset */ -#define _PAGE_HASHPTE 0x400UL /* software: pte has an associated HPTE */ -#define _PAGE_EXEC 0x800UL /* software: i-cache coherence required */ -#define _PAGE_SECONDARY 0x8000UL /* software: HPTE is in secondary group */ -#define _PAGE_GROUP_IX 0x7000UL /* software: HPTE index within group */ +#define _PAGE_PRESENT 0x0001 /* software: pte contains a translation */ +#define _PAGE_USER 0x0002 /* matches one of the PP bits */ +#define _PAGE_FILE 0x0002 /* (!present only) software: pte holds file offset */ +#define _PAGE_RW 0x0004 /* software: user write access allowed */ +#define _PAGE_GUARDED 0x0008 +#define _PAGE_COHERENT 0x0010 /* M: enforce memory coherence (SMP systems) */ +#define _PAGE_NO_CACHE 0x0020 /* I: cache inhibit */ +#define _PAGE_WRITETHRU 0x0040 /* W: cache write-through */ +#define _PAGE_DIRTY 0x0080 /* C: page changed */ +#define _PAGE_ACCESSED 0x0100 /* R: page referenced */ +#define _PAGE_EXEC 0x0200 /* software: i-cache coherence required */ +#define _PAGE_HASHPTE 0x0400 /* software: pte has an associated HPTE */ +#define _PAGE_BUSY 0x0800 /* software: PTE & hash are busy */ +#define _PAGE_SECONDARY 0x8000 /* software: HPTE is in secondary group */ +#define _PAGE_GROUP_IX 0x7000 /* software: HPTE index within group */ /* Bits 0x7000 identify the index within an HPT Group */ -#define _PAGE_HPTEFLAGS (_PAGE_HASHPTE | _PAGE_SECONDARY | _PAGE_GROUP_IX) +#define _PAGE_HPTEFLAGS (_PAGE_BUSY | _PAGE_HASHPTE | _PAGE_SECONDARY | _PAGE_GROUP_IX) /* PAGE_MASK gives the right answer below, but only by accident */ /* It should be preserving the high 48 bits and then specifically */ /* preserving _PAGE_SECONDARY | _PAGE_GROUP_IX */ @@ -157,8 +158,10 @@ #define _PMD_HUGEPAGE 0x00000001U #define HUGEPTE_BATCH_SIZE (1<<(HPAGE_SHIFT-PMD_SHIFT)) +#ifndef __ASSEMBLY__ int hash_huge_page(struct mm_struct *mm, unsigned long access, unsigned long ea, unsigned long vsid, int local); +#endif /* __ASSEMBLY__ */ #define HAVE_ARCH_UNMAPPED_AREA #else @@ -288,9 +291,12 @@ unsigned long set ) { unsigned long old, tmp; - + extern void udbg_putc(unsigned char c); + __asm__ __volatile__( "1: ldarx %0,0,%3 # pte_update\n\ + andi. %1,%0,0x0800 \n\ + bne- 1b \n\ andc %1,%0,%4 \n\ or %1,%1,%5 \n\ stdcx. %1,0,%3 \n\ ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Tue Dec 9 19:02:41 2003 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 09 Dec 2003 19:02:41 +1100 Subject: hash table In-Reply-To: <1070956595.11006.183.camel@gaston> References: <1070956595.11006.183.camel@gaston> Message-ID: <1070956960.11009.186.camel@gaston> On Tue, 2003-12-09 at 18:56, Benjamin Herrenschmidt wrote: > Here's a first shot at my rework of __hash_page, if you want > to have a quick look... It did a few tests but didn't really > stress the box that badly, so there may be bugs in there. > > .../... And we also want that patch (which can be included in ameslab asap I suppose). It fixes the .got to be right before the .toc so that @got accesses done from assembly work properly. According to Alan Modra, the old stuff with .got in data segment was bogus. Ben. ===== arch/ppc64/kernel/vmlinux.lds.S 1.18 vs edited ===== --- 1.18/arch/ppc64/kernel/vmlinux.lds.S Fri Sep 12 21:01:40 2003 +++ edited/arch/ppc64/kernel/vmlinux.lds.S Mon Dec 8 19:04:13 2003 @@ -53,7 +53,6 @@ *(.data1) *(.sdata) *(.sdata2) - *(.got.plt) *(.got) *(.dynamic) CONSTRUCTORS } @@ -126,6 +125,7 @@ /* freed after init ends here */ __toc_start = .; + .got : { *(.got.plt) *(.got) } .toc : { *(.toc) } . = ALIGN(4096); __toc_end = .; ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Wed Dec 10 07:20:19 2003 From: olof at austin.ibm.com (Olof Johansson) Date: Tue, 09 Dec 2003 14:20:19 -0600 Subject: hash table In-Reply-To: <1070956595.11006.183.camel@gaston> References: <1070956595.11006.183.camel@gaston> Message-ID: <3FD62E83.6070207@austin.ibm.com> Benjamin Herrenschmidt wrote: >I think there is no race with the flush code. The reason is that >the case where flush is called on a present page seem to be strictly >limited to a PP bits update (or an accessed bits update in some >error case, but we can dismiss that one completely I beleive). > > >Since flush uses pte_update, it will not race with a pending >_hash_page. The only possible race is a __hash_page occuring during >a flush. But in this case, the PTE will have the new PP bits already >so at worst, we exit flush with an entry present... but that has the >new bits. So it's ok. I don't think we can race on the content of >the HPTE neither as we have the HPTE lock bit there. I'd still >like your point of view though. > > I can see a race between the find_linux_pte() and the use of ptep in __hash_page. Another CPU can come in during that window and deallocate the PTE, can't it? One solution for this is to set _PAGE_BUSY in find_linux_pte() atomically during lookup. There's even more subtle races in the sense that the tree is walked while someone might update it underneath of the lookup, but maybe they can be ignored? Also two minor comments: * in pte_update, use _PAGE_BUSY instead of hardcoded 0x0800? Would increase readability a little. * in __hash_page / htab_wrong_access: There's no check for failed stdcx. -Olof ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Wed Dec 10 07:39:05 2003 From: olof at austin.ibm.com (Olof Johansson) Date: Tue, 09 Dec 2003 14:39:05 -0600 Subject: hash table In-Reply-To: <3FD62E83.6070207@austin.ibm.com> References: <1070956595.11006.183.camel@gaston> <3FD62E83.6070207@austin.ibm.com> Message-ID: <3FD632E9.2030606@austin.ibm.com> Olof Johansson wrote: > > There's even more subtle races in the sense that the tree is walked > while someone might update it > underneath of the lookup, but maybe they can be ignored? Hmm, I'm used to thinking about this in 2.4, where we don't have the HPTE lock bit. I'm guessing that will protect us here. -Olof ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Wed Dec 10 11:02:52 2003 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 10 Dec 2003 11:02:52 +1100 Subject: hash table In-Reply-To: <3FD62E83.6070207@austin.ibm.com> References: <1070956595.11006.183.camel@gaston> <3FD62E83.6070207@austin.ibm.com> Message-ID: <1071014571.12500.215.camel@gaston> > I can see a race between the find_linux_pte() and the use of ptep in > __hash_page. Another CPU can come in during that window and deallocate > the PTE, can't it? One solution for this is to set _PAGE_BUSY in > find_linux_pte() atomically during lookup. There's even more subtle > races in the sense that the tree is walked while someone might update it > underneath of the lookup, but maybe they can be ignored? Yup, this race is on my list already ;) I want to move find_linux_pte down into __hash_page anyway, but that's not how to fix this race. AFAIK, the only race is (very unlikely but definitely there) if we free a PTE page on one CPU while we are in hash_page() on another CPU. Paulus proposed a fix for this which consist of delaying the actual freeing of PTE pages. We gather them into a list that we free either after a given threshold or after a while at idle time. When we actually go to free it, we use an IPI to sync with othe CPUs, making sure they aren't in hash_page(). At that point, we'll have already cleared the pmd entries, so we know no CPU will go down to the PTE any more on a further hash_page(). >Also two minor comments: > > * in pte_update, use _PAGE_BUSY instead of hardcoded 0x0800? Would > increase readability a little. Yah, maybe, I didn't feel like adding another argument to the asm statement, I hate that syntax, but you are probably right ;) > * in __hash_page / htab_wrong_access: There's no check for failed stdcx. That's normal, the only point of this stdcx. is to not leave a dangling reservation, I don't care if it succeed as the value I'm writing back is the original value intact. Thanks for your comments, Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From brking at us.ibm.com Thu Dec 11 01:51:41 2003 From: brking at us.ibm.com (Brian King) Date: Wed, 10 Dec 2003 08:51:41 -0600 Subject: pci_map_single return value Message-ID: <3FD732FD.10903@us.ibm.com> Why is it that pci_map_single returns NO_TCE on failure on PPC64 and 0 on failure on IA64? Which one is correct? If NO_TCE is correct, then why is it defined in a ppc64 include? This makes it difficult for device drivers to actually check for it. -- Brian King eServer Storage I/O IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From boutcher at us.ibm.com Thu Dec 11 02:27:04 2003 From: boutcher at us.ibm.com (David Boutcher) Date: Wed, 10 Dec 2003 09:27:04 -0600 Subject: pci_map_single return value In-Reply-To: <3FD732FD.10903@us.ibm.com> Message-ID: Because 0 is a valid TCE in the current ppc64 implementation :-) Though actually Dave E made some changes lately that may make that not true any more, I'm not sure. owner-linuxppc64-dev at lists.linuxppc.org wrote on 12/10/2003 08:51:41 AM: > Why is it that pci_map_single returns NO_TCE on failure on PPC64 and 0 > on failure on IA64? Which one is correct? If NO_TCE is correct, then why > is it defined in a ppc64 include? This makes it difficult for device > drivers to actually check for it. Dave Boutcher IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olh at suse.de Thu Dec 11 03:45:53 2003 From: olh at suse.de (Olaf Hering) Date: Wed, 10 Dec 2003 17:45:53 +0100 Subject: bogus changes for generic pci_scan_slot() in ameslab-2.5 In-Reply-To: <1070385249.1123.1659.camel@tin.ibm.com > References: <20031128154151.GA30606@suse.de> <1070385249.1123.1659.camel@tin.ibm.com > Message-ID: <20031210164553.GB29983@suse.de> On Tue, Dec 02, Jake Moilanen wrote: > This looks like the workaround for the pci multifunc problem that is > seen on LPARs. The linux-2.6-pcibios-scan-all-fns-1.patch I posted in > the JS20 support email fixes the same problem, does not break ppc32, and > should probably used instead. I have tried this patch on the beige g3 and it booted ok. The one below should be reverted from ameslab. > On Fri, 2003-11-28 at 09:41, Olaf Hering wrote: > > Good morning, > > > > what is the purpose of this change? > > > > > > diff -purN linux-2.5/drivers/pci/probe.c linuxppc64-2.5/drivers/pci/probe.c > > --- linux-2.5/drivers/pci/probe.c 2003-08-06 15:34:30.000000000 +0000 > > +++ linuxppc64-2.5/drivers/pci/probe.c 2003-11-05 22:12:33.000000000 +0000 > > @@ -552,6 +552,7 @@ int __devinit pci_scan_slot(struct pci_b > > struct pci_dev *dev; > > > > dev = pci_scan_device(bus, devfn); > > +#if 0 > > if (func == 0) { > > if (!dev) > > break; > > @@ -560,6 +561,10 @@ int __devinit pci_scan_slot(struct pci_b > > continue; > > dev->multifunction = 1; > > } > > +#else > > + if (!dev) > > + continue; > > +#endif > > > > /* Fix up broken headers */ > > pci_fixup_device(PCI_FIXUP_HEADER, dev); > > > > It breaks on ppc32, B&W G3, dies in indirect_read_config() because the > > pointer *cfg_data becomes bogus, devfn is > 0xff (no idea if that > > matters). > > > > turning #if 0 into #if 1 cures it. I havent tried it on other systems > > yet, but at least a PReP MTX+ works with the patch above. > > > > > > > > -- > > USB is for mice, FireWire is for men! > > > > sUse lINUX ag, n?RNBERG > > > -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nathanl at austin.ibm.com Thu Dec 11 11:52:06 2003 From: nathanl at austin.ibm.com (Nathan Lynch) Date: Wed, 10 Dec 2003 18:52:06 -0600 Subject: lparcfg code In-Reply-To: <20031210215427.CDC3E24064@source.scl.ameslab.gov> References: <20031210215427.CDC3E24064@source.scl.ameslab.gov> Message-ID: <3FD7BFB6.1000102@austin.ibm.com> Hi- I noticed this just got committed. ppc64 at source.scl.ameslab.gov wrote: > full patch URL: > http://source.scl.ameslab.gov:14690//linux-2.5/patch at 1.1343 > > ChangeSet > 1.1343 03/12/10 15:46:38 will_schmidt at vnet.ibm.com +5 -0 > add/forward port of lparcfg > > arch/ppc64/kernel/lparcfg.c > 1.1 03/12/10 15:43:51 will_schmidt at vnet.ibm.com +474 -0 > > include/asm-ppc64/hvcall.h > 1.10 03/12/10 15:43:51 will_schmidt at vnet.ibm.com +17 -0 > add hcall for 4 output parms > > arch/ppc64/kernel/pSeries_hvCall.S > 1.6 03/12/10 15:43:51 will_schmidt at vnet.ibm.com +38 -0 > add hcall for 4 output parms > > arch/ppc64/kernel/lparcfg.c > 1.0 03/12/10 15:43:51 will_schmidt at vnet.ibm.com +0 -0 > BitKeeper file /development/willschm/kernels/bk25.dec10/linux-2.5/arch/ppc64/kernel/lparcfg.c > > arch/ppc64/kernel/Makefile > 1.30 03/12/10 15:43:51 will_schmidt at vnet.ibm.com +2 -0 > add/forward port of lparcfg > > arch/ppc64/Kconfig > 1.31 03/12/10 15:43:51 will_schmidt at vnet.ibm.com +4 -0 > add/forward port of lparcfg > > ======== ChangeSet 1.1343 ======== > will_schmidt at vnet.ibm.com|ChangeSet|20031210214638|52756 The lparcfg thing doesn't want to build as a module: *** Warning: "cur_cpu_spec" [arch/ppc64/kernel/lparcfg.ko] undefined! *** Warning: "systemcfg" [arch/ppc64/kernel/lparcfg.ko] undefined! *** Warning: ".plpar_hcall_4out" [arch/ppc64/kernel/lparcfg.ko] undefined! CC arch/ppc64/kernel/lparcfg.mod.o LD [M] arch/ppc64/kernel/lparcfg.ko Do we really need it to be a module? To me, it seems like something that could just be a boolean config option instead of tristate. Some other points and questions: - the init function in arch/ppc64/kernel/lparcfg.c should check whether this is an lpar system; if not, it should return without creating /proc/ppc64/lparcfg. - the way the lparcfg code uses the proc_dir_entry->data pointer as a "scratch" buffer before copying to user space is pretty weird. The read function should allocate a separate buffer for each invocation, or use the stack. The data member of proc_dir_entry is usually used for symlinks or static data, such as device tree properties. - the lparcfg_open function seems unnecessary. - the h_get_ppp function should be declared static, I suspect. It should also check the return value of plpar_hcall_4out. - functions' opening braces should be on their own lines. - why is the pSeries version of lparcfg_data gathering information (e.g. serial number, system type) that is already available to users in /proc/device-tree? - in 2.5, of_find_node_by_path should be used instead of find_path_device. - I noticed a few lines like this: if (cur_cpu_spec->firmware_features && FW_FEATURE_SPLPAR) I think these should be bitwise-and'ing instead of logical. - Is /proc/ppc64/lparcfg going to be writable in order to change certain parameters? I am doing some work along these lines and I don't want to duplicate effort. Thanks, Nathan ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From paulus at samba.org Thu Dec 11 15:39:39 2003 From: paulus at samba.org (Paul Mackerras) Date: Thu, 11 Dec 2003 15:39:39 +1100 Subject: lparcfg code In-Reply-To: <3FD7BFB6.1000102@austin.ibm.com> References: <20031210215427.CDC3E24064@source.scl.ameslab.gov> <3FD7BFB6.1000102@austin.ibm.com> Message-ID: <16343.62731.254445.650129@cargo.ozlabs.ibm.com> Nathan Lynch writes: > I noticed this just got committed. > > ppc64 at source.scl.ameslab.gov wrote: > > full patch URL: > > http://source.scl.ameslab.gov:14690//linux-2.5/patch at 1.1343 > > > > ChangeSet > > 1.1343 03/12/10 15:46:38 will_schmidt at vnet.ibm.com +5 -0 > > add/forward port of lparcfg I noticed that the config option doesn't have any help text. It's not necessarily obvious to people what "LPAR configuration data" is or whether they might want it. Will, please add some reasonable help text. > - functions' opening braces should be on their own lines. Yes. Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Fri Dec 12 04:47:47 2003 From: olof at austin.ibm.com (Olof Johansson) Date: Thu, 11 Dec 2003 11:47:47 -0600 Subject: [2.5] [PATCH] Don't loop looking for free SLB entries in do_slb_bolted Message-ID: <3FD8ADC3.4090401@austin.ibm.com> Anton and I have done some work in 2.4 to optimize the SLB reload path. One big time waster for a busy system is the search for a free entry. This is the corresponding patch for 2.5/2.6. It's smaller since we don't have to do slbie's on 2.6. For a system with smaller working set than the SLB can fit (<16GB), evicting a used entry is no biggie: It'll be fauled in again and we'll reach a stable state after a few iterations through this. For systems with larger working sets, there is no stable state and we will save alot of time by not scanning all 63 entries on every fault just to find them all valid and fall back to round-robin. Comments are welcome. I haven't tested the 2.6 patch as much as 2.4 due to lack of big hardware and good workloads. -Olof -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: slb-noloop-patch.25 Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20031211/da429f90/attachment.txt From anton at samba.org Fri Dec 12 06:40:05 2003 From: anton at samba.org (Anton Blanchard) Date: Fri, 12 Dec 2003 06:40:05 +1100 Subject: [2.5] [PATCH] Don't loop looking for free SLB entries in do_slb_bolted In-Reply-To: <3FD8ADC3.4090401@austin.ibm.com> References: <3FD8ADC3.4090401@austin.ibm.com> Message-ID: <20031211194005.GB17683@krispykreme> Hi Olof, > Anton and I have done some work in 2.4 to optimize the SLB reload > path. One big time waster for a busy system is the search for a free > entry. This is the corresponding patch for 2.5/2.6. It's smaller since > we don't have to do slbie's on 2.6. Heres what Im testing at the moment. Its a fairly big patch but Im really hoping to get our SLB reload overhead under control in 2.6 :) - nop out some stuff that is POWER3/RS64 specific - we were checking some bits in the DSISR in DataAccess_common that I cant find in the architecture manual so I nuked it. (0xa4500000 -> 0x04500000) - put do_slb_bolted on a diet, dont do a search for empty entries, similar to Olofs patch. Use the POWER4 optimsed mtcrf instruction. - flush the kernel segment out of the SLB on context switch to avoid the race where the translation is in the ERAT but not in the SLB and it gets invalidated by another cpu doing tlbie at just the wrong time (eg exception exit after srr0/srr1 has been loaded) - split segment handling and slb handling code apart. - preload PC, SP and TASK_UNMAPPED_BASE segments on a context switch. - create an SLB cache and only flush those segments if possible on a context switch. - optimise switch_mm, we were flushing the stab/slb more often than we needed to (eg when switching between a user task and a kernel thread). Its been soaking on a large box for a while now. Its completely untested on POWER3 however :) Anton -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: reload.patch Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20031212/358726fc/attachment.txt From willschm at us.ibm.com Fri Dec 12 09:06:39 2003 From: willschm at us.ibm.com (Will Schmidt) Date: Thu, 11 Dec 2003 16:06:39 -0600 Subject: lparcfg code Message-ID: I've just pushed up some changes that include help text, got rid of those extra '&' chars, and moved the function opening braces. I will need to revisit the code before too long to fill in the guts of the get_splpar_potential_characteristics() function, and will see what I can do to resolve the rest of the comments. The purpose of the lparcfg interface was initially intended to be an interface for a License Manager tool to determine what sort of system capabilities exist. Some of the contents are available elsewhere in the device tree, but this is meant to be a 'one-stop shopping' interface. :-) -Will willschm at us.ibm.com Linux on PowerPC-64 Development IBM Rochester |---------+---------------------------------------> | | Paul Mackerras | | | | | | Sent by: | | | owner-linuxppc64-dev at lists.l| | | inuxppc.org | | | | | | | | | 12/10/2003 10:39 PM | | | | |---------+---------------------------------------> >------------------------------------------------------------------------------------------------------------| | | | To: Nathan Lynch , Will Schmidt/Rochester/IBM at IBMUS | | cc: linuxppc64-dev at lists.linuxppc.org | | Subject: Re: lparcfg code | | | >------------------------------------------------------------------------------------------------------------| Nathan Lynch writes: > I noticed this just got committed. > > ppc64 at source.scl.ameslab.gov wrote: > > full patch URL: > > http://source.scl.ameslab.gov:14690//linux-2.5/patch at 1.1343 > > > > ChangeSet > > 1.1343 03/12/10 15:46:38 will_schmidt at vnet.ibm.com +5 -0 > > add/forward port of lparcfg I noticed that the config option doesn't have any help text. It's not necessarily obvious to people what "LPAR configuration data" is or whether they might want it. Will, please add some reasonable help text. > - functions' opening braces should be on their own lines. Yes. Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nathanl at austin.ibm.com Fri Dec 12 10:11:00 2003 From: nathanl at austin.ibm.com (Nathan Lynch) Date: Thu, 11 Dec 2003 17:11:00 -0600 Subject: lparcfg code In-Reply-To: References: Message-ID: <3FD8F984.7050104@austin.ibm.com> Will Schmidt wrote: > I've just pushed up some changes that include help text, got rid of those > extra '&' chars, and moved the function opening braces. > I will need to revisit the code before too long to fill in the guts of the > get_splpar_potential_characteristics() function, and will see what I can > do to resolve the rest of the comments. > > The purpose of the lparcfg interface was initially intended to be an > interface for a License Manager tool to determine what sort of system > capabilities exist. Some of the contents are available elsewhere in the > device tree, but this is meant to be a 'one-stop shopping' interface. :-) Thanks, Will. I noticed a couple other little things (compiler warnings, some more && vs & stuff, and struct initializers) in the meantime, patch is attached. Look ok? Nathan -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: lparcfg_cleanup.patch Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20031211/69ae8ed8/attachment.txt From nathanl at austin.ibm.com Fri Dec 12 13:13:18 2003 From: nathanl at austin.ibm.com (Nathan Lynch) Date: Thu, 11 Dec 2003 20:13:18 -0600 Subject: [PATCH] make debugger optional, depend on CONFIG_DEBUG_KERNEL Message-ID: <3FD9243E.6030600@austin.ibm.com> Hi- Currently in the 2.5 tree, selecting "Kernel hacking" in the top level config menu entails enabling a debugger (either xmon or kdb). Patch allows one to not use a debugger at all, and ensures that one cannot enable a debugger without selecting CONFIG_DEBUG_KERNEL. Nathan -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: debugger_optional.patch Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20031211/405bc3c1/attachment.txt From paulus at au1.ibm.com Fri Dec 12 16:08:03 2003 From: paulus at au1.ibm.com (Paul Mackerras) Date: Fri, 12 Dec 2003 16:08:03 +1100 Subject: srp.h and viosrp.h Message-ID: <16345.19763.163666.223631@cargo.ozlabs.ibm.com> I just had a look at srp.h and viosrp.h in drivers/scsi in the ameslab 2.4 tree. These can't be submitted in their present form. Firstly, they have no copyright notice, and secondly, they have no comments at the top explaining what the files are and what they relate to (and very few comments explaining the definitions mean). Finally, why are these files in drivers/scsi rather than include/something? Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Fri Dec 12 16:51:11 2003 From: anton at samba.org (Anton Blanchard) Date: Fri, 12 Dec 2003 16:51:11 +1100 Subject: [2.5] [PATCH] Don't loop looking for free SLB entries in do_slb_bolted In-Reply-To: <20031211194005.GB17683@krispykreme> References: <3FD8ADC3.4090401@austin.ibm.com> <20031211194005.GB17683@krispykreme> Message-ID: <20031212055111.GC17683@krispykreme> > Its been soaking on a large box for a while now. Its completely untested > on POWER3 however :) And we popped a bug in the patch. The valid bit is in the correct spot, no idea what we were moving from 2^11 (I suspect it was my bad spec reading). Anton diff -puN arch/ppc64/kernel/head.S~debug_slb_rewrite arch/ppc64/kernel/head.S --- foo_work/arch/ppc64/kernel/head.S~debug_slb_rewrite 2003-12-11 15:51:17.110465423 -0600 +++ foo_work-anton/arch/ppc64/kernel/head.S 2003-12-11 15:51:31.704685017 -0600 @@ -997,7 +997,9 @@ SLB_NUM_ENTRIES = 64 * non recoverable point (after setting srr0/1) - Anton */ slbmfee r21,r22 +#if 0 insrdi r21,r21,12,36 /* move valid bit 2^11 to 2^27 */ +#endif srdi r21,r21,27 /* * This is incorrect (r1 is not the kernel stack) if we entered _ ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Fri Dec 12 17:12:03 2003 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 12 Dec 2003 17:12:03 +1100 Subject: [PATCH] Fix race between pte_free and hash_page Message-ID: <1071209522.12496.344.camel@gaston> Hi ! As discussed earlier with Olof, there is (and has always been afaik) a race between freeing a PTE page and dereferencing PTEs in that page from hash_page on another CPU. Typically can happen with a threaded application if one thread unmap's something that the other thread still uses. After much thinking and at least 2 different implementations and Rusty wise advice, here is an implementation that should be both fast and scalable. The idea is to defer the freeing until some safe point where we know the PTE page will no longer be used (since the PMD has been cleared, since a point where we know all pending hash_page are completed). This is just what RCU gives us so let's use it. Unless I missed something, this should probably be applied to ameslab-2.5 now. ===== include/asm/pgalloc.h 1.11 vs edited ===== --- 1.11/include/asm-ppc64/pgalloc.h Fri Sep 19 16:55:11 2003 +++ edited/include/asm/pgalloc.h Fri Dec 12 17:07:17 2003 @@ -3,7 +3,10 @@ #include #include +#include +#include #include +#include extern kmem_cache_t *zero_cache; @@ -62,15 +65,54 @@ return NULL; } - -static inline void -pte_free_kernel(pte_t *pte) + +static inline void pte_free_kernel(pte_t *pte) { kmem_cache_free(zero_cache, pte); } #define pte_free(pte_page) pte_free_kernel(page_address(pte_page)) -#define __pte_free_tlb(tlb, pte) pte_free(pte) + +struct pte_freelist_batch +{ + struct rcu_head rcu; + unsigned int index; + struct page * pages[0]; +}; + +#define PTE_FREELIST_SIZE ((PAGE_SIZE - sizeof(struct pte_freelist_batch) / \ + sizeof(struct page *))) + +extern void pte_free_now(struct page *ptepage); +extern void pte_free_submit(struct pte_freelist_batch *batch); + +DECLARE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur); + +static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage) +{ + /* This is safe as we are holding page_table_lock */ + cpumask_t local_cpumask = cpumask_of_cpu(smp_processor_id()); + struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur); + + if (cpus_equal(tlb->mm->cpu_vm_mask, local_cpumask) || + cpus_equal(tlb->mm->cpu_vm_mask, CPU_MASK_NONE)) { + pte_free(ptepage); + return; + } + + if (*batchp == NULL) { + *batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC); + if (*batchp == NULL) { + pte_free_now(ptepage); + return; + } + } + (*batchp)->pages[(*batchp)->index++] = ptepage; + if ((*batchp)->index == PTE_FREELIST_SIZE) { + pte_free_submit(*batchp); + *batchp = NULL; + } +} #define check_pgt_cache() do { } while (0) ===== include/asm/tlb.h 1.9 vs edited ===== --- 1.9/include/asm-ppc64/tlb.h Tue Aug 19 12:46:23 2003 +++ edited/include/asm/tlb.h Fri Dec 12 13:48:28 2003 @@ -74,6 +74,8 @@ batch->index = i; } +extern void pte_free_finish(void); + static inline void tlb_flush(struct mmu_gather *tlb) { int cpu = smp_processor_id(); @@ -86,6 +88,8 @@ flush_hash_range(tlb->mm->context, batch->index, local); batch->index = 0; + + pte_free_finish(); } #endif /* _PPC64_TLB_H */ ===== arch/ppc64/mm/init.c 1.52 vs edited ===== --- 1.52/arch/ppc64/mm/init.c Fri Oct 24 00:10:29 2003 +++ edited/arch/ppc64/mm/init.c Fri Dec 12 17:09:58 2003 @@ -94,6 +94,52 @@ * include/asm-ppc64/tlb.h file -- tgall */ DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); +DEFINE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur); +unsigned long pte_freelist_forced_free; + +static void pte_free_smp_sync(void *arg) +{ + /* Do nothing, just ensure we sync with all CPUs */ +} + +/* This is only called when we are critically out of memory + * (and fail to get a page in pte_free_tlb). + */ +void pte_free_now(struct page *ptepage) +{ + pte_freelist_forced_free++; + + smp_call_function(pte_free_smp_sync, NULL, 0, 1); + + pte_free(ptepage); +} + +static void pte_free_rcu_callback(void *arg) +{ + struct pte_freelist_batch *batch = arg; + unsigned int i; + + for (i = 0; i < batch->index; i++) + pte_free(batch->pages[i]); + free_page((unsigned long)batch); +} + +void pte_free_submit(struct pte_freelist_batch *batch) +{ + INIT_RCU_HEAD(&batch->rcu); + call_rcu(&batch->rcu, pte_free_rcu_callback, batch); +} + +void pte_free_finish(void) +{ + /* This is safe as we are holding page_table_lock */ + struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur); + + if (*batchp == NULL) + return; + pte_free_submit(*batchp); + *batchp = NULL; +} void show_mem(void) { ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From boutcher at us.ibm.com Fri Dec 12 18:38:46 2003 From: boutcher at us.ibm.com (David Boutcher) Date: Fri, 12 Dec 2003 01:38:46 -0600 Subject: srp.h and viosrp.h In-Reply-To: <16345.19763.163666.223631@cargo.ozlabs.ibm.com> Message-ID: owner-linuxppc64-dev at lists.linuxppc.org wrote on 12/11/2003 11:08:03 PM: > I just had a look at srp.h and viosrp.h in drivers/scsi in the ameslab > 2.4 tree. These can't be submitted in their present form. Firstly, > they have no copyright notice, and secondly, they have no comments at > the top explaining what the files are and what they relate to (and > very few comments explaining the definitions mean). Finally, why are > these files in drivers/scsi rather than include/something? Ya, Hollis already made similar comments. I'm in Germany right now with limited access to the network, so I'll fix them up next week. I debated where they should live....they are very SCSI oriented, so drivers/scsi seemed like a reasonable place. A fact that will be clearer once I put comments in them :-) Dave Boutcher IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olh at suse.de Sat Dec 13 02:07:31 2003 From: olh at suse.de (Olaf Hering) Date: Fri, 12 Dec 2003 16:07:31 +0100 Subject: [PATCH] boot Makefile dependency for ld.script Message-ID: <20031212150731.GA12935@suse.de> A ld.script update will not trigger a rebuild of the boot files. diff -purN linux-2.6.0-test11/arch/ppc64/boot/Makefile linux-2.6.0-test11.ppc32/arch/ppc64/boot/Makefile --- linux-2.6.0-test11/arch/ppc64/boot/Makefile 2003-11-26 21:45:27.000000000 +0100 +++ linux-2.6.0-test11.ppc32/arch/ppc64/boot/Makefile 2003-12-12 16:04:16.000000000 +0100 @@ -106,11 +106,11 @@ $(call obj-sec, $(required) $(initrd)): $(call addsection, $@) $(obj)/zImage: obj-boot += $(call obj-sec, $(required)) -$(obj)/zImage: $(call obj-sec, $(required)) $(obj-boot) $(obj)/addnote FORCE +$(obj)/zImage: $(call obj-sec, $(required)) $(obj-boot) $(obj)/zImage.lds $(obj)/addnote FORCE $(call if_changed,addnote) $(obj)/zImage.initrd: obj-boot += $(call obj-sec, $(required) $(initrd)) -$(obj)/zImage.initrd: $(call obj-sec, $(required) $(initrd)) $(obj-boot) $(obj)/addnote FORCE +$(obj)/zImage.initrd: $(call obj-sec, $(required) $(initrd)) $(obj-boot) $(obj)/zImage.lds $(obj)/addnote FORCE $(call if_changed,addnote) $(obj)/imagesize.c: vmlinux -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Sat Dec 13 03:17:01 2003 From: anton at samba.org (Anton Blanchard) Date: Sat, 13 Dec 2003 03:17:01 +1100 Subject: [PATCH] Fix race between pte_free and hash_page In-Reply-To: <1071209522.12496.344.camel@gaston> References: <1071209522.12496.344.camel@gaston> Message-ID: <20031212161701.GE17683@krispykreme> Hi Ben, > +static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage) > +{ > + /* This is safe as we are holding page_table_lock */ > + cpumask_t local_cpumask = cpumask_of_cpu(smp_processor_id()); > + struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur); > + > + if (cpus_equal(tlb->mm->cpu_vm_mask, local_cpumask) || > + cpus_equal(tlb->mm->cpu_vm_mask, CPU_MASK_NONE)) { > + pte_free(ptepage); > + return; > + } Looks good. Since we hold the pagetable lock, can we also check for mm users == 1 and take the fast path? Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Sat Dec 13 04:20:12 2003 From: olof at austin.ibm.com (Olof Johansson) Date: Fri, 12 Dec 2003 11:20:12 -0600 Subject: [PATCH] Fix race between pte_free and hash_page In-Reply-To: <1071209522.12496.344.camel@gaston> References: <1071209522.12496.344.camel@gaston> Message-ID: <3FD9F8CC.2020003@austin.ibm.com> Benjamin Herrenschmidt wrote: >Unless I missed something, this should probably be applied to >ameslab-2.5 now. > > > Ben, I like it! Two comments and one question: * Unless I'm missing something myself, (*batchp)->index is never initialized. I guess we might get lucky most the time, but it could cause badness. * pte_freelist_forced_free is unprotected/nonatomic. It only seems to be used as an indicator of memory pressure so it's not a big problem. I'm guessing we don't really want to waste cycles on syncronization and can live with it not being exact. * Do we know how much extra IPI activity this causes on a fairly loaded system? It would be interesting to know. Thanks, -Olof ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Sat Dec 13 15:32:48 2003 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sat, 13 Dec 2003 15:32:48 +1100 Subject: [PATCH] Fix race between pte_free and hash_page In-Reply-To: <3FD9F8CC.2020003@austin.ibm.com> References: <1071209522.12496.344.camel@gaston> <3FD9F8CC.2020003@austin.ibm.com> Message-ID: <1071289968.12499.354.camel@gaston> On Sat, 2003-12-13 at 04:20, Olof Johansson wrote: > Benjamin Herrenschmidt wrote: > > >Unless I missed something, this should probably be applied to > >ameslab-2.5 now. > > > > > > > Ben, > > I like it! Two comments and one question: > > * Unless I'm missing something myself, (*batchp)->index is never > initialized. I guess we might get lucky most the time, but it could > cause badness. Right. New patch on the way. > * pte_freelist_forced_free is unprotected/nonatomic. It only seems to be > used as an indicator of memory pressure so it's not a big problem. I'm > guessing we don't really want to waste cycles on syncronization and can > live with it not being exact. Yup. > * Do we know how much extra IPI activity this causes on a fairly loaded > system? It would be interesting to know. I don't know at this point, I expect no much though, a system that can't get one page with GFP_ATOMIC must be close to total oom... Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Sat Dec 13 15:52:40 2003 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sat, 13 Dec 2003 15:52:40 +1100 Subject: [PATCH] Fix race between pte_free and hash_page In-Reply-To: <20031212161701.GE17683@krispykreme> References: <1071209522.12496.344.camel@gaston> <20031212161701.GE17683@krispykreme> Message-ID: <1071291159.12499.365.camel@gaston> On Sat, 2003-12-13 at 03:17, Anton Blanchard wrote: > Looks good. Since we hold the pagetable lock, can we also check for > mm users == 1 and take the fast path? Yup. Here's a new one doing that. I also removed the test with CPU_MASK_NONE (seems useless especially with the mm_users one) and added proper initialization of the "index" field in the batch as pointed out by Olof. ===== include/asm/pgalloc.h 1.11 vs edited ===== --- 1.11/include/asm-ppc64/pgalloc.h Fri Sep 19 16:55:11 2003 +++ edited/include/asm/pgalloc.h Sat Dec 13 15:49:57 2003 @@ -3,7 +3,10 @@ #include #include +#include +#include #include +#include extern kmem_cache_t *zero_cache; @@ -62,15 +65,55 @@ return NULL; } - -static inline void -pte_free_kernel(pte_t *pte) + +static inline void pte_free_kernel(pte_t *pte) { kmem_cache_free(zero_cache, pte); } #define pte_free(pte_page) pte_free_kernel(page_address(pte_page)) -#define __pte_free_tlb(tlb, pte) pte_free(pte) + +struct pte_freelist_batch +{ + struct rcu_head rcu; + unsigned int index; + struct page * pages[0]; +}; + +#define PTE_FREELIST_SIZE ((PAGE_SIZE - sizeof(struct pte_freelist_batch) / \ + sizeof(struct page *))) + +extern void pte_free_now(struct page *ptepage); +extern void pte_free_submit(struct pte_freelist_batch *batch); + +DECLARE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur); + +static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage) +{ + /* This is safe as we are holding page_table_lock */ + cpumask_t local_cpumask = cpumask_of_cpu(smp_processor_id()); + struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur); + + if (atomic_read(&tlb->mm->mm_users) < 2 || + cpus_equal(tlb->mm->cpu_vm_mask, local_cpumask)) { + pte_free(ptepage); + return; + } + + if (*batchp == NULL) { + *batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC); + if (*batchp == NULL) { + pte_free_now(ptepage); + return; + } + (*batchp)->index = 0; + } + (*batchp)->pages[(*batchp)->index++] = ptepage; + if ((*batchp)->index == PTE_FREELIST_SIZE) { + pte_free_submit(*batchp); + *batchp = NULL; + } +} #define check_pgt_cache() do { } while (0) ===== include/asm/tlb.h 1.9 vs edited ===== --- 1.9/include/asm-ppc64/tlb.h Tue Aug 19 12:46:23 2003 +++ edited/include/asm/tlb.h Fri Dec 12 13:48:28 2003 @@ -74,6 +74,8 @@ batch->index = i; } +extern void pte_free_finish(void); + static inline void tlb_flush(struct mmu_gather *tlb) { int cpu = smp_processor_id(); @@ -86,6 +88,8 @@ flush_hash_range(tlb->mm->context, batch->index, local); batch->index = 0; + + pte_free_finish(); } #endif /* _PPC64_TLB_H */ ===== arch/ppc64/mm/init.c 1.52 vs edited ===== --- 1.52/arch/ppc64/mm/init.c Fri Oct 24 00:10:29 2003 +++ edited/arch/ppc64/mm/init.c Fri Dec 12 17:09:58 2003 @@ -94,6 +94,52 @@ * include/asm-ppc64/tlb.h file -- tgall */ DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); +DEFINE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur); +unsigned long pte_freelist_forced_free; + +static void pte_free_smp_sync(void *arg) +{ + /* Do nothing, just ensure we sync with all CPUs */ +} + +/* This is only called when we are critically out of memory + * (and fail to get a page in pte_free_tlb). + */ +void pte_free_now(struct page *ptepage) +{ + pte_freelist_forced_free++; + + smp_call_function(pte_free_smp_sync, NULL, 0, 1); + + pte_free(ptepage); +} + +static void pte_free_rcu_callback(void *arg) +{ + struct pte_freelist_batch *batch = arg; + unsigned int i; + + for (i = 0; i < batch->index; i++) + pte_free(batch->pages[i]); + free_page((unsigned long)batch); +} + +void pte_free_submit(struct pte_freelist_batch *batch) +{ + INIT_RCU_HEAD(&batch->rcu); + call_rcu(&batch->rcu, pte_free_rcu_callback, batch); +} + +void pte_free_finish(void) +{ + /* This is safe as we are holding page_table_lock */ + struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur); + + if (*batchp == NULL) + return; + pte_free_submit(*batchp); + *batchp = NULL; +} void show_mem(void) { ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From dwm at austin.ibm.com Tue Dec 16 04:06:58 2003 From: dwm at austin.ibm.com (Doug Maxey) Date: Mon, 15 Dec 2003 11:06:58 -0600 Subject: [RFC] JS20 IDE perf patch Message-ID: <200312151706.hBFH6wv1008098@falcon30.maxey.austin.rr.com> Howdy, Have done some investigation of the AMD IDE implementation on the JS20 and have come up with a set of patches to address this. The underlying assumption that the PCI bus can only do 66mhz is no longer true with the HT implementation on this platform. The specific issues seen were: 1) ide.c has code that assumes the PCI bus only runs between 20 and 66 mhz. 2) FW has not completely configured the chipset to run at the higher IDE bus speeds. 3) amd74xx.c does not recognize and cannot use the higher PCI bus speeds to enable the higher IDE speeds. Item 1 is addressed by creating ATA_PCIMHZ_MIN and ATA_PCIMHZ_MAX in linux/ide.h. These values are used in ide/ide.c and amd74xx.c. If there are arch specific values, they override the the values set in linux/ide.h via the arch specific header. Item 2 is addressed by fixup_amd_ide in arch/ppc64/kernel/pci.c. Item 3 uses the ATA_PCIMHZ* to calculate the PCI bus frequency in amd74xx.c and is used to determine the UDMA settings for the drives. Please review these changes. Comments are welcome. ++doug -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/octet-stream Size: 6001 bytes Desc: amd-js20-ide-perf.diff Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20031215/c7e40981/attachment.obj From lxiep at us.ibm.com Tue Dec 16 11:26:32 2003 From: lxiep at us.ibm.com (Linda Xie) Date: Mon, 15 Dec 2003 18:26:32 -0600 Subject: [PATCH] Add unlimited name length support to PHP core Message-ID: <3FDE5138.7000507@us.ltcfwd.linux.ibm.com> Hi All, The attached patch adds unlimited name length support to PHP core and some fixes needed in kobject.c and symlink.c. Please review it and send your comments by 12/19/03. Thanks, Linda -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: kobj-patch.patch Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20031215/300f2ff1/attachment.txt From benh at kernel.crashing.org Tue Dec 16 12:59:14 2003 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 16 Dec 2003 12:59:14 +1100 Subject: [RFC] JS20 IDE perf patch In-Reply-To: <200312151706.hBFH6wv1008098@falcon30.maxey.austin.rr.com> References: <200312151706.hBFH6wv1008098@falcon30.maxey.austin.rr.com> Message-ID: <1071539954.12497.443.camel@gaston> > Item 1 is addressed by creating ATA_PCIMHZ_MIN and ATA_PCIMHZ_MAX in > linux/ide.h. These values are used in ide/ide.c and amd74xx.c. If > there are arch specific values, they override the the values set in > linux/ide.h via the arch specific header. > > Item 2 is addressed by fixup_amd_ide in arch/ppc64/kernel/pci.c. > > Item 3 uses the ATA_PCIMHZ* to calculate the PCI bus frequency in > amd74xx.c and is used to determine the UDMA settings for the > drives. > > Please review these changes. Comments are welcome. I'd rather keep generic drivers/ide.c out of the picture and do things locally to the AMD IDE driver... Wouldn't it be possible to get the proper values in the OF device-tree and add code to the AMD driver to pick them from there ? Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From segher at kernel.crashing.org Wed Dec 17 01:54:26 2003 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Tue, 16 Dec 2003 15:54:26 +0100 Subject: [RFC] JS20 IDE perf patch In-Reply-To: <200312151706.hBFH6wv1008098@falcon30.maxey.austin.rr.com> References: <200312151706.hBFH6wv1008098@falcon30.maxey.austin.rr.com> Message-ID: On 15-dec-03, at 18:06, Doug Maxey wrote: > Please review these changes. Comments are welcome. +#ifdef CONFIG_PPC_PSERIES +#define ATA_PCIMHZ_MIN 20 +#define ATA_PCIMHZ_MAX 512 +#endif /* CONFIG_PPC_PSERIES */ The 8111 HT is an 8-bit link at 200MHz, i.e. 800MB/s total or 400MB/s each way; that would translate into a 200MHz or a 100MHz PCI bus equivalent, whichever value you like better (and they're both somewhat bogus anyway), but 512 seems a bit high. Anything I'm missing here? Thanks, Segher ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From dwm at austin.ibm.com Wed Dec 17 02:59:54 2003 From: dwm at austin.ibm.com (Doug Maxey) Date: Tue, 16 Dec 2003 09:59:54 -0600 Subject: [RFC] JS20 IDE perf patch Message-ID: <200312161559.hBGFxsdd024620@localhost.localdomain> Ben, First, my bad, the "hack at falcon30..." was a garbled address, not using the same alias file on both machines... Correct addr here. Did not see a way to pass any interesting data, in particular the bus speed via pci_dev. Do you have a suggestion on that? The primary reason to change the generic ide.c was to allow the setting on the append line to turn on the changes. Cannot answer regarding the device tree contents. ++doug On Tue, 16 Dec 2003 12:59:14 +1100, Benjamin Herrenschmidt wrote: [snip] > >I'd rather keep generic drivers/ide.c out of the picture and do things >locally to the AMD IDE driver... Wouldn't it be possible to get the >proper values in the OF device-tree and add code to the AMD driver >to pick them from there ? > >Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From dwm at austin.ibm.com Wed Dec 17 03:19:19 2003 From: dwm at austin.ibm.com (Doug Maxey) Date: Tue, 16 Dec 2003 10:19:19 -0600 Subject: [RFC] JS20 IDE perf patch Message-ID: <200312161619.hBGGJJ8X024652@localhost.localdomain> Segher, Thanks for the info. Have not spent as much time with the 8111 spec as I probably should(: I was extrapolating the ATA_PCIMHZ_MAX for future systems that might have the maxed out PCIX bus. You never know what HW throws at you here. For this particular instance, the 100 would be good. I need to rework the amd74xx.c changes for that one. ++doug On Tue, 16 Dec 2003 15:54:26 +0100, Segher Boessenkool wrote: > >On 15-dec-03, at 18:06, Doug Maxey wrote: >> Please review these changes. Comments are welcome. > >+#ifdef CONFIG_PPC_PSERIES >+#define ATA_PCIMHZ_MIN 20 >+#define ATA_PCIMHZ_MAX 512 >+#endif /* CONFIG_PPC_PSERIES */ > > >The 8111 HT is an 8-bit link at 200MHz, i.e. 800MB/s total or 400MB/s >each way; >that would translate into a 200MHz or a 100MHz PCI bus equivalent, >whichever >value you like better (and they're both somewhat bogus anyway), but 512 >seems >a bit high. Anything I'm missing here? > > >Thanks, > >Segher > ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From gregkh at us.ibm.com Wed Dec 17 03:55:25 2003 From: gregkh at us.ibm.com (Greg KH) Date: Tue, 16 Dec 2003 08:55:25 -0800 Subject: [PATCH] Add unlimited name length support to PHP core In-Reply-To: <3FDE5138.7000507@us.ltcfwd.linux.ibm.com> References: <3FDE5138.7000507@us.ltcfwd.linux.ibm.com> Message-ID: <20031216165525.GA16517@us.ibm.com> On Mon, Dec 15, 2003 at 06:26:32PM -0600, Linda Xie wrote: > Hi All, > > > The attached patch adds unlimited name length support to PHP core > and some fixes needed in kobject.c and symlink.c. Please _ALWAYS_ send patches like this, that are not PPC specific, to the proper mailing lists (in this case, the pci hotplug one, and linux-kernel, or just linux-kernel, that would work.) Also, please explain why the change is needed to kobjects and sysfs, breaking them all up into individual patches if they do different things. > Please review it and send your comments by 12/19/03. Heh, asking for review by a specific date is not the nicest thing to do in the open source community :) thanks, greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at us.ibm.com Wed Dec 17 04:17:10 2003 From: johnrose at us.ibm.com (John H Rose) Date: Tue, 16 Dec 2003 11:17:10 -0600 Subject: [PATCH] Add unlimited name length support to PHP core In-Reply-To: <20031216165525.GA16517@us.ibm.com> Message-ID: Hi Greg- > Heh, asking for review by a specific date is not the nicest thing to do > in the open source community :) How can one convey the importance of timing for review comments without sounding inconsiderate? Given the reality of project deadlines and code drop dates, such requests seem like a necessary evil. Thoughts? Just curious- John ----------------------- John Rose pSeries Linux Development johnrose at austin.ibm.com Office: 512-838-0298 Tieline: 678-0298 gregkh at us.ltcfwd.linux.ibm.com@lists.linuxppc.org on 12/16/2003 10:55:25 AM Sent by: owner-linuxppc64-dev at lists.linuxppc.org To: lxiep at us.ltcfwd.linux.ibm.com cc: linuxppc64-dev at lists.linuxppc.org, Linda Xie/Austin/IBM at IBMUS Subject: Re: [PATCH] Add unlimited name length support to PHP core On Mon, Dec 15, 2003 at 06:26:32PM -0600, Linda Xie wrote: > Hi All, > > > The attached patch adds unlimited name length support to PHP core > and some fixes needed in kobject.c and symlink.c. Please _ALWAYS_ send patches like this, that are not PPC specific, to the proper mailing lists (in this case, the pci hotplug one, and linux-kernel, or just linux-kernel, that would work.) Also, please explain why the change is needed to kobjects and sysfs, breaking them all up into individual patches if they do different things. > Please review it and send your comments by 12/19/03. Heh, asking for review by a specific date is not the nicest thing to do in the open source community :) thanks, greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From segher at kernel.crashing.org Wed Dec 17 04:22:56 2003 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Tue, 16 Dec 2003 18:22:56 +0100 Subject: [RFC] JS20 IDE perf patch In-Reply-To: References: Message-ID: <7D17378E-2FEC-11D8-9472-000A95A4DC02@kernel.crashing.org> On 16-dec-03, at 17:51, Mark Hack wrote: > The HT link is 400Mhz DDR 8 bit. The 200Mhz bus is the default but > this is increased during init. > > The bus speed is therefore 400 x 2 MBytes /sec That's not what the AMD spec (http://www.amd.com/us-en/assets/content_type/ white_papers_and_tech_docs/24674.pdf , sorry if this link gets broken up) says -- right on the front page it says 200MHz DDR is max speed, and again it says so at page 39 (bottom). This DDR stuff is confusing sometimes ;-) Segher ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From gregkh at us.ibm.com Wed Dec 17 04:29:27 2003 From: gregkh at us.ibm.com (Greg KH) Date: Tue, 16 Dec 2003 09:29:27 -0800 Subject: [PATCH] Add unlimited name length support to PHP core In-Reply-To: References: <20031216165525.GA16517@us.ibm.com> Message-ID: <20031216172927.GA16655@us.ibm.com> On Tue, Dec 16, 2003 at 11:17:10AM -0600, John H Rose wrote: > > > Hi Greg- > > > Heh, asking for review by a specific date is not the nicest thing to do > > in the open source community :) > > How can one convey the importance of timing for review comments without > sounding inconsiderate? Given the reality of project deadlines and code > drop dates, such requests seem like a necessary evil. Thoughts? First off, don't say anything. Almost all maintainers are very responsive. If you don't hear back after a while, a repost is acceptable, and usually expected. If you still don't hear back, oh well... But for you to try to impose your project deadlines on community members is not a very nice thing to do. Odds are, none of them really care about your deadlines, as they have their own, differing schedules, and priorities :) Hope this helps, greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Wed Dec 17 06:17:26 2003 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Tue, 16 Dec 2003 13:17:26 -0600 Subject: OpenAFS? Message-ID: <7BD4DC5B-2FFC-11D8-BB49-000A95A0560C@us.ibm.com> I just got a random call yesterday asking about AFS on ppc64, and I don't think I've seen the issue(s) described anywhere... I remember hearing people complain OpenAFS didn't work on ppc64. One issue was the code trying to insert a function pointer into the syscall table at runtime, which doesn't work because it ends up inserting a pointer to the ppc64 function descriptor. Are there other issues? Has anyone seriously tried to make it work? -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Wed Dec 17 07:16:02 2003 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Tue, 16 Dec 2003 14:16:02 -0600 Subject: code review requests In-Reply-To: <20031216172927.GA16655@us.ibm.com> Message-ID: On Tuesday, Dec 16, 2003, at 11:29 US/Central, Greg KH wrote: > > First off, don't say anything. Almost all maintainers are very > responsive. That hasn't been my experience, at least... Maybe you've found some maintainers who aren't overworked?! :) Seriously, it seems to me that the "gets dropped" option ends up happening pretty frequently, unless you're working in an area that's personally interesting to the maintainer. (That goes for email RFCs as well, private or public.) Those areas of interest may not coincide with work assignments you've been given... -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From jschopp at austin.ibm.com Wed Dec 17 07:22:09 2003 From: jschopp at austin.ibm.com (jschopp at austin.ibm.com) Date: Tue, 16 Dec 2003 14:22:09 -0600 (CST) Subject: [PATCH] minor lock improvements Message-ID: I've attached a patch that should help locks be just a smidge faster on ppc64 machines. I am not a performance guy so I ran the only benchmark I had handy (sdet) which I am unfortunatly not allowed to publish a number on to show the increase. I got a overall throughput increase of .436%, with a confidence of 95% that the increase is between .232% and .639%. I would expect other tests to show larger improvements (performance guys welcome to help me out here). The patch needs some feedback (comments in code show where) on how to do a couple things correctly. -------------- next part -------------- diff -Nru a/include/asm-ppc64/ppc_asm.h b/include/asm-ppc64/ppc_asm.h --- a/include/asm-ppc64/ppc_asm.h Tue Dec 16 14:15:20 2003 +++ b/include/asm-ppc64/ppc_asm.h Tue Dec 16 14:15:20 2003 @@ -44,10 +44,25 @@ ld ra,PACALPPACA+LPPACAANYINT(rb); /* Get pending interrupt flags */\ cmpldi 0,ra,0; -/* Macros to adjust thread priority for Iseries hardware multithreading */ +/* Macros to adjust thread priority for RPA hardware multithreading + * and iSeries hardware nultithreading. This way is kidof hackish, + * looking for suggestions on how to do it better. Joel S. + */ +#ifdef CONFIG_HMT #define HMT_LOW or 1,1,1 #define HMT_MEDIUM or 2,2,2 #define HMT_HIGH or 3,3,3 +#else /* CONFIG_HMT */ +#ifdef CONFIG_PPC_ISERIES +#define HMT_LOW or 1,1,1 +#define HMT_MEDIUM or 2,2,2 +#define HMT_HIGH or 3,3,3 +#else /* CONFIG_PPC_ISERES */ +#define HMT_LOW +#define HMT_MEDIUM +#define HMT_HIGH +#endif /* CONFIG_PPC_ISERIES */ +#endif /* CONFIG_HMT */ /* Insert the high 32 bits of the MSR into what will be the new MSR (via SRR1 and rfid) This preserves the MSR.SF and MSR.ISF diff -Nru a/include/asm-ppc64/spinlock.h b/include/asm-ppc64/spinlock.h --- a/include/asm-ppc64/spinlock.h Tue Dec 16 14:15:20 2003 +++ b/include/asm-ppc64/spinlock.h Tue Dec 16 14:15:20 2003 @@ -22,7 +22,18 @@ * locking when running on an RPA platform. As we do more performance * tuning, I would expect this selection mechanism to change. Dave E. */ +/* XXX- Need some way to test if SPLPAR is possible on this machine + * this way is kindof hackish. HMT and SPLPAR don't really have anything + * to do with eachother. Open for suggestions. Joel S. + */ +#ifdef CONFIG_PPC_PSERIES +#ifndef CONFIG_HMT +#undef SPLPAR_LOCKS +#else /* CONIFG_HMT is defined */ #define SPLPAR_LOCKS +#endif /* CONFIG_HMT */ +#endif /* CONFIG_PPC_PSERIES */ + #define HVSC ".long 0x44000022\n" typedef struct { @@ -107,7 +118,7 @@ unsigned long tmp, tmp2; __asm__ __volatile__( - "b 2f # spin_lock\n\ + "b 3f # spin_lock\n\ 1:" HMT_LOW " ldx %0,0,%2 # load the lock value\n\ @@ -127,11 +138,12 @@ " b 1b\n\ 2: \n" HMT_MEDIUM -" ldarx %0,0,%2\n\ +"3: \n\ + ldarx %0,0,%2\n\ cmpdi 0,%0,0\n\ bne- 1b\n\ stdcx. 13,0,%2\n\ - bne- 2b\n\ + bne- 3b\n\ isync" : "=&r"(tmp), "=&r"(tmp2) : "r"(&lock->lock) @@ -148,10 +160,10 @@ HMT_LOW " ldx %0,0,%1 # load the lock value\n\ cmpdi 0,%0,0 # if not locked, try to acquire\n\ - bne+ 1b\n\ -2: \n" + bne+ 1b\n" HMT_MEDIUM -" ldarx %0,0,%1\n\ +"2: \n\ + ldarx %0,0,%1\n\ cmpdi 0,%0,0\n\ bne- 1b\n\ stdcx. 13,0,%1\n\ @@ -224,7 +236,7 @@ unsigned long tmp, tmp2; __asm__ __volatile__( - "b 2f # read_lock\n\ + "b 3f # read_lock\n\ 1:" HMT_LOW " ldx %0,0,%2\n\ @@ -247,11 +259,12 @@ sc # do the hcall \n\ 2: \n" HMT_MEDIUM -" ldarx %0,0,%2\n\ +"3:\n\ + ldarx %0,0,%2\n\ addic. %0,%0,1\n\ ble- 1b\n\ stdcx. %0,0,%2\n\ - bne- 2b\n\ + bne- 3b\n\ isync" : "=&r"(tmp), "=&r"(tmp2) : "r"(&rw->lock) @@ -265,7 +278,7 @@ unsigned long tmp, tmp2; __asm__ __volatile__( - "b 2f # read_lock\n\ + "b 3f # read_lock\n\ 1:" HMT_LOW " ldx %0,0,%2\n\ @@ -284,11 +297,12 @@ HVSC "2: \n" HMT_MEDIUM -" ldarx %0,0,%2\n\ +"3: \n\ + ldarx %0,0,%2\n\ addic. %0,%0,1\n\ ble- 1b\n\ stdcx. %0,0,%2\n\ - bne- 2b\n\ + bne- 3b\n\ isync" : "=&r"(tmp), "=&r"(tmp2) : "r"(&rw->lock) @@ -305,10 +319,10 @@ HMT_LOW " ldx %0,0,%1\n\ cmpdi 0,%0,0\n\ - blt+ 1b\n\ -2: \n" + blt+ 1b\n" HMT_MEDIUM -" ldarx %0,0,%1\n\ +"2: \n\ + ldarx %0,0,%1\n\ addic. %0,%0,1\n\ ble- 1b\n\ stdcx. %0,0,%1\n\ @@ -363,7 +377,7 @@ unsigned long tmp, tmp2; __asm__ __volatile__( - "b 2f # spin_lock\n\ + "b 3f # spin_lock\n\ 1:" HMT_LOW " ldx %0,0,%2 # load the lock value\n\ @@ -387,11 +401,12 @@ sc # do the hcall \n\ 2: \n" HMT_MEDIUM -" ldarx %0,0,%2\n\ +"3: \n\ + ldarx %0,0,%2\n\ cmpdi 0,%0,0\n\ bne- 1b\n\ stdcx. 13,0,%2\n\ - bne- 2b\n\ + bne- 3b\n\ isync" : "=&r"(tmp), "=&r"(tmp2) : "r"(&rw->lock) @@ -405,7 +420,7 @@ unsigned long tmp, tmp2; __asm__ __volatile__( - "b 2f # spin_lock\n\ + "b 3f # spin_lock\n\ 1:" HMT_LOW " ldx %0,0,%2 # load the lock value\n\ @@ -427,11 +442,12 @@ " b 1b\n\ 2: \n" HMT_MEDIUM -" ldarx %0,0,%2\n\ +"3: \n\ + ldarx %0,0,%2\n\ cmpdi 0,%0,0\n\ bne- 1b\n\ stdcx. 13,0,%2\n\ - bne- 2b\n\ + bne- 3b\n\ isync" : "=&r"(tmp), "=&r"(tmp2) : "r"(&rw->lock) @@ -443,7 +459,7 @@ unsigned long tmp; __asm__ __volatile__( - "b 2f # spin_lock\n\ + "b 3f # spin_lock\n\ 1:" HMT_LOW " ldx %0,0,%1 # load the lock value\n\ @@ -451,11 +467,12 @@ bne+ 1b\n\ 2: \n" HMT_MEDIUM -" ldarx %0,0,%1\n\ +"3: \n\ + ldarx %0,0,%1\n\ cmpdi 0,%0,0\n\ bne- 1b\n\ stdcx. 13,0,%1\n\ - bne- 2b\n\ + bne- 3b\n\ isync" : "=&r"(tmp) : "r"(&rw->lock) From gregkh at us.ibm.com Wed Dec 17 08:04:18 2003 From: gregkh at us.ibm.com (Greg KH) Date: Tue, 16 Dec 2003 13:04:18 -0800 Subject: code review requests In-Reply-To: References: <20031216172927.GA16655@us.ibm.com> Message-ID: <20031216210418.GA29601@us.ibm.com> On Tue, Dec 16, 2003 at 02:16:02PM -0600, Hollis Blanchard wrote: > Seriously, it seems to me that the "gets dropped" option ends up > happening pretty frequently, unless you're working in an area that's > personally interesting to the maintainer. (That goes for email RFCs as > well, private or public.) Those areas of interest may not coincide with > work assignments you've been given... That's the joy/headache of dealing with the opensource community :) But telling someone that they need to review your patch by a specific date would do nothing more than piss off a overworked maintainer. It's just bad form, just like sending email questions privately to them: http://www.arm.linux.org.uk/news/?newsitem=11 greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From peter at bergner.org Wed Dec 17 08:55:46 2003 From: peter at bergner.org (Peter Bergner) Date: Tue, 16 Dec 2003 15:55:46 -0600 Subject: OpenAFS? In-Reply-To: <7BD4DC5B-2FFC-11D8-BB49-000A95A0560C@us.ibm.com> References: <7BD4DC5B-2FFC-11D8-BB49-000A95A0560C@us.ibm.com> Message-ID: <1071611744.3077.29.camel@otta.rchland.ibm.com> On Tue, 2003-12-16 at 13:17, Hollis Blanchard wrote: > I just got a random call yesterday asking about AFS on ppc64, and I > don't think I've seen the issue(s) described anywhere... > > I remember hearing people complain OpenAFS didn't work on ppc64. One > issue was the code trying to insert a function pointer into the syscall > table at runtime, which doesn't work because it ends up inserting a > pointer to the ppc64 function descriptor. > > Are there other issues? Has anyone seriously tried to make it work? That's not enough of a problem for you!?!? ;-) Actually, I think SUSE is working on getting it to work. Try asking _Marcus_ on #ppc64 on how far he's gotten. Peter ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From dje at watson.ibm.com Wed Dec 17 09:27:13 2003 From: dje at watson.ibm.com (David Edelsohn) Date: Tue, 16 Dec 2003 17:27:13 -0500 Subject: OpenAFS? In-Reply-To: Message from Hollis Blanchard of "Tue, 16 Dec 2003 13:17:26 CST." <7BD4DC5B-2FFC-11D8-BB49-000A95A0560C@us.ibm.com> References: <7BD4DC5B-2FFC-11D8-BB49-000A95A0560C@us.ibm.com> Message-ID: <200312162227.hBGMRDT34098@makai.watson.ibm.com> >>>>> Hollis Blanchard writes: Hollis> I just got a random call yesterday asking about AFS on ppc64, and I Hollis> don't think I've seen the issue(s) described anywhere... Hollis> I remember hearing people complain OpenAFS didn't work on ppc64. One Hollis> issue was the code trying to insert a function pointer into the syscall Hollis> table at runtime, which doesn't work because it ends up inserting a Hollis> pointer to the ppc64 function descriptor. A port to IA-64 should have a similar problem. Has OpenAFS been ported to IA-64 Linux? David ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Wed Dec 17 10:23:16 2003 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 17 Dec 2003 10:23:16 +1100 Subject: [RFC] JS20 IDE perf patch In-Reply-To: <200312161559.hBGFxsdd024620@localhost.localdomain> References: <200312161559.hBGFxsdd024620@localhost.localdomain> Message-ID: <1071616996.753.26.camel@gaston> On Wed, 2003-12-17 at 02:59, Doug Maxey wrote: > Ben, > > First, my bad, the "hack at falcon30..." was a garbled address, not > using the same alias file on both machines... Correct addr here. > > Did not see a way to pass any interesting data, in particular the bus > speed via pci_dev. Do you have a suggestion on that? > > The primary reason to change the generic ide.c was to allow the > setting on the append line to turn on the changes. > > Cannot answer regarding the device tree contents. No, we don't have a good way yet to provide the bus speed per pci dev in a generic way afaik, though that would be a good addition I beleive. We could add a ppc-specific pcibios_get_bus_speed() thingy eventually... If the information is available in the device-tree, you can peek there too from the driver. The append line is fine for testing, I don't like the idea of having to rely on that for proper IDE timings in a product though Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Wed Dec 17 10:54:43 2003 From: johnrose at austin.ibm.com (John Rose) Date: Tue, 16 Dec 2003 17:54:43 -0600 Subject: [PATCH] xmon fixes for 2.6 Message-ID: <1071618882.7664.37.camel@verve> This patch was written mostly by Jake Moilanen. It fixes problems that prevented the use of xmon breakpoints on pSeries LPAR systems. Also updates help text to reflect obsolete subcommands, and adds some necessary lines for single stepping. If there are no comments on this, will push soon (thurs?). Thanks- John diff -Nru a/arch/ppc64/xmon/xmon.c b/arch/ppc64/xmon/xmon.c --- a/arch/ppc64/xmon/xmon.c Tue Dec 16 17:49:24 2003 +++ b/arch/ppc64/xmon/xmon.c Tue Dec 16 17:49:24 2003 @@ -37,9 +37,10 @@ #define skipbl xmon_skipbl #ifdef CONFIG_SMP -unsigned long cpus_in_xmon = 0; +volatile unsigned long cpus_in_xmon = 0; static unsigned long got_xmon = 0; static volatile int take_xmon = -1; +static volatile int leaving_xmon = 0; #endif /* CONFIG_SMP */ static unsigned long adrs; @@ -162,7 +163,6 @@ mz zero a block of memory\n\ mx translation information for an effective address\n\ mi show information about memory allocation\n\ - M print System.map\n\ p show the task list\n\ r print registers\n\ s single step\n\ @@ -227,7 +227,7 @@ xmon(struct pt_regs *excp) { struct pt_regs regs; - int cmd; + int cmd = 0; unsigned long msr; if (excp == NULL) { @@ -285,9 +285,15 @@ xmon_regs[smp_processor_id()] = excp; excprint(excp); #ifdef CONFIG_SMP - if (test_and_set_bit(smp_processor_id(), &cpus_in_xmon)) + leaving_xmon = 0; + /* possible race condition here if a CPU is held up and gets + * here while we are exiting */ + if (test_and_set_bit(smp_processor_id(), &cpus_in_xmon)) { + /* xmon probably caused an exception itself */ + printf("We are already in xmon\n"); for (;;) ; + } while (test_and_set_bit(0, &got_xmon)) { if (take_xmon == smp_processor_id()) { take_xmon = -1; @@ -304,6 +310,9 @@ if (cmd == 's') { xmon_trace[smp_processor_id()] = SSTEP; excp->msr |= MSR_SE; +#ifdef CONFIG_SMP + take_xmon = smp_processor_id(); +#endif } else if (at_breakpoint(excp->nip)) { xmon_trace[smp_processor_id()] = BRSTEP; excp->msr |= MSR_SE; @@ -313,7 +322,9 @@ } xmon_regs[smp_processor_id()] = 0; #ifdef CONFIG_SMP - clear_bit(0, &got_xmon); + leaving_xmon = 1; + if (cmd != 's') + clear_bit(0, &got_xmon); clear_bit(smp_processor_id(), &cpus_in_xmon); #endif /* CONFIG_SMP */ set_msrd(msr); /* restore interrupt enable */ @@ -421,8 +432,6 @@ int i; struct bpt *bp; - if (systemcfg->platform != PLATFORM_PSERIES) - return; bp = bpts; for (i = 0; i < NBPTS; ++i, ++bp) { if (!bp->enabled) @@ -450,9 +459,6 @@ struct bpt *bp; unsigned instr; - if (systemcfg->platform != PLATFORM_PSERIES) - return; - if ((cur_cpu_spec->cpu_features & CPU_FTR_DABR)) set_dabr(0); if ((cur_cpu_spec->cpu_features & CPU_FTR_IABR)) @@ -478,11 +484,15 @@ static int cmds(struct pt_regs *excp) { - int cmd; + int cmd = 0; last_cmd = NULL; for(;;) { #ifdef CONFIG_SMP + /* Need to check if we should take any commands on + this CPU. */ + if (leaving_xmon) + return cmd; printf("%d:", smp_processor_id()); #endif /* CONFIG_SMP */ printf("mon> "); @@ -832,6 +842,12 @@ } break; } + + if (!(systemcfg->platform & PLATFORM_PSERIES)) { + printf("Not supported for this platform\n"); + break; + } + bp = at_breakpoint(a); if (bp == 0) { for (bp = bpts; bp < &bpts[NBPTS]; ++bp) ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From paulus at samba.org Wed Dec 17 11:45:08 2003 From: paulus at samba.org (Paul Mackerras) Date: Wed, 17 Dec 2003 11:45:08 +1100 Subject: [PATCH] xmon fixes for 2.6 In-Reply-To: <1071618882.7664.37.camel@verve> References: <1071618882.7664.37.camel@verve> Message-ID: <16351.42772.717656.989995@cargo.ozlabs.ibm.com> John Rose writes: > This patch was written mostly by Jake Moilanen. It fixes problems > that prevented the use of xmon breakpoints on pSeries LPAR systems. > Also updates help text to reflect obsolete subcommands, and adds > some necessary lines for single stepping. Hmmmm. I am leaning towards the view that on entering xmon, we should send an IPI to try to get all cpus into xmon, and that on exiting xmon, we release all cpus. At the moment the breakpoint stuff is a real mess on SMP systems - we can easily get into the situation where xmon thinks that the trap instruction that it put at a particular place is the real instruction that should be there. I also think it is wrong to hang if we try to re-enter xmon. Regards, Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Thu Dec 18 09:53:28 2003 From: anton at samba.org (Anton Blanchard) Date: Thu, 18 Dec 2003 09:53:28 +1100 Subject: pci_map_single return value In-Reply-To: <3FD732FD.10903@us.ibm.com> References: <3FD732FD.10903@us.ibm.com> Message-ID: <20031217225328.GB25456@krispykreme> Hi, > Why is it that pci_map_single returns NO_TCE on failure on PPC64 and 0 > on failure on IA64? Which one is correct? If NO_TCE is correct, then why > is it defined in a ppc64 include? This makes it difficult for device > drivers to actually check for it. I brought this up with the architecture maintainers and the concensus seems to be having a per arch pci_dma_error() function, like x86-64 does: extern dma_addr_t bad_dma_address; #define pci_dma_error(x) ((x) == bad_dma_address) We also need a consistent error return value from pci_map_sg(). Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nathanl at austin.ibm.com Thu Dec 18 13:28:37 2003 From: nathanl at austin.ibm.com (Nathan Lynch) Date: Wed, 17 Dec 2003 20:28:37 -0600 Subject: [PATCH] Re: lparcfg code In-Reply-To: References: Message-ID: <3FE110D5.5060601@austin.ibm.com> I've attached a patch which gives /proc/ppc64/lparcfg a write() function for changing certain of the system parameters ("partition_entitled_capacity" and "capacity_weight"). Nathan -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: lparcfg_write.patch Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20031217/79e39e51/attachment.txt From boutcher at us.ibm.com Fri Dec 19 00:05:13 2003 From: boutcher at us.ibm.com (David Boutcher) Date: Thu, 18 Dec 2003 07:05:13 -0600 Subject: Removing ibmveth /proc stuff Message-ID: I noticed the following go by in a changeset: 1.1299 03/12/18 14:19:45 dgibson at superego.ozlabs.ibm.com +2 -0 Substantial cleanups to iSeries virtual ethernet driver (backported from ameslab 2.5). Removes the barely useful /proc stuff, amongst other things. There is iSeries documentation as well as testcases that rely on ibmveth information in /proc. Could you describe what you removed? Thanks, Dave B ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From meissner at suse.de Fri Dec 19 01:14:55 2003 From: meissner at suse.de (Marcus Meissner) Date: Thu, 18 Dec 2003 15:14:55 +0100 Subject: OpenAFS? In-Reply-To: <1071611744.3077.29.camel@otta.rchland.ibm.com> References: <7BD4DC5B-2FFC-11D8-BB49-000A95A0560C@us.ibm.com> <1071611744.3077.29.camel@otta.rchland.ibm.com> Message-ID: <20031218141454.GA27975@suse.de> On Tue, Dec 16, 2003 at 03:55:46PM -0600, Peter Bergner wrote: > > On Tue, 2003-12-16 at 13:17, Hollis Blanchard wrote: > > I just got a random call yesterday asking about AFS on ppc64, and I > > don't think I've seen the issue(s) described anywhere... > > > > I remember hearing people complain OpenAFS didn't work on ppc64. > > One issue was the code trying to insert a function pointer into the > > syscall table at runtime, which doesn't work because it ends up > > inserting a pointer to the ppc64 function descriptor. > > > > Are there other issues? Has anyone seriously tried to make it work? > > That's not enough of a problem for you!?!? ;-) Actually, I think SUSE > is working on getting it to work. Try asking _Marcus_ on #ppc64 on how > far he's gotten. Looks good so far, but I had to employ some pretty bad hacks, especial for the kernel system calltable hooking (as mentioned). Check the openafs*.patch files on: ftp://ftp.suse.com/pub/people/aj/afs-test/ # uname -a Linux tamarillo 2.4.21-128-pseries64 #1 SMP Tue Dec 16 14:11:31 UTC 2003 ppc64 unknown # ls /afs/suse.de/ . .. afsdoc space user usr # Ciao, Marcus ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From lxiep at us.ibm.com Fri Dec 19 08:05:14 2003 From: lxiep at us.ibm.com (Linda Xie) Date: Thu, 18 Dec 2003 15:05:14 -0600 Subject: [PATCH] -- export symbols needed by rpadlpar module Message-ID: <3FE2168A.1030503@us.ltcfwd.linux.ibm.com> Hi, The attached patch exports of_find_node_by_name and of_get_next_child needed by rpadlpar module. Comments are welcome. Linda diff -Nru a/arch/ppc64/kernel/prom.c b/arch/ppc64/kernel/prom.c --- a/arch/ppc64/kernel/prom.c Thu Dec 18 14:39:08 2003 +++ b/arch/ppc64/kernel/prom.c Thu Dec 18 14:39:08 2003 @@ -2185,6 +2185,7 @@ read_unlock(&devtree_lock); return np; } +EXPORT_SYMBOL(of_find_node_by_name); /** * of_find_node_by_type - Find a node by its "device_type" property @@ -2335,6 +2336,7 @@ read_unlock(&devtree_lock); return next; } +EXPORT_SYMBOL(of_get_next_child); /** * of_node_get - Increment refcount of a node ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hozer at hozed.org Sat Dec 20 02:32:40 2003 From: hozer at hozed.org (Troy Benjegerdes) Date: Fri, 19 Dec 2003 09:32:40 -0600 Subject: srp.h and viosrp.h In-Reply-To: References: <16345.19763.163666.223631@cargo.ozlabs.ibm.com> Message-ID: <20031219153239.GH18377@kalmia.hozed.org> On Fri, Dec 12, 2003 at 01:38:46AM -0600, David Boutcher wrote: > > owner-linuxppc64-dev at lists.linuxppc.org wrote on 12/11/2003 11:08:03 PM: > > I just had a look at srp.h and viosrp.h in drivers/scsi in the ameslab > > 2.4 tree. These can't be submitted in their present form. Firstly, > > they have no copyright notice, and secondly, they have no comments at > > the top explaining what the files are and what they relate to (and > > very few comments explaining the definitions mean). Finally, why are > > these files in drivers/scsi rather than include/something? > > Ya, Hollis already made similar comments. I'm in Germany right now with > limited access to the network, so I'll fix them up next week. > > I debated where they should live....they are very SCSI oriented, so > drivers/scsi seemed like a reasonable place. A fact that will be clearer > once I put comments in them :-) > > Dave Boutcher > IBM Linux Technology Center Is this the same SRP I keep hearing about in Infiniband? If so, are we going to wind up with conflicts with infiniband drivers? Have you take a look at the sourceforge infiniband stack at all? (the only open source stack I know of so far) This sounds like something that should go in include/ somewhere, and we ought to try and work with other people working on it elsewhere. -- -------------------------------------------------------------------------- Troy Benjegerdes 'da hozer' hozer at drgw.net ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From boutcher at us.ibm.com Sat Dec 20 03:22:38 2003 From: boutcher at us.ibm.com (David Boutcher) Date: Fri, 19 Dec 2003 10:22:38 -0600 Subject: srp.h and viosrp.h In-Reply-To: <20031219153239.GH18377@kalmia.hozed.org> Message-ID: Troy Benjegerdes wrote on 12/19/2003 09:32:40 AM: > Is this the same SRP I keep hearing about in Infiniband? If so, are we > going to wind up with conflicts with infiniband drivers? > > Have you take a look at the sourceforge infiniband stack at all? (the > only open source stack I know of so far) Yes, this is exactly the same SRP, and all the definitions in these files are from the SRP specification that will be used for Infiniband. Since the "S" in SRP stands for SCSI, /drivers/scsi SEEMED like a good place....and probably a better place than /include/scsi, since those header files are mainly concerned with the interface from user to kernel for SCSI ops. The SRP definitions define the interface out the bottom of the kernel to the device (over some interface like infiniband.) I looked at the sourceforge code a while ago, I should go back and see if it has changed lately....it wasn't moving too fast last time I looked at it. Dave Boutcher IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From brking at us.ibm.com Wed Dec 24 08:34:13 2003 From: brking at us.ibm.com (Brian King) Date: Tue, 23 Dec 2003 15:34:13 -0600 Subject: pci_map_single return value References: <3FD732FD.10903@us.ibm.com> <20031217225328.GB25456@krispykreme> Message-ID: <3FE8B4D5.2070405@us.ibm.com> Anton Blanchard wrote: >>Why is it that pci_map_single returns NO_TCE on failure on PPC64 and 0 >>on failure on IA64? Which one is correct? If NO_TCE is correct, then why >>is it defined in a ppc64 include? This makes it difficult for device >>drivers to actually check for it. > > > I brought this up with the architecture maintainers and the concensus > seems to be having a per arch pci_dma_error() function, like x86-64 does: > > extern dma_addr_t bad_dma_address; > #define pci_dma_error(x) ((x) == bad_dma_address) > How does this look? -- Brian King eServer Storage I/O IBM Linux Technology Center -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: dma_mapping_error.patch Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20031223/00570dcf/attachment.txt From anton at samba.org Wed Dec 24 10:56:32 2003 From: anton at samba.org (Anton Blanchard) Date: Wed, 24 Dec 2003 10:56:32 +1100 Subject: ppc64 PTE hacks Message-ID: <20031223235632.GE934@krispykreme> Hi, I just remembered we never merged this patch from Paul. It would be great to get rid of the flush_tlb_* functions. Anton ----- Forwarded message from Paul Mackerras ----- From: Paul Mackerras To: anton at samba.org Subject: ppc64 PTE hacks Anton, Here is the patch that changes the HPTE handling so that we queue up a HPTE invalidation at the time when we change the linux PTE, instead of later in the flush_tlb_* call. Could you run some benchmarks for me with and without this patch on a decent-sized POWER4 box sometime? (I just noticed that this patch gives a net removal of 66 lines from the kernel, which is nice. :) Thanks, Paul. diff -urN linux-2.5/arch/ppc64/kernel/process.c ppc64/arch/ppc64/kernel/process.c --- linux-2.5/arch/ppc64/kernel/process.c 2003-02-23 21:45:50.000000000 +1100 +++ ppc64/arch/ppc64/kernel/process.c 2003-03-19 16:37:25.000000000 +1100 @@ -45,6 +45,7 @@ #include #include #include +#include struct task_struct *last_task_used_math = NULL; @@ -103,6 +104,8 @@ giveup_fpu(prev); #endif /* CONFIG_SMP */ + flush_tlb_pending(); + new_thread = &new->thread; old_thread = ¤t->thread; diff -urN linux-2.5/arch/ppc64/mm/Makefile ppc64/arch/ppc64/mm/Makefile --- linux-2.5/arch/ppc64/mm/Makefile 2002-12-16 10:50:39.000000000 +1100 +++ ppc64/arch/ppc64/mm/Makefile 2003-02-24 17:14:52.000000000 +1100 @@ -4,5 +4,5 @@ EXTRA_CFLAGS += -mno-minimal-toc -obj-y := fault.o init.o extable.o imalloc.o +obj-y := fault.o init.o extable.o imalloc.o tlb.o obj-$(CONFIG_DISCONTIGMEM) += numa.o diff -urN linux-2.5/arch/ppc64/mm/init.c ppc64/arch/ppc64/mm/init.c --- linux-2.5/arch/ppc64/mm/init.c 2003-02-23 21:45:50.000000000 +1100 +++ ppc64/arch/ppc64/mm/init.c 2003-02-24 17:15:30.000000000 +1100 @@ -242,147 +242,6 @@ } } -void -flush_tlb_mm(struct mm_struct *mm) -{ - struct vm_area_struct *mp; - - spin_lock(&mm->page_table_lock); - - for (mp = mm->mmap; mp != NULL; mp = mp->vm_next) - __flush_tlb_range(mm, mp->vm_start, mp->vm_end); - - /* XXX are there races with checking cpu_vm_mask? - Anton */ - mm->cpu_vm_mask = 0; - - spin_unlock(&mm->page_table_lock); -} - -/* - * Callers should hold the mm->page_table_lock - */ -void -flush_tlb_page(struct vm_area_struct *vma, unsigned long vmaddr) -{ - unsigned long context = 0; - pgd_t *pgd; - pmd_t *pmd; - pte_t *ptep; - pte_t pte; - int local = 0; - - switch( REGION_ID(vmaddr) ) { - case VMALLOC_REGION_ID: - pgd = pgd_offset_k( vmaddr ); - break; - case IO_REGION_ID: - pgd = pgd_offset_i( vmaddr ); - break; - case USER_REGION_ID: - pgd = pgd_offset( vma->vm_mm, vmaddr ); - context = vma->vm_mm->context; - - /* XXX are there races with checking cpu_vm_mask? - Anton */ - if (vma->vm_mm->cpu_vm_mask == (1 << smp_processor_id())) - local = 1; - - break; - default: - panic("flush_tlb_page: invalid region 0x%016lx", vmaddr); - - } - - if (!pgd_none(*pgd)) { - pmd = pmd_offset(pgd, vmaddr); - if (!pmd_none(*pmd)) { - ptep = pte_offset_kernel(pmd, vmaddr); - /* Check if HPTE might exist and flush it if so */ - pte = __pte(pte_update(ptep, _PAGE_HPTEFLAGS, 0)); - if ( pte_val(pte) & _PAGE_HASHPTE ) { - flush_hash_page(context, vmaddr, pte, local); - } - } - } -} - -struct ppc64_tlb_batch ppc64_tlb_batch[NR_CPUS]; - -void -__flush_tlb_range(struct mm_struct *mm, unsigned long start, unsigned long end) -{ - pgd_t *pgd; - pmd_t *pmd; - pte_t *ptep; - pte_t pte; - unsigned long pgd_end, pmd_end; - unsigned long context = 0; - struct ppc64_tlb_batch *batch = &ppc64_tlb_batch[smp_processor_id()]; - unsigned long i = 0; - int local = 0; - - switch(REGION_ID(start)) { - case VMALLOC_REGION_ID: - pgd = pgd_offset_k(start); - break; - case IO_REGION_ID: - pgd = pgd_offset_i(start); - break; - case USER_REGION_ID: - pgd = pgd_offset(mm, start); - context = mm->context; - - /* XXX are there races with checking cpu_vm_mask? - Anton */ - if (mm->cpu_vm_mask == (1 << smp_processor_id())) - local = 1; - - break; - default: - panic("flush_tlb_range: invalid region for start (%016lx) and end (%016lx)\n", start, end); - } - - do { - pgd_end = (start + PGDIR_SIZE) & PGDIR_MASK; - if (pgd_end > end) - pgd_end = end; - if (!pgd_none(*pgd)) { - pmd = pmd_offset(pgd, start); - do { - pmd_end = (start + PMD_SIZE) & PMD_MASK; - if (pmd_end > end) - pmd_end = end; - if (!pmd_none(*pmd)) { - ptep = pte_offset_kernel(pmd, start); - do { - if (pte_val(*ptep) & _PAGE_HASHPTE) { - pte = __pte(pte_update(ptep, _PAGE_HPTEFLAGS, 0)); - if (pte_val(pte) & _PAGE_HASHPTE) { - batch->pte[i] = pte; - batch->addr[i] = start; - i++; - if (i == PPC64_TLB_BATCH_NR) { - flush_hash_range(context, i, local); - i = 0; - } - } - } - start += PAGE_SIZE; - ++ptep; - } while (start < pmd_end); - } else { - start = pmd_end; - } - ++pmd; - } while (start < pgd_end); - } else { - start = pgd_end; - } - ++pgd; - } while (start < end); - - if (i) - flush_hash_range(context, i, local); -} - void free_initmem(void) { unsigned long addr; diff -urN linux-2.5/arch/ppc64/mm/tlb.c ppc64/arch/ppc64/mm/tlb.c --- linux-2.5/arch/ppc64/mm/tlb.c Thu Jan 01 10:00:00 1970 +++ ppc64/arch/ppc64/mm/tlb.c Tue Feb 25 15:51:52 2003 @@ -0,0 +1,96 @@ +/* + * This file contains the routines for flushing entries from the + * TLB and MMU hash table. + * + * Derived from arch/ppc64/mm/init.c: + * Copyright (C) 1995-1996 Gary Thomas (gdt at linuxppc.org) + * + * Modifications by Paul Mackerras (PowerMac) (paulus at cs.anu.edu.au) + * and Cort Dougan (PReP) (cort at cs.nmt.edu) + * Copyright (C) 1996 Paul Mackerras + * Amiga/APUS changes by Jesper Skov (jskov at cygnus.co.uk). + * + * Derived from "arch/i386/mm/init.c" + * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds + * + * Dave Engebretsen + * Rework for PPC64 port. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ +#include +#include +#include +#include +#include +#include +#include +#include + +#if 0 +struct ppc64_tlb_batch { + unsigned long index; + unsigned long context; + struct mm_struct *mm; + pte_t pte[PPC64_TLB_BATCH_NR]; + unsigned long addr[PPC64_TLB_BATCH_NR]; + //unsigned long vaddr[PPC64_TLB_BATCH_NR]; +}; +#endif + +struct ppc64_tlb_batch ppc64_tlb_batch[NR_CPUS]; + +/* + * Update the MMU hash table to correspond with a change to + * a Linux PTE. If wrprot is true, it is permissible to + * change the existing HPTE to read-only rather than removing it + * (if we remove it we should clear the _PTE_HPTEFLAGS bits). + */ +void hpte_update(pte_t *ptep, unsigned long pte, int wrprot) +{ + struct page *ptepage; + struct mm_struct *mm; + unsigned long addr; + int i; + unsigned long context = 0; + struct ppc64_tlb_batch *batch = &ppc64_tlb_batch[smp_processor_id()]; + + ptepage = virt_to_page(ptep); + mm = (struct mm_struct *) ptepage->mapping; + addr = ptepage->index + (((unsigned long)ptep & ~PAGE_MASK) << 9); + if (REGION_ID(addr) == USER_REGION_ID) + context = mm->context; + i = batch->index; + if (unlikely(i != 0 && context != batch->context)) { + flush_tlb_pending(); + i = 0; + } + if (i == 0) { + batch->context = context; + batch->mm = mm; + } + batch->pte[i] = __pte(pte); + batch->addr[i] = addr; + batch->index = ++i; + if (i >= PPC64_TLB_BATCH_NR) + flush_tlb_pending(); +} + +void __flush_tlb_pending(struct ppc64_tlb_batch *batch) +{ + int i; + int local = 0; + + i = batch->index; + if (batch->mm->cpu_vm_mask == (1 << smp_processor_id())) + local = 1; + if (i == 1) + flush_hash_page(batch->context, batch->addr[0], batch->pte[0], + local); + else + flush_hash_range(batch->context, i, local); + batch->index = 0; +} diff -urN linux-2.5/include/asm-ppc64/pgtable.h ppc64/include/asm-ppc64/pgtable.h --- linux-2.5/include/asm-ppc64/pgtable.h 2003-02-27 08:12:37.000000000 +1100 +++ ppc64/include/asm-ppc64/pgtable.h 2003-03-19 16:03:12.000000000 +1100 @@ -10,6 +10,7 @@ #include /* For TASK_SIZE */ #include #include +#include #endif /* __ASSEMBLY__ */ /* PMD_SHIFT determines what a second-level page table entry can map */ @@ -262,64 +263,85 @@ /* Atomic PTE updates */ -static inline unsigned long pte_update( pte_t *p, unsigned long clr, - unsigned long set ) +static inline unsigned long pte_update(pte_t *p, unsigned long clr) { unsigned long old, tmp; __asm__ __volatile__( "1: ldarx %0,0,%3 # pte_update\n\ andc %1,%0,%4 \n\ - or %1,%1,%5 \n\ stdcx. %1,0,%3 \n\ bne- 1b" : "=&r" (old), "=&r" (tmp), "=m" (*p) - : "r" (p), "r" (clr), "r" (set), "m" (*p) + : "r" (p), "r" (clr), "m" (*p) : "cc" ); return old; } +/* PTE updating functions */ +extern void hpte_update(pte_t *ptep, unsigned long pte, int wrprot); + static inline int ptep_test_and_clear_young(pte_t *ptep) { - return (pte_update(ptep, _PAGE_ACCESSED, 0) & _PAGE_ACCESSED) != 0; + unsigned long old; + + old = pte_update(ptep, _PAGE_ACCESSED | _PAGE_HPTEFLAGS); + if (old & _PAGE_HASHPTE) { + hpte_update(ptep, old, 0); + flush_tlb_pending(); /* XXX generic code doesn't flush */ + } + return (old & _PAGE_ACCESSED) != 0; } static inline int ptep_test_and_clear_dirty(pte_t *ptep) { - return (pte_update(ptep, _PAGE_DIRTY, 0) & _PAGE_DIRTY) != 0; -} + unsigned long old; -static inline pte_t ptep_get_and_clear(pte_t *ptep) -{ - return __pte(pte_update(ptep, ~_PAGE_HPTEFLAGS, 0)); + old = pte_update(ptep, _PAGE_DIRTY); + if ((~old & (_PAGE_HASHPTE | _PAGE_RW | _PAGE_DIRTY)) == 0) + hpte_update(ptep, old, 1); + return (old & _PAGE_DIRTY) != 0; } static inline void ptep_set_wrprotect(pte_t *ptep) { - pte_update(ptep, _PAGE_RW, 0); + unsigned long old; + + old = pte_update(ptep, _PAGE_RW); + if ((~old & (_PAGE_HASHPTE | _PAGE_RW | _PAGE_DIRTY)) == 0) + hpte_update(ptep, old, 1); } -static inline void ptep_mkdirty(pte_t *ptep) +static inline pte_t ptep_get_and_clear(pte_t *ptep) { - pte_update(ptep, 0, _PAGE_DIRTY); + unsigned long old = pte_update(ptep, ~0UL); + + if (old & _PAGE_HASHPTE) + hpte_update(ptep, old, 0); + return __pte(old); } -#define pte_same(A,B) (((pte_val(A) ^ pte_val(B)) & ~_PAGE_HPTEFLAGS) == 0) +static inline void pte_clear(pte_t * ptep) +{ + unsigned long old = pte_update(ptep, ~0UL); + + if (old & _PAGE_HASHPTE) + hpte_update(ptep, old, 0); +} /* * set_pte stores a linux PTE into the linux page table. - * On machines which use an MMU hash table we avoid changing the - * _PAGE_HASHPTE bit. */ static inline void set_pte(pte_t *ptep, pte_t pte) { - pte_update(ptep, ~_PAGE_HPTEFLAGS, pte_val(pte) & ~_PAGE_HPTEFLAGS); + if (pte_present(*ptep)) + pte_clear(ptep); + if (pte_present(pte)) + flush_tlb_pending(); + *ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS; } -static inline void pte_clear(pte_t * ptep) -{ - pte_update(ptep, ~_PAGE_HPTEFLAGS, 0); -} +#define pte_same(A,B) (((pte_val(A) ^ pte_val(B)) & ~_PAGE_HPTEFLAGS) == 0) extern unsigned long ioremap_bot, ioremap_base; diff -urN linux-2.5/include/asm-ppc64/tlb.h ppc64/include/asm-ppc64/tlb.h --- linux-2.5/include/asm-ppc64/tlb.h 2003-01-12 18:45:40.000000000 +1100 +++ ppc64/include/asm-ppc64/tlb.h 2003-02-25 15:52:01.000000000 +1100 @@ -13,11 +13,10 @@ #define _PPC64_TLB_H #include -#include #include #include -static inline void tlb_flush(struct mmu_gather *tlb); +#define tlb_flush(tlb) flush_tlb_pending() /* Avoid pulling in another include just for this */ #define check_pgt_cache() do { } while (0) @@ -29,61 +28,6 @@ #define tlb_start_vma(tlb, vma) do { } while (0) #define tlb_end_vma(tlb, vma) do { } while (0) -/* Should make this at least as large as the generic batch size, but it - * takes up too much space */ -#define PPC64_TLB_BATCH_NR 192 - -struct ppc64_tlb_batch { - unsigned long index; - pte_t pte[PPC64_TLB_BATCH_NR]; - unsigned long addr[PPC64_TLB_BATCH_NR]; - unsigned long vaddr[PPC64_TLB_BATCH_NR]; -}; - -extern struct ppc64_tlb_batch ppc64_tlb_batch[NR_CPUS]; - -static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, - unsigned long address) -{ - int cpu = smp_processor_id(); - struct ppc64_tlb_batch *batch = &ppc64_tlb_batch[cpu]; - unsigned long i = batch->index; - pte_t pte; - - if (pte_val(*ptep) & _PAGE_HASHPTE) { - pte = __pte(pte_update(ptep, _PAGE_HPTEFLAGS, 0)); - if (pte_val(pte) & _PAGE_HASHPTE) { - - batch->pte[i] = pte; - batch->addr[i] = address; - i++; - - if (i == PPC64_TLB_BATCH_NR) { - int local = 0; - - if (tlb->mm->cpu_vm_mask == (1UL << cpu)) - local = 1; - - flush_hash_range(tlb->mm->context, i, local); - i = 0; - } - } - } - - batch->index = i; -} - -static inline void tlb_flush(struct mmu_gather *tlb) -{ - int cpu = smp_processor_id(); - struct ppc64_tlb_batch *batch = &ppc64_tlb_batch[cpu]; - int local = 0; - - if (tlb->mm->cpu_vm_mask == (1UL << smp_processor_id())) - local = 1; - - flush_hash_range(tlb->mm->context, batch->index, local); - batch->index = 0; -} +#define __tlb_remove_tlb_entry(tlb, pte, address) do { } while (0) #endif /* _PPC64_TLB_H */ diff -urN linux-2.5/include/asm-ppc64/tlbflush.h ppc64/include/asm-ppc64/tlbflush.h --- linux-2.5/include/asm-ppc64/tlbflush.h 2002-06-07 18:21:41.000000000 +1000 +++ ppc64/include/asm-ppc64/tlbflush.h 2003-02-25 15:51:59.000000000 +1100 @@ -1,10 +1,6 @@ #ifndef _PPC64_TLBFLUSH_H #define _PPC64_TLBFLUSH_H -#include -#include -#include - /* * TLB flushing: * @@ -15,22 +11,36 @@ * - flush_tlb_pgtables(mm, start, end) flushes a range of page tables */ -extern void flush_tlb_mm(struct mm_struct *mm); -extern void flush_tlb_page(struct vm_area_struct *vma, unsigned long vmaddr); -extern void __flush_tlb_range(struct mm_struct *mm, - unsigned long start, unsigned long end); -#define flush_tlb_range(vma, start, end) \ - __flush_tlb_range(vma->vm_mm, start, end) +struct mm_struct; + +#define PPC64_TLB_BATCH_NR 192 + +struct ppc64_tlb_batch { + unsigned long index; + unsigned long context; + struct mm_struct *mm; + pte_t pte[PPC64_TLB_BATCH_NR]; + unsigned long addr[PPC64_TLB_BATCH_NR]; + unsigned long vaddr[PPC64_TLB_BATCH_NR]; +}; -#define flush_tlb_kernel_range(start, end) \ - __flush_tlb_range(&init_mm, (start), (end)) +extern struct ppc64_tlb_batch ppc64_tlb_batch[]; +extern void __flush_tlb_pending(struct ppc64_tlb_batch *batch); -static inline void flush_tlb_pgtables(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline void flush_tlb_pending(void) { - /* PPC has hw page tables. */ + struct ppc64_tlb_batch *batch = &ppc64_tlb_batch[smp_processor_id()]; + + if (batch->index) + __flush_tlb_pending(batch); } +#define flush_tlb_mm(mm) flush_tlb_pending() +#define flush_tlb_page(vma, addr) flush_tlb_pending() +#define flush_tlb_range(vma, start, end) flush_tlb_pending() +#define flush_tlb_kernel_range(start, end) flush_tlb_pending() +#define flush_tlb_pgtables(mm, start, end) do { } while (0) + extern void flush_hash_page(unsigned long context, unsigned long ea, pte_t pte, int local); void flush_hash_range(unsigned long context, unsigned long number, int local); ----- End forwarded message ----- ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Wed Dec 24 15:58:48 2003 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 24 Dec 2003 15:58:48 +1100 Subject: ppc64 PTE hacks In-Reply-To: <20031223235632.GE934@krispykreme> References: <20031223235632.GE934@krispykreme> Message-ID: <1072241928.820.39.camel@gaston> On Wed, 2003-12-24 at 10:56, Anton Blanchard wrote: > Hi, > > I just remembered we never merged this patch from Paul. It would be > great to get rid of the flush_tlb_* functions. We actually _need_ it to fix the problem you discovered where we never flush HPTEs on ptep_test_and_clear_young(), thus defeating page aging. paul: same problem with ppc32 I suppose Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Sat Dec 27 23:15:25 2003 From: anton at samba.org (Anton Blanchard) Date: Sat, 27 Dec 2003 23:15:25 +1100 Subject: per page execute Message-ID: <20031227121524.GA24358@krispykreme> Hi, We need to move towards enforcing exec permission on all mappings. Heres a start: - Switch _PAGE_EXEC and _PAGE_RW so _PAGE_EXEC matches up with the hardware noexec bit - Add _PAGE_EXEC to hash_page permission check - Check for exec permission in do_page_fault - Remove redundant set of _PAGE_PRESENT in do_hash_page_DSI, we do it again in __hash_page - Invert linux _PAGE_EXEC bit and enter it in the ppc64 hpte - Awful bss hack to force program BSS sections to be marked exec - Split 32 and 64bit data and stack flags, only enforce exec permission on mmap/brk for the moment. (ie always mark stack executable) - More rigid pte_modify, avoid turning off no cache bits when doing mprotect. (not related to the rest of this patch, I took the opportunity to fix it while I was in the area) - Our PXXX and SXXX were backwards :) They are in XWR order due to the way our mmap flags are laid out. Wow, that bug must date back a few years. - Kill unused PAGE_KERNEL_CI, pte_cache, pte_uncache When mapping in an executable, it seems the kernel doesnt look at permissions of non load segments. We explode pretty early because our plt in ppc32 is in such an area. The awful hack was an attempt to fix this problem quickly without marking the entire bss as exec by default. Its crying for a proper fix :) Another thing that worries me: [Nr] Name Type Addr Off Size ES Flg Lk Inf Al [25] .plt NOBITS 10010c08 000c00 0000c0 00 WAX 0 0 4 [26] .bss NOBITS 10010cc8 000c00 000004 00 WA 0 0 1 Look how the non executable bss butts right onto the executable plt. Even with the patch below, we are failing some security tests that try and exec stuff out of the bss. Thats because the stuff ends up in the same page as the plt. Alan, could this be considered a toolchain bug? We also need to fix the kernel signal trampoline code before turning off exec permission on the stack. If we did the fixmap trick that x86 does and the trampoline always ended up in this page, that would work well. Does glibc rely on the stack being executable? We may need a boot option for people on old toolchains/glibcs (eg the bug where the toolchain forgot to mark sections executable or the other bug where our GOT was not marked executable). Anton ===== arch/ppc64/kernel/head.S 1.42 vs edited ===== --- 1.42/arch/ppc64/kernel/head.S Wed Dec 17 15:27:52 2003 +++ edited/arch/ppc64/kernel/head.S Sat Dec 27 11:23:37 2003 @@ -35,6 +35,7 @@ #include #include #include +#include #ifdef CONFIG_PPC_ISERIES #define DO_SOFT_DISABLE @@ -658,7 +659,7 @@ andis. r0,r3,0xa450 /* weird error? */ bne 1f /* if not, try to put a PTE */ andis. r0,r3,0x0020 /* Is it a page table fault? */ - rlwinm r4,r3,32-23,29,29 /* DSISR_STORE -> _PAGE_RW */ + rlwinm r4,r3,32-25+9,31-9,31-9 /* DSISR_STORE -> _PAGE_RW */ ld r3,_DAR(r1) /* into the hash table */ beq+ 2f /* If so handle it */ @@ -818,10 +819,9 @@ b .ret_from_except _GLOBAL(do_hash_page_ISI) - li r4,0 + li r4,_PAGE_EXEC _GLOBAL(do_hash_page_DSI) rlwimi r4,r23,32-13,30,30 /* Insert MSR_PR as _PAGE_USER */ - ori r4,r4,1 /* add _PAGE_PRESENT */ mflr r21 /* Save LR in r21 */ ===== arch/ppc64/mm/fault.c 1.14 vs edited ===== --- 1.14/arch/ppc64/mm/fault.c Fri Sep 12 21:01:40 2003 +++ edited/arch/ppc64/mm/fault.c Sat Dec 27 16:32:23 2003 @@ -59,6 +59,7 @@ siginfo_t info; unsigned long code = SEGV_MAPERR; unsigned long is_write = error_code & 0x02000000; + unsigned long is_exec = regs->trap == 0x400; #ifdef CONFIG_DEBUG_KERNEL if (debugger_fault_handler && (regs->trap == 0x300 || @@ -102,16 +103,20 @@ good_area: code = SEGV_ACCERR; + if (is_exec) { + /* XXX huh? */ + /* protection fault */ + if (error_code & 0x08000000) + goto bad_area; + if (!(vma->vm_flags & VM_EXEC)) + goto bad_area; /* a write */ - if (is_write) { + } else if (is_write) { if (!(vma->vm_flags & VM_WRITE)) goto bad_area; /* a read */ } else { - /* protection fault */ - if (error_code & 0x08000000) - goto bad_area; - if (!(vma->vm_flags & (VM_READ | VM_EXEC))) + if (!(vma->vm_flags & VM_READ)) goto bad_area; } ===== arch/ppc64/mm/hash_low.S 1.1 vs edited ===== --- 1.1/arch/ppc64/mm/hash_low.S Wed Dec 17 15:55:14 2003 +++ edited/arch/ppc64/mm/hash_low.S Sat Dec 27 11:25:59 2003 @@ -90,7 +90,7 @@ /* Prepare new PTE value (turn access RW into DIRTY, then * add BUSY,HASHPTE and ACCESSED) */ - rlwinm r30,r4,5,24,24 /* _PAGE_RW -> _PAGE_DIRTY */ + rlwinm r30,r4,32-9+7,31-7,31-7 /* _PAGE_RW -> _PAGE_DIRTY */ or r30,r30,r31 ori r30,r30,_PAGE_BUSY | _PAGE_ACCESSED | _PAGE_HASHPTE /* Write the linux PTE atomically (setting busy) */ @@ -113,11 +113,11 @@ rldicl r5,r5,0,25 /* vsid & 0x0000007fffffffff */ rldicl r0,r3,64-12,48 /* (ea >> 12) & 0xffff */ xor r28,r5,r0 - - /* Convert linux PTE bits into HW equivalents - */ - andi. r3,r30,0x1fa /* Get basic set of flags */ - rlwinm r0,r30,32-2+1,30,30 /* _PAGE_RW -> _PAGE_USER (r0) */ + + /* Convert linux PTE bits into HW equivalents */ + andi. r3,r30,0x1fe /* Get basic set of flags */ + xori r3,r3,_PAGE_EXEC /* _PAGE_EXEC -> NOEXEC */ + rlwinm r0,r30,32-9+1,30,30 /* _PAGE_RW -> _PAGE_USER (r0) */ rlwinm r4,r30,32-7+1,30,30 /* _PAGE_DIRTY -> _PAGE_USER (r4) */ and r0,r0,r4 /* _PAGE_RW & _PAGE_DIRTY -> r0 bit 30 */ andc r0,r30,r0 /* r0 = pte & ~r0 */ ===== fs/binfmt_elf.c 1.54 vs edited ===== --- 1.54/fs/binfmt_elf.c Thu Oct 23 08:29:22 2003 +++ edited/fs/binfmt_elf.c Sat Dec 27 22:00:22 2003 @@ -86,8 +86,10 @@ { start = ELF_PAGEALIGN(start); end = ELF_PAGEALIGN(end); - if (end > start) + if (end > start) { do_brk(start, end - start); + sys_mprotect(start, end-start, PROT_READ|PROT_WRITE|PROT_EXEC); + } current->mm->start_brk = current->mm->brk = end; } ===== include/asm-ppc64/page.h 1.22 vs edited ===== --- 1.22/include/asm-ppc64/page.h Fri Sep 12 21:06:51 2003 +++ edited/include/asm-ppc64/page.h Sat Dec 27 17:43:57 2003 @@ -234,8 +234,25 @@ #define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT) -#define VM_DATA_DEFAULT_FLAGS (VM_READ | VM_WRITE | VM_EXEC | \ +#define VM_DATA_DEFAULT_FLAGS32 (VM_READ | VM_WRITE | \ VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) + +#define VM_STACK_DEFAULT_FLAGS32 (VM_READ | VM_WRITE | VM_EXEC | \ + VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) + +#define VM_DATA_DEFAULT_FLAGS64 (VM_READ | VM_WRITE | \ + VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) + +#define VM_STACK_DEFAULT_FLAGS64 (VM_READ | VM_WRITE | VM_EXEC | \ + VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) + +#define VM_DATA_DEFAULT_FLAGS \ + (test_thread_flag(TIF_32BIT) ? \ + VM_DATA_DEFAULT_FLAGS32 : VM_DATA_DEFAULT_FLAGS64) + +#define VM_STACK_DEFAULT_FLAGS \ + (test_thread_flag(TIF_32BIT) ? \ + VM_STACK_DEFAULT_FLAGS32 : VM_STACK_DEFAULT_FLAGS64) #endif /* __KERNEL__ */ #endif /* _PPC64_PAGE_H */ ===== include/asm-ppc64/pgtable.h 1.30 vs edited ===== --- 1.30/include/asm-ppc64/pgtable.h Wed Dec 17 16:08:23 2003 +++ edited/include/asm-ppc64/pgtable.h Sat Dec 27 15:07:05 2003 @@ -78,24 +78,25 @@ #define _PAGE_PRESENT 0x0001 /* software: pte contains a translation */ #define _PAGE_USER 0x0002 /* matches one of the PP bits */ #define _PAGE_FILE 0x0002 /* (!present only) software: pte holds file offset */ -#define _PAGE_RW 0x0004 /* software: user write access allowed */ +#define _PAGE_EXEC 0x0004 /* No execute on POWER4 and newer (we invert) */ #define _PAGE_GUARDED 0x0008 #define _PAGE_COHERENT 0x0010 /* M: enforce memory coherence (SMP systems) */ #define _PAGE_NO_CACHE 0x0020 /* I: cache inhibit */ #define _PAGE_WRITETHRU 0x0040 /* W: cache write-through */ #define _PAGE_DIRTY 0x0080 /* C: page changed */ #define _PAGE_ACCESSED 0x0100 /* R: page referenced */ -#define _PAGE_EXEC 0x0200 /* software: i-cache coherence required */ +#define _PAGE_RW 0x0200 /* software: user write access allowed */ #define _PAGE_HASHPTE 0x0400 /* software: pte has an associated HPTE */ #define _PAGE_BUSY 0x0800 /* software: PTE & hash are busy */ #define _PAGE_SECONDARY 0x8000 /* software: HPTE is in secondary group */ #define _PAGE_GROUP_IX 0x7000 /* software: HPTE index within group */ /* Bits 0x7000 identify the index within an HPT Group */ #define _PAGE_HPTEFLAGS (_PAGE_BUSY | _PAGE_HASHPTE | _PAGE_SECONDARY | _PAGE_GROUP_IX) + /* PAGE_MASK gives the right answer below, but only by accident */ /* It should be preserving the high 48 bits and then specifically */ /* preserving _PAGE_SECONDARY | _PAGE_GROUP_IX */ -#define _PAGE_CHG_MASK (PAGE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_HPTEFLAGS) +#define _PAGE_CHG_MASK (_PAGE_GUARDED | _PAGE_COHERENT | _PAGE_NO_CACHE | _PAGE_WRITETHRU | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_HPTEFLAGS | PAGE_MASK) #define _PAGE_BASE (_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_COHERENT) @@ -111,31 +112,32 @@ #define PAGE_READONLY __pgprot(_PAGE_BASE | _PAGE_USER) #define PAGE_READONLY_X __pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_EXEC) #define PAGE_KERNEL __pgprot(_PAGE_BASE | _PAGE_WRENABLE) -#define PAGE_KERNEL_CI __pgprot(_PAGE_PRESENT | _PAGE_ACCESSED | \ - _PAGE_WRENABLE | _PAGE_NO_CACHE | _PAGE_GUARDED) /* - * The PowerPC can only do execute protection on a segment (256MB) basis, - * not on a page basis. So we consider execute permission the same as read. + * POWER4 and newer have per page execute protection, older chips can only + * do this on a segment (256MB) basis. + * * Also, write permissions imply read permissions. * This is the closest we can get.. + * + * Note due to the way vm flags are laid out, the bits are XWR */ #define __P000 PAGE_NONE -#define __P001 PAGE_READONLY_X +#define __P001 PAGE_READONLY #define __P010 PAGE_COPY -#define __P011 PAGE_COPY_X -#define __P100 PAGE_READONLY +#define __P011 PAGE_COPY +#define __P100 PAGE_READONLY_X #define __P101 PAGE_READONLY_X -#define __P110 PAGE_COPY +#define __P110 PAGE_COPY_X #define __P111 PAGE_COPY_X #define __S000 PAGE_NONE -#define __S001 PAGE_READONLY_X +#define __S001 PAGE_READONLY #define __S010 PAGE_SHARED -#define __S011 PAGE_SHARED_X -#define __S100 PAGE_READONLY +#define __S011 PAGE_SHARED +#define __S100 PAGE_READONLY_X #define __S101 PAGE_READONLY_X -#define __S110 PAGE_SHARED +#define __S110 PAGE_SHARED_X #define __S111 PAGE_SHARED_X #ifndef __ASSEMBLY__ @@ -191,7 +193,8 @@ }) #define pte_modify(_pte, newprot) \ - (__pte((pte_val(_pte) & _PAGE_CHG_MASK) | pgprot_val(newprot))) + (__pte((pte_val(_pte) & _PAGE_CHG_MASK) | \ + (pgprot_val(newprot) & ~_PAGE_CHG_MASK))) #define pte_none(pte) ((pte_val(pte) & ~_PAGE_HPTEFLAGS) == 0) #define pte_present(pte) (pte_val(pte) & _PAGE_PRESENT) @@ -260,9 +263,6 @@ static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY;} static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED;} static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE;} - -static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; } -static inline void pte_cache(pte_t pte) { pte_val(pte) &= ~_PAGE_NO_CACHE; } static inline pte_t pte_rdprotect(pte_t pte) { pte_val(pte) &= ~_PAGE_USER; return pte; } ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Sun Dec 28 16:29:55 2003 From: anton at samba.org (Anton Blanchard) Date: Sun, 28 Dec 2003 16:29:55 +1100 Subject: spinlocks Message-ID: <20031228052954.GD24358@krispykreme> Hi, We really have to get the new spinlocks beaten into shape... 1. They are massive: 24 inline instructions. They eat hot icache for breakfast. 2. They add a bunch of clobbers: : "=&r"(tmp), "=&r"(tmp2) : "r"(&lock->lock) : "r3", "r4", "r5", "cr0", "cr1", "ctr", "xer", "memory"); We tie gcc's hands behind its back for the unlikely case that we have to call into the hypervisor. 3. Separate spinlocks for iseries and pseries where most of it is duplicated. 4. They add more reliance on the paca. We have to stop using the paca for everything that isnt architecturally required and move to per cpu data. In the end we may have to put the processor virtual area in the paca, but we need to be thinking about this issue. As an aside, can someone explain why we reread the lock holder: lwsync # if odd, give up cycles\n\ ldx %1,0,%2 # reverify the lock holder\n\ cmpd %0,%1\n\ bne 1b # new holder so restart\n\ Wont there be a race regardless of whether this code is there? 4. I like how we store r13 into the lock since it could save one register and will make the guys wanting debug spinlocks a bit happier (you can work out which cpu has the lock via the spinlock value) Im proposing a few things: 1. Recognise that once we are in SPLPAR mode, all performance bets are off and we can burn more cycles. If we are calling into the hypervisor, the path length there is going to dwarf us so why optimise for it? 2. Move the slow path out of line. We had problems with this due to the limited reach of a conditional branch but we can fix this by compiling with -ffunction-sections. We only then encounter problems if we get a function that is larger than 32kB. If that happens, something is really wrong :) 3. In the slow path call a single out of line function when calling into the hypervisor that saves/restores all relevant registers. The call will be nop'ed out by the cpufeature fixup stuff on non SPLPAR. With the new module interface we should be able to handle cpufeature fixups in modules. Outstanding stuff: - implement the out of line splpar_spinlock code - fix cpu features to fixup stuff in modules - work out how to use FW_FEATURE_SPLPAR in the FTR_SECTION code Here is what Im thinking the spinlocks should look like: static inline void _raw_spin_lock(spinlock_t *lock) { unsigned long tmp; asm volatile( "1: ldarx %0,0,%1 # spin_lock\n\ cmpdi 0,%0,0\n\ bne- 2f\n\ stdcx. 13,0,%1\n\ bne- 1b\n\ isync\n\ .subsection 1\n\ 2:" HMT_LOW BEGIN_FTR_SECTION " mflr %0\n\ bl .splpar_spinlock\n" END_FTR_SECTION_IFSET(CPU_FTR_SPLPAR) " ldx %0,0,%1\n\ cmpdi 0,%0,0\n\ bne- 2b\n" HMT_MEDIUM " b 1b\n\ .previous" : "=&r"(tmp) : "r"(&lock->lock) : "cr0", "memory"); } Anton ===== arch/ppc64/Makefile 1.39 vs edited ===== --- 1.39/arch/ppc64/Makefile Tue Dec 9 03:23:33 2003 +++ edited/arch/ppc64/Makefile Sun Dec 28 13:41:49 2003 @@ -28,7 +28,8 @@ LDFLAGS := -m elf64ppc LDFLAGS_vmlinux := -Bstatic -e $(KERNELLOAD) -Ttext $(KERNELLOAD) -CFLAGS += -msoft-float -pipe -Wno-uninitialized -mminimal-toc +CFLAGS += -msoft-float -pipe -Wno-uninitialized -mminimal-toc \ + -mtraceback=none -ffunction-sections ifeq ($(CONFIG_POWER4_ONLY),y) CFLAGS += -mcpu=power4 ===== include/asm-ppc64/spinlock.h 1.7 vs edited ===== --- 1.7/include/asm-ppc64/spinlock.h Sat Nov 15 05:45:32 2003 +++ edited/include/asm-ppc64/spinlock.h Sun Dec 28 13:50:18 2003 @@ -15,14 +15,14 @@ * 2 of the License, or (at your option) any later version. */ -#include +#include /* * The following define is being used to select basic or shared processor * locking when running on an RPA platform. As we do more performance * tuning, I would expect this selection mechanism to change. Dave E. */ -#define SPLPAR_LOCKS +#undef SPLPAR_LOCKS #define HVSC ".long 0x44000022\n" typedef struct { @@ -138,25 +138,33 @@ : "r3", "r4", "r5", "cr0", "cr1", "ctr", "xer", "memory"); } #else -static __inline__ void _raw_spin_lock(spinlock_t *lock) + +static inline void _raw_spin_lock(spinlock_t *lock) { unsigned long tmp; - __asm__ __volatile__( - "b 2f # spin_lock\n\ -1:" - HMT_LOW -" ldx %0,0,%1 # load the lock value\n\ - cmpdi 0,%0,0 # if not locked, try to acquire\n\ - bne+ 1b\n\ -2: \n" - HMT_MEDIUM -" ldarx %0,0,%1\n\ + asm volatile( +"1: ldarx %0,0,%1 # spin_lock\n\ cmpdi 0,%0,0\n\ - bne- 1b\n\ + bne- 2f\n\ stdcx. 13,0,%1\n\ - bne- 2b\n\ - isync" + bne- 1b\n\ + isync\n\ + .subsection 1\n\ +2:" + HMT_LOW +#if 0 +BEGIN_FTR_SECTION +" mflr %0\n\ + bl .splpar_spinlock\n" +END_FTR_SECTION_IFSET(CPU_FTR_SPLPAR) +#endif +" ldx %0,0,%1\n\ + cmpdi 0,%0,0\n\ + bne- 2b\n" + HMT_MEDIUM +" b 1b\n\ + .previous" : "=&r"(tmp) : "r"(&lock->lock) : "cr0", "memory"); ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Wed Dec 31 01:36:58 2003 From: anton at samba.org (Anton Blanchard) Date: Wed, 31 Dec 2003 01:36:58 +1100 Subject: 2.6.0 oops Message-ID: <20031230143658.GA28023@krispykreme> Hi, Running bash-shared-mappings on ameslab BK from last night I got the following oops: cpu 0: Vector: 300 (Data Access) at [c00000000ee63760] pc: c00000000000e158 (.__hash_page+0x2c/0xa4) lr: c00000000004af14 (.update_mmu_cache+0x15c/0x1d4) sp: c00000000ee639e0 msr: a000000000009032 dar: 0 dsisr: 40000000 current = 0xc000000001e18300 paca = 0xc00000000051a000 pid = 17778, comm = init 0:mon> t c00000000ee63ae0 c00000000004af14 .update_mmu_cache+0x15c/0x1d4 c00000000ee63b90 c00000000008fe40 .do_no_page+0x288/0x554 c00000000ee63c60 c0000000000903c8 .handle_mm_fault+0x194/0x268 c00000000ee63d00 c000000000049734 .do_page_fault+0x160/0x48c c00000000ee63e30 c00000000000acb0 InstructionAccess_common+0x10c/0x110 0:mon> di %pc c00000000000e158 7fe030a8 ldarx r31,r0,r6 We died nice and early in __hash_page, all arguments should be preserved: int __hash_page(unsigned long ea, unsigned long access, unsigned long vsid, pte_t *ptep, unsigned long trap, int local); R03 = 000000000ff4e7d8 R04 = 0000000000000003 R05 = 0000000e3779b970 R06 = 0000000000000000 R07 = 0000000000000300 R08 = 0000000000000000 ea is probably somewhere in shared libraries. access is read/write, vsid looks ok but ptep is NULL. How did this happen? We hold the page table spinlock around the whole region that does the set_pte in do_no_page right through to update_mmu_cache where we died. So Im guessing we did put a pte in there but why didnt find_linux_pte find it? There are 2 ways find_linux_pte (in update_mmu_cache) could return NULL. Either the pgd/pmd is empty or the pte doesnt have _PAGE_PRESENT set. We did check that the pte had _PAGE_ACCESSED set earlier in update_mmu_cache so its sounding like we are asking to insert a pte with _PAGE_ACCESSED set and _PAGE_PRESENT not set. The safe thing to do is add a check for NULL after find_linux_pte, update_mmu_cache is just an optimisation and it makes sense to guard against weird things like this. However I would like to understand how we ended up in this scenario in the first place... Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From amodra at bigpond.net.au Wed Dec 31 09:18:32 2003 From: amodra at bigpond.net.au (Alan Modra) Date: Wed, 31 Dec 2003 08:48:32 +1030 Subject: spinlocks In-Reply-To: <20031228052954.GD24358@krispykreme> References: <20031228052954.GD24358@krispykreme> Message-ID: <20031230221832.GA22998@bubble.sa.bigpond.net.au> On Sun, Dec 28, 2003 at 04:29:55PM +1100, Anton Blanchard wrote: > static inline void _raw_spin_lock(spinlock_t *lock) > { > unsigned long tmp; > > asm volatile( > "1: ldarx %0,0,%1 # spin_lock\n\ > cmpdi 0,%0,0\n\ > bne- 2f\n\ > stdcx. 13,0,%1\n\ > bne- 1b\n\ > isync\n\ > .subsection 1\n\ > 2:" > HMT_LOW > BEGIN_FTR_SECTION > " mflr %0\n\ > bl .splpar_spinlock\n" > END_FTR_SECTION_IFSET(CPU_FTR_SPLPAR) > " ldx %0,0,%1\n\ > cmpdi 0,%0,0\n\ > bne- 2b\n" > HMT_MEDIUM > " b 1b\n\ > .previous" > : "=&r"(tmp) > : "r"(&lock->lock) > : "cr0", "memory"); > } You might want to restore lr somewhere in there, unless there's something magic about those FTR_SECTION macros. :) Do you really want to tell gcc that all memory is potentially changed by _raw_spin_lock? Hmm, I guess if you're accessing something protected by a lock then you want to say that old values of the "something" are stale. However, I think it would be better to explicitly say that &lock->lock is an output of the asm, rather than relying on the "memory" clobber to do that. Also, you might find it a little tricky to write splpar_spinlock. The problem is that you can't use any registers (since you haven't told gcc about any), and you'll need to be careful about using the stack. If _raw_spin_lock is called from a leaf function foo, then gcc may not set up a stack frame for foo. As per the ABI, gcc may use 288 bytes below r1 as scratch that isn't saved over calls. Since you haven't told gcc that you're making a call, you need to skip this area if using the stack in splpar_spinlock. I wonder if you wouldn't do better by making _raw_spin_lock a function written in asm. OK, that would mean the overhead of a function call, but I reckon many people forget that inline code blows icache, which probably hurts more.. -- Alan Modra IBM OzLabs - Linux Technology Centre ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From amodra at bigpond.net.au Wed Dec 31 09:18:41 2003 From: amodra at bigpond.net.au (Alan Modra) Date: Wed, 31 Dec 2003 08:48:41 +1030 Subject: per page execute In-Reply-To: <20031227121524.GA24358@krispykreme> References: <20031227121524.GA24358@krispykreme> Message-ID: <20031230221841.GB22998@bubble.sa.bigpond.net.au> On Sat, Dec 27, 2003 at 11:15:25PM +1100, Anton Blanchard wrote: > [25] .plt NOBITS 10010c08 000c00 0000c0 00 WAX 0 0 4 > [26] .bss NOBITS 10010cc8 000c00 000004 00 WA 0 0 1 > > Look how the non executable bss butts right onto the executable plt. > Even with the patch below, we are failing some security tests that try > and exec stuff out of the bss. Thats because the stuff ends up in the same > page as the plt. Alan, could this be considered a toolchain bug? Possibly. What about .got (exec) and adjacent .sdata (non-exec)? The ABI says that shared libs access .sdata via the got pointer, so there's no hope of separating them. -- Alan Modra IBM OzLabs - Linux Technology Centre ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Wed Dec 31 10:58:36 2003 From: anton at samba.org (Anton Blanchard) Date: Wed, 31 Dec 2003 10:58:36 +1100 Subject: spinlocks In-Reply-To: <20031230221832.GA22998@bubble.sa.bigpond.net.au> References: <20031228052954.GD24358@krispykreme> <20031230221832.GA22998@bubble.sa.bigpond.net.au> Message-ID: <20031230235836.GC28023@krispykreme> Hi, > You might want to restore lr somewhere in there, unless there's > something magic about those FTR_SECTION macros. :) No magic just not enough thought has gone into my code yet :) > Do you really want to tell gcc that all memory is potentially changed > by _raw_spin_lock? Hmm, I guess if you're accessing something > protected by a lock then you want to say that old values of the > "something" are stale. However, I think it would be better to > explicitly say that &lock->lock is an output of the asm, rather than > relying on the "memory" clobber to do that. Yeah we need to force a full gcc memory barrier there. If you think we should add the explicit clobber as well I can, we have a lot of code that does that however (atomic and bitop code). > Also, you might find it a little tricky to write splpar_spinlock. The > problem is that you can't use any registers (since you haven't told > gcc about any), and you'll need to be careful about using the stack. > If _raw_spin_lock is called from a leaf function foo, then gcc may not > set up a stack frame for foo. As per the ABI, gcc may use 288 bytes > below r1 as scratch that isn't saved over calls. Since you haven't > told gcc that you're making a call, you need to skip this area if > using the stack in splpar_spinlock. Yeah I was thinking we force tmp to be an explicit register in the clobbers, then we have something to start from. Id expect splpar_spinlock will allocate a stackframe and go from there. > I wonder if you wouldn't do better by making _raw_spin_lock a function > written in asm. OK, that would mean the overhead of a function call, > but I reckon many people forget that inline code blows icache, which > probably hurts more.. Well Id do that if we could specify clobbers in function prototypes in gcc :) Otherwise the overhead of a function call is reasonably high. Also it makes profiling a bitch when you spend 50% of your time in the spinlock function and have no idea how that is broken up. FYI enable -ffunction-sections and notice how it takes a few minutes to do the final link stage... The profile looks like (numbers are % of cpu time): 22.9499 ld __udivmoddi4 8.0067 libc-2.3.2.so strcmp 7.8211 ld lang_check_section_addresses 5.3252 ld lang_output_section_find 4.1369 ld gldelf64ppc_place_orphan 3.8997 make (no symbols) 2.7411 libpthread-0.10.so __pthread_alt_unlock 1.2113 libpthread-0.10.so __pthread_alt_lock 1.1746 ld __udivdi3 0.8079 libc-2.3.2.so __ctype_b_loc Ouch, really ld doesnt like 10,000 sections :) GNU ld version 2.14.90 20030814 Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/