From moilanen at austin.ibm.com Tue Nov 1 08:45:40 2005 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Mon, 31 Oct 2005 15:45:40 -0600 Subject: [PATCH 2/2] Export Physical IO base address In-Reply-To: <1130540087.29054.128.camel@gaston> References: <20051028150035.3d1da846.moilanen@austin.ibm.com> <20051028150804.73b5cedb.moilanen@austin.ibm.com> <1130540087.29054.128.camel@gaston> Message-ID: <20051031154540.195b0de0.moilanen@austin.ibm.com> On Sat, 29 Oct 2005 08:54:47 +1000 Benjamin Herrenschmidt wrote: > On Fri, 2005-10-28 at 15:08 -0500, Jake Moilanen wrote: > > This patch exports the physical IO base address so drivers can pick it > > up when using addresses from the device-tree. > > Why ? What is your driver exactly trying to do ? TPM needs to get the base address for IO as an offset into IO space. This base physical address is stored in the reg property in the device-node. To calculate the offset, need to do: TPM_base_phy_addr - io_base_phys. Thus the need to export this address. Jake From arndb at de.ibm.com Tue Nov 1 12:08:38 2005 From: arndb at de.ibm.com (Arnd Bergmann) Date: Mon, 31 Oct 2005 20:08:38 -0500 Subject: [patch 2/5] powerpc: create a new arch/powerpc/platforms/cell/smp.c References: <20051101010836.771791000@localhost> Message-ID: <20051101011133.300238000@localhost> An embedded and charset-unspecified text was scrubbed... Name: cell-smp.diff Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20051031/03409828/attachment.txt From arndb at de.ibm.com Tue Nov 1 12:08:37 2005 From: arndb at de.ibm.com (Arnd Bergmann) Date: Mon, 31 Oct 2005 20:08:37 -0500 Subject: [patch 1/5] powerpc: Rename BPA to Cell References: <20051101010836.771791000@localhost> Message-ID: <20051101011133.134984000@localhost> An embedded and charset-unspecified text was scrubbed... Name: cell-kconfig.diff Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20051031/751fab3b/attachment.txt From arndb at de.ibm.com Tue Nov 1 12:08:36 2005 From: arndb at de.ibm.com (Arnd Bergmann) Date: Mon, 31 Oct 2005 20:08:36 -0500 Subject: [patch 0/5] Move Cell stuff to arch/powerpc Message-ID: <20051101010836.771791000@localhost> As promised, here is my new patch set moving all Cell stuff over to arch/powerpc. Please apply. Arnd <>< From arndb at de.ibm.com Tue Nov 1 12:08:40 2005 From: arndb at de.ibm.com (Arnd Bergmann) Date: Mon, 31 Oct 2005 20:08:40 -0500 Subject: [patch 4/5] powerpc: move mmio_nvram.c over to arch/powerpc References: <20051101010836.771791000@localhost> Message-ID: <20051101011133.623411000@localhost> An embedded and charset-unspecified text was scrubbed... Name: mmio-nvram.diff Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20051031/f47181fa/attachment.txt From arndb at de.ibm.com Tue Nov 1 12:08:39 2005 From: arndb at de.ibm.com (Arnd Bergmann) Date: Mon, 31 Oct 2005 20:08:39 -0500 Subject: [patch 3/5] powerpc: move rtas_fw.c out of platforms/pseries References: <20051101010836.771791000@localhost> Message-ID: <20051101011133.463223000@localhost> An embedded and charset-unspecified text was scrubbed... Name: rtas-flash.diff Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20051031/5894915b/attachment.txt From arndb at de.ibm.com Tue Nov 1 12:08:41 2005 From: arndb at de.ibm.com (Arnd Bergmann) Date: Mon, 31 Oct 2005 20:08:41 -0500 Subject: [patch 5/5] powerpc: move arch/ppc64/kernel/bpa* to arch/powerpc/platforms/cell References: <20051101010836.771791000@localhost> Message-ID: <20051101011133.788778000@localhost> An embedded and charset-unspecified text was scrubbed... Name: cell-platform.diff Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20051031/13827568/attachment.txt From michael at ellerman.id.au Tue Nov 1 10:26:55 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Tue, 1 Nov 2005 10:26:55 +1100 Subject: [patch 2/5] powerpc: create a new arch/powerpc/platforms/cell/smp.c In-Reply-To: <20051101011133.300238000@localhost> References: <20051101010836.771791000@localhost> <20051101011133.300238000@localhost> Message-ID: <200511011026.59266.michael@ellerman.id.au> On Tue, 1 Nov 2005 12:08, Arnd Bergmann wrote: > During the conversion to the merge tree, the Cell specific > SMP initialization was removed from the pSeries code. > > This creates a new Cell specific SMP implementation file. > > Signed-off-by: Arnd Bergmann > > --- > > arch/powerpc/platforms/Makefile | 1 > arch/powerpc/platforms/cell/Makefile | 1 > arch/powerpc/platforms/cell/smp.c | 230 ++++++++++++++++++++ > include/asm-ppc64/smp.h | 1 > 4 files changed, 233 insertions(+) A lot of your smp routines are identical to the pSeries versions. Wouldn't it be preferable to only have one implementation? cheers -- Michael Ellerman IBM OzLabs email: michael:ellerman.id.au inmsg: mpe:jabber.org wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20051101/0d649a53/attachment.pgp From benh at kernel.crashing.org Tue Nov 1 10:27:23 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 01 Nov 2005 10:27:23 +1100 Subject: [PATCH 2/2] Export Physical IO base address In-Reply-To: <20051031154540.195b0de0.moilanen@austin.ibm.com> References: <20051028150035.3d1da846.moilanen@austin.ibm.com> <20051028150804.73b5cedb.moilanen@austin.ibm.com> <1130540087.29054.128.camel@gaston> <20051031154540.195b0de0.moilanen@austin.ibm.com> Message-ID: <1130801243.29054.376.camel@gaston> On Mon, 2005-10-31 at 15:45 -0600, Jake Moilanen wrote: > On Sat, 29 Oct 2005 08:54:47 +1000 > Benjamin Herrenschmidt wrote: > > > On Fri, 2005-10-28 at 15:08 -0500, Jake Moilanen wrote: > > > This patch exports the physical IO base address so drivers can pick it > > > up when using addresses from the device-tree. > > > > Why ? What is your driver exactly trying to do ? > > TPM needs to get the base address for IO as an offset into IO space. > > This base physical address is stored in the reg property in the > device-node. > > To calculate the offset, need to do: TPM_base_phy_addr - io_base_phys. > > Thus the need to export this address. Hrm... that is sooo bogus. If the device-tree exposes a full physical address in "reg", then just use that with ioremap (ignore the fact that it's actually IO space). Additionally, tell the firmware folks to fix their device-tree, this is all very bogus to me. The TPM device is on the LPC bus right ? Thus it should appear below the LPC/ISA bridge and thus get proper address space. It's totally broken to put it anywhere else Ben. From arnd at arndb.de Tue Nov 1 10:50:48 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Tue, 1 Nov 2005 00:50:48 +0100 Subject: [patch 2/5] powerpc: create a new arch/powerpc/platforms/cell/smp.c In-Reply-To: <200511011026.59266.michael@ellerman.id.au> References: <20051101010836.771791000@localhost> <20051101011133.300238000@localhost> <200511011026.59266.michael@ellerman.id.au> Message-ID: <200511010050.48828.arnd@arndb.de> On Dinsdag 01 November 2005 00:26, Michael Ellerman wrote: > A lot of your smp routines are identical to the pSeries versions. Wouldn't it > be preferable to only have one implementation? Yes it would. I'm not sure how that would best be done however. Until 2.6.14, we've just used the pSeries implementation, which does not work any more now that we want to keep the platform stuff in separate directories. One idea might be to split out the rtas calls (startup_cpu, give_timebase, take_timebase) to rtas.c so they can be included by all chrp-descendants. smp_init_cell() can be further simplified under the assumption that we're always SMT and never LPAR, although the latter might change in the future. Arnd <>< From benh at kernel.crashing.org Tue Nov 1 11:05:08 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 01 Nov 2005 11:05:08 +1100 Subject: [PATCH] tpm: support PPC64 hardware In-Reply-To: <1130769479.4882.35.camel@localhost.localdomain> References: <1130769479.4882.35.camel@localhost.localdomain> Message-ID: <1130803508.29054.388.camel@gaston> On Mon, 2005-10-31 at 08:37 -0600, Kylene Jo Hall wrote: > The TPM is discovered differently on PPC64 because the device must be > discovered through the device tree in order to open the proper holes in > the io_page_mask for reading and writing in the low memory space. This > does not happen automatically like most devices because the tpm is not a > normal pci device and lives under the root node. > > This patch contains the necessary changes to the tpm logic. > > This depends on patches submitted by Jake Moilanen (10/28) to allow for > the opening of holes in the io_page_mask for this device. Please submit to the appropriate list (linuxppc64-dev at ozlabs.org). There are some issues with that patch. One, I intend to get rid of the io_page_mask, so that part at least is gone. I don't like the exporting of io_base_phys neither, it's an ugly hack. Other comments inline. > +/* Verify this is a 1.1 Atmel TPM */ > +static int atmel_verify_tpm11(void) > +{ > + struct device_node * dn; > + char *compat; > + int compat_len; > + > + dn = find_devices("tpm"); find_devices() is a deprecated interface. Use the of_find_node_* series and do an of_node_put() once done. > + if (!dn) > + return 1; > + > + compat = (char *) get_property(dn, "compatible", &compat_len); > + if (!compat) > + return 1; > + > + if ( strcmp( compat,"AT97SC3201_r") == 0 ) > + return 0; > + Testing the "compatible" property this way is bogus. Use device_is_compatible(). Or better, use of_find_compatible_node() which allows you to find by type & compatible in one step. > + dn = find_devices("tpm"); Same comment as above. In addition, why do you have to do it twice ? You should rethink your changes. Only one "probe" should be needed that retreives the base addresses. > + if (!dn) > + return 0; > + > + reg = (unsigned int *) get_property(dn, "reg", ®len); > + naddrc = prom_n_addr_cells(dn); > + nsizec = prom_n_size_cells(dn); > + > + for (i = 0; i < reglen; i = i + naddrc + nsizec) { > + > + if (naddrc == 2) > + address = ((unsigned long)reg[i] << 32) | reg[i+1]; > + else > + address = reg[i]; > + > + address = address - pci_io_base_phys; That is bogosity. Your address is an ISA IO address, It should be relative to the parent LPC bus and thus useable as is. It looks like the firmware folks crapped the device-tree. Please check that with them. If they decide to stick with a broken device-tree, then you'll have to consider the address as an MMIO address. That mean you'll have to change the IO accesses of the TPM driver to use the iomap API thus making it immune of IO cs. MMIO distinction. > + allow_isa_address(address, address+size-1); That is going away. Ben. From michael at ellerman.id.au Tue Nov 1 11:13:23 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Tue, 1 Nov 2005 11:13:23 +1100 Subject: [patch 2/5] powerpc: create a new arch/powerpc/platforms/cell/smp.c In-Reply-To: <200511010050.48828.arnd@arndb.de> References: <20051101010836.771791000@localhost> <200511011026.59266.michael@ellerman.id.au> <200511010050.48828.arnd@arndb.de> Message-ID: <200511011113.26609.michael@ellerman.id.au> On Tue, 1 Nov 2005 10:50, Arnd Bergmann wrote: > On Dinsdag 01 November 2005 00:26, Michael Ellerman wrote: > > A lot of your smp routines are identical to the pSeries versions. > > Wouldn't it be preferable to only have one implementation? > > Yes it would. I'm not sure how that would best be done however. Until > 2.6.14, we've just used the pSeries implementation, which does not work any > more now that we want to keep the platform stuff in separate directories. > > One idea might be to split out the rtas calls (startup_cpu, give_timebase, > take_timebase) to rtas.c so they can be included by all chrp-descendants. > smp_init_cell() can be further simplified under the assumption that we're > always SMT and never LPAR, although the latter might change in the future. OK, I'm not sure what the best spot is. arch/powerpc/sysdev is apparently the place for stuff that's not core-kernel but shared between platforms, although maybe smp ops are core, I dunno. cheers -- Michael Ellerman IBM OzLabs email: michael:ellerman.id.au inmsg: mpe:jabber.org wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20051101/dc38ac71/attachment.pgp From david at gibson.dropbear.id.au Tue Nov 1 15:30:26 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Tue, 1 Nov 2005 15:30:26 +1100 Subject: powerpc: Move naca.h to platforms/iseries Message-ID: <20051101043026.GB27961@localhost.localdomain> Paulus, please apply. These days, the NACA only exists on iSeries. Therefore, this patch moves naca.h from include/asm-ppc64 to arch/powerpc/platforms/iseries. There was one file including naca.h outside of platforms/iseries - arch/ppc64/kernel/udbg_scc.c. However, that's obviously a hangover from older days. The include is not necessary, so this patch simply removes it. Built and booted on iSeries, built for G5 (which uses udbg_scc.o). Signed-off-by: David Gibson Index: working-2.6/arch/powerpc/platforms/iseries/lpardata.c =================================================================== --- working-2.6.orig/arch/powerpc/platforms/iseries/lpardata.c 2005-10-31 15:20:20.000000000 +1100 +++ working-2.6/arch/powerpc/platforms/iseries/lpardata.c 2005-11-01 15:15:57.000000000 +1100 @@ -13,7 +13,6 @@ #include #include #include -#include #include #include #include @@ -23,6 +22,7 @@ #include #include +#include "naca.h" #include "vpd_areas.h" #include "spcomm_area.h" #include "ipl_parms.h" Index: working-2.6/arch/powerpc/platforms/iseries/naca.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ working-2.6/arch/powerpc/platforms/iseries/naca.h 2005-11-01 15:28:03.000000000 +1100 @@ -0,0 +1,24 @@ +#ifndef _PLATFORMS_ISERIES_NACA_H +#define _PLATFORMS_ISERIES_NACA_H + +/* + * c 2001 PPC 64 Team, IBM Corp + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +#include + +struct naca_struct { + /* Kernel only data - undefined for user space */ + void *xItVpdAreas; /* VPD Data 0x00 */ + void *xRamDisk; /* iSeries ramdisk 0x08 */ + u64 xRamDiskSize; /* In pages 0x10 */ +}; + +extern struct naca_struct naca; + +#endif /* _PLATFORMS_ISERIES_NACA_H */ Index: working-2.6/arch/powerpc/platforms/iseries/release_data.h =================================================================== --- working-2.6.orig/arch/powerpc/platforms/iseries/release_data.h 2005-10-31 15:20:20.000000000 +1100 +++ working-2.6/arch/powerpc/platforms/iseries/release_data.h 2005-11-01 15:15:57.000000000 +1100 @@ -24,7 +24,7 @@ * address of the OS's NACA). */ #include -#include +#include "naca.h" /* * When we IPL a secondary partition, we will check if if the Index: working-2.6/arch/powerpc/platforms/iseries/setup.c =================================================================== --- working-2.6.orig/arch/powerpc/platforms/iseries/setup.c 2005-10-31 15:44:59.000000000 +1100 +++ working-2.6/arch/powerpc/platforms/iseries/setup.c 2005-11-01 15:15:57.000000000 +1100 @@ -40,7 +40,6 @@ #include #include -#include #include #include #include @@ -53,6 +52,7 @@ #include #include +#include "naca.h" #include "setup.h" #include "irq.h" #include "vpd_areas.h" Index: working-2.6/arch/ppc64/kernel/udbg_scc.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/udbg_scc.c 2005-10-25 11:59:53.000000000 +1000 +++ working-2.6/arch/ppc64/kernel/udbg_scc.c 2005-11-01 15:15:57.000000000 +1100 @@ -12,7 +12,6 @@ #include #include #include -#include #include #include #include Index: working-2.6/include/asm-ppc64/naca.h =================================================================== --- working-2.6.orig/include/asm-ppc64/naca.h 2005-10-25 11:59:59.000000000 +1000 +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 @@ -1,24 +0,0 @@ -#ifndef _NACA_H -#define _NACA_H - -/* - * c 2001 PPC 64 Team, IBM Corp - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - */ - -#include - -struct naca_struct { - /* Kernel only data - undefined for user space */ - void *xItVpdAreas; /* VPD Data 0x00 */ - void *xRamDisk; /* iSeries ramdisk 0x08 */ - u64 xRamDiskSize; /* In pages 0x10 */ -}; - -extern struct naca_struct naca; - -#endif /* _NACA_H */ -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From david at gibson.dropbear.id.au Tue Nov 1 16:53:24 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Tue, 1 Nov 2005 16:53:24 +1100 Subject: powerpc: Merge ipcbuf.h Message-ID: <20051101055324.GA3551@localhost.localdomain> Paulus, please apply. This patch merges ppc32 and ppc64 versions of ipcbuf.h. The merge is essentially trivial, since the structure defined in each version was already identical. Only wrinkle is that the merged version now includes linux/types.h in order to get the fixed width integer types. In fact, the old versions probably should have been including that anyway, since the file uses various __kernel_*_t types. Built and booted on G5, built for 32-bit pmac, but not booted, since the merge tree currently doesn't boot there. Signed-off-by: David Gibson Index: working-2.6/include/asm-powerpc/ipcbuf.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ working-2.6/include/asm-powerpc/ipcbuf.h 2005-11-01 15:44:01.000000000 +1100 @@ -0,0 +1,34 @@ +#ifndef _ASM_POWERPC_IPCBUF_H +#define _ASM_POWERPC_IPCBUF_H + +/* + * The ipc64_perm structure for the powerpc is identical to + * kern_ipc_perm as we have always had 32-bit UIDs and GIDs in the + * kernel. Note extra padding because this structure is passed back + * and forth between kernel and user space. Pad space is left for: + * - 1 32-bit value to fill up for 8-byte alignment + * - 2 miscellaneous 64-bit values + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +#include + +struct ipc64_perm +{ + __kernel_key_t key; + __kernel_uid_t uid; + __kernel_gid_t gid; + __kernel_uid_t cuid; + __kernel_gid_t cgid; + __kernel_mode_t mode; + unsigned int seq; + unsigned int __pad1; + u64 __unused1; + u64 __unused2; +}; + +#endif /* _ASM_POWERPC_IPCBUF_H */ Index: working-2.6/include/asm-ppc64/ipcbuf.h =================================================================== --- working-2.6.orig/include/asm-ppc64/ipcbuf.h 2005-10-25 11:59:59.000000000 +1000 +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 @@ -1,28 +0,0 @@ -#ifndef __PPC64_IPCBUF_H__ -#define __PPC64_IPCBUF_H__ - -/* - * The ipc64_perm structure for the PPC is identical to kern_ipc_perm - * as we have always had 32-bit UIDs and GIDs in the kernel. - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - */ - -struct ipc64_perm -{ - __kernel_key_t key; - __kernel_uid_t uid; - __kernel_gid_t gid; - __kernel_uid_t cuid; - __kernel_gid_t cgid; - __kernel_mode_t mode; - unsigned int seq; - unsigned int __pad1; - unsigned long __unused1; - unsigned long __unused2; -}; - -#endif /* __PPC64_IPCBUF_H__ */ Index: working-2.6/include/asm-ppc/ipcbuf.h =================================================================== --- working-2.6.orig/include/asm-ppc/ipcbuf.h 2005-10-25 11:59:59.000000000 +1000 +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 @@ -1,29 +0,0 @@ -#ifndef __PPC_IPCBUF_H__ -#define __PPC_IPCBUF_H__ - -/* - * The ipc64_perm structure for PPC architecture. - * Note extra padding because this structure is passed back and forth - * between kernel and user space. - * - * Pad space is left for: - * - 1 32-bit value to fill up for 8-byte alignment - * - 2 miscellaneous 64-bit values (so that this structure matches - * PPC64 ipc64_perm) - */ - -struct ipc64_perm -{ - __kernel_key_t key; - __kernel_uid_t uid; - __kernel_gid_t gid; - __kernel_uid_t cuid; - __kernel_gid_t cgid; - __kernel_mode_t mode; - unsigned long seq; - unsigned int __pad2; - unsigned long long __unused1; - unsigned long long __unused2; -}; - -#endif /* __PPC_IPCBUF_H__ */ -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From david at gibson.dropbear.id.au Tue Nov 1 17:28:10 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Tue, 1 Nov 2005 17:28:10 +1100 Subject: powerpc: Merge bitops.h In-Reply-To: <20051031064823.GD6622@localhost.localdomain> References: <20051031064823.GD6622@localhost.localdomain> Message-ID: <20051101062810.GC3551@localhost.localdomain> Here's a revised version. This re-introduces the set_bits() function from ppc64, which I removed because I thought it was unused (it exists on no other arch). In fact it is used in the powermac interrupt code (but not on pSeries). This seems to be running fine on my G5 (ARCH=powerpc), but it still hasn't been tested on 32-bit, which should probably happen before merging. - We use LARXL/STCXL macros to generate the right (32 or 64 bit) instructions, similar to LDL/STL from ppc_asm.h, used in fpu.S - ppc32 previously used a full "sync" barrier at the end of test_and_*_bit(), whereas ppc64 used an "isync". The merged version uses "isync", since I believe that's sufficient. - The ppc64 versions of then minix_*() bitmap functions have changed semantics. Previously on ppc64, these functions were big-endian (that is bit 0 was the LSB in the first 64-bit, big-endian word). On ppc32 (and x86, for that matter, they were little-endian. As far as I can tell, the big-endian usage was simply wrong - I guess no-one ever tried to use minixfs on ppc64. - On ppc32 find_next_bit() and find_next_zero_bit() are no longer inline (they were already out-of-line on ppc64). - For ppc64, sched_find_first_bit() has moved from mmu_context.h to the merged bitops. What it was doing in mmu_context.h in the first place, I have no idea. - The fls() function is now implemented using the cntlzw instruction on ppc64, instead of generic_fls(), as it already was on ppc32. - For ARCH=ppc, this patch requires adding arch/powerpc/lib to the arch/ppc/Makefile. This in turn requires some changes to arch/powerpc/lib/Makefile which didn't correctly handle ARCH=ppc. Built and running on G5. Signed-off-by: David Gibson Index: working-2.6/include/asm-ppc/bitops.h =================================================================== --- working-2.6.orig/include/asm-ppc/bitops.h 2005-10-25 11:59:59.000000000 +1000 +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 @@ -1,460 +0,0 @@ -/* - * bitops.h: Bit string operations on the ppc - */ - -#ifdef __KERNEL__ -#ifndef _PPC_BITOPS_H -#define _PPC_BITOPS_H - -#include -#include -#include -#include - -/* - * The test_and_*_bit operations are taken to imply a memory barrier - * on SMP systems. - */ -#ifdef CONFIG_SMP -#define SMP_WMB "eieio\n" -#define SMP_MB "\nsync" -#else -#define SMP_WMB -#define SMP_MB -#endif /* CONFIG_SMP */ - -static __inline__ void set_bit(int nr, volatile unsigned long * addr) -{ - unsigned long old; - unsigned long mask = 1 << (nr & 0x1f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 5); - - __asm__ __volatile__("\n\ -1: lwarx %0,0,%3 \n\ - or %0,%0,%2 \n" - PPC405_ERR77(0,%3) -" stwcx. %0,0,%3 \n\ - bne- 1b" - : "=&r" (old), "=m" (*p) - : "r" (mask), "r" (p), "m" (*p) - : "cc" ); -} - -/* - * non-atomic version - */ -static __inline__ void __set_bit(int nr, volatile unsigned long *addr) -{ - unsigned long mask = 1 << (nr & 0x1f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 5); - - *p |= mask; -} - -/* - * clear_bit doesn't imply a memory barrier - */ -#define smp_mb__before_clear_bit() smp_mb() -#define smp_mb__after_clear_bit() smp_mb() - -static __inline__ void clear_bit(int nr, volatile unsigned long *addr) -{ - unsigned long old; - unsigned long mask = 1 << (nr & 0x1f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 5); - - __asm__ __volatile__("\n\ -1: lwarx %0,0,%3 \n\ - andc %0,%0,%2 \n" - PPC405_ERR77(0,%3) -" stwcx. %0,0,%3 \n\ - bne- 1b" - : "=&r" (old), "=m" (*p) - : "r" (mask), "r" (p), "m" (*p) - : "cc"); -} - -/* - * non-atomic version - */ -static __inline__ void __clear_bit(int nr, volatile unsigned long *addr) -{ - unsigned long mask = 1 << (nr & 0x1f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 5); - - *p &= ~mask; -} - -static __inline__ void change_bit(int nr, volatile unsigned long *addr) -{ - unsigned long old; - unsigned long mask = 1 << (nr & 0x1f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 5); - - __asm__ __volatile__("\n\ -1: lwarx %0,0,%3 \n\ - xor %0,%0,%2 \n" - PPC405_ERR77(0,%3) -" stwcx. %0,0,%3 \n\ - bne- 1b" - : "=&r" (old), "=m" (*p) - : "r" (mask), "r" (p), "m" (*p) - : "cc"); -} - -/* - * non-atomic version - */ -static __inline__ void __change_bit(int nr, volatile unsigned long *addr) -{ - unsigned long mask = 1 << (nr & 0x1f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 5); - - *p ^= mask; -} - -/* - * test_and_*_bit do imply a memory barrier (?) - */ -static __inline__ int test_and_set_bit(int nr, volatile unsigned long *addr) -{ - unsigned int old, t; - unsigned int mask = 1 << (nr & 0x1f); - volatile unsigned int *p = ((volatile unsigned int *)addr) + (nr >> 5); - - __asm__ __volatile__(SMP_WMB "\n\ -1: lwarx %0,0,%4 \n\ - or %1,%0,%3 \n" - PPC405_ERR77(0,%4) -" stwcx. %1,0,%4 \n\ - bne 1b" - SMP_MB - : "=&r" (old), "=&r" (t), "=m" (*p) - : "r" (mask), "r" (p), "m" (*p) - : "cc", "memory"); - - return (old & mask) != 0; -} - -/* - * non-atomic version - */ -static __inline__ int __test_and_set_bit(int nr, volatile unsigned long *addr) -{ - unsigned long mask = 1 << (nr & 0x1f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 5); - unsigned long old = *p; - - *p = old | mask; - return (old & mask) != 0; -} - -static __inline__ int test_and_clear_bit(int nr, volatile unsigned long *addr) -{ - unsigned int old, t; - unsigned int mask = 1 << (nr & 0x1f); - volatile unsigned int *p = ((volatile unsigned int *)addr) + (nr >> 5); - - __asm__ __volatile__(SMP_WMB "\n\ -1: lwarx %0,0,%4 \n\ - andc %1,%0,%3 \n" - PPC405_ERR77(0,%4) -" stwcx. %1,0,%4 \n\ - bne 1b" - SMP_MB - : "=&r" (old), "=&r" (t), "=m" (*p) - : "r" (mask), "r" (p), "m" (*p) - : "cc", "memory"); - - return (old & mask) != 0; -} - -/* - * non-atomic version - */ -static __inline__ int __test_and_clear_bit(int nr, volatile unsigned long *addr) -{ - unsigned long mask = 1 << (nr & 0x1f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 5); - unsigned long old = *p; - - *p = old & ~mask; - return (old & mask) != 0; -} - -static __inline__ int test_and_change_bit(int nr, volatile unsigned long *addr) -{ - unsigned int old, t; - unsigned int mask = 1 << (nr & 0x1f); - volatile unsigned int *p = ((volatile unsigned int *)addr) + (nr >> 5); - - __asm__ __volatile__(SMP_WMB "\n\ -1: lwarx %0,0,%4 \n\ - xor %1,%0,%3 \n" - PPC405_ERR77(0,%4) -" stwcx. %1,0,%4 \n\ - bne 1b" - SMP_MB - : "=&r" (old), "=&r" (t), "=m" (*p) - : "r" (mask), "r" (p), "m" (*p) - : "cc", "memory"); - - return (old & mask) != 0; -} - -/* - * non-atomic version - */ -static __inline__ int __test_and_change_bit(int nr, volatile unsigned long *addr) -{ - unsigned long mask = 1 << (nr & 0x1f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 5); - unsigned long old = *p; - - *p = old ^ mask; - return (old & mask) != 0; -} - -static __inline__ int test_bit(int nr, __const__ volatile unsigned long *addr) -{ - return ((addr[nr >> 5] >> (nr & 0x1f)) & 1) != 0; -} - -/* Return the bit position of the most significant 1 bit in a word */ -static __inline__ int __ilog2(unsigned long x) -{ - int lz; - - asm ("cntlzw %0,%1" : "=r" (lz) : "r" (x)); - return 31 - lz; -} - -static __inline__ int ffz(unsigned long x) -{ - if ((x = ~x) == 0) - return 32; - return __ilog2(x & -x); -} - -static inline int __ffs(unsigned long x) -{ - return __ilog2(x & -x); -} - -/* - * ffs: find first bit set. This is defined the same way as - * the libc and compiler builtin ffs routines, therefore - * differs in spirit from the above ffz (man ffs). - */ -static __inline__ int ffs(int x) -{ - return __ilog2(x & -x) + 1; -} - -/* - * fls: find last (most-significant) bit set. - * Note fls(0) = 0, fls(1) = 1, fls(0x80000000) = 32. - */ -static __inline__ int fls(unsigned int x) -{ - int lz; - - asm ("cntlzw %0,%1" : "=r" (lz) : "r" (x)); - return 32 - lz; -} - -/* - * hweightN: returns the hamming weight (i.e. the number - * of bits set) of a N-bit word - */ - -#define hweight32(x) generic_hweight32(x) -#define hweight16(x) generic_hweight16(x) -#define hweight8(x) generic_hweight8(x) - -/* - * Find the first bit set in a 140-bit bitmap. - * The first 100 bits are unlikely to be set. - */ -static inline int sched_find_first_bit(const unsigned long *b) -{ - if (unlikely(b[0])) - return __ffs(b[0]); - if (unlikely(b[1])) - return __ffs(b[1]) + 32; - if (unlikely(b[2])) - return __ffs(b[2]) + 64; - if (b[3]) - return __ffs(b[3]) + 96; - return __ffs(b[4]) + 128; -} - -/** - * find_next_bit - find the next set bit in a memory region - * @addr: The address to base the search on - * @offset: The bitnumber to start searching at - * @size: The maximum size to search - */ -static __inline__ unsigned long find_next_bit(const unsigned long *addr, - unsigned long size, unsigned long offset) -{ - unsigned int *p = ((unsigned int *) addr) + (offset >> 5); - unsigned int result = offset & ~31UL; - unsigned int tmp; - - if (offset >= size) - return size; - size -= result; - offset &= 31UL; - if (offset) { - tmp = *p++; - tmp &= ~0UL << offset; - if (size < 32) - goto found_first; - if (tmp) - goto found_middle; - size -= 32; - result += 32; - } - while (size >= 32) { - if ((tmp = *p++) != 0) - goto found_middle; - result += 32; - size -= 32; - } - if (!size) - return result; - tmp = *p; - -found_first: - tmp &= ~0UL >> (32 - size); - if (tmp == 0UL) /* Are any bits set? */ - return result + size; /* Nope. */ -found_middle: - return result + __ffs(tmp); -} - -/** - * find_first_bit - find the first set bit in a memory region - * @addr: The address to start the search at - * @size: The maximum size to search - * - * Returns the bit-number of the first set bit, not the number of the byte - * containing a bit. - */ -#define find_first_bit(addr, size) \ - find_next_bit((addr), (size), 0) - -/* - * This implementation of find_{first,next}_zero_bit was stolen from - * Linus' asm-alpha/bitops.h. - */ -#define find_first_zero_bit(addr, size) \ - find_next_zero_bit((addr), (size), 0) - -static __inline__ unsigned long find_next_zero_bit(const unsigned long *addr, - unsigned long size, unsigned long offset) -{ - unsigned int * p = ((unsigned int *) addr) + (offset >> 5); - unsigned int result = offset & ~31UL; - unsigned int tmp; - - if (offset >= size) - return size; - size -= result; - offset &= 31UL; - if (offset) { - tmp = *p++; - tmp |= ~0UL >> (32-offset); - if (size < 32) - goto found_first; - if (tmp != ~0U) - goto found_middle; - size -= 32; - result += 32; - } - while (size >= 32) { - if ((tmp = *p++) != ~0U) - goto found_middle; - result += 32; - size -= 32; - } - if (!size) - return result; - tmp = *p; -found_first: - tmp |= ~0UL << size; - if (tmp == ~0UL) /* Are any bits zero? */ - return result + size; /* Nope. */ -found_middle: - return result + ffz(tmp); -} - - -#define ext2_set_bit(nr, addr) __test_and_set_bit((nr) ^ 0x18, (unsigned long *)(addr)) -#define ext2_set_bit_atomic(lock, nr, addr) test_and_set_bit((nr) ^ 0x18, (unsigned long *)(addr)) -#define ext2_clear_bit(nr, addr) __test_and_clear_bit((nr) ^ 0x18, (unsigned long *)(addr)) -#define ext2_clear_bit_atomic(lock, nr, addr) test_and_clear_bit((nr) ^ 0x18, (unsigned long *)(addr)) - -static __inline__ int ext2_test_bit(int nr, __const__ void * addr) -{ - __const__ unsigned char *ADDR = (__const__ unsigned char *) addr; - - return (ADDR[nr >> 3] >> (nr & 7)) & 1; -} - -/* - * This implementation of ext2_find_{first,next}_zero_bit was stolen from - * Linus' asm-alpha/bitops.h and modified for a big-endian machine. - */ - -#define ext2_find_first_zero_bit(addr, size) \ - ext2_find_next_zero_bit((addr), (size), 0) - -static __inline__ unsigned long ext2_find_next_zero_bit(const void *addr, - unsigned long size, unsigned long offset) -{ - unsigned int *p = ((unsigned int *) addr) + (offset >> 5); - unsigned int result = offset & ~31UL; - unsigned int tmp; - - if (offset >= size) - return size; - size -= result; - offset &= 31UL; - if (offset) { - tmp = cpu_to_le32p(p++); - tmp |= ~0UL >> (32-offset); - if (size < 32) - goto found_first; - if (tmp != ~0U) - goto found_middle; - size -= 32; - result += 32; - } - while (size >= 32) { - if ((tmp = cpu_to_le32p(p++)) != ~0U) - goto found_middle; - result += 32; - size -= 32; - } - if (!size) - return result; - tmp = cpu_to_le32p(p); -found_first: - tmp |= ~0U << size; - if (tmp == ~0UL) /* Are any bits zero? */ - return result + size; /* Nope. */ -found_middle: - return result + ffz(tmp); -} - -/* Bitmap functions for the minix filesystem. */ -#define minix_test_and_set_bit(nr,addr) ext2_set_bit(nr,addr) -#define minix_set_bit(nr,addr) ((void)ext2_set_bit(nr,addr)) -#define minix_test_and_clear_bit(nr,addr) ext2_clear_bit(nr,addr) -#define minix_test_bit(nr,addr) ext2_test_bit(nr,addr) -#define minix_find_first_zero_bit(addr,size) ext2_find_first_zero_bit(addr,size) - -#endif /* _PPC_BITOPS_H */ -#endif /* __KERNEL__ */ Index: working-2.6/include/asm-ppc64/bitops.h =================================================================== --- working-2.6.orig/include/asm-ppc64/bitops.h 2005-10-31 15:20:22.000000000 +1100 +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 @@ -1,360 +0,0 @@ -/* - * PowerPC64 atomic bit operations. - * Dave Engebretsen, Todd Inglett, Don Reed, Pat McCarthy, Peter Bergner, - * Anton Blanchard - * - * Originally taken from the 32b PPC code. Modified to use 64b values for - * the various counters & memory references. - * - * Bitops are odd when viewed on big-endian systems. They were designed - * on little endian so the size of the bitset doesn't matter (low order bytes - * come first) as long as the bit in question is valid. - * - * Bits are "tested" often using the C expression (val & (1< - -/* - * clear_bit doesn't imply a memory barrier - */ -#define smp_mb__before_clear_bit() smp_mb() -#define smp_mb__after_clear_bit() smp_mb() - -static __inline__ int test_bit(unsigned long nr, __const__ volatile unsigned long *addr) -{ - return (1UL & (addr[nr >> 6] >> (nr & 63))); -} - -static __inline__ void set_bit(unsigned long nr, volatile unsigned long *addr) -{ - unsigned long old; - unsigned long mask = 1UL << (nr & 0x3f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 6); - - __asm__ __volatile__( -"1: ldarx %0,0,%3 # set_bit\n\ - or %0,%0,%2\n\ - stdcx. %0,0,%3\n\ - bne- 1b" - : "=&r" (old), "=m" (*p) - : "r" (mask), "r" (p), "m" (*p) - : "cc"); -} - -static __inline__ void clear_bit(unsigned long nr, volatile unsigned long *addr) -{ - unsigned long old; - unsigned long mask = 1UL << (nr & 0x3f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 6); - - __asm__ __volatile__( -"1: ldarx %0,0,%3 # clear_bit\n\ - andc %0,%0,%2\n\ - stdcx. %0,0,%3\n\ - bne- 1b" - : "=&r" (old), "=m" (*p) - : "r" (mask), "r" (p), "m" (*p) - : "cc"); -} - -static __inline__ void change_bit(unsigned long nr, volatile unsigned long *addr) -{ - unsigned long old; - unsigned long mask = 1UL << (nr & 0x3f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 6); - - __asm__ __volatile__( -"1: ldarx %0,0,%3 # change_bit\n\ - xor %0,%0,%2\n\ - stdcx. %0,0,%3\n\ - bne- 1b" - : "=&r" (old), "=m" (*p) - : "r" (mask), "r" (p), "m" (*p) - : "cc"); -} - -static __inline__ int test_and_set_bit(unsigned long nr, volatile unsigned long *addr) -{ - unsigned long old, t; - unsigned long mask = 1UL << (nr & 0x3f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 6); - - __asm__ __volatile__( - EIEIO_ON_SMP -"1: ldarx %0,0,%3 # test_and_set_bit\n\ - or %1,%0,%2 \n\ - stdcx. %1,0,%3 \n\ - bne- 1b" - ISYNC_ON_SMP - : "=&r" (old), "=&r" (t) - : "r" (mask), "r" (p) - : "cc", "memory"); - - return (old & mask) != 0; -} - -static __inline__ int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr) -{ - unsigned long old, t; - unsigned long mask = 1UL << (nr & 0x3f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 6); - - __asm__ __volatile__( - EIEIO_ON_SMP -"1: ldarx %0,0,%3 # test_and_clear_bit\n\ - andc %1,%0,%2\n\ - stdcx. %1,0,%3\n\ - bne- 1b" - ISYNC_ON_SMP - : "=&r" (old), "=&r" (t) - : "r" (mask), "r" (p) - : "cc", "memory"); - - return (old & mask) != 0; -} - -static __inline__ int test_and_change_bit(unsigned long nr, volatile unsigned long *addr) -{ - unsigned long old, t; - unsigned long mask = 1UL << (nr & 0x3f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 6); - - __asm__ __volatile__( - EIEIO_ON_SMP -"1: ldarx %0,0,%3 # test_and_change_bit\n\ - xor %1,%0,%2\n\ - stdcx. %1,0,%3\n\ - bne- 1b" - ISYNC_ON_SMP - : "=&r" (old), "=&r" (t) - : "r" (mask), "r" (p) - : "cc", "memory"); - - return (old & mask) != 0; -} - -static __inline__ void set_bits(unsigned long mask, unsigned long *addr) -{ - unsigned long old; - - __asm__ __volatile__( -"1: ldarx %0,0,%3 # set_bit\n\ - or %0,%0,%2\n\ - stdcx. %0,0,%3\n\ - bne- 1b" - : "=&r" (old), "=m" (*addr) - : "r" (mask), "r" (addr), "m" (*addr) - : "cc"); -} - -/* - * non-atomic versions - */ -static __inline__ void __set_bit(unsigned long nr, volatile unsigned long *addr) -{ - unsigned long mask = 1UL << (nr & 0x3f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 6); - - *p |= mask; -} - -static __inline__ void __clear_bit(unsigned long nr, volatile unsigned long *addr) -{ - unsigned long mask = 1UL << (nr & 0x3f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 6); - - *p &= ~mask; -} - -static __inline__ void __change_bit(unsigned long nr, volatile unsigned long *addr) -{ - unsigned long mask = 1UL << (nr & 0x3f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 6); - - *p ^= mask; -} - -static __inline__ int __test_and_set_bit(unsigned long nr, volatile unsigned long *addr) -{ - unsigned long mask = 1UL << (nr & 0x3f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 6); - unsigned long old = *p; - - *p = old | mask; - return (old & mask) != 0; -} - -static __inline__ int __test_and_clear_bit(unsigned long nr, volatile unsigned long *addr) -{ - unsigned long mask = 1UL << (nr & 0x3f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 6); - unsigned long old = *p; - - *p = old & ~mask; - return (old & mask) != 0; -} - -static __inline__ int __test_and_change_bit(unsigned long nr, volatile unsigned long *addr) -{ - unsigned long mask = 1UL << (nr & 0x3f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 6); - unsigned long old = *p; - - *p = old ^ mask; - return (old & mask) != 0; -} - -/* - * Return the zero-based bit position (from RIGHT TO LEFT, 63 -> 0) of the - * most significant (left-most) 1-bit in a double word. - */ -static __inline__ int __ilog2(unsigned long x) -{ - int lz; - - asm ("cntlzd %0,%1" : "=r" (lz) : "r" (x)); - return 63 - lz; -} - -/* - * Determines the bit position of the least significant (rightmost) 0 bit - * in the specified double word. The returned bit position will be zero-based, - * starting from the right side (63 - 0). - */ -static __inline__ unsigned long ffz(unsigned long x) -{ - /* no zero exists anywhere in the 8 byte area. */ - if ((x = ~x) == 0) - return 64; - - /* - * Calculate the bit position of the least signficant '1' bit in x - * (since x has been changed this will actually be the least signficant - * '0' bit in * the original x). Note: (x & -x) gives us a mask that - * is the least significant * (RIGHT-most) 1-bit of the value in x. - */ - return __ilog2(x & -x); -} - -static __inline__ int __ffs(unsigned long x) -{ - return __ilog2(x & -x); -} - -/* - * ffs: find first bit set. This is defined the same way as - * the libc and compiler builtin ffs routines, therefore - * differs in spirit from the above ffz (man ffs). - */ -static __inline__ int ffs(int x) -{ - unsigned long i = (unsigned long)x; - return __ilog2(i & -i) + 1; -} - -/* - * fls: find last (most-significant) bit set. - * Note fls(0) = 0, fls(1) = 1, fls(0x80000000) = 32. - */ -#define fls(x) generic_fls(x) - -/* - * hweightN: returns the hamming weight (i.e. the number - * of bits set) of a N-bit word - */ -#define hweight64(x) generic_hweight64(x) -#define hweight32(x) generic_hweight32(x) -#define hweight16(x) generic_hweight16(x) -#define hweight8(x) generic_hweight8(x) - -extern unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size, unsigned long offset); -#define find_first_zero_bit(addr, size) \ - find_next_zero_bit((addr), (size), 0) - -extern unsigned long find_next_bit(const unsigned long *addr, unsigned long size, unsigned long offset); -#define find_first_bit(addr, size) \ - find_next_bit((addr), (size), 0) - -extern unsigned long find_next_zero_le_bit(const unsigned long *addr, unsigned long size, unsigned long offset); -#define find_first_zero_le_bit(addr, size) \ - find_next_zero_le_bit((addr), (size), 0) - -static __inline__ int test_le_bit(unsigned long nr, __const__ unsigned long * addr) -{ - __const__ unsigned char *ADDR = (__const__ unsigned char *) addr; - return (ADDR[nr >> 3] >> (nr & 7)) & 1; -} - -#define test_and_clear_le_bit(nr, addr) \ - test_and_clear_bit((nr) ^ 0x38, (addr)) -#define test_and_set_le_bit(nr, addr) \ - test_and_set_bit((nr) ^ 0x38, (addr)) - -/* - * non-atomic versions - */ - -#define __set_le_bit(nr, addr) \ - __set_bit((nr) ^ 0x38, (addr)) -#define __clear_le_bit(nr, addr) \ - __clear_bit((nr) ^ 0x38, (addr)) -#define __test_and_clear_le_bit(nr, addr) \ - __test_and_clear_bit((nr) ^ 0x38, (addr)) -#define __test_and_set_le_bit(nr, addr) \ - __test_and_set_bit((nr) ^ 0x38, (addr)) - -#define ext2_set_bit(nr,addr) \ - __test_and_set_le_bit((nr), (unsigned long*)addr) -#define ext2_clear_bit(nr, addr) \ - __test_and_clear_le_bit((nr), (unsigned long*)addr) - -#define ext2_set_bit_atomic(lock, nr, addr) \ - test_and_set_le_bit((nr), (unsigned long*)addr) -#define ext2_clear_bit_atomic(lock, nr, addr) \ - test_and_clear_le_bit((nr), (unsigned long*)addr) - - -#define ext2_test_bit(nr, addr) test_le_bit((nr),(unsigned long*)addr) -#define ext2_find_first_zero_bit(addr, size) \ - find_first_zero_le_bit((unsigned long*)addr, size) -#define ext2_find_next_zero_bit(addr, size, off) \ - find_next_zero_le_bit((unsigned long*)addr, size, off) - -#define minix_test_and_set_bit(nr,addr) test_and_set_bit(nr,addr) -#define minix_set_bit(nr,addr) set_bit(nr,addr) -#define minix_test_and_clear_bit(nr,addr) test_and_clear_bit(nr,addr) -#define minix_test_bit(nr,addr) test_bit(nr,addr) -#define minix_find_first_zero_bit(addr,size) find_first_zero_bit(addr,size) - -#endif /* __KERNEL__ */ -#endif /* _PPC64_BITOPS_H */ Index: working-2.6/arch/powerpc/lib/bitops.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ working-2.6/arch/powerpc/lib/bitops.c 2005-11-01 15:51:56.000000000 +1100 @@ -0,0 +1,150 @@ +#include +#include +#include +#include + +/** + * find_next_bit - find the next set bit in a memory region + * @addr: The address to base the search on + * @offset: The bitnumber to start searching at + * @size: The maximum size to search + */ +unsigned long find_next_bit(const unsigned long *addr, unsigned long size, + unsigned long offset) +{ + const unsigned long *p = addr + BITOP_WORD(offset); + unsigned long result = offset & ~(BITS_PER_LONG-1); + unsigned long tmp; + + if (offset >= size) + return size; + size -= result; + offset %= BITS_PER_LONG; + if (offset) { + tmp = *(p++); + tmp &= (~0UL << offset); + if (size < BITS_PER_LONG) + goto found_first; + if (tmp) + goto found_middle; + size -= BITS_PER_LONG; + result += BITS_PER_LONG; + } + while (size & ~(BITS_PER_LONG-1)) { + if ((tmp = *(p++))) + goto found_middle; + result += BITS_PER_LONG; + size -= BITS_PER_LONG; + } + if (!size) + return result; + tmp = *p; + +found_first: + tmp &= (~0UL >> (64 - size)); + if (tmp == 0UL) /* Are any bits set? */ + return result + size; /* Nope. */ +found_middle: + return result + __ffs(tmp); +} +EXPORT_SYMBOL(find_next_bit); + +/* + * This implementation of find_{first,next}_zero_bit was stolen from + * Linus' asm-alpha/bitops.h. + */ +unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size, + unsigned long offset) +{ + const unsigned long *p = addr + BITOP_WORD(offset); + unsigned long result = offset & ~(BITS_PER_LONG-1); + unsigned long tmp; + + if (offset >= size) + return size; + size -= result; + offset %= BITS_PER_LONG; + if (offset) { + tmp = *(p++); + tmp |= ~0UL >> (BITS_PER_LONG - offset); + if (size < BITS_PER_LONG) + goto found_first; + if (~tmp) + goto found_middle; + size -= BITS_PER_LONG; + result += BITS_PER_LONG; + } + while (size & ~(BITS_PER_LONG-1)) { + if (~(tmp = *(p++))) + goto found_middle; + result += BITS_PER_LONG; + size -= BITS_PER_LONG; + } + if (!size) + return result; + tmp = *p; + +found_first: + tmp |= ~0UL << size; + if (tmp == ~0UL) /* Are any bits zero? */ + return result + size; /* Nope. */ +found_middle: + return result + ffz(tmp); +} +EXPORT_SYMBOL(find_next_zero_bit); + +static inline unsigned int ext2_ilog2(unsigned int x) +{ + int lz; + + asm("cntlzw %0,%1": "=r"(lz):"r"(x)); + return 31 - lz; +} + +static inline unsigned int ext2_ffz(unsigned int x) +{ + u32 rc; + if ((x = ~x) == 0) + return 32; + rc = ext2_ilog2(x & -x); + return rc; +} + +unsigned long find_next_zero_le_bit(const unsigned long *addr, + unsigned long size, unsigned long offset) +{ + const unsigned int *p = ((const unsigned int *)addr) + (offset >> 5); + unsigned int result = offset & ~31; + unsigned int tmp; + + if (offset >= size) + return size; + size -= result; + offset &= 31; + if (offset) { + tmp = cpu_to_le32p(p++); + tmp |= ~0U >> (32 - offset); /* bug or feature ? */ + if (size < 32) + goto found_first; + if (tmp != ~0) + goto found_middle; + size -= 32; + result += 32; + } + while (size >= 32) { + if ((tmp = cpu_to_le32p(p++)) != ~0) + goto found_middle; + result += 32; + size -= 32; + } + if (!size) + return result; + tmp = cpu_to_le32p(p); +found_first: + tmp |= ~0 << size; + if (tmp == ~0) /* Are any bits zero? */ + return result + size; /* Nope. */ +found_middle: + return result + ext2_ffz(tmp); +} +EXPORT_SYMBOL(find_next_zero_le_bit); Index: working-2.6/arch/ppc64/kernel/bitops.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/bitops.c 2005-10-25 11:59:53.000000000 +1000 +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 @@ -1,147 +0,0 @@ -/* - * These are too big to be inlined. - */ - -#include -#include -#include -#include - -unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size, - unsigned long offset) -{ - const unsigned long *p = addr + (offset >> 6); - unsigned long result = offset & ~63UL; - unsigned long tmp; - - if (offset >= size) - return size; - size -= result; - offset &= 63UL; - if (offset) { - tmp = *(p++); - tmp |= ~0UL >> (64 - offset); - if (size < 64) - goto found_first; - if (~tmp) - goto found_middle; - size -= 64; - result += 64; - } - while (size & ~63UL) { - if (~(tmp = *(p++))) - goto found_middle; - result += 64; - size -= 64; - } - if (!size) - return result; - tmp = *p; - -found_first: - tmp |= ~0UL << size; - if (tmp == ~0UL) /* Are any bits zero? */ - return result + size; /* Nope. */ -found_middle: - return result + ffz(tmp); -} - -EXPORT_SYMBOL(find_next_zero_bit); - -unsigned long find_next_bit(const unsigned long *addr, unsigned long size, - unsigned long offset) -{ - const unsigned long *p = addr + (offset >> 6); - unsigned long result = offset & ~63UL; - unsigned long tmp; - - if (offset >= size) - return size; - size -= result; - offset &= 63UL; - if (offset) { - tmp = *(p++); - tmp &= (~0UL << offset); - if (size < 64) - goto found_first; - if (tmp) - goto found_middle; - size -= 64; - result += 64; - } - while (size & ~63UL) { - if ((tmp = *(p++))) - goto found_middle; - result += 64; - size -= 64; - } - if (!size) - return result; - tmp = *p; - -found_first: - tmp &= (~0UL >> (64 - size)); - if (tmp == 0UL) /* Are any bits set? */ - return result + size; /* Nope. */ -found_middle: - return result + __ffs(tmp); -} - -EXPORT_SYMBOL(find_next_bit); - -static inline unsigned int ext2_ilog2(unsigned int x) -{ - int lz; - - asm("cntlzw %0,%1": "=r"(lz):"r"(x)); - return 31 - lz; -} - -static inline unsigned int ext2_ffz(unsigned int x) -{ - u32 rc; - if ((x = ~x) == 0) - return 32; - rc = ext2_ilog2(x & -x); - return rc; -} - -unsigned long find_next_zero_le_bit(const unsigned long *addr, unsigned long size, - unsigned long offset) -{ - const unsigned int *p = ((const unsigned int *)addr) + (offset >> 5); - unsigned int result = offset & ~31; - unsigned int tmp; - - if (offset >= size) - return size; - size -= result; - offset &= 31; - if (offset) { - tmp = cpu_to_le32p(p++); - tmp |= ~0U >> (32 - offset); /* bug or feature ? */ - if (size < 32) - goto found_first; - if (tmp != ~0) - goto found_middle; - size -= 32; - result += 32; - } - while (size >= 32) { - if ((tmp = cpu_to_le32p(p++)) != ~0) - goto found_middle; - result += 32; - size -= 32; - } - if (!size) - return result; - tmp = cpu_to_le32p(p); -found_first: - tmp |= ~0 << size; - if (tmp == ~0) /* Are any bits zero? */ - return result + size; /* Nope. */ -found_middle: - return result + ext2_ffz(tmp); -} - -EXPORT_SYMBOL(find_next_zero_le_bit); Index: working-2.6/arch/powerpc/kernel/ppc_ksyms.c =================================================================== --- working-2.6.orig/arch/powerpc/kernel/ppc_ksyms.c 2005-10-31 15:20:57.000000000 +1100 +++ working-2.6/arch/powerpc/kernel/ppc_ksyms.c 2005-11-01 15:51:56.000000000 +1100 @@ -81,15 +81,6 @@ EXPORT_SYMBOL(ucSystemType); #endif -#if !defined(__INLINE_BITOPS) -EXPORT_SYMBOL(set_bit); -EXPORT_SYMBOL(clear_bit); -EXPORT_SYMBOL(change_bit); -EXPORT_SYMBOL(test_and_set_bit); -EXPORT_SYMBOL(test_and_clear_bit); -EXPORT_SYMBOL(test_and_change_bit); -#endif /* __INLINE_BITOPS */ - EXPORT_SYMBOL(strcpy); EXPORT_SYMBOL(strncpy); EXPORT_SYMBOL(strcat); Index: working-2.6/arch/ppc/kernel/bitops.c =================================================================== --- working-2.6.orig/arch/ppc/kernel/bitops.c 2005-10-25 11:59:53.000000000 +1000 +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 @@ -1,126 +0,0 @@ -/* - * Copyright (C) 1996 Paul Mackerras. - */ - -#include -#include - -/* - * If the bitops are not inlined in bitops.h, they are defined here. - * -- paulus - */ -#if !__INLINE_BITOPS -void set_bit(int nr, volatile void * addr) -{ - unsigned long old; - unsigned long mask = 1 << (nr & 0x1f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 5); - - __asm__ __volatile__(SMP_WMB "\n\ -1: lwarx %0,0,%3 \n\ - or %0,%0,%2 \n" - PPC405_ERR77(0,%3) -" stwcx. %0,0,%3 \n\ - bne 1b" - SMP_MB - : "=&r" (old), "=m" (*p) - : "r" (mask), "r" (p), "m" (*p) - : "cc" ); -} - -void clear_bit(int nr, volatile void *addr) -{ - unsigned long old; - unsigned long mask = 1 << (nr & 0x1f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 5); - - __asm__ __volatile__(SMP_WMB "\n\ -1: lwarx %0,0,%3 \n\ - andc %0,%0,%2 \n" - PPC405_ERR77(0,%3) -" stwcx. %0,0,%3 \n\ - bne 1b" - SMP_MB - : "=&r" (old), "=m" (*p) - : "r" (mask), "r" (p), "m" (*p) - : "cc"); -} - -void change_bit(int nr, volatile void *addr) -{ - unsigned long old; - unsigned long mask = 1 << (nr & 0x1f); - unsigned long *p = ((unsigned long *)addr) + (nr >> 5); - - __asm__ __volatile__(SMP_WMB "\n\ -1: lwarx %0,0,%3 \n\ - xor %0,%0,%2 \n" - PPC405_ERR77(0,%3) -" stwcx. %0,0,%3 \n\ - bne 1b" - SMP_MB - : "=&r" (old), "=m" (*p) - : "r" (mask), "r" (p), "m" (*p) - : "cc"); -} - -int test_and_set_bit(int nr, volatile void *addr) -{ - unsigned int old, t; - unsigned int mask = 1 << (nr & 0x1f); - volatile unsigned int *p = ((volatile unsigned int *)addr) + (nr >> 5); - - __asm__ __volatile__(SMP_WMB "\n\ -1: lwarx %0,0,%4 \n\ - or %1,%0,%3 \n" - PPC405_ERR77(0,%4) -" stwcx. %1,0,%4 \n\ - bne 1b" - SMP_MB - : "=&r" (old), "=&r" (t), "=m" (*p) - : "r" (mask), "r" (p), "m" (*p) - : "cc"); - - return (old & mask) != 0; -} - -int test_and_clear_bit(int nr, volatile void *addr) -{ - unsigned int old, t; - unsigned int mask = 1 << (nr & 0x1f); - volatile unsigned int *p = ((volatile unsigned int *)addr) + (nr >> 5); - - __asm__ __volatile__(SMP_WMB "\n\ -1: lwarx %0,0,%4 \n\ - andc %1,%0,%3 \n" - PPC405_ERR77(0,%4) -" stwcx. %1,0,%4 \n\ - bne 1b" - SMP_MB - : "=&r" (old), "=&r" (t), "=m" (*p) - : "r" (mask), "r" (p), "m" (*p) - : "cc"); - - return (old & mask) != 0; -} - -int test_and_change_bit(int nr, volatile void *addr) -{ - unsigned int old, t; - unsigned int mask = 1 << (nr & 0x1f); - volatile unsigned int *p = ((volatile unsigned int *)addr) + (nr >> 5); - - __asm__ __volatile__(SMP_WMB "\n\ -1: lwarx %0,0,%4 \n\ - xor %1,%0,%3 \n" - PPC405_ERR77(0,%4) -" stwcx. %1,0,%4 \n\ - bne 1b" - SMP_MB - : "=&r" (old), "=&r" (t), "=m" (*p) - : "r" (mask), "r" (p), "m" (*p) - : "cc"); - - return (old & mask) != 0; -} -#endif /* !__INLINE_BITOPS */ Index: working-2.6/arch/ppc64/kernel/Makefile =================================================================== --- working-2.6.orig/arch/ppc64/kernel/Makefile 2005-10-31 15:20:57.000000000 +1100 +++ working-2.6/arch/ppc64/kernel/Makefile 2005-11-01 15:51:56.000000000 +1100 @@ -13,7 +13,7 @@ obj-y += irq.o idle.o dma.o \ signal.o \ - align.o bitops.o pacaData.o \ + align.o pacaData.o \ udbg.o ioctl32.o \ rtc.o \ cpu_setup_power4.o \ Index: working-2.6/include/asm-powerpc/bitops.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ working-2.6/include/asm-powerpc/bitops.h 2005-11-01 15:51:56.000000000 +1100 @@ -0,0 +1,437 @@ +/* + * PowerPC atomic bit operations. + * + * Merged version by David Gibson . + * Based on ppc64 versions by: Dave Engebretsen, Todd Inglett, Don + * Reed, Pat McCarthy, Peter Bergner, Anton Blanchard. They + * originally took it from the ppc32 code. + * + * Within a word, bits are numbered LSB first. Lot's of places make + * this assumption by directly testing bits with (val & (1< 1 word) bitmaps on a + * big-endian system because, unlike little endian, the number of each + * bit depends on the word size. + * + * The bitop functions are defined to work on unsigned longs, so for a + * ppc64 system the bits end up numbered: + * |63..............0|127............64|191...........128|255...........196| + * and on ppc32: + * |31.....0|63....31|95....64|127...96|159..128|191..160|223..192|255..224| + * + * There are a few little-endian macros used mostly for filesystem + * bitmaps, these work on similar bit arrays layouts, but + * byte-oriented: + * |7...0|15...8|23...16|31...24|39...32|47...40|55...48|63...56| + * + * The main difference is that bit 3-5 (64b) or 3-4 (32b) in the bit + * number field needs to be reversed compared to the big-endian bit + * fields. This can be achieved by XOR with 0x38 (64b) or 0x18 (32b). + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +#ifndef _ASM_POWERPC_BITOPS_H +#define _ASM_POWERPC_BITOPS_H + +#ifdef __KERNEL__ + +#include +#include +#include + +/* + * clear_bit doesn't imply a memory barrier + */ +#define smp_mb__before_clear_bit() smp_mb() +#define smp_mb__after_clear_bit() smp_mb() + +#define BITOP_MASK(nr) (1UL << ((nr) % BITS_PER_LONG)) +#define BITOP_WORD(nr) ((nr) / BITS_PER_LONG) +#define BITOP_LE_SWIZZLE ((BITS_PER_LONG-1) & ~0x7) + +#ifdef CONFIG_PPC64 +#define LARXL "ldarx" +#define STCXL "stdcx." +#define CNTLZL "cntlzd" +#else +#define LARXL "lwarx" +#define STCXL "stwcx." +#define CNTLZL "cntlzw" +#endif + +static __inline__ void set_bit(int nr, volatile unsigned long *addr) +{ + unsigned long old; + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + + __asm__ __volatile__( +"1:" LARXL " %0,0,%3 # set_bit\n" + "or %0,%0,%2\n" + PPC405_ERR77(0,%3) + STCXL " %0,0,%3\n" + "bne- 1b" + : "=&r"(old), "=m"(*p) + : "r"(mask), "r"(p), "m"(*p) + : "cc" ); +} + +static __inline__ void clear_bit(int nr, volatile unsigned long *addr) +{ + unsigned long old; + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + + __asm__ __volatile__( +"1:" LARXL " %0,0,%3 # set_bit\n" + "andc %0,%0,%2\n" + PPC405_ERR77(0,%3) + STCXL " %0,0,%3\n" + "bne- 1b" + : "=&r"(old), "=m"(*p) + : "r"(mask), "r"(p), "m"(*p) + : "cc" ); +} + +static __inline__ void change_bit(int nr, volatile unsigned long *addr) +{ + unsigned long old; + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + + __asm__ __volatile__( +"1:" LARXL " %0,0,%3 # set_bit\n" + "xor %0,%0,%2\n" + PPC405_ERR77(0,%3) + STCXL " %0,0,%3\n" + "bne- 1b" + : "=&r"(old), "=m"(*p) + : "r"(mask), "r"(p), "m"(*p) + : "cc" ); +} + +static __inline__ int test_and_set_bit(unsigned long nr, + volatile unsigned long *addr) +{ + unsigned long old, t; + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + + __asm__ __volatile__( + EIEIO_ON_SMP +"1:" LARXL " %0,0,%3 # test_and_set_bit\n" + "or %1,%0,%2 \n" + PPC405_ERR77(0,%3) + STCXL " %1,0,%3 \n" + "bne- 1b" + ISYNC_ON_SMP + : "=&r" (old), "=&r" (t) + : "r" (mask), "r" (p) + : "cc", "memory"); + + return (old & mask) != 0; +} + +static __inline__ int test_and_clear_bit(unsigned long nr, + volatile unsigned long *addr) +{ + unsigned long old, t; + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + + __asm__ __volatile__( + EIEIO_ON_SMP +"1:" LARXL " %0,0,%3 # test_and_clear_bit\n" + "andc %1,%0,%2 \n" + PPC405_ERR77(0,%3) + STCXL " %1,0,%3 \n" + "bne- 1b" + ISYNC_ON_SMP + : "=&r" (old), "=&r" (t) + : "r" (mask), "r" (p) + : "cc", "memory"); + + return (old & mask) != 0; +} + +static __inline__ int test_and_change_bit(unsigned long nr, + volatile unsigned long *addr) +{ + unsigned long old, t; + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + + __asm__ __volatile__( + EIEIO_ON_SMP +"1:" LARXL " %0,0,%3 # test_and_change_bit\n" + "xor %1,%0,%2 \n" + PPC405_ERR77(0,%3) + STCXL " %1,0,%3 \n" + "bne- 1b" + ISYNC_ON_SMP + : "=&r" (old), "=&r" (t) + : "r" (mask), "r" (p) + : "cc", "memory"); + + return (old & mask) != 0; +} + +static __inline__ void set_bits(unsigned long mask, unsigned long *addr) +{ + unsigned long old; + + __asm__ __volatile__( +"1:" LARXL " %0,0,%3 # set_bit\n" + "or %0,%0,%2\n" + STCXL " %0,0,%3\n" + "bne- 1b" + : "=&r" (old), "=m" (*addr) + : "r" (mask), "r" (addr), "m" (*addr) + : "cc"); +} + +/* Non-atomic versions */ +static __inline__ int test_bit(unsigned long nr, + __const__ volatile unsigned long *addr) +{ + return 1UL & (addr[BITOP_WORD(nr)] >> (nr & (BITS_PER_LONG-1))); +} + +static __inline__ void __set_bit(unsigned long nr, + volatile unsigned long *addr) +{ + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + + *p |= mask; +} + +static __inline__ void __clear_bit(unsigned long nr, + volatile unsigned long *addr) +{ + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + + *p &= ~mask; +} + +static __inline__ void __change_bit(unsigned long nr, + volatile unsigned long *addr) +{ + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + + *p ^= mask; +} + +static __inline__ int __test_and_set_bit(unsigned long nr, + volatile unsigned long *addr) +{ + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + unsigned long old = *p; + + *p = old | mask; + return (old & mask) != 0; +} + +static __inline__ int __test_and_clear_bit(unsigned long nr, + volatile unsigned long *addr) +{ + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + unsigned long old = *p; + + *p = old & ~mask; + return (old & mask) != 0; +} + +static __inline__ int __test_and_change_bit(unsigned long nr, + volatile unsigned long *addr) +{ + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + unsigned long old = *p; + + *p = old ^ mask; + return (old & mask) != 0; +} + +/* + * Return the zero-based bit position (LE, not IBM bit numbering) of + * the most significant 1-bit in a double word. + */ +static __inline__ int __ilog2(unsigned long x) +{ + int lz; + + asm (CNTLZL " %0,%1" : "=r" (lz) : "r" (x)); + return BITS_PER_LONG - 1 - lz; +} + +/* + * Determines the bit position of the least significant 0 bit in the + * specified double word. The returned bit position will be + * zero-based, starting from the right side (63/31 - 0). + */ +static __inline__ unsigned long ffz(unsigned long x) +{ + /* no zero exists anywhere in the 8 byte area. */ + if ((x = ~x) == 0) + return BITS_PER_LONG; + + /* + * Calculate the bit position of the least signficant '1' bit in x + * (since x has been changed this will actually be the least signficant + * '0' bit in * the original x). Note: (x & -x) gives us a mask that + * is the least significant * (RIGHT-most) 1-bit of the value in x. + */ + return __ilog2(x & -x); +} + +static __inline__ int __ffs(unsigned long x) +{ + return __ilog2(x & -x); +} + +/* + * ffs: find first bit set. This is defined the same way as + * the libc and compiler builtin ffs routines, therefore + * differs in spirit from the above ffz (man ffs). + */ +static __inline__ int ffs(int x) +{ + unsigned long i = (unsigned long)x; + return __ilog2(i & -i) + 1; +} + +/* + * fls: find last (most-significant) bit set. + * Note fls(0) = 0, fls(1) = 1, fls(0x80000000) = 32. + */ +static __inline__ int fls(unsigned int x) +{ + int lz; + + asm ("cntlzw %0,%1" : "=r" (lz) : "r" (x)); + return 32 - lz; +} + +/* + * hweightN: returns the hamming weight (i.e. the number + * of bits set) of a N-bit word + */ +#define hweight64(x) generic_hweight64(x) +#define hweight32(x) generic_hweight32(x) +#define hweight16(x) generic_hweight16(x) +#define hweight8(x) generic_hweight8(x) + +#define find_first_zero_bit(addr, size) find_next_zero_bit((addr), (size), 0) +unsigned long find_next_zero_bit(const unsigned long *addr, + unsigned long size, unsigned long offset); +/** + * find_first_bit - find the first set bit in a memory region + * @addr: The address to start the search at + * @size: The maximum size to search + * + * Returns the bit-number of the first set bit, not the number of the byte + * containing a bit. + */ +#define find_first_bit(addr, size) find_next_bit((addr), (size), 0) +unsigned long find_next_bit(const unsigned long *addr, + unsigned long size, unsigned long offset); + +/* Little-endian versions */ + +static __inline__ int test_le_bit(unsigned long nr, + __const__ unsigned long *addr) +{ + __const__ unsigned char *tmp = (__const__ unsigned char *) addr; + return (tmp[nr >> 3] >> (nr & 7)) & 1; +} + +#define __set_le_bit(nr, addr) \ + __set_bit((nr) ^ BITOP_LE_SWIZZLE, (addr)) +#define __clear_le_bit(nr, addr) \ + __clear_bit((nr) ^ BITOP_LE_SWIZZLE, (addr)) + +#define test_and_set_le_bit(nr, addr) \ + test_and_set_bit((nr) ^ BITOP_LE_SWIZZLE, (addr)) +#define test_and_clear_le_bit(nr, addr) \ + test_and_clear_bit((nr) ^ BITOP_LE_SWIZZLE, (addr)) + +#define __test_and_set_le_bit(nr, addr) \ + __test_and_set_bit((nr) ^ BITOP_LE_SWIZZLE, (addr)) +#define __test_and_clear_le_bit(nr, addr) \ + __test_and_clear_bit((nr) ^ BITOP_LE_SWIZZLE, (addr)) + +#define find_first_zero_le_bit(addr, size) find_next_zero_le_bit((addr), (size), 0) +unsigned long find_next_zero_le_bit(const unsigned long *addr, + unsigned long size, unsigned long offset); + +/* Bitmap functions for the ext2 filesystem */ + +#define ext2_set_bit(nr,addr) \ + __test_and_set_le_bit((nr), (unsigned long*)addr) +#define ext2_clear_bit(nr, addr) \ + __test_and_clear_le_bit((nr), (unsigned long*)addr) + +#define ext2_set_bit_atomic(lock, nr, addr) \ + test_and_set_le_bit((nr), (unsigned long*)addr) +#define ext2_clear_bit_atomic(lock, nr, addr) \ + test_and_clear_le_bit((nr), (unsigned long*)addr) + +#define ext2_test_bit(nr, addr) test_le_bit((nr),(unsigned long*)addr) + +#define ext2_find_first_zero_bit(addr, size) \ + find_first_zero_le_bit((unsigned long*)addr, size) +#define ext2_find_next_zero_bit(addr, size, off) \ + find_next_zero_le_bit((unsigned long*)addr, size, off) + +/* Bitmap functions for the minix filesystem. */ + +#define minix_test_and_set_bit(nr,addr) \ + __test_and_set_le_bit(nr, (unsigned long *)addr) +#define minix_set_bit(nr,addr) \ + __set_le_bit(nr, (unsigned long *)addr) +#define minix_test_and_clear_bit(nr,addr) \ + __test_and_clear_le_bit(nr, (unsigned long *)addr) +#define minix_test_bit(nr,addr) \ + test_le_bit(nr, (unsigned long *)addr) + +#define minix_find_first_zero_bit(addr,size) \ + find_first_zero_le_bit((unsigned long *)addr, size) + +/* + * Every architecture must define this function. It's the fastest + * way of searching a 140-bit bitmap where the first 100 bits are + * unlikely to be set. It's guaranteed that at least one of the 140 + * bits is cleared. + */ +static inline int sched_find_first_bit(const unsigned long *b) +{ +#ifdef CONFIG_PPC64 + if (unlikely(b[0])) + return __ffs(b[0]); + if (unlikely(b[1])) + return __ffs(b[1]) + 64; + return __ffs(b[2]) + 128; +#else + if (unlikely(b[0])) + return __ffs(b[0]); + if (unlikely(b[1])) + return __ffs(b[1]) + 32; + if (unlikely(b[2])) + return __ffs(b[2]) + 64; + if (b[3]) + return __ffs(b[3]) + 96; + return __ffs(b[4]) + 128; +#endif +} + +#endif /* __KERNEL__ */ + +#endif /* _ASM_POWERPC_BITOPS_H */ Index: working-2.6/include/asm-ppc64/mmu_context.h =================================================================== --- working-2.6.orig/include/asm-ppc64/mmu_context.h 2005-10-25 11:59:59.000000000 +1000 +++ working-2.6/include/asm-ppc64/mmu_context.h 2005-11-01 15:51:56.000000000 +1100 @@ -16,21 +16,6 @@ * 2 of the License, or (at your option) any later version. */ -/* - * Every architecture must define this function. It's the fastest - * way of searching a 140-bit bitmap where the first 100 bits are - * unlikely to be set. It's guaranteed that at least one of the 140 - * bits is cleared. - */ -static inline int sched_find_first_bit(unsigned long *b) -{ - if (unlikely(b[0])) - return __ffs(b[0]); - if (unlikely(b[1])) - return __ffs(b[1]) + 64; - return __ffs(b[2]) + 128; -} - static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) { } Index: working-2.6/arch/ppc/Makefile =================================================================== --- working-2.6.orig/arch/ppc/Makefile 2005-10-31 15:20:57.000000000 +1100 +++ working-2.6/arch/ppc/Makefile 2005-11-01 15:51:56.000000000 +1100 @@ -66,7 +66,8 @@ core-y += arch/ppc/kernel/ arch/powerpc/kernel/ \ arch/ppc/platforms/ \ arch/ppc/mm/ arch/ppc/lib/ \ - arch/ppc/syslib/ arch/powerpc/sysdev/ + arch/ppc/syslib/ arch/powerpc/sysdev/ \ + arch/powerpc/lib/ core-$(CONFIG_4xx) += arch/ppc/platforms/4xx/ core-$(CONFIG_83xx) += arch/ppc/platforms/83xx/ core-$(CONFIG_85xx) += arch/ppc/platforms/85xx/ Index: working-2.6/arch/powerpc/lib/Makefile =================================================================== --- working-2.6.orig/arch/powerpc/lib/Makefile 2005-10-31 15:20:57.000000000 +1100 +++ working-2.6/arch/powerpc/lib/Makefile 2005-11-01 15:51:56.000000000 +1100 @@ -3,13 +3,14 @@ # ifeq ($(CONFIG_PPC_MERGE),y) -obj-y := string.o +obj-y := string.o strcase.o +obj-$(CONFIG_PPC32) += div64.o copy_32.o checksum_32.o endif -obj-y += strcase.o -obj-$(CONFIG_PPC32) += div64.o copy_32.o checksum_32.o +obj-y += bitops.o obj-$(CONFIG_PPC64) += checksum_64.o copypage_64.o copyuser_64.o \ - memcpy_64.o usercopy_64.o mem_64.o + memcpy_64.o usercopy_64.o mem_64.o \ + strcase.o obj-$(CONFIG_PPC_ISERIES) += e2a.o obj-$(CONFIG_XMON) += sstep.o -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From mikey at neuling.org Tue Nov 1 18:14:40 2005 From: mikey at neuling.org (Michael Neuling) Date: Tue, 1 Nov 2005 18:14:40 +1100 (EST) Subject: [PATCH 0/3] powerpc: Fix legacy drivers for remove io_page_mask patch Message-ID: <1130829279.790724.373020147318.qpush@coopers> These patches are the same as I send the other day, except updated for the merge tree. Also now contains an updated version of Anton's original patch which will apply cleanly to the merge tree. Anton's patch is necessary for running kexec with some e1000 revisions. The issue we have is that the reset is not being sent to the e1000 correctly, resulting in it still running during the second boot. Other two patches fix drivers which have issues with the first patch. Mikey From mikey at neuling.org Tue Nov 1 18:14:41 2005 From: mikey at neuling.org (Michael Neuling) Date: Tue, 1 Nov 2005 18:14:41 +1100 (EST) Subject: [PATCH 1/3] powerpc: Updated remove io_page_mask In-Reply-To: <1130829279.790724.373020147318.qpush@coopers> Message-ID: <20051101071441.1C53668662@ozlabs.org> From: Anton Blanchard Retransmit of Anton's patch from here: http://ozlabs.org/pipermail/linuxppc64-dev/2005-May/003922.html Updated for merge tree. Signed-off-by: Michael Neuling arch/powerpc/platforms/iseries/pci.c | 3 --- arch/powerpc/platforms/maple/pci.c | 3 --- arch/powerpc/platforms/powermac/pci.c | 3 --- arch/ppc64/kernel/iomap.c | 2 -- arch/ppc64/kernel/pci.c | 30 +++--------------------------- include/asm-ppc64/eeh.h | 15 +++------------ include/asm-ppc64/io.h | 6 ------ 7 files changed, 6 insertions(+), 56 deletions(-) Index: linux-2.6/arch/powerpc/platforms/iseries/pci.c =================================================================== --- linux-2.6.orig/arch/powerpc/platforms/iseries/pci.c 2005-11-01 10:30:35.000000000 +1100 +++ linux-2.6/arch/powerpc/platforms/iseries/pci.c 2005-11-01 11:04:27.000000000 +1100 @@ -45,8 +45,6 @@ #include "pci.h" #include "call_pci.h" -extern unsigned long io_page_mask; - /* * Forward declares of prototypes. */ @@ -288,7 +286,6 @@ PPCDBG(PPCDBG_BUSWALK, "iSeries_pcibios_init Entry.\n"); iomm_table_initialize(); find_and_init_phbs(); - io_page_mask = -1; PPCDBG(PPCDBG_BUSWALK, "iSeries_pcibios_init Exit.\n"); } Index: linux-2.6/arch/powerpc/platforms/maple/pci.c =================================================================== --- linux-2.6.orig/arch/powerpc/platforms/maple/pci.c 2005-11-01 10:30:35.000000000 +1100 +++ linux-2.6/arch/powerpc/platforms/maple/pci.c 2005-11-01 11:06:30.000000000 +1100 @@ -455,9 +455,6 @@ /* Tell pci.c to use the common resource allocation mecanism */ pci_probe_only = 0; - - /* Allow all IO */ - io_page_mask = -1; } int maple_pci_get_legacy_ide_irq(struct pci_dev *pdev, int channel) Index: linux-2.6/arch/powerpc/platforms/powermac/pci.c =================================================================== --- linux-2.6.orig/arch/powerpc/platforms/powermac/pci.c 2005-11-01 10:30:35.000000000 +1100 +++ linux-2.6/arch/powerpc/platforms/powermac/pci.c 2005-11-01 11:10:07.000000000 +1100 @@ -926,9 +926,6 @@ /* Tell pci.c to not use the common resource allocation mechanism */ pci_probe_only = 1; - /* Allow all IO */ - io_page_mask = -1; - #else /* CONFIG_PPC64 */ init_p2pbridge(); fixup_nec_usb2(); Index: linux-2.6/arch/ppc64/kernel/iomap.c =================================================================== --- linux-2.6.orig/arch/ppc64/kernel/iomap.c 2005-11-01 10:30:32.000000000 +1100 +++ linux-2.6/arch/ppc64/kernel/iomap.c 2005-11-01 11:05:01.000000000 +1100 @@ -108,8 +108,6 @@ void __iomem *ioport_map(unsigned long port, unsigned int len) { - if (!_IO_IS_VALID(port)) - return NULL; return (void __iomem *) (port+pci_io_base); } Index: linux-2.6/arch/ppc64/kernel/pci.c =================================================================== --- linux-2.6.orig/arch/ppc64/kernel/pci.c 2005-11-01 10:30:35.000000000 +1100 +++ linux-2.6/arch/ppc64/kernel/pci.c 2005-11-01 11:08:44.000000000 +1100 @@ -42,14 +42,6 @@ unsigned long pci_probe_only = 1; unsigned long pci_assign_all_buses = 0; -/* - * legal IO pages under MAX_ISA_PORT. This is to ensure we don't touch - * devices we don't have access to. - */ -unsigned long io_page_mask; - -EXPORT_SYMBOL(io_page_mask); - #ifdef CONFIG_PPC_MULTIPLATFORM static void fixup_resource(struct resource *res, struct pci_dev *dev); static void do_bus_setup(struct pci_bus *bus); @@ -995,8 +987,6 @@ pci_process_ISA_OF_ranges(isa_dn, hose->io_base_phys, hose->io_base_virt); of_node_put(isa_dn); - /* Allow all IO */ - io_page_mask = -1; } } @@ -1132,27 +1122,13 @@ static void __devinit fixup_resource(struct resource *res, struct pci_dev *dev) { struct pci_controller *hose = pci_bus_to_host(dev->bus); - unsigned long start, end, mask, offset; + unsigned long offset; if (res->flags & IORESOURCE_IO) { offset = (unsigned long)hose->io_base_virt - pci_io_base; - start = res->start += offset; - end = res->end += offset; - - /* Need to allow IO access to pages that are in the - ISA range */ - if (start < MAX_ISA_PORT) { - if (end > MAX_ISA_PORT) - end = MAX_ISA_PORT; - - start >>= PAGE_SHIFT; - end >>= PAGE_SHIFT; - - /* get the range of pages for the map */ - mask = ((1 << (end+1)) - 1) ^ ((1 << start) - 1); - io_page_mask |= mask; - } + res->start += offset; + res->end += offset; } else if (res->flags & IORESOURCE_MEM) { res->start += hose->pci_mem_offset; res->end += hose->pci_mem_offset; Index: linux-2.6/include/asm-ppc64/eeh.h =================================================================== --- linux-2.6.orig/include/asm-ppc64/eeh.h 2005-11-01 10:30:32.000000000 +1100 +++ linux-2.6/include/asm-ppc64/eeh.h 2005-11-01 11:13:05.000000000 +1100 @@ -311,8 +311,6 @@ static inline u8 eeh_inb(unsigned long port) { u8 val; - if (!_IO_IS_VALID(port)) - return ~0; val = in_8((u8 __iomem *)(port+pci_io_base)); if (EEH_POSSIBLE_ERROR(val, u8)) return eeh_check_failure((void __iomem *)(port), val); @@ -321,15 +319,12 @@ static inline void eeh_outb(u8 val, unsigned long port) { - if (_IO_IS_VALID(port)) - out_8((u8 __iomem *)(port+pci_io_base), val); + out_8((u8 __iomem *)(port+pci_io_base), val); } static inline u16 eeh_inw(unsigned long port) { u16 val; - if (!_IO_IS_VALID(port)) - return ~0; val = in_le16((u16 __iomem *)(port+pci_io_base)); if (EEH_POSSIBLE_ERROR(val, u16)) return eeh_check_failure((void __iomem *)(port), val); @@ -338,15 +333,12 @@ static inline void eeh_outw(u16 val, unsigned long port) { - if (_IO_IS_VALID(port)) - out_le16((u16 __iomem *)(port+pci_io_base), val); + out_le16((u16 __iomem *)(port+pci_io_base), val); } static inline u32 eeh_inl(unsigned long port) { u32 val; - if (!_IO_IS_VALID(port)) - return ~0; val = in_le32((u32 __iomem *)(port+pci_io_base)); if (EEH_POSSIBLE_ERROR(val, u32)) return eeh_check_failure((void __iomem *)(port), val); @@ -355,8 +347,7 @@ static inline void eeh_outl(u32 val, unsigned long port) { - if (_IO_IS_VALID(port)) - out_le32((u32 __iomem *)(port+pci_io_base), val); + out_le32((u32 __iomem *)(port+pci_io_base), val); } /* in-string eeh macros */ Index: linux-2.6/include/asm-ppc64/io.h =================================================================== --- linux-2.6.orig/include/asm-ppc64/io.h 2005-11-01 10:30:35.000000000 +1100 +++ linux-2.6/include/asm-ppc64/io.h 2005-11-01 11:13:40.000000000 +1100 @@ -33,12 +33,6 @@ extern unsigned long isa_io_base; extern unsigned long pci_io_base; -extern unsigned long io_page_mask; - -#define MAX_ISA_PORT 0x10000 - -#define _IO_IS_VALID(port) ((port) >= MAX_ISA_PORT || (1 << (port>>PAGE_SHIFT)) \ - & io_page_mask) #ifdef CONFIG_PPC_ISERIES /* __raw_* accessors aren't supported on iSeries */ From mikey at neuling.org Tue Nov 1 18:14:41 2005 From: mikey at neuling.org (Michael Neuling) Date: Tue, 1 Nov 2005 18:14:41 +1100 (EST) Subject: [PATCH 2/3] powerpc: Updated parallel port init fix In-Reply-To: <1130829279.790724.373020147318.qpush@coopers> Message-ID: <20051101071441.1B6BB68665@ozlabs.org> Updated for powerpc merge tree. Signed-off-by: Michael Neuling include/asm-powerpc/parport.h | 28 ++++++++++++++++++++++++++-- 1 files changed, 26 insertions(+), 2 deletions(-) Index: linux-2.6/include/asm-powerpc/parport.h =================================================================== --- linux-2.6.orig/include/asm-powerpc/parport.h 2005-11-01 10:30:35.000000000 +1100 +++ linux-2.6/include/asm-powerpc/parport.h 2005-11-01 11:35:05.000000000 +1100 @@ -9,10 +9,34 @@ #ifndef _ASM_POWERPC_PARPORT_H #define _ASM_POWERPC_PARPORT_H -static int __devinit parport_pc_find_isa_ports (int autoirq, int autodma); +#include + +extern struct parport *parport_pc_probe_port (unsigned long int base, + unsigned long int base_hi, + int irq, int dma, + struct pci_dev *dev); + static int __devinit parport_pc_find_nonpci_ports (int autoirq, int autodma) { - return parport_pc_find_isa_ports (autoirq, autodma); + struct device_node *np; + u32 *prop; + u32 io1, io2; + int propsize; + int count = 0; + for (np = NULL; (np = of_find_compatible_node(np, + "parallel", + "pnpPNP,400")) != NULL;) { + prop = (u32 *)get_property(np, "reg", &propsize); + if (!prop || propsize > 6*sizeof(u32)) + continue; + io1 = prop[1]; io2 = prop[2]; + prop = (u32 *)get_property(np, "interrupts", NULL); + if (!prop) + continue; + if (parport_pc_probe_port(io1, io2, prop[0], autodma, NULL) != NULL) + count++; + } + return count; } #endif /* !(_ASM_POWERPC_PARPORT_H) */ From mikey at neuling.org Tue Nov 1 18:14:42 2005 From: mikey at neuling.org (Michael Neuling) Date: Tue, 1 Nov 2005 18:14:42 +1100 (EST) Subject: [PATCH 3/3] powerpc: Updated PC speaker init fix In-Reply-To: <1130829279.790724.373020147318.qpush@coopers> Message-ID: <20051101071442.CD1C66866A@ozlabs.org> Updated for powerpc merge tree. Adds architecture specific init to pcspkr. Signed-off-by: Michael Neuling Acked-by: Paul Mackerras drivers/input/misc/pcspkr.c | 5 +++++ include/asm-powerpc/8253pit.h | 13 +++++++++++++ 2 files changed, 18 insertions(+) Index: linux-2.6/drivers/input/misc/pcspkr.c =================================================================== --- linux-2.6.orig/drivers/input/misc/pcspkr.c 2005-10-31 15:16:39.000000000 +1100 +++ linux-2.6/drivers/input/misc/pcspkr.c 2005-10-31 15:21:13.000000000 +1100 @@ -66,6 +66,11 @@ static int __init pcspkr_init(void) { +#ifdef HAS_PCSPKR_ARCH_INIT + int rc = pcspkr_arch_init(); + if (rc) + return rc; +#endif pcspkr_dev = input_allocate_device(); if (!pcspkr_dev) return -ENOMEM; Index: linux-2.6/include/asm-powerpc/8253pit.h =================================================================== --- linux-2.6.orig/include/asm-powerpc/8253pit.h 2005-10-31 15:02:18.000000000 +1100 +++ linux-2.6/include/asm-powerpc/8253pit.h 2005-10-31 15:20:30.000000000 +1100 @@ -5,6 +5,19 @@ * 8253/8254 Programmable Interval Timer */ +#include + #define PIT_TICK_RATE 1193182UL +#define HAS_PCSPKR_ARCH_INIT + +static inline int pcspkr_arch_init(void) +{ + struct device_node *np; + + np = of_find_compatible_node(NULL, NULL, "pnpPNP,100"); + of_node_put(np); + return np ? 0 : -ENODEV; +} + #endif /* _ASM_POWERPC_8253PIT_H */ From paulus at samba.org Tue Nov 1 16:53:37 2005 From: paulus at samba.org (Paul Mackerras) Date: Tue, 1 Nov 2005 16:53:37 +1100 Subject: Patches for 2.6.15 In-Reply-To: <20051028103041.B15268@cox.net> References: <17250.8725.358204.62510@cargo.ozlabs.ibm.com> <20051028103041.B15268@cox.net> Message-ID: <17255.737.832440.137041@cargo.ozlabs.ibm.com> Matt Porter writes: > Ok, we have a set of 4xx patches that I plan to send to Andrew. > They are some existing 4xx SoC/board updates as well as a new > SoC/board. They are obviously mostly confined to the 4xx code paths > but there's likely conflicts in changes to Makefiles, etc. > > Would you prefer these going upstream before or after the > powerpc-merge pull? Did you send them yet? Linus has pulled the powerpc-merge tree, as I'm sure you've noticed. Paul. From paulus at samba.org Tue Nov 1 16:46:29 2005 From: paulus at samba.org (Paul Mackerras) Date: Tue, 1 Nov 2005 16:46:29 +1100 Subject: [PATCH] VMX get_user w/ irq disabled In-Reply-To: <20051028115509.1bb23cb6.moilanen@austin.ibm.com> References: <20051028115509.1bb23cb6.moilanen@austin.ibm.com> Message-ID: <17255.309.688169.531174@cargo.ozlabs.ibm.com> Jake Moilanen writes: > Looks like we have a get_user() call with interrupts disabled. While I > haven't seen the problem, I believe we have the same hole in mainline. > > The patch below fixed the problem on Redhat (rebased at 2.6.14). The problem is that altivec_assist_exception gets called with interrupts disabled on ppc64. I haven't decided whether to change head_64.S or just do a local_irq_enable() inside altivec_assist_exception(). Paul. From paulus at samba.org Tue Nov 1 16:57:58 2005 From: paulus at samba.org (Paul Mackerras) Date: Tue, 1 Nov 2005 16:57:58 +1100 Subject: Patches for 2.6.15 In-Reply-To: References: <17250.8725.358204.62510@cargo.ozlabs.ibm.com> Message-ID: <17255.998.459385.399953@cargo.ozlabs.ibm.com> Kumar Gala writes: > Can you merge this in: > > http://patchwork.ozlabs.org/linuxppc/patch?id=2931 Having the same extern declaration in several C files raises a red flag. Could we have that in a suitable header file instead please? Paul. From paulus at samba.org Tue Nov 1 20:55:17 2005 From: paulus at samba.org (Paul Mackerras) Date: Tue, 1 Nov 2005 20:55:17 +1100 Subject: [patch 2/5] powerpc: create a new arch/powerpc/platforms/cell/smp.c In-Reply-To: <200511010050.48828.arnd@arndb.de> References: <20051101010836.771791000@localhost> <20051101011133.300238000@localhost> <200511011026.59266.michael@ellerman.id.au> <200511010050.48828.arnd@arndb.de> Message-ID: <17255.15237.695554.410362@cargo.ozlabs.ibm.com> Arnd Bergmann writes: > On Dinsdag 01 November 2005 00:26, Michael Ellerman wrote: > > A lot of your smp routines are identical to the pSeries versions. Wouldn't it > > be preferable to only have one implementation? > > Yes it would. I'm not sure how that would best be done however. Until 2.6.14, > we've just used the pSeries implementation, which does not work any more now > that we want to keep the platform stuff in separate directories. I don't mind putting generic rtas stuff in arch/powerpc/kernel. Paul. From mporter at kernel.crashing.org Wed Nov 2 00:22:50 2005 From: mporter at kernel.crashing.org (Matt Porter) Date: Tue, 1 Nov 2005 06:22:50 -0700 Subject: Patches for 2.6.15 In-Reply-To: <17255.737.832440.137041@cargo.ozlabs.ibm.com>; from paulus@samba.org on Tue, Nov 01, 2005 at 04:53:37PM +1100 References: <17250.8725.358204.62510@cargo.ozlabs.ibm.com> <20051028103041.B15268@cox.net> <17255.737.832440.137041@cargo.ozlabs.ibm.com> Message-ID: <20051101062250.A28639@cox.net> On Tue, Nov 01, 2005 at 04:53:37PM +1100, Paul Mackerras wrote: > Matt Porter writes: > > > Ok, we have a set of 4xx patches that I plan to send to Andrew. > > They are some existing 4xx SoC/board updates as well as a new > > SoC/board. They are obviously mostly confined to the 4xx code paths > > but there's likely conflicts in changes to Makefiles, etc. > > > > Would you prefer these going upstream before or after the > > powerpc-merge pull? > > Did you send them yet? Linus has pulled the powerpc-merge tree, as > I'm sure you've noticed. Yes I did. I saw the merge go into mainline and rebased what was necessary off of that. Andrew now has them queued up for Linus so we are set. BTW, we're starting to look at merging 4xx to arch/powerpc/ now. -Matt From dwmw2 at infradead.org Wed Nov 2 02:54:04 2005 From: dwmw2 at infradead.org (David Woodhouse) Date: Tue, 01 Nov 2005 15:54:04 +0000 Subject: please pull the powerpc-merge.git tree In-Reply-To: <17253.39993.502458.390760@cargo.ozlabs.ibm.com> References: <17253.39993.502458.390760@cargo.ozlabs.ibm.com> Message-ID: <1130860444.21212.52.camel@hades.cambridge.redhat.com> On Mon, 2005-10-31 at 15:23 +1100, Paul Mackerras wrote: > It is now possible to build kernels for powermac, pSeries, iSeries and > maple with ARCH=powerpc, and for powermac, both 32-bit and 64-bit > build and run. Hm. Not entirely in line with my experience. Can you share the configs you used? Using http://david/woodhou.se/powerpc-merge-32.config it doesn't actually boot on my powerbook. I'll try it on the Pegasos later or tomorrow, where I have a serial console; it dies very early. Aside from disabling CONFIG_NVRAM because call_rtas() isn't implemented anywhere, I also needed to do this to make that config build: --- linux-2.6.14/arch/powerpc/kernel/setup-common.c.orig 2005-11-01 10:14:32.000000000 +0000 +++ linux-2.6.14/arch/powerpc/kernel/setup-common.c 2005-11-01 10:15:03.000000000 +0000 @@ -203,11 +203,11 @@ static int show_cpuinfo(struct seq_file #ifdef CONFIG_TAU_AVERAGE /* more straightforward, but potentially misleading */ seq_printf(m, "temperature \t: %u C (uncalibrated)\n", - cpu_temp(i)); + cpu_temp(cpu_id)); #else /* show the actual temp sensor range */ u32 temp; - temp = cpu_temp_both(i); + temp = cpu_temp_both(cpu_id); seq_printf(m, "temperature \t: %u-%u C (uncalibrated)\n", temp & 0xff, temp >> 16); #endif -- dwmw2 From dwmw2 at infradead.org Wed Nov 2 03:06:47 2005 From: dwmw2 at infradead.org (David Woodhouse) Date: Tue, 01 Nov 2005 16:06:47 +0000 Subject: please pull the powerpc-merge.git tree In-Reply-To: <17253.39993.502458.390760@cargo.ozlabs.ibm.com> References: <17253.39993.502458.390760@cargo.ozlabs.ibm.com> Message-ID: <1130861207.21212.66.camel@hades.cambridge.redhat.com> On Mon, 2005-10-31 at 15:23 +1100, Paul Mackerras wrote: > It is now possible to build kernels for powermac, pSeries, iSeries and > maple with ARCH=powerpc, and for powermac, both 32-bit and 64-bit > build and run. The ppc64 build (http://david.woodhou.se/powerpc-merge-64.config) fares worse than ppc32 for me -- it doesn't even build. arch/powerpc/platforms/powermac/pic.c:614: error: ?ppc_cached_irq_mask? undeclared (first use in this function) arch/powerpc/platforms/powermac/pic.c:620: error: ?pmac_irq_hw? undeclared (first use in this function) arch/powerpc/platforms/powermac/pic.c:621: error: ?max_real_irqs? undeclared (first use in this function) arch/powerpc/platforms/powermac/pic.c:641: warning: implicit declaration of function ?pmac_unmask_irq? If I leave CONFIG_ADB_PMU enabled (as I think I should since some G5s have it?) I also see this: drivers/macintosh/via-pmu.c:2410: undefined reference to `.pmac_tweak_clock_spreading' drivers/macintosh/via-pmu.c:2494: undefined reference to `.set_context' drivers/macintosh/via-pmu.c:2670: undefined reference to `._nmask_and_or_msr' drivers/macintosh/via-pmu.c:2592: undefined reference to `.set_context' If I turn CONFIG_ADB_PMU off, I see this: arch/powerpc/platforms/powermac/time.c:335: undefined reference to `.pmu_register_sleep_notifier' I think I'll leave the task of switching the Fedora rawhide kernel to arch/powerpc to another day :) -- dwmw2 From miltonm at bga.com Wed Nov 2 02:02:08 2005 From: miltonm at bga.com (Milton Miller) Date: Tue, 1 Nov 2005 09:02:08 -0600 Subject: Patches for 2.6.15 Message-ID: Paul wrote: > Kumar Gala writes: > > http://patchwork.ozlabs.org/linuxppc/patch?id=2931 > > Having the same extern declaration in several C files raises a red > flag. Could we have that in a suitable header file instead please? And the patch header says: > Having a prototype that uses seq_file without always including > seq_file.h generates a lot of warnings. This happened when asm/irq.h > was merged. Hint: a simple struct seq_file; in the approprate header should suffice. milton From arnd at arndb.de Wed Nov 2 06:13:50 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Tue, 1 Nov 2005 20:13:50 +0100 Subject: powerpc: Merge ipcbuf.h In-Reply-To: <20051101055324.GA3551@localhost.localdomain> References: <20051101055324.GA3551@localhost.localdomain> Message-ID: <200511012013.51443.arnd@arndb.de> On Dinsdag 01 November 2005 06:53, David Gibson wrote: > +struct ipc64_perm > +{ > +???????__kernel_key_t??key; > +???????__kernel_uid_t??uid; > +???????__kernel_gid_t??gid; > +???????__kernel_uid_t??cuid; > +???????__kernel_gid_t??cgid; > +???????__kernel_mode_t?mode; > +???????unsigned int????seq; > +???????unsigned int????__pad1; > +???????u64?????????????__unused1; > +???????u64?????????????__unused2; > +}; ipc64_perm is a user visible structure, so you have to use __u64 here instead of u64. Even that does not exists if you build with 32 bit and __STRICT_ANSI__, so it might be better yet to use four __u32 for the unused fields. Arnd <>< From ingvar at linpro.no Wed Nov 2 09:49:26 2005 From: ingvar at linpro.no (Ingvar Hagelund) Date: 01 Nov 2005 23:49:26 +0100 Subject: dlpar problem on sles9/openpower Message-ID: This is a pure user question We have an IBM OpenPower 720 with hypervisor and a set of lpars. Earlier we got dlpar to work, at least a couple of times, but not anymore. Here's an example: Ran this on the hmc: ~> chhwres -r virtualio -m Server-9124-720-SNXXXXXXX -p mgmt -o a \ --rsubtype eth -s 11 \ -a "ieee_virtual_eth=1,port_vlan_id=1,is_trunk=1,\"addl_vlan_ids=600,601\"" HSCL294C DLPAR ADD Virtual I/O resources failed: HMC adding Virtual I/O ...... HMC Virtual slot DLPAR operation failed. Here are the virtual slot IDs that failed and the reasons for failure: 11 The dynamic logical partitioning operation failed. On the lpar "mgmt", running sles9 sp2 kernel 2.6.5-7.193-pseries64, we have the magic rpms from IBM installed: evlog-drv-tmpl-0.8-1 diagela-1.3.0.0-6 lsvpd-0.12.7-1 ppc64-utils-2.5-2 librtas-1.2-1 ... and the following IBM services are running: # lssrc -a Subsystem Group PID Status ctrmc rsct 5011 active IBM.ServiceRM rsct_rm 5103 active IBM.DRM rsct_rm 5110 active IBM.HostRM rsct_rm 5186 active IBM.CSMAgentRM rsct_rm 5217 active IBM.ERRM rsct_rm 5221 active IBM.AuditRM rsct_rm 5266 active ctcas rsct 7705 active IBM.SensorRM rsct_rm 7721 active IBM.FSRM rsct_rm 7722 active IBM.ConfigRM rsct_rm 7723 active The hmc seems to be able to talk to the managed system ~> lshwres -r virtualio --rsubtype eth -m Server-9124-720-SNXXXXXXX \ --level lpar --filter "lpar_names=mgmt" lpar_name=mgmt,lpar_id=1,slot_num=10,state=null,ieee_virtual_eth=1,port_vlan_id=1,"addl_vlan_ids=500,501,502,503",is_trunk=1,is_required=0,mac_addr=AE808000100A ... and to the lpar: ~> lspartition -dlpar <#0> Partition:<1, power0.somewhere.tld, 10.0.0.2> Active:<1>, OS:, DCaps:<0xf>, CmdCaps:<0x1, 0x1>, PinnedMem:<0> So, where can I start digging? Regards, Ingvar -- Many that live deserve death. And some that die deserve life. Can you give it to them? Then do not be too eager to deal out death in judgement. For even the very wise cannot see all ends. Gandalf From tdgarcia at us.ibm.com Wed Nov 2 09:51:24 2005 From: tdgarcia at us.ibm.com (tdgarcia) Date: Tue, 01 Nov 2005 16:51:24 -0600 Subject: lockmeter port for ppc64 Message-ID: <4367F16C.5080003@us.ibm.com> My team and I adapted your lockmeter code to the ppc64 architecture. I am attaching the ppc64 specific code to this email. This code should patch cleanly to the 2.6.13 kernel. diff -Narup linux-2.6.13/arch/ppc64/Kconfig.debug linux-2.6.13-lockmeter/arch/ppc64/Kconfig.debug --- linux-2.6.13/arch/ppc64/Kconfig.debug 2005-08-28 16:41:01.000000000 -0700 +++ linux-2.6.13-lockmeter/arch/ppc64/Kconfig.debug 2005-10-11 07:40:28.000000000 -0700 @@ -19,6 +19,13 @@ config KPROBES for kernel debugging, non-intrusive instrumentation and testing. If in doubt, say "N". +config LOCKMETER + bool "Kernel lock metering" + depends on SMP && !PREEMPT + help + Say Y to enable kernel lock metering, which adds overhead to SMP locks, + but allows you to see various statistics using the lockstat command. + config DEBUG_STACK_USAGE bool "Stack utilization instrumentation" depends on DEBUG_KERNEL diff -Narup linux-2.6.13/arch/ppc64/lib/dec_and_lock.c linux-2.6.13-lockmeter/arch/ppc64/lib/dec_and_lock.c --- linux-2.6.13/arch/ppc64/lib/dec_and_lock.c 2005-08-28 16:41:01.000000000 -0700 +++ linux-2.6.13-lockmeter/arch/ppc64/lib/dec_and_lock.c 2005-10-10 13:02:47.000000000 -0700 @@ -28,7 +28,13 @@ */ #ifndef ATOMIC_DEC_AND_LOCK + +#ifndef CONFIG_LOCKMETER +int atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock) +#else int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock) +#endif /* CONFIG_LOCKMETER */ + { int counter; int newcount; @@ -51,5 +57,10 @@ int _atomic_dec_and_lock(atomic_t *atomi return 0; } +#ifndef CONFIG_LOCKMETER +EXPORT_SYMBOL(atomic_dec_and_lock); +#else EXPORT_SYMBOL(_atomic_dec_and_lock); +#endif /* CONFIG_LOCKMETER */ + #endif /* ATOMIC_DEC_AND_LOCK */ diff -Narup linux-2.6.13/include/asm-ppc64/lockmeter.h linux-2.6.13-lockmeter/include/asm-ppc64/lockmeter.h --- linux-2.6.13/include/asm-ppc64/lockmeter.h 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.13-lockmeter/include/asm-ppc64/lockmeter.h 2005-10-11 08:48:48.000000000 -0700 @@ -0,0 +1,110 @@ +/* + * Copyright (C) 1999,2000 Silicon Graphics, Inc. + * + * Written by John Hawkes (hawkes at sgi.com) + * Based on klstat.h by Jack Steiner (steiner at sgi.com) + * + * Modified by Ray Bryant (raybry at us.ibm.com) + * Changes Copyright (C) 2000 IBM, Inc. + * Added save of index in spinlock_t to improve efficiency + * of "hold" time reporting for spinlocks. + * Added support for hold time statistics for read and write + * locks. + * Moved machine dependent code here from include/lockmeter.h. + * + * Modified by Tony Garcia (garcia1 at us.ibm.com) + * Ported to Power PC 64 + */ + +#ifndef _PPC64_LOCKMETER_H +#define _PPC64_LOCKMETER_H + + +#include +#include +#include + +#include /* definitions for SPRN_TBRL + SPRN_TBRU, mftb() */ +extern unsigned long ppc_proc_freq; + +#define CPU_CYCLE_FREQUENCY ppc_proc_freq + +#define THIS_CPU_NUMBER smp_processor_id() + +/* + * macros to cache and retrieve an index value inside of a spin lock + * these macros assume that there are less than 65536 simultaneous + * (read mode) holders of a rwlock. Not normally a problem!! + * we also assume that the hash table has less than 65535 entries. + */ +/* + * instrumented spinlock structure -- never used to allocate storage + * only used in macros below to overlay a spinlock_t + */ +typedef struct inst_spinlock_s { + volatile unsigned int lock; + unsigned int index; +} inst_spinlock_t; + +#define PUT_INDEX(lock_ptr,indexv) ((inst_spinlock_t *)(lock_ptr))->index = indexv +#define GET_INDEX(lock_ptr) ((inst_spinlock_t *)(lock_ptr))->index + +/* + * macros to cache and retrieve an index value in a read/write lock + * as well as the cpu where a reader busy period started + * we use the 2nd word (the debug word) for this, so require the + * debug word to be present + */ +/* + * instrumented rwlock structure -- never used to allocate storage + * only used in macros below to overlay a rwlock_t + */ +typedef struct inst_rwlock_s { + volatile signed int lock; + unsigned int index; + unsigned int cpu; +} inst_rwlock_t; + +#define PUT_RWINDEX(rwlock_ptr,indexv) ((inst_rwlock_t *)(rwlock_ptr))->index = indexv +#define GET_RWINDEX(rwlock_ptr) ((inst_rwlock_t *)(rwlock_ptr))->index +#define PUT_RW_CPU(rwlock_ptr,cpuv) ((inst_rwlock_t *)(rwlock_ptr))->cpu = cpuv +#define GET_RW_CPU(rwlock_ptr) ((inst_rwlock_t *)(rwlock_ptr))->cpu + +/* + * return the number of readers for a rwlock_t + */ +#define RWLOCK_READERS(rwlock_ptr) rwlock_readers(rwlock_ptr) + +/* Return number of readers */ +extern inline int rwlock_readers(rwlock_t *rwlock_ptr) +{ + signed int tmp = rwlock_ptr->lock; + + if ( tmp > 0 ) + return tmp; + else + return 0; +} + +/* + * return true if rwlock is write locked + * (note that other lock attempts can cause the lock value to be negative) + */ +#define RWLOCK_IS_WRITE_LOCKED(rwlock_ptr) ((signed int)(rwlock_ptr)->lock < 0) +#define RWLOCK_IS_READ_LOCKED(rwlock_ptr) ((signed int)(rwlock_ptr)->lock > 0 ) + +/*Written by Carl L. to get the time base counters on ppc, + rplaces the Intel only call rtds*/ +static inline long get_cycles64 (void) +{ + unsigned long tb; + + /* read the upper and lower 32 bit Time base counter */ + tb = mfspr(SPRN_TBRU); + tb = (tb << 32) | mfspr(SPRN_TBRL); + + return(tb); +} + +#endif /* _PPC64_LOCKMETER_H */ diff -Narup linux-2.6.13/include/asm-ppc64/spinlock.h linux-2.6.13-lockmeter/include/asm-ppc64/spinlock.h --- linux-2.6.13/include/asm-ppc64/spinlock.h 2005-08-28 16:41:01.000000000 -0700 +++ linux-2.6.13-lockmeter/include/asm-ppc64/spinlock.h 2005-10-10 14:04:25.000000000 -0700 @@ -23,6 +23,9 @@ typedef struct { volatile unsigned int lock; +#ifdef CONFIG_LOCKMETER + unsigned int lockmeter_magic; +#endif /* CONFIG_LOCKMETER */ #ifdef CONFIG_PREEMPT unsigned int break_lock; #endif @@ -30,13 +33,20 @@ typedef struct { typedef struct { volatile signed int lock; +#ifdef CONFIG_LOCKMETER + unsigned int index; + unsigned int cpu; +#endif /* CONFIG_LOCKMETER */ #ifdef CONFIG_PREEMPT unsigned int break_lock; #endif } rwlock_t; -#ifdef __KERNEL__ -#define SPIN_LOCK_UNLOCKED (spinlock_t) { 0 } +#ifdef CONFIG_LOCKMETER + #define SPIN_LOCK_UNLOCKED (spinlock_t) { 0 } +#else + #define SPIN_LOCK_UNLOCKED (spinlock_t) { 0 , 0} +#endif /* CONFIG_LOCKMETER */ #define spin_is_locked(x) ((x)->lock != 0) #define spin_lock_init(x) do { *(x) = SPIN_LOCK_UNLOCKED; } while(0) @@ -144,7 +154,7 @@ static void __inline__ _raw_spin_lock_fl * irq-safe write-lock, but readers can get non-irqsafe * read-locks. */ -#define RW_LOCK_UNLOCKED (rwlock_t) { 0 } +#define RW_LOCK_UNLOCKED (rwlock_t) { 0 , 0 , 0 } #define rwlock_init(x) do { *(x) = RW_LOCK_UNLOCKED; } while(0) @@ -157,6 +167,44 @@ static __inline__ void _raw_write_unlock rw->lock = 0; } +#if defined(CONFIG_LOCKMETER) && defined(CONFIG_HAVE_DEC_LOCK) +extern void _metered_spin_lock (spinlock_t *lock, void *caller_pc); +extern void _metered_spin_unlock(spinlock_t *lock); + +/* + * Matches what is in arch/ppc64/lib/dec_and_lock.c, except this one is + * "static inline" so that the spin_lock(), if actually invoked, is charged + * against the real caller, not against the catch-all atomic_dec_and_lock + */ +static inline int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock) +{ + int counter; + int newcount; + + for (;;) { + counter = atomic_read(atomic); + newcount = counter - 1; + if (!newcount) + break; /* do it the slow way */ + + newcount = cmpxchg(&atomic->counter, counter, newcount); + if (newcount == counter) + return 0; + } + + preempt_disable(); + _metered_spin_lock(lock, __builtin_return_address(0)); + if (atomic_dec_and_test(atomic)) + return 1; + _metered_spin_unlock(lock); + preempt_enable(); + + return 0; +} + +#define ATOMIC_DEC_AND_LOCK +#endif /* CONFIG_LOCKMETER and CONFIG_HAVE_DEC_LOCK */ + /* * This returns the old value in the lock + 1, * so we got a read lock if the return value is > 0. @@ -256,5 +304,4 @@ static void __inline__ _raw_write_lock(r } } -#endif /* __KERNEL__ */ #endif /* __ASM_SPINLOCK_H */ From hien at us.ibm.com Wed Nov 2 11:14:42 2005 From: hien at us.ibm.com (Hien Nguyen) Date: Tue, 01 Nov 2005 16:14:42 -0800 Subject: [PATCH] exporting validate_sp Message-ID: <1130890483.4032.20.camel@dyn9047022138.beaverton.ibm.com> This patch will export the validate_sp() function (part of dump_stack code). I am developers for the systemtap project: http://sourceware.org/systemtap/ The SystemTap runtime includes a function for capturing a stack trace as a string. For the ppc64 port, we need to have the validate_sp() function exported so it is accessible to our stack-trace function, which is part of a SystemTap-generated kernel module. This patch should apply to kernel 2.6.14-rc5-mm1. Thanks, Hien. Signed-off-by: Hien Nguyen --- linux-2.6.14-rc5.org/arch/ppc64/kernel/process.c 2005-10-19 23:23:05.000000000 -0700 +++ linux-2.6.14-rc5/arch/ppc64/kernel/process.c 2005-11-01 12:54:23.000000000 -0800 @@ -626,6 +626,7 @@ return 0; } +EXPORT_SYMBOL_GPL(validate_sp); unsigned long get_wchan(struct task_struct *p) { From david at gibson.dropbear.id.au Wed Nov 2 11:06:44 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Wed, 2 Nov 2005 11:06:44 +1100 Subject: please pull the powerpc-merge.git tree In-Reply-To: <1130860444.21212.52.camel@hades.cambridge.redhat.com> References: <17253.39993.502458.390760@cargo.ozlabs.ibm.com> <1130860444.21212.52.camel@hades.cambridge.redhat.com> Message-ID: <20051102000644.GB8308@localhost.localdomain> On Tue, Nov 01, 2005 at 03:54:04PM +0000, David Woodhouse wrote: > On Mon, 2005-10-31 at 15:23 +1100, Paul Mackerras wrote: > > It is now possible to build kernels for powermac, pSeries, iSeries and > > maple with ARCH=powerpc, and for powermac, both 32-bit and 64-bit > > build and run. > > Hm. Not entirely in line with my experience. Can you share the configs > you used? I gather paulus doesn't believe in CONFIG_TAU. > Using http://david/woodhou.se/powerpc-merge-32.config it doesn't > actually boot on my powerbook. I'll try it on the Pegasos later or > tomorrow, where I have a serial console; it dies very early. > > Aside from disabling CONFIG_NVRAM because call_rtas() isn't implemented > anywhere, I also needed to do this to make that config build: > > --- linux-2.6.14/arch/powerpc/kernel/setup-common.c.orig 2005-11-01 10:14:32.000000000 +0000 > +++ linux-2.6.14/arch/powerpc/kernel/setup-common.c 2005-11-01 10:15:03.000000000 +0000 > @@ -203,11 +203,11 @@ static int show_cpuinfo(struct seq_file > #ifdef CONFIG_TAU_AVERAGE > /* more straightforward, but potentially misleading */ > seq_printf(m, "temperature \t: %u C (uncalibrated)\n", > - cpu_temp(i)); > + cpu_temp(cpu_id)); > #else > /* show the actual temp sensor range */ > u32 temp; > - temp = cpu_temp_both(i); > + temp = cpu_temp_both(cpu_id); > seq_printf(m, "temperature \t: %u-%u C (uncalibrated)\n", > temp & 0xff, temp >> 16); > #endif > > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From david at gibson.dropbear.id.au Wed Nov 2 11:44:26 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Wed, 2 Nov 2005 11:44:26 +1100 Subject: powerpc: Merge ipcbuf.h In-Reply-To: <200511012013.51443.arnd@arndb.de> References: <20051101055324.GA3551@localhost.localdomain> <200511012013.51443.arnd@arndb.de> Message-ID: <20051102004426.GC8308@localhost.localdomain> On Tue, Nov 01, 2005 at 08:13:50PM +0100, Arnd Bergmann wrote: > On Dinsdag 01 November 2005 06:53, David Gibson wrote: > > +struct ipc64_perm > > +{ > > +???????__kernel_key_t??key; > > +???????__kernel_uid_t??uid; > > +???????__kernel_gid_t??gid; > > +???????__kernel_uid_t??cuid; > > +???????__kernel_gid_t??cgid; > > +???????__kernel_mode_t?mode; > > +???????unsigned int????seq; > > +???????unsigned int????__pad1; > > +???????u64?????????????__unused1; > > +???????u64?????????????__unused2; > > +}; > > ipc64_perm is a user visible structure, so you have to use > __u64 here instead of u64. Even that does not exists if > you build with 32 bit and __STRICT_ANSI__, so it might > be better yet to use four __u32 for the unused fields. Oops. I realised it was user visible, but forgot the wrinkle that the 'uXX' names can't be used there. Here's a patch to correct it. Paulus, please apply. Oops, when merging ipcbuf.h, I forgot that 'u64' can't be used in user-visible headers. This patch corrects the problem, replacing the unused fields with an array of four __u32s. Signed-off-by: David Gibson Index: working-2.6/include/asm-powerpc/ipcbuf.h =================================================================== --- working-2.6.orig/include/asm-powerpc/ipcbuf.h 2005-11-02 10:41:06.000000000 +1100 +++ working-2.6/include/asm-powerpc/ipcbuf.h 2005-11-02 11:41:36.000000000 +1100 @@ -27,8 +27,7 @@ __kernel_mode_t mode; unsigned int seq; unsigned int __pad1; - u64 __unused1; - u64 __unused2; + __u32 __unused[4]; }; #endif /* _ASM_POWERPC_IPCBUF_H */ -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From paulus at samba.org Wed Nov 2 12:03:26 2005 From: paulus at samba.org (Paul Mackerras) Date: Wed, 2 Nov 2005 12:03:26 +1100 Subject: please pull the powerpc-merge.git tree In-Reply-To: <1130860444.21212.52.camel@hades.cambridge.redhat.com> References: <17253.39993.502458.390760@cargo.ozlabs.ibm.com> <1130860444.21212.52.camel@hades.cambridge.redhat.com> Message-ID: <17256.4190.184855.821331@cargo.ozlabs.ibm.com> David Woodhouse writes: > Hm. Not entirely in line with my experience. Can you share the configs > you used? Sure, attached (as a .tar.gz). For 32-bit pmac, you currently have to disable CONFIG_PREP and (I believe) the TAU options. For the 64-bit configs I basically just used the defconfigs in arch/ppc64/configs. > Using http://david/woodhou.se/powerpc-merge-32.config it doesn't > actually boot on my powerbook. I'll try it on the Pegasos later or > tomorrow, where I have a serial console; it dies very early. That's probably either the pci quirk that got added to do USB host controller handoff unconditionally on all platforms, and which touches the device without doing pci_enable_device or checking whether MMIO is enabled. A fix has gone into Linus' tree for that. There was also a bug added to the adbhid.c driver which would cause an oops when you pressed a key if you had an ADB keyboard (which powerbooks do). That's also fixed in Linus' tree. > Aside from disabling CONFIG_NVRAM because call_rtas() isn't implemented > anywhere, I also needed to do this to make that config build: > > --- linux-2.6.14/arch/powerpc/kernel/setup-common.c.orig 2005-11-01 10:14:32.000000000 +0000 > +++ linux-2.6.14/arch/powerpc/kernel/setup-common.c 2005-11-01 10:15:03.000000000 +0000 > @@ -203,11 +203,11 @@ static int show_cpuinfo(struct seq_file > #ifdef CONFIG_TAU_AVERAGE > /* more straightforward, but potentially misleading */ > seq_printf(m, "temperature \t: %u C (uncalibrated)\n", > - cpu_temp(i)); > + cpu_temp(cpu_id)); > #else > /* show the actual temp sensor range */ > u32 temp; > - temp = cpu_temp_both(i); > + temp = cpu_temp_both(cpu_id); > seq_printf(m, "temperature \t: %u-%u C (uncalibrated)\n", > temp & 0xff, temp >> 16); > #endif Thanks, I'll put that in. Paul. From paulus at samba.org Wed Nov 2 12:04:07 2005 From: paulus at samba.org (Paul Mackerras) Date: Wed, 2 Nov 2005 12:04:07 +1100 Subject: please pull the powerpc-merge.git tree In-Reply-To: <1130860444.21212.52.camel@hades.cambridge.redhat.com> References: <17253.39993.502458.390760@cargo.ozlabs.ibm.com> <1130860444.21212.52.camel@hades.cambridge.redhat.com> Message-ID: <17256.4231.124394.723713@cargo.ozlabs.ibm.com> David Woodhouse writes: > Hm. Not entirely in line with my experience. Can you share the configs > you used? Forgot to attach the configs on my previous reply. Paul. -------------- next part -------------- A non-text attachment was scrubbed... Name: configs.tar.gz Type: application/octet-stream Size: 19644 bytes Desc: config tarball Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20051102/55f01850/attachment.obj From david at gibson.dropbear.id.au Wed Nov 2 13:58:22 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Wed, 2 Nov 2005 13:58:22 +1100 Subject: powerpc: Merge futex.h Message-ID: <20051102025822.GB10682@localhost.localdomain> This patch merges the ppc32 and ppc64 versions of futex.h, essentially by taking the ppc64 version as the powerpc version. The old ppc32 version did not implement the futex_atomic_op_inuser() callback (it always returned -ENOSYS), so FUTEX_WAKE_OP would not work on ppc32. In fact the ppc64 version of this function is almost suitable for ppc32 as well - the only change needed is to extend ppc_asm.h with a macro expanding to to the right pseudo-op to store a pointer (either ".long" or ".llong"). Built and booted on pSeries. Built for 32-bit powermac. Signed-off-by: David Gibson Index: working-2.6/include/asm-powerpc/futex.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ working-2.6/include/asm-powerpc/futex.h 2005-11-02 13:43:08.000000000 +1100 @@ -0,0 +1,84 @@ +#ifndef _ASM_POWERPC_FUTEX_H +#define _ASM_POWERPC_FUTEX_H + +#ifdef __KERNEL__ + +#include +#include +#include +#include +#include + +#define __futex_atomic_op(insn, ret, oldval, uaddr, oparg) \ + __asm__ __volatile ( \ + SYNC_ON_SMP \ +"1: lwarx %0,0,%2\n" \ + insn \ +"2: stwcx. %1,0,%2\n" \ + "bne- 1b\n" \ + "li %1,0\n" \ +"3: .section .fixup,\"ax\"\n" \ +"4: li %1,%3\n" \ + "b 3b\n" \ + ".previous\n" \ + ".section __ex_table,\"a\"\n" \ + ".align 3\n" \ + DATAL " 1b,4b,2b,4b\n" \ + ".previous" \ + : "=&r" (oldval), "=&r" (ret) \ + : "b" (uaddr), "i" (-EFAULT), "1" (oparg) \ + : "cr0", "memory") + +static inline int futex_atomic_op_inuser (int encoded_op, int __user *uaddr) +{ + int op = (encoded_op >> 28) & 7; + int cmp = (encoded_op >> 24) & 15; + int oparg = (encoded_op << 8) >> 20; + int cmparg = (encoded_op << 20) >> 20; + int oldval = 0, ret; + if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28)) + oparg = 1 << oparg; + + if (! access_ok (VERIFY_WRITE, uaddr, sizeof(int))) + return -EFAULT; + + inc_preempt_count(); + + switch (op) { + case FUTEX_OP_SET: + __futex_atomic_op("", ret, oldval, uaddr, oparg); + break; + case FUTEX_OP_ADD: + __futex_atomic_op("add %1,%0,%1\n", ret, oldval, uaddr, oparg); + break; + case FUTEX_OP_OR: + __futex_atomic_op("or %1,%0,%1\n", ret, oldval, uaddr, oparg); + break; + case FUTEX_OP_ANDN: + __futex_atomic_op("andc %1,%0,%1\n", ret, oldval, uaddr, oparg); + break; + case FUTEX_OP_XOR: + __futex_atomic_op("xor %1,%0,%1\n", ret, oldval, uaddr, oparg); + break; + default: + ret = -ENOSYS; + } + + dec_preempt_count(); + + if (!ret) { + switch (cmp) { + case FUTEX_OP_CMP_EQ: ret = (oldval == cmparg); break; + case FUTEX_OP_CMP_NE: ret = (oldval != cmparg); break; + case FUTEX_OP_CMP_LT: ret = (oldval < cmparg); break; + case FUTEX_OP_CMP_GE: ret = (oldval >= cmparg); break; + case FUTEX_OP_CMP_LE: ret = (oldval <= cmparg); break; + case FUTEX_OP_CMP_GT: ret = (oldval > cmparg); break; + default: ret = -ENOSYS; + } + } + return ret; +} + +#endif /* __KERNEL__ */ +#endif /* _ASM_POWERPC_FUTEX_H */ Index: working-2.6/include/asm-ppc/futex.h =================================================================== --- working-2.6.orig/include/asm-ppc/futex.h 2005-10-25 11:59:59.000000000 +1000 +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 @@ -1,53 +0,0 @@ -#ifndef _ASM_FUTEX_H -#define _ASM_FUTEX_H - -#ifdef __KERNEL__ - -#include -#include -#include - -static inline int -futex_atomic_op_inuser (int encoded_op, int __user *uaddr) -{ - int op = (encoded_op >> 28) & 7; - int cmp = (encoded_op >> 24) & 15; - int oparg = (encoded_op << 8) >> 20; - int cmparg = (encoded_op << 20) >> 20; - int oldval = 0, ret; - if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28)) - oparg = 1 << oparg; - - if (! access_ok (VERIFY_WRITE, uaddr, sizeof(int))) - return -EFAULT; - - inc_preempt_count(); - - switch (op) { - case FUTEX_OP_SET: - case FUTEX_OP_ADD: - case FUTEX_OP_OR: - case FUTEX_OP_ANDN: - case FUTEX_OP_XOR: - default: - ret = -ENOSYS; - } - - dec_preempt_count(); - - if (!ret) { - switch (cmp) { - case FUTEX_OP_CMP_EQ: ret = (oldval == cmparg); break; - case FUTEX_OP_CMP_NE: ret = (oldval != cmparg); break; - case FUTEX_OP_CMP_LT: ret = (oldval < cmparg); break; - case FUTEX_OP_CMP_GE: ret = (oldval >= cmparg); break; - case FUTEX_OP_CMP_LE: ret = (oldval <= cmparg); break; - case FUTEX_OP_CMP_GT: ret = (oldval > cmparg); break; - default: ret = -ENOSYS; - } - } - return ret; -} - -#endif -#endif Index: working-2.6/include/asm-ppc64/futex.h =================================================================== --- working-2.6.orig/include/asm-ppc64/futex.h 2005-10-31 15:20:22.000000000 +1100 +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 @@ -1,83 +0,0 @@ -#ifndef _ASM_FUTEX_H -#define _ASM_FUTEX_H - -#ifdef __KERNEL__ - -#include -#include -#include -#include - -#define __futex_atomic_op(insn, ret, oldval, uaddr, oparg) \ - __asm__ __volatile (SYNC_ON_SMP \ -"1: lwarx %0,0,%2\n" \ - insn \ -"2: stwcx. %1,0,%2\n\ - bne- 1b\n\ - li %1,0\n\ -3: .section .fixup,\"ax\"\n\ -4: li %1,%3\n\ - b 3b\n\ - .previous\n\ - .section __ex_table,\"a\"\n\ - .align 3\n\ - .llong 1b,4b,2b,4b\n\ - .previous" \ - : "=&r" (oldval), "=&r" (ret) \ - : "b" (uaddr), "i" (-EFAULT), "1" (oparg) \ - : "cr0", "memory") - -static inline int -futex_atomic_op_inuser (int encoded_op, int __user *uaddr) -{ - int op = (encoded_op >> 28) & 7; - int cmp = (encoded_op >> 24) & 15; - int oparg = (encoded_op << 8) >> 20; - int cmparg = (encoded_op << 20) >> 20; - int oldval = 0, ret; - if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28)) - oparg = 1 << oparg; - - if (! access_ok (VERIFY_WRITE, uaddr, sizeof(int))) - return -EFAULT; - - inc_preempt_count(); - - switch (op) { - case FUTEX_OP_SET: - __futex_atomic_op("", ret, oldval, uaddr, oparg); - break; - case FUTEX_OP_ADD: - __futex_atomic_op("add %1,%0,%1\n", ret, oldval, uaddr, oparg); - break; - case FUTEX_OP_OR: - __futex_atomic_op("or %1,%0,%1\n", ret, oldval, uaddr, oparg); - break; - case FUTEX_OP_ANDN: - __futex_atomic_op("andc %1,%0,%1\n", ret, oldval, uaddr, oparg); - break; - case FUTEX_OP_XOR: - __futex_atomic_op("xor %1,%0,%1\n", ret, oldval, uaddr, oparg); - break; - default: - ret = -ENOSYS; - } - - dec_preempt_count(); - - if (!ret) { - switch (cmp) { - case FUTEX_OP_CMP_EQ: ret = (oldval == cmparg); break; - case FUTEX_OP_CMP_NE: ret = (oldval != cmparg); break; - case FUTEX_OP_CMP_LT: ret = (oldval < cmparg); break; - case FUTEX_OP_CMP_GE: ret = (oldval >= cmparg); break; - case FUTEX_OP_CMP_LE: ret = (oldval <= cmparg); break; - case FUTEX_OP_CMP_GT: ret = (oldval > cmparg); break; - default: ret = -ENOSYS; - } - } - return ret; -} - -#endif -#endif Index: working-2.6/include/asm-powerpc/ppc_asm.h =================================================================== --- working-2.6.orig/include/asm-powerpc/ppc_asm.h 2005-10-31 15:20:57.000000000 +1100 +++ working-2.6/include/asm-powerpc/ppc_asm.h 2005-11-02 13:48:08.000000000 +1100 @@ -506,6 +506,13 @@ #else #define __ASM_CONST(x) x##UL #define ASM_CONST(x) __ASM_CONST(x) + +#ifdef CONFIG_PPC64 +#define DATAL ".llong" +#else +#define DATAL ".long" +#endif + #endif /* __ASSEMBLY__ */ #endif /* _ASM_POWERPC_PPC_ASM_H */ -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From david at gibson.dropbear.id.au Wed Nov 2 15:13:20 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Wed, 2 Nov 2005 15:13:20 +1100 Subject: powerpc: Move dart.h Message-ID: <20051102041320.GA15666@localhost.localdomain> asm-ppc64/dart.h is included in exactly one place - arch/powerpc/sysdev/u3_iommu.c. This patch, therefore, moves it into arch/powerpc/sysdev. While we're at it, update the #ifndef/#define protecting the include, and the filename in the comments of u3_iommu.c. Built and booted on pSeries and G5, built for ppc32 powermac. Signed-off-by: David Gibson Index: working-2.6/arch/powerpc/sysdev/dart.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ working-2.6/arch/powerpc/sysdev/dart.h 2005-11-02 14:53:28.000000000 +1100 @@ -0,0 +1,59 @@ +/* + * Copyright (C) 2004 Olof Johansson , IBM Corporation + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef _POWERPC_SYSDEV_DART_H +#define _POWERPC_SYSDEV_DART_H + + +/* physical base of DART registers */ +#define DART_BASE 0xf8033000UL + +/* Offset from base to control register */ +#define DARTCNTL 0 +/* Offset from base to exception register */ +#define DARTEXCP 0x10 +/* Offset from base to TLB tag registers */ +#define DARTTAG 0x1000 + + +/* Control Register fields */ + +/* base address of table (pfn) */ +#define DARTCNTL_BASE_MASK 0xfffff +#define DARTCNTL_BASE_SHIFT 12 + +#define DARTCNTL_FLUSHTLB 0x400 +#define DARTCNTL_ENABLE 0x200 + +/* size of table in pages */ +#define DARTCNTL_SIZE_MASK 0x1ff +#define DARTCNTL_SIZE_SHIFT 0 + + +/* DART table fields */ + +#define DARTMAP_VALID 0x80000000 +#define DARTMAP_RPNMASK 0x00ffffff + + +#define DART_PAGE_SHIFT 12 +#define DART_PAGE_SIZE (1 << DART_PAGE_SHIFT) +#define DART_PAGE_FACTOR (PAGE_SHIFT - DART_PAGE_SHIFT) + + +#endif /* _POWERPC_SYSDEV_DART_H */ Index: working-2.6/include/asm-ppc64/dart.h =================================================================== --- working-2.6.orig/include/asm-ppc64/dart.h 2005-10-31 15:20:22.000000000 +1100 +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 @@ -1,59 +0,0 @@ -/* - * Copyright (C) 2004 Olof Johansson , IBM Corporation - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA - */ - -#ifndef _ASM_DART_H -#define _ASM_DART_H - - -/* physical base of DART registers */ -#define DART_BASE 0xf8033000UL - -/* Offset from base to control register */ -#define DARTCNTL 0 -/* Offset from base to exception register */ -#define DARTEXCP 0x10 -/* Offset from base to TLB tag registers */ -#define DARTTAG 0x1000 - - -/* Control Register fields */ - -/* base address of table (pfn) */ -#define DARTCNTL_BASE_MASK 0xfffff -#define DARTCNTL_BASE_SHIFT 12 - -#define DARTCNTL_FLUSHTLB 0x400 -#define DARTCNTL_ENABLE 0x200 - -/* size of table in pages */ -#define DARTCNTL_SIZE_MASK 0x1ff -#define DARTCNTL_SIZE_SHIFT 0 - - -/* DART table fields */ - -#define DARTMAP_VALID 0x80000000 -#define DARTMAP_RPNMASK 0x00ffffff - - -#define DART_PAGE_SHIFT 12 -#define DART_PAGE_SIZE (1 << DART_PAGE_SHIFT) -#define DART_PAGE_FACTOR (PAGE_SHIFT - DART_PAGE_SHIFT) - - -#endif Index: working-2.6/arch/powerpc/sysdev/u3_iommu.c =================================================================== --- working-2.6.orig/arch/powerpc/sysdev/u3_iommu.c 2005-10-31 15:20:20.000000000 +1100 +++ working-2.6/arch/powerpc/sysdev/u3_iommu.c 2005-11-02 14:54:16.000000000 +1100 @@ -1,5 +1,5 @@ /* - * arch/ppc64/kernel/u3_iommu.c + * arch/powerpc/sysdev/u3_iommu.c * * Copyright (C) 2004 Olof Johansson , IBM Corporation * @@ -44,9 +44,10 @@ #include #include #include -#include #include +#include "dart.h" + extern int iommu_force_on; /* Physical base address and size of the DART table */ -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From paulus at samba.org Wed Nov 2 15:43:44 2005 From: paulus at samba.org (Paul Mackerras) Date: Wed, 2 Nov 2005 15:43:44 +1100 Subject: please pull the powerpc-merge.git tree In-Reply-To: <1130861207.21212.66.camel@hades.cambridge.redhat.com> References: <17253.39993.502458.390760@cargo.ozlabs.ibm.com> <1130861207.21212.66.camel@hades.cambridge.redhat.com> Message-ID: <17256.17408.247708.622755@cargo.ozlabs.ibm.com> David Woodhouse writes: > The ppc64 build (http://david.woodhou.se/powerpc-merge-64.config) fares > worse than ppc32 for me -- it doesn't even build. Those errors were all due to getting powerbook sleep code included because you have CONFIG_PM=y. I have changed things so that that code doesn't get included on a 64-bit build (at least until BenH gets sleep going on the G5 :). I also pulled in Linus' tree, and now drivers/char/tlclk.c fails to build for some reason. I claim that's not my fault, however. :) Paul. From michael at ellerman.id.au Wed Nov 2 18:23:33 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Wed, 2 Nov 2005 18:23:33 +1100 (EST) Subject: [PATCH] powerpc: Fix random memory corruption in merged elf.h Message-ID: <20051102072333.B91D568664@ozlabs.org> The merged verison of ELF_CORE_COPY_REGS is basically the PPC64 version, with a memset that came from PPC and a few types abstracted out into #defines. But it's not _quite_ right. The first problem is we calculate the number of registers with: nregs = sizeof(struct pt_regs) / sizeof(ELF_GREG_TYPE) For a 32-bit process on a 64-bit kernel that's bogus because the registers are 64 bits, but ELF_GREG_TYPE is u32, so nregs == 88 which is wrong. The other problem is the memset, which assumes a struct pt_regs is smaller than a struct elf_regs. For a 32-bit process on a 64-bit kernel that's false. The fix is to calculate the number of regs using sizeof(unsigned long), which should always be right, and just memset the whole damn thing _before_ copying the registers in. Signed-off-by: Michael Ellerman --- include/asm-powerpc/elf.h | 22 +++++++++++++--------- 1 files changed, 13 insertions(+), 9 deletions(-) Index: kexec/include/asm-powerpc/elf.h =================================================================== --- kexec.orig/include/asm-powerpc/elf.h +++ kexec/include/asm-powerpc/elf.h @@ -178,18 +178,22 @@ typedef elf_vrreg_t elf_vrregset_t32[ELF static inline void ppc_elf_core_copy_regs(elf_gregset_t elf_regs, struct pt_regs *regs) { - int i; - int gprs = sizeof(struct pt_regs)/sizeof(ELF_GREG_TYPE); + int i, nregs; - if (gprs > ELF_NGREG) - gprs = ELF_NGREG; + memset((void *)elf_regs, 0, sizeof(elf_gregset_t)); - for (i=0; i < gprs; i++) + /* Our registers are always unsigned longs, whether we're a 32 bit + * process or 64 bit, on either a 64 bit or 32 bit kernel. + * Don't use ELF_GREG_TYPE here. */ + nregs = sizeof(struct pt_regs) / sizeof(unsigned long); + if (nregs > ELF_NGREG) + nregs = ELF_NGREG; + + for (i = 0; i < nregs; i++) { + /* This will correctly truncate 64 bit registers to 32 bits + * for a 32 bit process on a 64 bit kernel. */ elf_regs[i] = (elf_greg_t)((ELF_GREG_TYPE *)regs)[i]; - - memset((char *)(elf_regs) + sizeof(struct pt_regs), 0, \ - sizeof(elf_gregset_t) - sizeof(struct pt_regs)); - + } } #define ELF_CORE_COPY_REGS(gregs, regs) ppc_elf_core_copy_regs(gregs, regs); From benh at kernel.crashing.org Wed Nov 2 18:23:18 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 02 Nov 2005 18:23:18 +1100 Subject: [PATCH] ppc64: 64K pages support In-Reply-To: <1130915220.20136.14.camel@gaston> References: <1130915220.20136.14.camel@gaston> Message-ID: <1130916198.20136.17.camel@gaston> On Wed, 2005-11-02 at 18:07 +1100, Benjamin Herrenschmidt wrote: > It took a while, but finally, here is the 64K pages support patch for > ppc64. This patch adds a new CONFIG_PPC_64K_PAGES which, when enabled, > changes the kernel base page size to 64K. The resulting kernel still > boots on any hardware. On current machines with 4K pages support only, > the kernel will maintain 16 "subpages" for each 64K page > transparently. > > Note that while real 64K capable HW has been tested, the current patch > will not enable it yet as such hardware is not released yet, and I'm > still verifying with the firmware architects the proper to get the > information from the newer hypervisors. > > Signed-off-by: Benjamin Herrenschmidt Oh, and since the mailing lists are probably filtering this out due to the patch size, here's an URL where you can find it too: http://gate.crashing.org/~benh/ppc64-64k-pages.diff Ben. From dwmw2 at infradead.org Wed Nov 2 18:35:12 2005 From: dwmw2 at infradead.org (David Woodhouse) Date: Wed, 02 Nov 2005 07:35:12 +0000 Subject: please pull the powerpc-merge.git tree In-Reply-To: <17256.17408.247708.622755@cargo.ozlabs.ibm.com> References: <17253.39993.502458.390760@cargo.ozlabs.ibm.com> <1130861207.21212.66.camel@hades.cambridge.redhat.com> <17256.17408.247708.622755@cargo.ozlabs.ibm.com> Message-ID: <1130916912.10031.143.camel@baythorne.infradead.org> On Wed, 2005-11-02 at 15:43 +1100, Paul Mackerras wrote: > I also pulled in Linus' tree, and now drivers/char/tlclk.c fails to > build for some reason. I claim that's not my fault, however. :) It just needs included. That reminds me -- I needed that in platforms/chrp/pegasos_eth.c too. diff --git a/arch/powerpc/platforms/chrp/pegasos_eth.c b/arch/powerpc/platforms/chrp/pegasos_eth.c --- a/arch/powerpc/platforms/chrp/pegasos_eth.c +++ b/arch/powerpc/platforms/chrp/pegasos_eth.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #define PEGASOS2_MARVELL_REGBASE (0xf1000000) diff --git a/drivers/char/tlclk.c b/drivers/char/tlclk.c --- a/drivers/char/tlclk.c +++ b/drivers/char/tlclk.c @@ -43,6 +43,7 @@ #include #include #include +#include #include /* inb/outb */ #include -- dwmw2 From dwmw2 at infradead.org Wed Nov 2 20:06:19 2005 From: dwmw2 at infradead.org (David Woodhouse) Date: Wed, 02 Nov 2005 09:06:19 +0000 Subject: please pull the powerpc-merge.git tree In-Reply-To: <17256.4190.184855.821331@cargo.ozlabs.ibm.com> References: <17253.39993.502458.390760@cargo.ozlabs.ibm.com> <1130860444.21212.52.camel@hades.cambridge.redhat.com> <17256.4190.184855.821331@cargo.ozlabs.ibm.com> Message-ID: <1130922379.10031.154.camel@baythorne.infradead.org> On Wed, 2005-11-02 at 12:03 +1100, Paul Mackerras wrote: > That's probably either the pci quirk that got added to do USB host > controller handoff unconditionally on all platforms, and which touches > the device without doing pci_enable_device or checking whether MMIO is > enabled. A fix has gone into Linus' tree for that. > > There was also a bug added to the adbhid.c driver which would cause an > oops when you pressed a key if you had an ADB keyboard (which > powerbooks do). That's also fixed in Linus' tree. It was neither of those -- after a few warnings about sleeping in inappropriate contexts it just seems to stop. The Pegasos is a little more informative -- lots of 'hda: lost interrupt' on that. Keyboard seems to work though, and Bogomips calculation -- so maybe it's just PCI interrupts which are missing. I'll poke at it further. I'll also try again on the powerbook today and see if I can get anything more useful out of it. -- dwmw2 From vst at vlnb.net Wed Nov 2 17:51:41 2005 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Wed, 02 Nov 2005 09:51:41 +0300 Subject: [PATCH 0/3] ibmvscsis scsi target In-Reply-To: <435EE61C.8020404@torque.net> References: <20051017143644.GA9992@cs.umn.edu> <435EE61C.8020404@torque.net> Message-ID: <436861FD.6080300@vlnb.net> Douglas Gilbert wrote: > Dave Boutcher wrote: > >>James, >> >>Here's the ibmvscsis SCSI target submitted for inclusion in 2.4.15. >>This driver meets a couple of akpm's criteria for worthiness, in that >>its actually been shipping for a while in a distro kernel, and (given >>the posts when I broke compatibility) is being used. >> >>This version is basically the same as the recent RFC version I sent >>out, with a few bug fixes. It addresses a comment from Anton about >>using gratuitously small max_sectors limits, and has a few other >>miscellanious fixes. >> >>The only other significant comment generated by the the RFC was from >>Christoph, and requested that this work be combined with the sgtg work >>that Mike Christie and Tomonori Fujita are working on. I definitely >>will start contributing to that work, and will convert this driver to >>their framework when it becomes complete. I would rather not keep >>this driver out of mainline for the amount of time that may take. > > > Dave, > While I'm partial to things that start with "sg...", I > had problems finding that project until I tried "stgt". Doug, Dave, Have you seen SCST (SCSI target mid-layer for Linux) on http://scst.sourceforge.net? It's much more advanced, than stgt, and much more moved ahead. Vlad From schwab at suse.de Thu Nov 3 00:12:01 2005 From: schwab at suse.de (Andreas Schwab) Date: Wed, 02 Nov 2005 14:12:01 +0100 Subject: powerpc: Merge ipcbuf.h In-Reply-To: <20051102004426.GC8308@localhost.localdomain> (David Gibson's message of "Wed, 2 Nov 2005 11:44:26 +1100") References: <20051101055324.GA3551@localhost.localdomain> <200511012013.51443.arnd@arndb.de> <20051102004426.GC8308@localhost.localdomain> Message-ID: David Gibson writes: > Oops, when merging ipcbuf.h, I forgot that 'u64' can't be used in > user-visible headers. This patch corrects the problem, replacing the > unused fields with an array of four __u32s. > > Signed-off-by: David Gibson > > Index: working-2.6/include/asm-powerpc/ipcbuf.h > =================================================================== > --- working-2.6.orig/include/asm-powerpc/ipcbuf.h 2005-11-02 10:41:06.000000000 +1100 > +++ working-2.6/include/asm-powerpc/ipcbuf.h 2005-11-02 11:41:36.000000000 +1100 > @@ -27,8 +27,7 @@ > __kernel_mode_t mode; > unsigned int seq; > unsigned int __pad1; > - u64 __unused1; > - u64 __unused2; > + __u32 __unused[4]; I think you are changing the alignment of the structure. A u64 has bigger alignment than a u32[2]. Andreas. -- Andreas Schwab, SuSE Labs, schwab at suse.de SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." From johnrose at austin.ibm.com Thu Nov 3 03:29:55 2005 From: johnrose at austin.ibm.com (John Rose) Date: Wed, 02 Nov 2005 10:29:55 -0600 Subject: [PATCH] fix add notifier crashes Message-ID: <1130948995.32348.35.camel@sinatra.austin.ibm.com> Hi Paul- The extraction of PCI stuff from struct device_node left some false assumptions in notifier code. As a result, dynamic add crashes when non-PCI nodes are added. This patch fixes these assumptions. Thanks- John Signed-off-by: John Rose diff -puN arch/ppc64/kernel/pci_dn.c~add_crash_fix arch/ppc64/kernel/pci_dn.c --- 2_6_linus_2/arch/ppc64/kernel/pci_dn.c~add_crash_fix 2005-10-31 10:51:19.000000000 -0600 +++ 2_6_linus_2-johnrose/arch/ppc64/kernel/pci_dn.c 2005-10-31 10:56:47.000000000 -0600 @@ -181,13 +181,14 @@ EXPORT_SYMBOL(fetch_dev_dn); static int pci_dn_reconfig_notifier(struct notifier_block *nb, unsigned long action, void *node) { struct device_node *np = node; - struct pci_dn *pci; + struct pci_dn *pci = NULL; int err = NOTIFY_OK; switch (action) { case PSERIES_RECONFIG_ADD: pci = np->parent->data; - update_dn_pci_info(np, pci->phb); + if (pci) + update_dn_pci_info(np, pci->phb); break; default: err = NOTIFY_DONE; diff -puN arch/powerpc/platforms/pseries/iommu.c~add_crash_fix arch/powerpc/platforms/pseries/iommu.c --- 2_6_linus_2/arch/powerpc/platforms/pseries/iommu.c~add_crash_fix 2005-10-31 15:19:14.000000000 -0600 +++ 2_6_linus_2-johnrose/arch/powerpc/platforms/pseries/iommu.c 2005-10-31 15:20:44.000000000 -0600 @@ -498,7 +498,7 @@ static int iommu_reconfig_notifier(struc switch (action) { case PSERIES_RECONFIG_REMOVE: - if (pci->iommu_table && + if (pci && pci->iommu_table && get_property(np, "ibm,dma-window", NULL)) iommu_free_table(np); break; _ From johnrose at austin.ibm.com Thu Nov 3 03:40:06 2005 From: johnrose at austin.ibm.com (John Rose) Date: Wed, 02 Nov 2005 10:40:06 -0600 Subject: dlpar problem on sles9/openpower In-Reply-To: References: Message-ID: <1130949606.32348.45.camel@sinatra.austin.ibm.com> Hi Ingvar- > On the lpar "mgmt", running sles9 sp2 kernel 2.6.5-7.193-pseries64, we > have the magic rpms from IBM installed: > > evlog-drv-tmpl-0.8-1 > diagela-1.3.0.0-6 > lsvpd-0.12.7-1 > ppc64-utils-2.5-2 > librtas-1.2-1 I assume that you have the rpa-dlpar package as well, since you said DLPAR worked at an earlier point. :) I would check two things. First, check the dmesg for any blurbs around the time of the failure. Second, use "rpttr /var/ct/IW/log/mc/IBM.DRM/trace". The output from this includes much gibberish, but the translated hex strings can include the inputs to the "drmgr" command and any error messages. Check near the bottom of the file. Unfortunately, the HMC annoyingly hides error messages for DLPAR of virtual adapters, so we have to dig here. Good luck- John From dwmw2 at infradead.org Thu Nov 3 03:54:46 2005 From: dwmw2 at infradead.org (David Woodhouse) Date: Wed, 02 Nov 2005 16:54:46 +0000 Subject: please pull the powerpc-merge.git tree In-Reply-To: <17256.17408.247708.622755@cargo.ozlabs.ibm.com> References: <17253.39993.502458.390760@cargo.ozlabs.ibm.com> <1130861207.21212.66.camel@hades.cambridge.redhat.com> <17256.17408.247708.622755@cargo.ozlabs.ibm.com> Message-ID: <1130950487.21212.89.camel@hades.cambridge.redhat.com> On Wed, 2005-11-02 at 15:43 +1100, Paul Mackerras wrote: > Those errors were all due to getting powerbook sleep code included > because you have CONFIG_PM=y. I have changed things so that that code > doesn't get included on a 64-bit build (at least until BenH gets sleep > going on the G5 :). OK, now the Fedora rawhide kernel builds for ppc64 with arch/powerpc and runs on both my POWER5 and G5 test boxes. I need this if I want nvram support on the G5 though. Should we be using CONFIG_GENERIC_NVRAM on ppc64, and actually allowing the nvram support to be optional? --- a/arch/powerpc/platforms/powermac/setup.c +++ b/arch/powerpc/platforms/powermac/setup.c @@ -351,7 +350,7 @@ void __init pmac_setup_arch(void) find_via_pmu(); smu_init(); -#ifdef CONFIG_NVRAM +#if defined(CONFIG_NVRAM) || defined(CONFIG_PPC64) pmac_nvram_init(); #endif -- dwmw2 From hch at lst.de Thu Nov 3 04:40:37 2005 From: hch at lst.de (Christoph Hellwig) Date: Wed, 2 Nov 2005 18:40:37 +0100 Subject: [PATCH] exporting validate_sp In-Reply-To: <1130890483.4032.20.camel@dyn9047022138.beaverton.ibm.com> References: <1130890483.4032.20.camel@dyn9047022138.beaverton.ibm.com> Message-ID: <20051102174037.GA23650@lst.de> On Tue, Nov 01, 2005 at 04:14:42PM -0800, Hien Nguyen wrote: > This patch will export the validate_sp() function (part of dump_stack > code). > > I am developers for the systemtap project: > http://sourceware.org/systemtap/ > > The SystemTap runtime includes a function for capturing a stack trace as > a string. For the ppc64 port, we need to have the validate_sp() > function exported so it is accessible to our stack-trace function, which > is part of a SystemTap-generated kernel module. > > This patch should apply to kernel 2.6.14-rc5-mm1. NACK. this is not something that should be exported. especiall not for some odd crap that hopefully never will get merged. From hien at us.ibm.com Thu Nov 3 04:55:22 2005 From: hien at us.ibm.com (Hien Nguyen) Date: Wed, 02 Nov 2005 09:55:22 -0800 Subject: [PATCH] exporting validate_sp In-Reply-To: <20051102174037.GA23650@lst.de> References: <1130890483.4032.20.camel@dyn9047022138.beaverton.ibm.com> <20051102174037.GA23650@lst.de> Message-ID: <4368FD8A.3090000@us.ibm.com> Christoph Hellwig wrote: > especiall not for >some odd crap that hopefully never will get merged. > > > > I disagree, systemtap is not some odd crap (Redhat, IBM, Intel, Hitachi actively work on this project for a while). And systemtap itself does not try to merge any code to the main kernel. From david at gibson.dropbear.id.au Thu Nov 3 10:13:58 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Thu, 3 Nov 2005 10:13:58 +1100 Subject: powerpc: Merge ipcbuf.h In-Reply-To: References: <20051101055324.GA3551@localhost.localdomain> <200511012013.51443.arnd@arndb.de> <20051102004426.GC8308@localhost.localdomain> Message-ID: <20051102231358.GA24772@localhost.localdomain> On Wed, Nov 02, 2005 at 02:12:01PM +0100, Andreas Schwab wrote: > David Gibson writes: > > > Oops, when merging ipcbuf.h, I forgot that 'u64' can't be used in > > user-visible headers. This patch corrects the problem, replacing the > > unused fields with an array of four __u32s. > > > > Signed-off-by: David Gibson > > > > Index: working-2.6/include/asm-powerpc/ipcbuf.h > > =================================================================== > > --- working-2.6.orig/include/asm-powerpc/ipcbuf.h 2005-11-02 10:41:06.000000000 +1100 > > +++ working-2.6/include/asm-powerpc/ipcbuf.h 2005-11-02 11:41:36.000000000 +1100 > > @@ -27,8 +27,7 @@ > > __kernel_mode_t mode; > > unsigned int seq; > > unsigned int __pad1; > > - u64 __unused1; > > - u64 __unused2; > > + __u32 __unused[4]; > > I think you are changing the alignment of the structure. A u64 has bigger > alignment than a u32[2]. Bother, so it does. Paulus, please apply. powerpc: Keep fixing merged ipcbuf.h Oops, replacing the two u64s in struct ipc64_perm with __u32s changed the alignment of that structure, which could mess up userspace. Revert to using two unsigned long longs (which is what ppc32 had originally). ppc64 orignally had two unsigned longs, but long long is the same size on 64 bit, so this should be ok there too. Signed-off-by: David Gibson Index: working-2.6/include/asm-powerpc/ipcbuf.h =================================================================== --- working-2.6.orig/include/asm-powerpc/ipcbuf.h 2005-11-02 15:47:11.000000000 +1100 +++ working-2.6/include/asm-powerpc/ipcbuf.h 2005-11-03 10:10:58.000000000 +1100 @@ -27,7 +27,8 @@ __kernel_mode_t mode; unsigned int seq; unsigned int __pad1; - __u32 __unused[4]; + unsigned long long __unused1; + unsigned long long __unused2; }; #endif /* _ASM_POWERPC_IPCBUF_H */ -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From hch at lst.de Thu Nov 3 10:30:11 2005 From: hch at lst.de (Christoph Hellwig) Date: Thu, 3 Nov 2005 00:30:11 +0100 Subject: [PATCH] exporting validate_sp In-Reply-To: <4368FD8A.3090000@us.ibm.com> References: <1130890483.4032.20.camel@dyn9047022138.beaverton.ibm.com> <20051102174037.GA23650@lst.de> <4368FD8A.3090000@us.ibm.com> Message-ID: <20051102233011.GA29200@lst.de> On Wed, Nov 02, 2005 at 09:55:22AM -0800, Hien Nguyen wrote: > Christoph Hellwig wrote: > > > especiall not for > >some odd crap that hopefully never will get merged. > > > > > > > > > I disagree, systemtap is not some odd crap (Redhat, IBM, Intel, Hitachi > actively work on this project for a while). all these companies are known for producing lots of crap. > And systemtap itself does not try to merge any code to the main kernel. and we're never adding exports for out of tree code. you're out of luck. From paulus at samba.org Thu Nov 3 14:16:29 2005 From: paulus at samba.org (Paul Mackerras) Date: Thu, 3 Nov 2005 14:16:29 +1100 Subject: [PATCH] ppc64: 64K pages support In-Reply-To: <1130916198.20136.17.camel@gaston> References: <1130915220.20136.14.camel@gaston> <1130916198.20136.17.camel@gaston> Message-ID: <17257.33037.210237.986072@cargo.ozlabs.ibm.com> Benjamin Herrenschmidt writes: > It took a while, but finally, here is the 64K pages support patch for > ppc64. This patch adds a new CONFIG_PPC_64K_PAGES which, when enabled, > changes the kernel base page size to 64K. The resulting kernel still > boots on any hardware. On current machines with 4K pages support only, > the kernel will maintain 16 "subpages" for each 64K page > transparently. > > Note that while real 64K capable HW has been tested, the current patch > will not enable it yet as such hardware is not released yet, and I'm > still verifying with the firmware architects the proper to get the > information from the newer hypervisors. > > Signed-off-by: Benjamin Herrenschmidt Acked-by: Paul Mackerras From david at gibson.dropbear.id.au Thu Nov 3 16:26:34 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Thu, 3 Nov 2005 16:26:34 +1100 Subject: ppc64: Fix bug in SLB miss handler for hugepages In-Reply-To: <17257.33037.210237.986072@cargo.ozlabs.ibm.com> References: <1130915220.20136.14.camel@gaston> <1130916198.20136.17.camel@gaston> <17257.33037.210237.986072@cargo.ozlabs.ibm.com> Message-ID: <20051103052634.GD24772@localhost.localdomain> On Thu, Nov 03, 2005 at 02:16:29PM +1100, Paul Mackerras wrote: > Benjamin Herrenschmidt writes: > > > It took a while, but finally, here is the 64K pages support patch for > > ppc64. This patch adds a new CONFIG_PPC_64K_PAGES which, when enabled, > > changes the kernel base page size to 64K. The resulting kernel still > > boots on any hardware. On current machines with 4K pages support only, > > the kernel will maintain 16 "subpages" for each 64K page > > transparently. > > > > Note that while real 64K capable HW has been tested, the current patch > > will not enable it yet as such hardware is not released yet, and I'm > > still verifying with the firmware architects the proper to get the > > information from the newer hypervisors. > > > > Signed-off-by: Benjamin Herrenschmidt > > Acked-by: Paul Mackerras This patch, however, should be applied on top to fix some problems with hugepage (some pre-existing, another introduced by this patch). The patch fixes a bug in the SLB miss handler for hugepages on ppc64 introduced by the dynamic hugepage patch (commit id c594adad5653491813959277fb87a2fef54c4e05) due to a misunderstanding of the srd instruction's behaviour (mea culpa). The problem arises when a 64-bit process maps some hugepages in the low 4GB of the address space (unusual). In this case, as well as the 256M segment in question being marked for hugepages, other segments at 32G intervals will be incorrectly marked for hugepages. In the process, this patch tweaks the semantics of the hugepage bitmaps to be more sensible. Previously, an address below 4G was marked for hugepages if the appropriate segment bit in the "low areas" bitmask was set *or* if the low bit in the "high areas" bitmap was set (which would mark all addresses below 1TB for hugepage). With this patch, any given address is governed by a single bitmap. Addresses below 4GB are marked for hugepage if and only if their bit is set in the "low areas" bitmap (256M granularity). Addresses between 4GB and 1TB are marked for hugepage iff the low bit in the "high areas" bitmap is set. Higher addresses are marked for hugepage iff their bit in the "high areas" bitmap is set (1TB granularity). To avoid conflicts, this patch must be applied on top of BenH's pending patch for 64k base page size [0]. As such, this patch also addresses a hugepage problem introduced by that patch. That patch allows hugepages of 1MB in size on hardware which supports it, however, that won't work when using 4k pages (4 level pagetable), because in that case hugepage PTEs are stored at the PMD level, and each PMD entry maps 2MB. This patch simply disallows hugepages in that case (we can do something cleverer to re-enable them some other day). Built, booted, and a handful of hugepage related tests passed on POWER5 LPAR (both ARCH=powerpc and ARCH=ppc64). [0] http://gate.crashing.org/~benh/ppc64-64k-pages.diff Signed-off-by: David Gibson Index: working-2.6/arch/powerpc/mm/slb_low.S =================================================================== --- working-2.6.orig/arch/powerpc/mm/slb_low.S 2005-11-03 14:52:16.000000000 +1100 +++ working-2.6/arch/powerpc/mm/slb_low.S 2005-11-03 14:55:56.000000000 +1100 @@ -80,12 +80,17 @@ BEGIN_FTR_SECTION b 1f END_FTR_SECTION_IFCLR(CPU_FTR_16M_PAGE) + cmpldi r10,16 + + lhz r9,PACALOWHTLBAREAS(r13) + mr r11,r10 + blt 5f + lhz r9,PACAHIGHHTLBAREAS(r13) srdi r11,r10,(HTLB_AREA_SHIFT-SID_SHIFT) - srd r9,r9,r11 - lhz r11,PACALOWHTLBAREAS(r13) - srd r11,r11,r10 - or. r9,r9,r11 + +5: srd r9,r9,r11 + andi. r9,r9,1 beq 1f _GLOBAL(slb_miss_user_load_huge) li r11,0 Index: working-2.6/arch/powerpc/mm/hash_utils_64.c =================================================================== --- working-2.6.orig/arch/powerpc/mm/hash_utils_64.c 2005-11-03 14:52:16.000000000 +1100 +++ working-2.6/arch/powerpc/mm/hash_utils_64.c 2005-11-03 15:40:56.000000000 +1100 @@ -329,12 +329,14 @@ */ if (mmu_psize_defs[MMU_PAGE_16M].shift) mmu_huge_psize = MMU_PAGE_16M; + /* With 4k/4level pagetables, we can't (for now) cope with a + * huge page size < PMD_SIZE */ else if (mmu_psize_defs[MMU_PAGE_1M].shift) mmu_huge_psize = MMU_PAGE_1M; /* Calculate HPAGE_SHIFT and sanity check it */ - if (mmu_psize_defs[mmu_huge_psize].shift > 16 && - mmu_psize_defs[mmu_huge_psize].shift < 28) + if (mmu_psize_defs[mmu_huge_psize].shift > MIN_HUGEPTE_SHIFT && + mmu_psize_defs[mmu_huge_psize].shift < SID_SHIFT) HPAGE_SHIFT = mmu_psize_defs[mmu_huge_psize].shift; else HPAGE_SHIFT = 0; /* No huge pages dude ! */ Index: working-2.6/include/asm-ppc64/pgtable-4k.h =================================================================== --- working-2.6.orig/include/asm-ppc64/pgtable-4k.h 2005-11-03 14:52:16.000000000 +1100 +++ working-2.6/include/asm-ppc64/pgtable-4k.h 2005-11-03 15:38:40.000000000 +1100 @@ -23,6 +23,9 @@ #define PMD_SIZE (1UL << PMD_SHIFT) #define PMD_MASK (~(PMD_SIZE-1)) +/* With 4k base page size, hugepage PTEs go at the PMD level */ +#define MIN_HUGEPTE_SHIFT PMD_SHIFT + /* PUD_SHIFT determines what a third-level page table entry can map */ #define PUD_SHIFT (PMD_SHIFT + PMD_INDEX_SIZE) #define PUD_SIZE (1UL << PUD_SHIFT) Index: working-2.6/include/asm-ppc64/pgtable-64k.h =================================================================== --- working-2.6.orig/include/asm-ppc64/pgtable-64k.h 2005-11-03 14:52:16.000000000 +1100 +++ working-2.6/include/asm-ppc64/pgtable-64k.h 2005-11-03 15:39:07.000000000 +1100 @@ -14,6 +14,9 @@ #define PTRS_PER_PMD (1 << PMD_INDEX_SIZE) #define PTRS_PER_PGD (1 << PGD_INDEX_SIZE) +/* With 4k base page size, hugepage PTEs go at the PMD level */ +#define MIN_HUGEPTE_SHIFT PAGE_SHIFT + /* PMD_SHIFT determines what a second-level page table entry can map */ #define PMD_SHIFT (PAGE_SHIFT + PTE_INDEX_SIZE) #define PMD_SIZE (1UL << PMD_SHIFT) Index: working-2.6/arch/powerpc/mm/hugetlbpage.c =================================================================== --- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c 2005-11-03 14:52:16.000000000 +1100 +++ working-2.6/arch/powerpc/mm/hugetlbpage.c 2005-11-03 15:56:34.000000000 +1100 @@ -212,6 +212,12 @@ BUG_ON(area >= NUM_HIGH_AREAS); + /* Hack, so that each addresses is controlled by exactly one + * of the high or low area bitmaps, the first high area starts + * at 4GB, not 0 */ + if (start == 0) + start = 0x100000000UL; + /* Check no VMAs are in the region */ vma = find_vma(mm, start); if (vma && (vma->vm_start < end)) -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From olof at lixom.net Fri Nov 4 06:49:27 2005 From: olof at lixom.net (Olof Johansson) Date: Thu, 3 Nov 2005 11:49:27 -0800 Subject: [PATCH] POWERPC/PPC64: Fix CONFIG_SMP=n build for ppc64 Message-ID: <20051103194927.GC8515@pb15.lixom.net> Hi, Below is against 2.6.14-git5: --- Two CONFIG_SMP=n build fixes due to missing includes. Signed-off-by: Olof Johansson Index: 2.6/arch/ppc64/kernel/sysfs.c =================================================================== --- 2.6.orig/arch/ppc64/kernel/sysfs.c 2005-11-03 10:33:42.000000000 -0800 +++ 2.6/arch/ppc64/kernel/sysfs.c 2005-11-03 10:33:51.000000000 -0800 @@ -20,6 +20,7 @@ #include #include #include +#include static DEFINE_PER_CPU(struct cpu, cpu_devices); Index: 2.6/arch/powerpc/kernel/time.c =================================================================== --- 2.6.orig/arch/powerpc/kernel/time.c 2005-11-03 10:45:43.000000000 -0800 +++ 2.6/arch/powerpc/kernel/time.c 2005-11-03 10:49:52.000000000 -0800 @@ -69,6 +69,7 @@ #include #include #endif +#include /* keep track of when we need to update the rtc */ time_t last_rtc_update; From tim.bird at am.sony.com Fri Nov 4 06:59:31 2005 From: tim.bird at am.sony.com (Tim Bird) Date: Thu, 03 Nov 2005 11:59:31 -0800 Subject: [PATCH] exporting validate_sp In-Reply-To: <20051102233011.GA29200@lst.de> References: <1130890483.4032.20.camel@dyn9047022138.beaverton.ibm.com> <20051102174037.GA23650@lst.de> <4368FD8A.3090000@us.ibm.com> <20051102233011.GA29200@lst.de> Message-ID: <436A6C23.2050604@am.sony.com> Christoph Hellwig wrote: > On Wed, Nov 02, 2005 at 09:55:22AM -0800, Hien Nguyen wrote: > >>Christoph Hellwig wrote: >>>especiall not for >>>some odd crap that hopefully never will get merged. >> >>I disagree, systemtap is not some odd crap (Redhat, IBM, Intel, Hitachi >>actively work on this project for a while). > > all these companies are known for producing lots of crap. > >>And systemtap itself does not try to merge any code to the main kernel. > > and we're never adding exports for out of tree code. you're out of > luck. These lines must be directly from Christoph's "How to motivate people" management handbook. Don't worry Hein. Other people (though maybe quieter than Christoph) see value in the SystemTap work. I hope it will continue to be developed and improved. -- Tim ============================= Tim Bird Architecture Group Chair, CE Linux Forum Senior Staff Engineer, Sony Electronics ============================= From hien at us.ibm.com Fri Nov 4 07:42:28 2005 From: hien at us.ibm.com (Hien Nguyen) Date: Thu, 03 Nov 2005 12:42:28 -0800 Subject: [PATCH] exporting validate_sp In-Reply-To: <436A6C23.2050604@am.sony.com> References: <1130890483.4032.20.camel@dyn9047022138.beaverton.ibm.com> <20051102174037.GA23650@lst.de> <4368FD8A.3090000@us.ibm.com> <20051102233011.GA29200@lst.de> <436A6C23.2050604@am.sony.com> Message-ID: <436A7634.20602@us.ibm.com> Tim Bird wrote: >These lines must be directly from Christoph's >"How to motivate people" management handbook. > >Don't worry Hein. Other people (though maybe quieter than >Christoph) see value in the SystemTap work. I hope it will >continue to be developed and improved. > -- Tim > >============================= >Tim Bird >Architecture Group Chair, CE Linux Forum >Senior Staff Engineer, Sony Electronics >============================= > > > > Thanks for the kind words. Yes, our intention is to make systemtap better, safer. Hien. From linas at austin.ibm.com Fri Nov 4 07:53:31 2005 From: linas at austin.ibm.com (linas) Date: Thu, 3 Nov 2005 14:53:31 -0600 Subject: [PATCH] fix add notifier crashes In-Reply-To: <1130948995.32348.35.camel@sinatra.austin.ibm.com> References: <1130948995.32348.35.camel@sinatra.austin.ibm.com> Message-ID: <20051103205331.GN19593@austin.ibm.com> On Wed, Nov 02, 2005 at 10:29:55AM -0600, John Rose was heard to remark: > Hi Paul- > > The extraction of PCI stuff from struct device_node left some false > assumptions in notifier code. As a result, dynamic add crashes when > non-PCI nodes are added. This patch fixes these assumptions. This is more or less the same as the patch I sent on 4 October. It was called "crash-on-pci-slot-add.patch" There's another closely related null ptr deref that is fixed in the patch "rpaphp-crashing.patch" that was the next one in that series. Anyway, other than that, it looks good to me. --linas From david at gibson.dropbear.id.au Fri Nov 4 11:16:53 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Fri, 4 Nov 2005 11:16:53 +1100 Subject: powerpc: Kill ppcdebug Message-ID: <20051104001653.GC29025@localhost.localdomain> The ancient ppcdebug/PPCDBG mechanism is now only used in two places. First, in the hash setup code, one of the bits allows the size of the hash table to be reduced by a factor of 8 - which would be better accomplished with a command line option for that purpose. The other was a bunch of bus walking related messages in the iSeries code, which would seem to be insufficient reason to keep the mechanism. This patch removes the last traces of this mechanism. Built and booted on iSeries and pSeries POWER5 LPAR (ARCH=powerpc). Signed-off-by: David Gibson Index: working-2.6/arch/powerpc/kernel/signal_32.c =================================================================== --- working-2.6.orig/arch/powerpc/kernel/signal_32.c 2005-11-04 10:21:12.000000000 +1100 +++ working-2.6/arch/powerpc/kernel/signal_32.c 2005-11-04 10:23:36.000000000 +1100 @@ -44,7 +44,6 @@ #include #ifdef CONFIG_PPC64 #include "ppc32.h" -#include #include #include #else Index: working-2.6/arch/powerpc/mm/init_64.c =================================================================== --- working-2.6.orig/arch/powerpc/mm/init_64.c 2005-10-31 15:20:20.000000000 +1100 +++ working-2.6/arch/powerpc/mm/init_64.c 2005-11-04 10:23:20.000000000 +1100 @@ -57,7 +57,6 @@ #include #include #include -#include #include #include #include Index: working-2.6/arch/powerpc/mm/pgtable_64.c =================================================================== --- working-2.6.orig/arch/powerpc/mm/pgtable_64.c 2005-10-31 15:44:59.000000000 +1100 +++ working-2.6/arch/powerpc/mm/pgtable_64.c 2005-11-04 10:23:20.000000000 +1100 @@ -59,7 +59,6 @@ #include #include #include -#include #include #include #include Index: working-2.6/arch/powerpc/platforms/iseries/smp.c =================================================================== --- working-2.6.orig/arch/powerpc/platforms/iseries/smp.c 2005-11-03 16:26:57.000000000 +1100 +++ working-2.6/arch/powerpc/platforms/iseries/smp.c 2005-11-04 10:23:20.000000000 +1100 @@ -40,7 +40,6 @@ #include #include #include -#include #include #include #include Index: working-2.6/arch/powerpc/platforms/pseries/iommu.c =================================================================== --- working-2.6.orig/arch/powerpc/platforms/pseries/iommu.c 2005-11-04 10:21:12.000000000 +1100 +++ working-2.6/arch/powerpc/platforms/pseries/iommu.c 2005-11-04 10:23:20.000000000 +1100 @@ -37,7 +37,6 @@ #include #include #include -#include #include #include #include Index: working-2.6/arch/powerpc/platforms/pseries/lpar.c =================================================================== --- working-2.6.orig/arch/powerpc/platforms/pseries/lpar.c 2005-10-31 15:20:20.000000000 +1100 +++ working-2.6/arch/powerpc/platforms/pseries/lpar.c 2005-11-04 10:23:20.000000000 +1100 @@ -31,7 +31,6 @@ #include #include #include -#include #include #include #include @@ -39,6 +38,7 @@ #include #include #include +#include #ifdef DEBUG #define DBG(fmt...) udbg_printf(fmt) Index: working-2.6/arch/powerpc/platforms/pseries/ras.c =================================================================== --- working-2.6.orig/arch/powerpc/platforms/pseries/ras.c 2005-10-31 15:20:20.000000000 +1100 +++ working-2.6/arch/powerpc/platforms/pseries/ras.c 2005-11-04 10:23:20.000000000 +1100 @@ -48,7 +48,7 @@ #include #include #include -#include +#include static unsigned char ras_log_buf[RTAS_ERROR_LOG_MAX]; static DEFINE_SPINLOCK(ras_log_buf_lock); Index: working-2.6/arch/ppc64/kernel/prom.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/prom.c 2005-10-31 15:44:59.000000000 +1100 +++ working-2.6/arch/ppc64/kernel/prom.c 2005-11-04 10:23:20.000000000 +1100 @@ -46,7 +46,6 @@ #include #include #include -#include #include #include #include Index: working-2.6/arch/ppc64/kernel/prom_init.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/prom_init.c 2005-11-03 16:26:57.000000000 +1100 +++ working-2.6/arch/ppc64/kernel/prom_init.c 2005-11-04 10:23:20.000000000 +1100 @@ -44,7 +44,6 @@ #include #include #include -#include #include #include #include Index: working-2.6/arch/powerpc/sysdev/u3_iommu.c =================================================================== --- working-2.6.orig/arch/powerpc/sysdev/u3_iommu.c 2005-11-03 16:26:57.000000000 +1100 +++ working-2.6/arch/powerpc/sysdev/u3_iommu.c 2005-11-04 10:23:20.000000000 +1100 @@ -37,7 +37,6 @@ #include #include #include -#include #include #include #include Index: working-2.6/arch/powerpc/kernel/setup_64.c =================================================================== --- working-2.6.orig/arch/powerpc/kernel/setup_64.c 2005-11-03 16:26:57.000000000 +1100 +++ working-2.6/arch/powerpc/kernel/setup_64.c 2005-11-04 10:23:20.000000000 +1100 @@ -41,7 +41,6 @@ #include #include #include -#include #include #include #include @@ -60,6 +59,7 @@ #include #include #include +#include #ifdef DEBUG #define DBG(fmt...) udbg_printf(fmt) @@ -352,12 +352,6 @@ DBG(" -> early_setup()\n"); /* - * Fill the default DBG level (do we want to keep - * that old mecanism around forever ?) - */ - ppcdbg_initialize(); - - /* * Do early initializations using the flattened device * tree, like retreiving the physical memory map or * calculating/retreiving the hash table size @@ -605,7 +599,6 @@ printk("-----------------------------------------------------\n"); printk("ppc64_pft_size = 0x%lx\n", ppc64_pft_size); - printk("ppc64_debug_switch = 0x%lx\n", ppc64_debug_switch); printk("ppc64_interrupt_controller = 0x%ld\n", ppc64_interrupt_controller); printk("systemcfg = 0x%p\n", systemcfg); printk("systemcfg->platform = 0x%x\n", systemcfg->platform); Index: working-2.6/include/asm-ppc64/ppcdebug.h =================================================================== --- working-2.6.orig/include/asm-ppc64/ppcdebug.h 2005-10-25 11:59:59.000000000 +1000 +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 @@ -1,108 +0,0 @@ -#ifndef __PPCDEBUG_H -#define __PPCDEBUG_H -/******************************************************************** - * Author: Adam Litke, IBM Corp - * (c) 2001 - * - * This file contains definitions and macros for a runtime debugging - * system for ppc64 (This should also work on 32 bit with a few - * adjustments. - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * - ********************************************************************/ - -#include -#include -#include -#include - -#define PPCDBG_BITVAL(X) ((1UL)<<((unsigned long)(X))) - -/* Defined below are the bit positions of various debug flags in the - * ppc64_debug_switch variable. - * -- When adding new values, please enter them into trace names below -- - * - * Values 62 & 63 can be used to stress the hardware page table management - * code. They must be set statically, any attempt to change them dynamically - * would be a very bad idea. - */ -#define PPCDBG_MMINIT PPCDBG_BITVAL(0) -#define PPCDBG_MM PPCDBG_BITVAL(1) -#define PPCDBG_SYS32 PPCDBG_BITVAL(2) -#define PPCDBG_SYS32NI PPCDBG_BITVAL(3) -#define PPCDBG_SYS32X PPCDBG_BITVAL(4) -#define PPCDBG_SYS32M PPCDBG_BITVAL(5) -#define PPCDBG_SYS64 PPCDBG_BITVAL(6) -#define PPCDBG_SYS64NI PPCDBG_BITVAL(7) -#define PPCDBG_SYS64X PPCDBG_BITVAL(8) -#define PPCDBG_SIGNAL PPCDBG_BITVAL(9) -#define PPCDBG_SIGNALXMON PPCDBG_BITVAL(10) -#define PPCDBG_BINFMT32 PPCDBG_BITVAL(11) -#define PPCDBG_BINFMT64 PPCDBG_BITVAL(12) -#define PPCDBG_BINFMTXMON PPCDBG_BITVAL(13) -#define PPCDBG_BINFMT_32ADDR PPCDBG_BITVAL(14) -#define PPCDBG_ALIGNFIXUP PPCDBG_BITVAL(15) -#define PPCDBG_TCEINIT PPCDBG_BITVAL(16) -#define PPCDBG_TCE PPCDBG_BITVAL(17) -#define PPCDBG_PHBINIT PPCDBG_BITVAL(18) -#define PPCDBG_SMP PPCDBG_BITVAL(19) -#define PPCDBG_BOOT PPCDBG_BITVAL(20) -#define PPCDBG_BUSWALK PPCDBG_BITVAL(21) -#define PPCDBG_PROM PPCDBG_BITVAL(22) -#define PPCDBG_RTAS PPCDBG_BITVAL(23) -#define PPCDBG_HTABSTRESS PPCDBG_BITVAL(62) -#define PPCDBG_HTABSIZE PPCDBG_BITVAL(63) -#define PPCDBG_NONE (0UL) -#define PPCDBG_ALL (0xffffffffUL) - -/* The default initial value for the debug switch */ -#define PPC_DEBUG_DEFAULT 0 -/* #define PPC_DEBUG_DEFAULT PPCDBG_ALL */ - -#define PPCDBG_NUM_FLAGS 64 - -extern u64 ppc64_debug_switch; - -#ifdef WANT_PPCDBG_TAB -/* A table of debug switch names to allow name lookup in xmon - * (and whoever else wants it. - */ -char *trace_names[PPCDBG_NUM_FLAGS] = { - /* Known debug names */ - "mminit", "mm", - "syscall32", "syscall32_ni", "syscall32x", "syscall32m", - "syscall64", "syscall64_ni", "syscall64x", - "signal", "signal_xmon", - "binfmt32", "binfmt64", "binfmt_xmon", "binfmt_32addr", - "alignfixup", "tceinit", "tce", "phb_init", - "smp", "boot", "buswalk", "prom", - "rtas" -}; -#else -extern char *trace_names[64]; -#endif /* WANT_PPCDBG_TAB */ - -#ifdef CONFIG_PPCDBG -/* Macro to conditionally print debug based on debug_switch */ -#define PPCDBG(...) udbg_ppcdbg(__VA_ARGS__) - -/* Macro to conditionally call a debug routine based on debug_switch */ -#define PPCDBGCALL(FLAGS,FUNCTION) ifppcdebug(FLAGS) FUNCTION - -/* Macros to test for debug states */ -#define ifppcdebug(FLAGS) if (udbg_ifdebug(FLAGS)) -#define ppcdebugset(FLAGS) (udbg_ifdebug(FLAGS)) -#define PPCDBG_BINFMT (test_thread_flag(TIF_32BIT) ? PPCDBG_BINFMT32 : PPCDBG_BINFMT64) - -#else -#define PPCDBG(...) do {;} while (0) -#define PPCDBGCALL(FLAGS,FUNCTION) do {;} while (0) -#define ifppcdebug(...) if (0) -#define ppcdebugset(FLAGS) (0) -#endif /* CONFIG_PPCDBG */ - -#endif /*__PPCDEBUG_H */ Index: working-2.6/arch/ppc64/kernel/udbg.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/udbg.c 2005-10-25 11:59:53.000000000 +1000 +++ working-2.6/arch/ppc64/kernel/udbg.c 2005-11-04 10:23:20.000000000 +1100 @@ -10,12 +10,10 @@ */ #include -#define WANT_PPCDBG_TAB /* Only defined here */ #include #include #include #include -#include #include void (*udbg_putc)(unsigned char c); @@ -89,59 +87,6 @@ va_end(args); } -/* PPCDBG stuff */ - -u64 ppc64_debug_switch; - -/* Special print used by PPCDBG() macro */ -void udbg_ppcdbg(unsigned long debug_flags, const char *fmt, ...) -{ - unsigned long active_debugs = debug_flags & ppc64_debug_switch; - - if (active_debugs) { - va_list ap; - unsigned char buf[UDBG_BUFSIZE]; - unsigned long i, len = 0; - - for (i=0; i < PPCDBG_NUM_FLAGS; i++) { - if (((1U << i) & active_debugs) && - trace_names[i]) { - len += strlen(trace_names[i]); - udbg_puts(trace_names[i]); - break; - } - } - - snprintf(buf, UDBG_BUFSIZE, " [%s]: ", current->comm); - len += strlen(buf); - udbg_puts(buf); - - while (len < 18) { - udbg_puts(" "); - len++; - } - - va_start(ap, fmt); - vsnprintf(buf, UDBG_BUFSIZE, fmt, ap); - udbg_puts(buf); - va_end(ap); - } -} - -unsigned long udbg_ifdebug(unsigned long flags) -{ - return (flags & ppc64_debug_switch); -} - -/* - * Initialize the PPCDBG state. Called before relocation has been enabled. - */ -void __init ppcdbg_initialize(void) -{ - ppc64_debug_switch = PPC_DEBUG_DEFAULT; /* | PPCDBG_BUSWALK | */ - /* PPCDBG_PHBINIT | PPCDBG_MM | PPCDBG_MMINIT | PPCDBG_TCEINIT | PPCDBG_TCE */; -} - /* * Early boot console based on udbg */ Index: working-2.6/include/asm-ppc64/udbg.h =================================================================== --- working-2.6.orig/include/asm-ppc64/udbg.h 2005-10-31 15:20:22.000000000 +1100 +++ working-2.6/include/asm-ppc64/udbg.h 2005-11-04 10:23:20.000000000 +1100 @@ -23,9 +23,6 @@ extern void register_early_udbg_console(void); extern void udbg_printf(const char *fmt, ...); -extern void udbg_ppcdbg(unsigned long flags, const char *fmt, ...); -extern unsigned long udbg_ifdebug(unsigned long flags); -extern void __init ppcdbg_initialize(void); extern void udbg_init_uart(void __iomem *comport, unsigned int speed); Index: working-2.6/arch/powerpc/mm/hash_utils_64.c =================================================================== --- working-2.6.orig/arch/powerpc/mm/hash_utils_64.c 2005-10-31 15:20:20.000000000 +1100 +++ working-2.6/arch/powerpc/mm/hash_utils_64.c 2005-11-04 10:23:20.000000000 +1100 @@ -32,7 +32,6 @@ #include #include -#include #include #include #include @@ -194,12 +193,6 @@ htab_size_bytes = get_hashtable_size(); pteg_count = htab_size_bytes >> 7; - /* For debug, make the HTAB 1/8 as big as it normally would be. */ - ifppcdebug(PPCDBG_HTABSIZE) { - pteg_count >>= 3; - htab_size_bytes = pteg_count << 7; - } - htab_hash_mask = pteg_count - 1; if (systemcfg->platform & PLATFORM_LPAR) { Index: working-2.6/arch/powerpc/platforms/iseries/irq.c =================================================================== --- working-2.6.orig/arch/powerpc/platforms/iseries/irq.c 2005-11-03 16:26:57.000000000 +1100 +++ working-2.6/arch/powerpc/platforms/iseries/irq.c 2005-11-04 10:23:20.000000000 +1100 @@ -35,7 +35,6 @@ #include #include -#include #include #include #include @@ -227,8 +226,6 @@ /* Unmask secondary INTA */ mask = 0x80000000; HvCallPci_unmaskInterrupts(bus, subBus, deviceId, mask); - PPCDBG(PPCDBG_BUSWALK, "iSeries_enable_IRQ 0x%02X.%02X.%02X 0x%04X\n", - bus, subBus, deviceId, irq); } /* This is called by iSeries_activate_IRQs */ @@ -310,8 +307,6 @@ /* Mask secondary INTA */ mask = 0x80000000; HvCallPci_maskInterrupts(bus, subBus, deviceId, mask); - PPCDBG(PPCDBG_BUSWALK, "iSeries_disable_IRQ 0x%02X.%02X.%02X 0x%04X\n", - bus, subBus, deviceId, irq); } /* Index: working-2.6/arch/powerpc/platforms/iseries/pci.c =================================================================== --- working-2.6.orig/arch/powerpc/platforms/iseries/pci.c 2005-11-03 16:26:57.000000000 +1100 +++ working-2.6/arch/powerpc/platforms/iseries/pci.c 2005-11-04 10:23:20.000000000 +1100 @@ -32,7 +32,6 @@ #include #include #include -#include #include #include @@ -207,10 +206,6 @@ struct device_node *node; struct pci_dn *pdn; - PPCDBG(PPCDBG_BUSWALK, - "-build_device_node 0x%02X.%02X.%02X Function: %02X\n", - Bus, SubBus, AgentId, Function); - node = kmalloc(sizeof(struct device_node), GFP_KERNEL); if (node == NULL) return NULL; @@ -243,8 +238,6 @@ struct pci_controller *phb; HvBusNumber bus; - PPCDBG(PPCDBG_BUSWALK, "find_and_init_phbs Entry\n"); - /* Check all possible buses. */ for (bus = 0; bus < 256; bus++) { int ret = HvCallXm_testBus(bus); @@ -261,9 +254,6 @@ phb->last_busno = bus; phb->ops = &iSeries_pci_ops; - PPCDBG(PPCDBG_BUSWALK, "PCI:Create iSeries pci_controller(%p), Bus: %04X\n", - phb, bus); - /* Find and connect the devices. */ scan_PHB_slots(phb); } @@ -285,11 +275,9 @@ */ void iSeries_pcibios_init(void) { - PPCDBG(PPCDBG_BUSWALK, "iSeries_pcibios_init Entry.\n"); iomm_table_initialize(); find_and_init_phbs(); io_page_mask = -1; - PPCDBG(PPCDBG_BUSWALK, "iSeries_pcibios_init Exit.\n"); } /* @@ -301,8 +289,6 @@ struct device_node *node; int DeviceCount = 0; - PPCDBG(PPCDBG_BUSWALK, "iSeries_pcibios_fixup Entry.\n"); - /* Fix up at the device node and pci_dev relationship */ mf_display_src(0xC9000100); @@ -316,9 +302,6 @@ ++DeviceCount; pdev->sysdata = (void *)node; PCI_DN(node)->pcidev = pdev; - PPCDBG(PPCDBG_BUSWALK, - "pdev 0x%p <==> DevNode 0x%p\n", - pdev, node); allocate_device_bars(pdev); iSeries_Device_Information(pdev, DeviceCount); iommu_devnode_init_iSeries(node); @@ -333,13 +316,10 @@ void pcibios_fixup_bus(struct pci_bus *PciBus) { - PPCDBG(PPCDBG_BUSWALK, "iSeries_pcibios_fixup_bus(0x%04X) Entry.\n", - PciBus->number); } void pcibios_fixup_resources(struct pci_dev *pdev) { - PPCDBG(PPCDBG_BUSWALK, "fixup_resources pdev %p\n", pdev); } /* @@ -401,9 +381,6 @@ printk("found device at bus %d idsel %d func %d (AgentId %x)\n", bus, IdSel, Function, AgentId); /* Connect EADs: 0x18.00.12 = 0x00 */ - PPCDBG(PPCDBG_BUSWALK, - "PCI:Connect EADs: 0x%02X.%02X.%02X\n", - bus, SubBus, AgentId); HvRc = HvCallPci_getBusUnitInfo(bus, SubBus, AgentId, iseries_hv_addr(BridgeInfo), sizeof(struct HvCallPci_BridgeInfo)); @@ -414,14 +391,6 @@ BridgeInfo->maxAgents, BridgeInfo->maxSubBusNumber, BridgeInfo->logicalSlotNumber); - PPCDBG(PPCDBG_BUSWALK, - "PCI: BridgeInfo, Type:0x%02X, SubBus:0x%02X, MaxAgents:0x%02X, MaxSubBus: 0x%02X, LSlot: 0x%02X\n", - BridgeInfo->busUnitInfo.deviceType, - BridgeInfo->subBusNumber, - BridgeInfo->maxAgents, - BridgeInfo->maxSubBusNumber, - BridgeInfo->logicalSlotNumber); - if (BridgeInfo->busUnitInfo.deviceType == HvCallPci_BridgeDevice) { /* Scan_Bridge_Slot...: 0x18.00.12 */ @@ -454,9 +423,6 @@ /* iSeries_allocate_IRQ.: 0x18.00.12(0xA3) */ Irq = iSeries_allocate_IRQ(Bus, 0, EADsIdSel); - PPCDBG(PPCDBG_BUSWALK, - "PCI:- allocate and assign IRQ 0x%02X.%02X.%02X = 0x%02X\n", - Bus, 0, EADsIdSel, Irq); /* * Connect all functions of any device found. @@ -482,9 +448,6 @@ printk("read vendor ID: %x\n", VendorId); /* FoundDevice: 0x18.28.10 = 0x12AE */ - PPCDBG(PPCDBG_BUSWALK, - "PCI:- FoundDevice: 0x%02X.%02X.%02X = 0x%04X, irq %d\n", - Bus, SubBus, AgentId, VendorId, Irq); HvRc = HvCallPci_configStore8(Bus, SubBus, AgentId, PCI_INTERRUPT_LINE, Irq); if (HvRc != 0) Index: working-2.6/arch/powerpc/platforms/iseries/setup.c =================================================================== --- working-2.6.orig/arch/powerpc/platforms/iseries/setup.c 2005-11-03 16:26:57.000000000 +1100 +++ working-2.6/arch/powerpc/platforms/iseries/setup.c 2005-11-04 10:23:20.000000000 +1100 @@ -71,8 +71,6 @@ #endif /* Function Prototypes */ -extern void ppcdbg_initialize(void); - static void build_iSeries_Memory_Map(void); static void iseries_shared_idle(void); static void iseries_dedicated_idle(void); @@ -309,8 +307,6 @@ ppc64_firmware_features = FW_FEATURE_ISERIES; - ppcdbg_initialize(); - ppc64_interrupt_controller = IC_ISERIES; #if defined(CONFIG_BLK_DEV_INITRD) Index: working-2.6/arch/ppc64/Kconfig.debug =================================================================== --- working-2.6.orig/arch/ppc64/Kconfig.debug 2005-10-25 11:59:53.000000000 +1000 +++ working-2.6/arch/ppc64/Kconfig.debug 2005-11-04 10:23:20.000000000 +1100 @@ -55,10 +55,6 @@ xmon is normally disabled unless booted with 'xmon=on'. Use 'xmon=off' to disable xmon init during runtime. -config PPCDBG - bool "Include PPCDBG realtime debugging" - depends on DEBUG_KERNEL - config IRQSTACKS bool "Use separate kernel stacks when processing interrupts" help Index: working-2.6/arch/powerpc/kernel/signal_64.c =================================================================== --- working-2.6.orig/arch/powerpc/kernel/signal_64.c 2005-11-04 10:21:12.000000000 +1100 +++ working-2.6/arch/powerpc/kernel/signal_64.c 2005-11-04 10:23:47.000000000 +1100 @@ -33,7 +33,6 @@ #include #include #include -#include #include #include #include -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From linas at linas.org Fri Nov 4 10:59:18 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 17:59:18 -0600 Subject: [PATCH 0/42] PCI Error Recovery for PPC64 and misc device drivers Message-ID: <20051103235918.GA25616@mail.gnucash.org> What follows is a long sequence of mostly small patches to implement PCI Error Recovery by adding notification callbacks to the PCI device driver structure, implementing the recovery in 5 device drivers (3 ethernet, two scsi drivers), and adding the actual error detection and recovery code to the ppc64/powerpc arch tree. Highlights: -- Patches 1-14: Misc required ppc64/powerpc cleanup/bugfixes/restructuring -- Patch 15: Overview documentation -- Patch 16: changes to include/linux/pci.h -- Patches 17-26: error detection and recovery for pSeries PCI bridge chips -- Patchs 27-32: recovery patches for ethernet, scsi device drivers -- Patches 33-42: More misc ppc64-specific changes Signed-off-by: Linas Vepstas From michael at ellerman.id.au Fri Nov 4 11:34:26 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Fri, 4 Nov 2005 11:34:26 +1100 Subject: powerpc: Kill ppcdebug In-Reply-To: <20051104001653.GC29025@localhost.localdomain> References: <20051104001653.GC29025@localhost.localdomain> Message-ID: <200511041134.30106.michael@ellerman.id.au> On Fri, 4 Nov 2005 11:16, David Gibson wrote: > The ancient ppcdebug/PPCDBG mechanism is now only used in two places. > First, in the hash setup code, one of the bits allows the size of the > hash table to be reduced by a factor of 8 - which would be better > accomplished with a command line option for that purpose. The other > was a bunch of bus walking related messages in the iSeries code, which > would seem to be insufficient reason to keep the mechanism. > > This patch removes the last traces of this mechanism. I agree it's pretty ugly, but I thought the concept was at least nice, ie. runttime enablable debugging. The current scheme of having to #define DEBUG in a gazillion different files is pretty painful. Oh well. -- Michael Ellerman IBM OzLabs email: michael:ellerman.id.au inmsg: mpe:jabber.org wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20051104/204e9ca5/attachment.pgp From linas at austin.ibm.com Fri Nov 4 11:42:26 2005 From: linas at austin.ibm.com (linas) Date: Thu, 3 Nov 2005 18:42:26 -0600 Subject: [PATCH 1/42] ppc64: uniform usage of bus unit id interfaces In-Reply-To: <20051103235918.GA25616@mail.gnucash.org> References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104004226.GQ19593@austin.ibm.com> 01-pci-dn-uniformization.patch This patch changes the rtas_pci interface to use the new struct pci_dn structure for two routines that work with pci device nodes. This patch also does some minor janitorial work: it uses some handy macros and cleans up some trailing whitespace in the affected file. Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/ppc64/kernel/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/ppc64/kernel/eeh.c 2005-10-31 11:59:11.879644789 -0600 +++ linux-2.6.14-git3/arch/ppc64/kernel/eeh.c 2005-10-31 12:01:21.403477910 -0600 @@ -71,10 +71,6 @@ * and sent out for processing. */ -/** Bus Unit ID macros; get low and hi 32-bits of the 64-bit BUID */ -#define BUID_HI(buid) ((buid) >> 32) -#define BUID_LO(buid) ((buid) & 0xffffffff) - /* EEH event workqueue setup. */ static DEFINE_SPINLOCK(eeh_eventlist_lock); LIST_HEAD(eeh_eventlist); Index: linux-2.6.14-git3/include/asm-powerpc/ppc-pci.h =================================================================== --- linux-2.6.14-git3.orig/include/asm-powerpc/ppc-pci.h 2005-10-31 11:59:11.880644649 -0600 +++ linux-2.6.14-git3/include/asm-powerpc/ppc-pci.h 2005-10-31 12:01:21.404477769 -0600 @@ -26,6 +26,10 @@ extern struct pci_dev *ppc64_isabridge_dev; /* may be NULL if no ISA bus */ +/** Bus Unit ID macros; get low and hi 32-bits of the 64-bit BUID */ +#define BUID_HI(buid) ((buid) >> 32) +#define BUID_LO(buid) ((buid) & 0xffffffff) + /* PCI device_node operations */ struct device_node; typedef void *(*traverse_func)(struct device_node *me, void *data); Index: linux-2.6.14-git3/arch/ppc64/kernel/rtas_pci.c =================================================================== --- linux-2.6.14-git3.orig/arch/ppc64/kernel/rtas_pci.c 2005-10-31 11:59:11.879644789 -0600 +++ linux-2.6.14-git3/arch/ppc64/kernel/rtas_pci.c 2005-10-31 12:01:21.407477349 -0600 @@ -5,19 +5,19 @@ * Copyright (C) 2003 Anton Blanchard , IBM * * RTAS specific routines for PCI. - * + * * Based on code from pci.c, chrp_pci.c and pSeries_pci.c * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. - * + * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. - * + * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA @@ -47,7 +47,7 @@ static int ibm_read_pci_config; static int ibm_write_pci_config; -static int config_access_valid(struct pci_dn *dn, int where) +static inline int config_access_valid(struct pci_dn *dn, int where) { if (where < 256) return 1; @@ -72,16 +72,14 @@ return 0; } -static int rtas_read_config(struct device_node *dn, int where, int size, u32 *val) +static int rtas_read_config(struct pci_dn *pdn, int where, int size, u32 *val) { int returnval = -1; unsigned long buid, addr; int ret; - struct pci_dn *pdn; - if (!dn || !dn->data) + if (!pdn) return PCIBIOS_DEVICE_NOT_FOUND; - pdn = dn->data; if (!config_access_valid(pdn, where)) return PCIBIOS_BAD_REGISTER_NUMBER; @@ -90,7 +88,7 @@ buid = pdn->phb->buid; if (buid) { ret = rtas_call(ibm_read_pci_config, 4, 2, &returnval, - addr, buid >> 32, buid & 0xffffffff, size); + addr, BUID_HI(buid), BUID_LO(buid), size); } else { ret = rtas_call(read_pci_config, 2, 2, &returnval, addr, size); } @@ -100,7 +98,7 @@ return PCIBIOS_DEVICE_NOT_FOUND; if (returnval == EEH_IO_ERROR_VALUE(size) && - eeh_dn_check_failure (dn, NULL)) + eeh_dn_check_failure (pdn->node, NULL)) return PCIBIOS_DEVICE_NOT_FOUND; return PCIBIOS_SUCCESSFUL; @@ -118,23 +116,23 @@ busdn = bus->sysdata; /* must be a phb */ /* Search only direct children of the bus */ - for (dn = busdn->child; dn; dn = dn->sibling) - if (dn->data && PCI_DN(dn)->devfn == devfn + for (dn = busdn->child; dn; dn = dn->sibling) { + struct pci_dn *pdn = PCI_DN(dn); + if (pdn && pdn->devfn == devfn && of_device_available(dn)) - return rtas_read_config(dn, where, size, val); + return rtas_read_config(pdn, where, size, val); + } return PCIBIOS_DEVICE_NOT_FOUND; } -int rtas_write_config(struct device_node *dn, int where, int size, u32 val) +int rtas_write_config(struct pci_dn *pdn, int where, int size, u32 val) { unsigned long buid, addr; int ret; - struct pci_dn *pdn; - if (!dn || !dn->data) + if (!pdn) return PCIBIOS_DEVICE_NOT_FOUND; - pdn = dn->data; if (!config_access_valid(pdn, where)) return PCIBIOS_BAD_REGISTER_NUMBER; @@ -142,7 +140,8 @@ (pdn->devfn << 8) | (where & 0xff); buid = pdn->phb->buid; if (buid) { - ret = rtas_call(ibm_write_pci_config, 5, 1, NULL, addr, buid >> 32, buid & 0xffffffff, size, (ulong) val); + ret = rtas_call(ibm_write_pci_config, 5, 1, NULL, addr, + BUID_HI(buid), BUID_LO(buid), size, (ulong) val); } else { ret = rtas_call(write_pci_config, 3, 1, NULL, addr, size, (ulong)val); } @@ -165,10 +164,12 @@ busdn = bus->sysdata; /* must be a phb */ /* Search only direct children of the bus */ - for (dn = busdn->child; dn; dn = dn->sibling) - if (dn->data && PCI_DN(dn)->devfn == devfn + for (dn = busdn->child; dn; dn = dn->sibling) { + struct pci_dn *pdn = PCI_DN(dn); + if (pdn && pdn->devfn == devfn && of_device_available(dn)) - return rtas_write_config(dn, where, size, val); + return rtas_write_config(pdn, where, size, val); + } return PCIBIOS_DEVICE_NOT_FOUND; } @@ -221,7 +222,7 @@ /* Python's register file is 1 MB in size. */ chip_regs = ioremap(reg_struct.address & ~(0xfffffUL), 0x100000); - /* + /* * Firmware doesn't always clear this bit which is critical * for good performance - Anton */ @@ -292,7 +293,7 @@ if (bus_range == NULL || len < 2 * sizeof(int)) { return 1; } - + phb->first_busno = bus_range[0]; phb->last_busno = bus_range[1]; From linas at linas.org Fri Nov 4 11:47:50 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:47:50 -0600 Subject: [PATCH 2/42]: ppc64: misc minor cleanup References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104004750.GA26782@mail.gnucash.org> 02-eeh-minor-cleanup.patch This patch performs some minor cleanup of the eeh.c file, including: -- trim some trailing whitespace -- remove extraneous #includes -- use the macro PCI_DN uniformly, instead of the void pointer chase. -- typos in comments -- improved debug printk's Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/ppc64/kernel/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/ppc64/kernel/eeh.c 2005-10-31 12:01:21.403477910 -0600 +++ linux-2.6.14-git3/arch/ppc64/kernel/eeh.c 2005-10-31 12:06:16.222121166 -0600 @@ -1,32 +1,31 @@ /* * eeh.c * Copyright (C) 2001 Dave Engebretsen & Todd Inglett IBM Corporation - * + * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. - * + * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. - * + * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ -#include #include #include -#include #include #include #include #include #include #include +#include #include #include #include @@ -49,8 +48,8 @@ * were "empty": all reads return 0xff's and all writes are silently * ignored. EEH slot isolation events can be triggered by parity * errors on the address or data busses (e.g. during posted writes), - * which in turn might be caused by dust, vibration, humidity, - * radioactivity or plain-old failed hardware. + * which in turn might be caused by low voltage on the bus, dust, + * vibration, humidity, radioactivity or plain-old failed hardware. * * Note, however, that one of the leading causes of EEH slot * freeze events are buggy device drivers, buggy device microcode, @@ -256,18 +255,17 @@ dn = pci_device_to_OF_node(dev); if (!dn) { - printk(KERN_WARNING "PCI: no pci dn found for dev=%s\n", - pci_name(dev)); + printk(KERN_WARNING "PCI: no pci dn found for dev=%s\n", pci_name(dev)); return; } /* Skip any devices for which EEH is not enabled. */ - pdn = dn->data; + pdn = PCI_DN(dn); if (!(pdn->eeh_mode & EEH_MODE_SUPPORTED) || pdn->eeh_mode & EEH_MODE_NOCHECK) { #ifdef DEBUG - printk(KERN_INFO "PCI: skip building address cache for=%s\n", - pci_name(dev)); + printk(KERN_INFO "PCI: skip building address cache for=%s - %s\n", + pci_name(dev), pdn->node->full_name); #endif return; } @@ -410,16 +408,16 @@ * @dn: device node to read * @rets: array to return results in */ -static int read_slot_reset_state(struct device_node *dn, int rets[]) +static int read_slot_reset_state(struct pci_dn *pdn, int rets[]) { int token, outputs; - struct pci_dn *pdn = dn->data; if (ibm_read_slot_reset_state2 != RTAS_UNKNOWN_SERVICE) { token = ibm_read_slot_reset_state2; outputs = 4; } else { token = ibm_read_slot_reset_state; + rets[2] = 0; /* fake PE Unavailable info */ outputs = 3; } @@ -496,7 +494,7 @@ /** * eeh_token_to_phys - convert EEH address token to phys address - * @token i/o token, should be address in the form 0xE.... + * @token i/o token, should be address in the form 0xA.... */ static inline unsigned long eeh_token_to_phys(unsigned long token) { @@ -522,7 +520,7 @@ * will query firmware for the EEH status. * * Returns 0 if there has not been an EEH error; otherwise returns - * a non-zero value and queues up a solt isolation event notification. + * a non-zero value and queues up a slot isolation event notification. * * It is safe to call this routine in an interrupt context. */ @@ -542,7 +540,7 @@ if (!dn) return 0; - pdn = dn->data; + pdn = PCI_DN(dn); /* Access to IO BARs might get this far and still not want checking. */ if (!pdn->eeh_capable || !(pdn->eeh_mode & EEH_MODE_SUPPORTED) || @@ -562,7 +560,7 @@ atomic_inc(&eeh_fail_count); if (atomic_read(&eeh_fail_count) >= EEH_MAX_FAILS) { /* re-read the slot reset state */ - if (read_slot_reset_state(dn, rets) != 0) + if (read_slot_reset_state(pdn, rets) != 0) rets[0] = -1; /* reset state unknown */ eeh_panic(dev, rets[0]); } @@ -576,7 +574,7 @@ * function zero of a multi-function device. * In any case they must share a common PHB. */ - ret = read_slot_reset_state(dn, rets); + ret = read_slot_reset_state(pdn, rets); if (!(ret == 0 && rets[1] == 1 && (rets[0] == 2 || rets[0] == 4))) { __get_cpu_var(false_positives)++; return 0; @@ -635,7 +633,6 @@ * @token i/o token, should be address in the form 0xA.... * @val value, should be all 1's (XXX why do we need this arg??) * - * Check for an eeh failure at the given token address. * Check for an EEH failure at the given token address. Call this * routine if the result of a read was all 0xff's and you want to * find out if this is due to an EEH slot freeze event. This routine @@ -680,7 +677,7 @@ u32 *device_id = (u32 *)get_property(dn, "device-id", NULL); u32 *regs; int enable; - struct pci_dn *pdn = dn->data; + struct pci_dn *pdn = PCI_DN(dn); pdn->eeh_mode = 0; @@ -732,7 +729,7 @@ /* This device doesn't support EEH, but it may have an * EEH parent, in which case we mark it as supported. */ - if (dn->parent && dn->parent->data + if (dn->parent && PCI_DN(dn->parent) && (PCI_DN(dn->parent)->eeh_mode & EEH_MODE_SUPPORTED)) { /* Parent supports EEH. */ pdn->eeh_mode |= EEH_MODE_SUPPORTED; @@ -745,7 +742,7 @@ dn->full_name); } - return NULL; + return NULL; } /* @@ -793,13 +790,11 @@ for (phb = of_find_node_by_name(NULL, "pci"); phb; phb = of_find_node_by_name(phb, "pci")) { unsigned long buid; - struct pci_dn *pci; buid = get_phb_buid(phb); - if (buid == 0 || phb->data == NULL) + if (buid == 0 || PCI_DN(phb) == NULL) continue; - pci = phb->data; info.buid_lo = BUID_LO(buid); info.buid_hi = BUID_HI(buid); traverse_pci_devices(phb, early_enable_eeh, &info); @@ -828,11 +823,13 @@ struct pci_controller *phb; struct eeh_early_enable_info info; - if (!dn || !dn->data) + if (!dn || !PCI_DN(dn)) return; phb = PCI_DN(dn)->phb; if (NULL == phb || 0 == phb->buid) { - printk(KERN_WARNING "EEH: Expected buid but found none\n"); + printk(KERN_WARNING "EEH: Expected buid but found none for %s\n", + dn->full_name); + dump_stack(); return; } From linas at linas.org Fri Nov 4 11:48:45 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:48:45 -0600 Subject: [PATCH 3/42]: ppc64: PCI address cache minor fixes References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104004845.GA26803@mail.gnucash.org> 03-eeh-addr-cache-cleanup.patch This is a minor patch to clean up a buglet related to the PCI address cache. (The buglet doesn't manifes itself unless there are also bugs elsewhere, which is why its minor.). Also: -- Improved debug printing. -- Declare some private routines as static -- Adds reference counting to struct pci_dn->pcidev structure Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/ppc64/kernel/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/ppc64/kernel/eeh.c 2005-10-31 12:07:15.072864803 -0600 +++ linux-2.6.14-git3/arch/ppc64/kernel/eeh.c 2005-10-31 12:10:23.985360685 -0600 @@ -219,9 +219,9 @@ while (*p) { parent = *p; piar = rb_entry(parent, struct pci_io_addr_range, rb_node); - if (alo < piar->addr_lo) { + if (ahi < piar->addr_lo) { p = &parent->rb_left; - } else if (ahi > piar->addr_hi) { + } else if (alo > piar->addr_hi) { p = &parent->rb_right; } else { if (dev != piar->pcidev || @@ -240,6 +240,11 @@ piar->pcidev = dev; piar->flags = flags; +#ifdef DEBUG + printk(KERN_DEBUG "PIAR: insert range=[%lx:%lx] dev=%s\n", + alo, ahi, pci_name (dev)); +#endif + rb_link_node(&piar->rb_node, parent, p); rb_insert_color(&piar->rb_node, &pci_io_addr_cache_root.rb_root); @@ -301,7 +306,7 @@ * we maintain a cache of devices that can be quickly searched. * This routine adds a device to that cache. */ -void pci_addr_cache_insert_device(struct pci_dev *dev) +static void pci_addr_cache_insert_device(struct pci_dev *dev) { unsigned long flags; @@ -344,7 +349,7 @@ * the tree multiple times (once per resource). * But so what; device removal doesn't need to be that fast. */ -void pci_addr_cache_remove_device(struct pci_dev *dev) +static void pci_addr_cache_remove_device(struct pci_dev *dev) { unsigned long flags; @@ -366,6 +371,9 @@ { struct pci_dev *dev = NULL; + if (!eeh_subsystem_enabled) + return; + spin_lock_init(&pci_io_addr_cache_root.piar_lock); while ((dev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) { @@ -837,7 +845,7 @@ info.buid_lo = BUID_LO(phb->buid); early_enable_eeh(dn, &info); } -EXPORT_SYMBOL(eeh_add_device_early); +EXPORT_SYMBOL_GPL(eeh_add_device_early); /** * eeh_add_device_late - perform EEH initialization for the indicated pci device @@ -848,6 +856,8 @@ */ void eeh_add_device_late(struct pci_dev *dev) { + struct device_node *dn; + if (!dev || !eeh_subsystem_enabled) return; @@ -855,9 +865,13 @@ printk(KERN_DEBUG "EEH: adding device %s\n", pci_name(dev)); #endif + pci_dev_get (dev); + dn = pci_device_to_OF_node(dev); + PCI_DN(dn)->pcidev = dev; + pci_addr_cache_insert_device (dev); } -EXPORT_SYMBOL(eeh_add_device_late); +EXPORT_SYMBOL_GPL(eeh_add_device_late); /** * eeh_remove_device - undo EEH setup for the indicated pci device @@ -868,6 +882,7 @@ */ void eeh_remove_device(struct pci_dev *dev) { + struct device_node *dn; if (!dev || !eeh_subsystem_enabled) return; @@ -876,8 +891,12 @@ printk(KERN_DEBUG "EEH: remove device %s\n", pci_name(dev)); #endif pci_addr_cache_remove_device(dev); + + dn = pci_device_to_OF_node(dev); + PCI_DN(dn)->pcidev = NULL; + pci_dev_put (dev); } -EXPORT_SYMBOL(eeh_remove_device); +EXPORT_SYMBOL_GPL(eeh_remove_device); static int proc_eeh_show(struct seq_file *m, void *v) { Index: linux-2.6.14-git3/include/asm-powerpc/ppc-pci.h =================================================================== --- linux-2.6.14-git3.orig/include/asm-powerpc/ppc-pci.h 2005-10-31 12:01:21.404477769 -0600 +++ linux-2.6.14-git3/include/asm-powerpc/ppc-pci.h 2005-10-31 12:10:06.152862619 -0600 @@ -39,10 +39,6 @@ void pci_devs_phb_init(void); void pci_devs_phb_init_dynamic(struct pci_controller *phb); -/* PCI address cache management routines */ -void pci_addr_cache_insert_device(struct pci_dev *dev); -void pci_addr_cache_remove_device(struct pci_dev *dev); - /* From rtas_pci.h */ void init_pci_config_tokens (void); unsigned long get_phb_buid (struct device_node *); From linas at linas.org Fri Nov 4 11:48:52 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:48:52 -0600 Subject: [PATCH 4/42]: ppc64: PCI error rate statistics References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104004852.GA26811@mail.gnucash.org> 04-eeh-statistics.patch This minor patch adds some statistics-gathering counters that allow the behaviour of the EEH subsystem o be monitored. While far from perfect, it does provide a rudimentary device that makes understanding of the current state of the system a bit easier. Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/ppc64/kernel/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/ppc64/kernel/eeh.c 2005-10-31 12:10:23.985360685 -0600 +++ linux-2.6.14-git3/arch/ppc64/kernel/eeh.c 2005-10-31 12:11:57.134291514 -0600 @@ -102,6 +102,10 @@ static int eeh_error_buf_size; /* System monitoring statistics */ +static DEFINE_PER_CPU(unsigned long, no_device); +static DEFINE_PER_CPU(unsigned long, no_dn); +static DEFINE_PER_CPU(unsigned long, no_cfg_addr); +static DEFINE_PER_CPU(unsigned long, ignored_check); static DEFINE_PER_CPU(unsigned long, total_mmio_ffs); static DEFINE_PER_CPU(unsigned long, false_positives); static DEFINE_PER_CPU(unsigned long, ignored_failures); @@ -493,8 +497,6 @@ notifier_call_chain (&eeh_notifier_chain, EEH_NOTIFY_FREEZE, event); - __get_cpu_var(slot_resets)++; - pci_dev_put(event->dev); kfree(event); } @@ -546,17 +548,24 @@ if (!eeh_subsystem_enabled) return 0; - if (!dn) + if (!dn) { + __get_cpu_var(no_dn)++; return 0; + } pdn = PCI_DN(dn); /* Access to IO BARs might get this far and still not want checking. */ if (!pdn->eeh_capable || !(pdn->eeh_mode & EEH_MODE_SUPPORTED) || pdn->eeh_mode & EEH_MODE_NOCHECK) { + __get_cpu_var(ignored_check)++; +#ifdef DEBUG + printk ("EEH:ignored check for %s %s\n", pci_name (dev), dn->full_name); +#endif return 0; } if (!pdn->eeh_config_addr) { + __get_cpu_var(no_cfg_addr)++; return 0; } @@ -590,6 +599,7 @@ /* prevent repeated reports of this failure */ pdn->eeh_mode |= EEH_MODE_ISOLATED; + __get_cpu_var(slot_resets)++; reset_state = rets[0]; @@ -657,8 +667,10 @@ /* Finding the phys addr + pci device; this is pretty quick. */ addr = eeh_token_to_phys((unsigned long __force) token); dev = pci_get_device_by_addr(addr); - if (!dev) + if (!dev) { + __get_cpu_var(no_device)++; return val; + } dn = pci_device_to_OF_node(dev); eeh_dn_check_failure (dn, dev); @@ -903,12 +915,17 @@ unsigned int cpu; unsigned long ffs = 0, positives = 0, failures = 0; unsigned long resets = 0; + unsigned long no_dev = 0, no_dn = 0, no_cfg = 0, no_check = 0; for_each_cpu(cpu) { ffs += per_cpu(total_mmio_ffs, cpu); positives += per_cpu(false_positives, cpu); failures += per_cpu(ignored_failures, cpu); resets += per_cpu(slot_resets, cpu); + no_dev += per_cpu(no_device, cpu); + no_dn += per_cpu(no_dn, cpu); + no_cfg += per_cpu(no_cfg_addr, cpu); + no_check += per_cpu(ignored_check, cpu); } if (0 == eeh_subsystem_enabled) { @@ -916,13 +933,17 @@ seq_printf(m, "eeh_total_mmio_ffs=%ld\n", ffs); } else { seq_printf(m, "EEH Subsystem is enabled\n"); - seq_printf(m, "eeh_total_mmio_ffs=%ld\n" - "eeh_false_positives=%ld\n" - "eeh_ignored_failures=%ld\n" - "eeh_slot_resets=%ld\n" - "eeh_fail_count=%d\n", - ffs, positives, failures, resets, - eeh_fail_count.counter); + seq_printf(m, + "no device=%ld\n" + "no device node=%ld\n" + "no config address=%ld\n" + "check not wanted=%ld\n" + "eeh_total_mmio_ffs=%ld\n" + "eeh_false_positives=%ld\n" + "eeh_ignored_failures=%ld\n" + "eeh_slot_resets=%ld\n", + no_dev, no_dn, no_cfg, no_check, + ffs, positives, failures, resets); } return 0; From linas at linas.org Fri Nov 4 11:49:01 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:49:01 -0600 Subject: [PATCH 5/42]: ppc64: RTAS error reporting restructuring References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104004901.GA26819@mail.gnucash.org> 05-eeh-slot-error-detail.patch This patch encapsulates a section of code that reports the EEH event. The new subroutine can be used in several places to report the error. Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/ppc64/kernel/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/ppc64/kernel/eeh.c 2005-10-31 12:11:57.134291514 -0600 +++ linux-2.6.14-git3/arch/ppc64/kernel/eeh.c 2005-10-31 12:13:09.282168648 -0600 @@ -397,6 +397,28 @@ /* --------------------------------------------------------------- */ /* Above lies the PCI Address Cache. Below lies the EEH event infrastructure */ +void eeh_slot_error_detail (struct pci_dn *pdn, int severity) +{ + unsigned long flags; + int rc; + + /* Log the error with the rtas logger */ + spin_lock_irqsave(&slot_errbuf_lock, flags); + memset(slot_errbuf, 0, eeh_error_buf_size); + + rc = rtas_call(ibm_slot_error_detail, + 8, 1, NULL, pdn->eeh_config_addr, + BUID_HI(pdn->phb->buid), + BUID_LO(pdn->phb->buid), NULL, 0, + virt_to_phys(slot_errbuf), + eeh_error_buf_size, + severity); + + if (rc == 0) + log_error(slot_errbuf, ERR_TYPE_RTAS_LOG, 0); + spin_unlock_irqrestore(&slot_errbuf_lock, flags); +} + /** * eeh_register_notifier - Register to find out about EEH events. * @nb: notifier block to callback on events @@ -454,9 +476,12 @@ * Since the panic_on_oops sysctl is used to halt the system * in light of potential corruption, we can use it here. */ - if (panic_on_oops) + if (panic_on_oops) { + struct device_node *dn = pci_device_to_OF_node(dev); + eeh_slot_error_detail (PCI_DN(dn), 2 /* Permanent Error */); panic("EEH: MMIO failure (%d) on device:%s\n", reset_state, pci_name(dev)); + } else { __get_cpu_var(ignored_failures)++; printk(KERN_INFO "EEH: Ignored MMIO failure (%d) on device:%s\n", @@ -539,7 +564,7 @@ int ret; int rets[3]; unsigned long flags; - int rc, reset_state; + int reset_state; struct eeh_event *event; struct pci_dn *pdn; @@ -603,20 +628,7 @@ reset_state = rets[0]; - spin_lock_irqsave(&slot_errbuf_lock, flags); - memset(slot_errbuf, 0, eeh_error_buf_size); - - rc = rtas_call(ibm_slot_error_detail, - 8, 1, NULL, pdn->eeh_config_addr, - BUID_HI(pdn->phb->buid), - BUID_LO(pdn->phb->buid), NULL, 0, - virt_to_phys(slot_errbuf), - eeh_error_buf_size, - 1 /* Temporary Error */); - - if (rc == 0) - log_error(slot_errbuf, ERR_TYPE_RTAS_LOG, 0); - spin_unlock_irqrestore(&slot_errbuf_lock, flags); + eeh_slot_error_detail (pdn, 1 /* Temporary Error */); printk(KERN_INFO "EEH: MMIO failure (%d) on device: %s %s\n", rets[0], dn->name, dn->full_name); @@ -783,6 +795,8 @@ struct device_node *phb, *np; struct eeh_early_enable_info info; + spin_lock_init(&slot_errbuf_lock); + np = of_find_node_by_path("/rtas"); if (np == NULL) return; From linas at linas.org Fri Nov 4 11:49:15 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:49:15 -0600 Subject: [PATCH 6/42]: ppc64: avoid PCI error reporting for empty slots References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104004915.GA26827@mail.gnucash.org> 06-eeh-empty-slot-error.patch Performing PCI config-space reads to empty PCI slots can lead to reports of "permanent failure" from the firmware. Ignore permanent failures on empty slots. Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/ppc64/kernel/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/ppc64/kernel/eeh.c 2005-10-31 12:13:09.282168648 -0600 +++ linux-2.6.14-git3/arch/ppc64/kernel/eeh.c 2005-10-31 12:15:26.162962756 -0600 @@ -617,7 +617,32 @@ * In any case they must share a common PHB. */ ret = read_slot_reset_state(pdn, rets); - if (!(ret == 0 && rets[1] == 1 && (rets[0] == 2 || rets[0] == 4))) { + + /* If the call to firmware failed, punt */ + if (ret != 0) { + printk(KERN_WARNING "EEH: read_slot_reset_state() failed; rc=%d dn=%s\n", + ret, dn->full_name); + __get_cpu_var(false_positives)++; + return 0; + } + + /* If EEH is not supported on this device, punt. */ + if (rets[1] != 1) { + printk(KERN_WARNING "EEH: event on unsupported device, rc=%d dn=%s\n", + ret, dn->full_name); + __get_cpu_var(false_positives)++; + return 0; + } + + /* If not the kind of error we know about, punt. */ + if (rets[0] != 2 && rets[0] != 4 && rets[0] != 5) { + __get_cpu_var(false_positives)++; + return 0; + } + + /* Note that config-io to empty slots may fail; + * we recognize empty because they don't have children. */ + if ((rets[0] == 5) && (dn->child == NULL)) { __get_cpu_var(false_positives)++; return 0; } @@ -650,7 +675,7 @@ /* Most EEH events are due to device driver bugs. Having * a stack trace will help the device-driver authors figure * out what happened. So print that out. */ - dump_stack(); + if (rets[0] != 5) dump_stack(); schedule_work(&eeh_event_wq); return 0; From linas at linas.org Fri Nov 4 11:49:23 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:49:23 -0600 Subject: [PATCH 7/42]: ppc64: serialize reports of PCI errors References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104004923.GA26835@mail.gnucash.org> 07-eeh-report-race.patch When a PCI slot is isolated, all PCI functions under that slot are affected. If hese functions have separate device drivers, the EEH isolation event might be reported multiple times. This patch adds a lock to prevent the racing of such multiple reports. It also marks every device under the slot as having experienced an EEH event, so that multiple reports may be recognized more easily. Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/ppc64/kernel/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/ppc64/kernel/eeh.c 2005-10-31 12:15:26.162962756 -0600 +++ linux-2.6.14-git3/arch/ppc64/kernel/eeh.c 2005-10-31 12:16:19.766441392 -0600 @@ -96,6 +96,9 @@ static int eeh_subsystem_enabled; +/* Lock to avoid races due to multiple reports of an error */ +static DEFINE_SPINLOCK(confirm_error_lock); + /* Buffer for reporting slot-error-detail rtas calls */ static unsigned char slot_errbuf[RTAS_ERROR_LOG_MAX]; static DEFINE_SPINLOCK(slot_errbuf_lock); @@ -544,6 +547,55 @@ return pa | (token & (PAGE_SIZE-1)); } +/** + * Return the "partitionable endpoint" (pe) under which this device lies + */ +static struct device_node * find_device_pe(struct device_node *dn) +{ + while ((dn->parent) && PCI_DN(dn->parent) && + (PCI_DN(dn->parent)->eeh_mode & EEH_MODE_SUPPORTED)) { + dn = dn->parent; + } + return dn; +} + +/** Mark all devices that are peers of this device as failed. + * Mark the device driver too, so that it can see the failure + * immediately; this is critical, since some drivers poll + * status registers in interrupts ... If a driver is polling, + * and the slot is frozen, then the driver can deadlock in + * an interrupt context, which is bad. + */ + +static inline void __eeh_mark_slot (struct device_node *dn) +{ + while (dn) { + PCI_DN(dn)->eeh_mode |= EEH_MODE_ISOLATED; + + if (dn->child) + __eeh_mark_slot (dn->child); + dn = dn->sibling; + } +} + +static inline void __eeh_clear_slot (struct device_node *dn) +{ + while (dn) { + PCI_DN(dn)->eeh_mode &= ~EEH_MODE_ISOLATED; + if (dn->child) + __eeh_clear_slot (dn->child); + dn = dn->sibling; + } +} + +static inline void eeh_clear_slot (struct device_node *dn) +{ + unsigned long flags; + spin_lock_irqsave(&confirm_error_lock, flags); + __eeh_clear_slot (dn); + spin_unlock_irqrestore(&confirm_error_lock, flags); +} + /** * eeh_dn_check_failure - check if all 1's data is due to EEH slot freeze * @dn device node @@ -567,6 +619,8 @@ int reset_state; struct eeh_event *event; struct pci_dn *pdn; + struct device_node *pe_dn; + int rc = 0; __get_cpu_var(total_mmio_ffs)++; @@ -594,10 +648,14 @@ return 0; } - /* - * If we already have a pending isolation event for this - * slot, we know it's bad already, we don't need to check... + /* If we already have a pending isolation event for this + * slot, we know it's bad already, we don't need to check. + * Do this checking under a lock; as multiple PCI devices + * in one slot might report errors simultaneously, and we + * only want one error recovery routine running. */ + spin_lock_irqsave(&confirm_error_lock, flags); + rc = 1; if (pdn->eeh_mode & EEH_MODE_ISOLATED) { atomic_inc(&eeh_fail_count); if (atomic_read(&eeh_fail_count) >= EEH_MAX_FAILS) { @@ -606,7 +664,7 @@ rets[0] = -1; /* reset state unknown */ eeh_panic(dev, rets[0]); } - return 0; + goto dn_unlock; } /* @@ -623,7 +681,8 @@ printk(KERN_WARNING "EEH: read_slot_reset_state() failed; rc=%d dn=%s\n", ret, dn->full_name); __get_cpu_var(false_positives)++; - return 0; + rc = 0; + goto dn_unlock; } /* If EEH is not supported on this device, punt. */ @@ -631,25 +690,33 @@ printk(KERN_WARNING "EEH: event on unsupported device, rc=%d dn=%s\n", ret, dn->full_name); __get_cpu_var(false_positives)++; - return 0; + rc = 0; + goto dn_unlock; } /* If not the kind of error we know about, punt. */ if (rets[0] != 2 && rets[0] != 4 && rets[0] != 5) { __get_cpu_var(false_positives)++; - return 0; + rc = 0; + goto dn_unlock; } /* Note that config-io to empty slots may fail; * we recognize empty because they don't have children. */ if ((rets[0] == 5) && (dn->child == NULL)) { __get_cpu_var(false_positives)++; - return 0; + rc = 0; + goto dn_unlock; } - /* prevent repeated reports of this failure */ - pdn->eeh_mode |= EEH_MODE_ISOLATED; - __get_cpu_var(slot_resets)++; + __get_cpu_var(slot_resets)++; + + /* Avoid repeated reports of this failure, including problems + * with other functions on this device, and functions under + * bridges. */ + pe_dn = find_device_pe (dn); + __eeh_mark_slot (pe_dn); + spin_unlock_irqrestore(&confirm_error_lock, flags); reset_state = rets[0]; @@ -678,10 +745,14 @@ if (rets[0] != 5) dump_stack(); schedule_work(&eeh_event_wq); - return 0; + return 1; + +dn_unlock: + spin_unlock_irqrestore(&confirm_error_lock, flags); + return rc; } -EXPORT_SYMBOL(eeh_dn_check_failure); +EXPORT_SYMBOL_GPL(eeh_dn_check_failure); /** * eeh_check_failure - check if all 1's data is due to EEH slot freeze @@ -820,6 +891,7 @@ struct device_node *phb, *np; struct eeh_early_enable_info info; + spin_lock_init(&confirm_error_lock); spin_lock_init(&slot_errbuf_lock); np = of_find_node_by_path("/rtas"); From linas at linas.org Fri Nov 4 11:49:31 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:49:31 -0600 Subject: [PATCH 8/42]: ppc64: escape hatch for spinning interrupt deadlocks References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104004931.GA26844@mail.gnucash.org> 08-eeh-spin-counter.patch One an EEH event is triggers, all further I/O to a device is blocked (until reset). Bad device drivers may end up spinning in their interrupt handlers, trying to read an interrupt status register that will never change state. This patch moves that spin counter to a per-device structure, and adds some diagnostic prints to help locate the bad driver. Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/ppc64/kernel/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/ppc64/kernel/eeh.c 2005-10-31 12:16:19.766441392 -0600 +++ linux-2.6.14-git3/arch/ppc64/kernel/eeh.c 2005-10-31 12:18:21.924300428 -0600 @@ -78,14 +78,12 @@ static struct notifier_block *eeh_notifier_chain; -/* - * If a device driver keeps reading an MMIO register in an interrupt +/* If a device driver keeps reading an MMIO register in an interrupt * handler after a slot isolation event has occurred, we assume it * is broken and panic. This sets the threshold for how many read * attempts we allow before panicking. */ -#define EEH_MAX_FAILS 1000 -static atomic_t eeh_fail_count; +#define EEH_MAX_FAILS 100000 /* RTAS tokens */ static int ibm_set_eeh_option; @@ -521,7 +519,6 @@ "%s\n", event->reset_state, pci_name(event->dev)); - atomic_set(&eeh_fail_count, 0); notifier_call_chain (&eeh_notifier_chain, EEH_NOTIFY_FREEZE, event); @@ -657,12 +654,18 @@ spin_lock_irqsave(&confirm_error_lock, flags); rc = 1; if (pdn->eeh_mode & EEH_MODE_ISOLATED) { - atomic_inc(&eeh_fail_count); - if (atomic_read(&eeh_fail_count) >= EEH_MAX_FAILS) { + pdn->eeh_check_count ++; + if (pdn->eeh_check_count >= EEH_MAX_FAILS) { + printk (KERN_ERR "EEH: Device driver ignored %d bad reads, panicing\n", + pdn->eeh_check_count); + dump_stack(); + /* re-read the slot reset state */ if (read_slot_reset_state(pdn, rets) != 0) rets[0] = -1; /* reset state unknown */ - eeh_panic(dev, rets[0]); + + /* If we are here, then we hit an infinite loop. Stop. */ + panic("EEH: MMIO halt (%d) on device:%s\n", rets[0], pci_name(dev)); } goto dn_unlock; } @@ -808,6 +811,8 @@ struct pci_dn *pdn = PCI_DN(dn); pdn->eeh_mode = 0; + pdn->eeh_check_count = 0; + pdn->eeh_freeze_count = 0; if (status && strcmp(status, "ok") != 0) return NULL; /* ignore devices with bad status */ From linas at linas.org Fri Nov 4 11:49:38 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:49:38 -0600 Subject: [PATCH 9/42]: ppc64: bugfix: crash on PCI hotplug References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104004938.GA26852@mail.gnucash.org> 09-hotplug-bugfix.patch In the current 2.6.14-rc2-git6 kernel, performing a Dynamic LPAR Add of a hotplug slot will crash the system, with the following (abbreviated) stack trace: cpu 0x3: Vector: 700 (Program Check) at [c000000053dff7f0] pc: c0000000004f5974: .__alloc_bootmem+0x0/0xb0 lr: c0000000000258a0: .update_dn_pci_info+0x108/0x118 c0000000000257c8 .update_dn_pci_info+0x30/0x118 (unreliable) c0000000000258fc .pci_dn_reconfig_notifier+0x4c/0x64 c000000000060754 .notifier_call_chain+0x68/0x9c The root cause was that __init __alloc_bootmem() was called long after boot had finished, resulting in a crash because this routine is undefined after boot time. The patch below fixes this crash, and adds some docs to clarify the code. p.s. congrats to all for getting slashdotted on this yesterday! Signed-off-by: Linas Vepstas Mailed to: paulus at samba.org CC: linuxppc64-dev at ozlabs.org, linux-kernel at vger.kernel.org, johnrose at linux.ibm.com On Monday 3 October 2005 revised on 4 Ocober to [PATCH 1/2] ppc64: Crash in DLPAR code on PCI hotplug add Index: linux-2.6.14-git3/arch/ppc64/kernel/pci_dn.c =================================================================== --- linux-2.6.14-git3.orig/arch/ppc64/kernel/pci_dn.c 2005-10-31 12:19:03.211506966 -0600 +++ linux-2.6.14-git3/arch/ppc64/kernel/pci_dn.c 2005-10-31 12:19:47.420303479 -0600 @@ -43,7 +43,7 @@ u32 *regs; struct pci_dn *pdn; - if (phb->is_dynamic) + if (mem_init_done) pdn = kmalloc(sizeof(*pdn), GFP_KERNEL); else pdn = alloc_bootmem(sizeof(*pdn)); @@ -120,6 +120,14 @@ return NULL; } +/** + * pci_devs_phb_init_dynamic - setup pci devices under this PHB + * phb: pci-to-host bridge (top-level bridge connecting to cpu) + * + * This routine is called both during boot, (before the memory + * subsystem is set up, before kmalloc is valid) and during the + * dynamic lpar operation of adding a PHB to a running system. + */ void __devinit pci_devs_phb_init_dynamic(struct pci_controller *phb) { struct device_node * dn = (struct device_node *) phb->arch_data; @@ -200,9 +208,14 @@ .notifier_call = pci_dn_reconfig_notifier, }; -/* - * Actually initialize the phbs. - * The buswalk on this phb has not happened yet. +/** + * pci_devs_phb_init - Initialize phbs and pci devs under them. + * + * This routine walks over all phb's (pci-host bridges) on the + * system, and sets up assorted pci-related structures + * (including pci info in the device node structs) for each + * pci device found underneath. This routine runs once, + * early in the boot sequence. */ void __init pci_devs_phb_init(void) { From linas at linas.org Fri Nov 4 11:49:45 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:49:45 -0600 Subject: [PATCH 10/42]: ppc64: bugfix: don't silently gnore PCI errors References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104004945.GA26860@mail.gnucash.org> 10-EEH-enable-bugfix.patch Bugfix: With the curent linux-2.6.14-rc2-git6, EEH errors are ignored because thier detection requires an unusued, uninitialized flag to be set. This patch removes the unused flag. Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/ppc64/kernel/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/ppc64/kernel/eeh.c 2005-10-31 12:54:20.919034814 -0600 +++ linux-2.6.14-git3/arch/ppc64/kernel/eeh.c 2005-10-31 12:54:48.165215962 -0600 @@ -631,11 +631,12 @@ pdn = PCI_DN(dn); /* Access to IO BARs might get this far and still not want checking. */ - if (!pdn->eeh_capable || !(pdn->eeh_mode & EEH_MODE_SUPPORTED) || + if (!(pdn->eeh_mode & EEH_MODE_SUPPORTED) || pdn->eeh_mode & EEH_MODE_NOCHECK) { __get_cpu_var(ignored_check)++; #ifdef DEBUG - printk ("EEH:ignored check for %s %s\n", pci_name (dev), dn->full_name); + printk ("EEH:ignored check (%x) for %s %s\n", + pdn->eeh_mode, pci_name (dev), dn->full_name); #endif return 0; } Index: linux-2.6.14-git3/include/asm-ppc64/pci-bridge.h =================================================================== --- linux-2.6.14-git3.orig/include/asm-ppc64/pci-bridge.h 2005-10-31 12:54:20.919034814 -0600 +++ linux-2.6.14-git3/include/asm-ppc64/pci-bridge.h 2005-10-31 12:54:48.167215682 -0600 @@ -63,7 +63,6 @@ int devfn; /* for pci devices */ int eeh_mode; /* See eeh.h for possible EEH_MODEs */ int eeh_config_addr; - int eeh_capable; /* from firmware */ int eeh_check_count; /* # times driver ignored error */ int eeh_freeze_count; /* # times this device froze up. */ int eeh_is_bridge; /* device is pci-to-pci bridge */ From linas at linas.org Fri Nov 4 11:49:51 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:49:51 -0600 Subject: [PATCH 11/42]: ppc64: move code to powerpc directory from ppc64 References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104004951.GA26868@mail.gnucash.org> 11-eeh-move-to-powerpc.patch Move arch/ppc64/kernel/eeh.c to arch//powerpc/platforms/pseries/eeh.c No other changes (except for Makefile to build it) Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/ppc64/kernel/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/ppc64/kernel/eeh.c 2005-11-02 14:29:22.485829789 -0600 +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 @@ -1,1093 +0,0 @@ -/* - * eeh.c - * Copyright (C) 2001 Dave Engebretsen & Todd Inglett IBM Corporation - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#undef DEBUG - -/** Overview: - * EEH, or "Extended Error Handling" is a PCI bridge technology for - * dealing with PCI bus errors that can't be dealt with within the - * usual PCI framework, except by check-stopping the CPU. Systems - * that are designed for high-availability/reliability cannot afford - * to crash due to a "mere" PCI error, thus the need for EEH. - * An EEH-capable bridge operates by converting a detected error - * into a "slot freeze", taking the PCI adapter off-line, making - * the slot behave, from the OS'es point of view, as if the slot - * were "empty": all reads return 0xff's and all writes are silently - * ignored. EEH slot isolation events can be triggered by parity - * errors on the address or data busses (e.g. during posted writes), - * which in turn might be caused by low voltage on the bus, dust, - * vibration, humidity, radioactivity or plain-old failed hardware. - * - * Note, however, that one of the leading causes of EEH slot - * freeze events are buggy device drivers, buggy device microcode, - * or buggy device hardware. This is because any attempt by the - * device to bus-master data to a memory address that is not - * assigned to the device will trigger a slot freeze. (The idea - * is to prevent devices-gone-wild from corrupting system memory). - * Buggy hardware/drivers will have a miserable time co-existing - * with EEH. - * - * Ideally, a PCI device driver, when suspecting that an isolation - * event has occured (e.g. by reading 0xff's), will then ask EEH - * whether this is the case, and then take appropriate steps to - * reset the PCI slot, the PCI device, and then resume operations. - * However, until that day, the checking is done here, with the - * eeh_check_failure() routine embedded in the MMIO macros. If - * the slot is found to be isolated, an "EEH Event" is synthesized - * and sent out for processing. - */ - -/* EEH event workqueue setup. */ -static DEFINE_SPINLOCK(eeh_eventlist_lock); -LIST_HEAD(eeh_eventlist); -static void eeh_event_handler(void *); -DECLARE_WORK(eeh_event_wq, eeh_event_handler, NULL); - -static struct notifier_block *eeh_notifier_chain; - -/* If a device driver keeps reading an MMIO register in an interrupt - * handler after a slot isolation event has occurred, we assume it - * is broken and panic. This sets the threshold for how many read - * attempts we allow before panicking. - */ -#define EEH_MAX_FAILS 100000 - -/* RTAS tokens */ -static int ibm_set_eeh_option; -static int ibm_set_slot_reset; -static int ibm_read_slot_reset_state; -static int ibm_read_slot_reset_state2; -static int ibm_slot_error_detail; - -static int eeh_subsystem_enabled; - -/* Lock to avoid races due to multiple reports of an error */ -static DEFINE_SPINLOCK(confirm_error_lock); - -/* Buffer for reporting slot-error-detail rtas calls */ -static unsigned char slot_errbuf[RTAS_ERROR_LOG_MAX]; -static DEFINE_SPINLOCK(slot_errbuf_lock); -static int eeh_error_buf_size; - -/* System monitoring statistics */ -static DEFINE_PER_CPU(unsigned long, no_device); -static DEFINE_PER_CPU(unsigned long, no_dn); -static DEFINE_PER_CPU(unsigned long, no_cfg_addr); -static DEFINE_PER_CPU(unsigned long, ignored_check); -static DEFINE_PER_CPU(unsigned long, total_mmio_ffs); -static DEFINE_PER_CPU(unsigned long, false_positives); -static DEFINE_PER_CPU(unsigned long, ignored_failures); -static DEFINE_PER_CPU(unsigned long, slot_resets); - -/** - * The pci address cache subsystem. This subsystem places - * PCI device address resources into a red-black tree, sorted - * according to the address range, so that given only an i/o - * address, the corresponding PCI device can be **quickly** - * found. It is safe to perform an address lookup in an interrupt - * context; this ability is an important feature. - * - * Currently, the only customer of this code is the EEH subsystem; - * thus, this code has been somewhat tailored to suit EEH better. - * In particular, the cache does *not* hold the addresses of devices - * for which EEH is not enabled. - * - * (Implementation Note: The RB tree seems to be better/faster - * than any hash algo I could think of for this problem, even - * with the penalty of slow pointer chases for d-cache misses). - */ -struct pci_io_addr_range -{ - struct rb_node rb_node; - unsigned long addr_lo; - unsigned long addr_hi; - struct pci_dev *pcidev; - unsigned int flags; -}; - -static struct pci_io_addr_cache -{ - struct rb_root rb_root; - spinlock_t piar_lock; -} pci_io_addr_cache_root; - -static inline struct pci_dev *__pci_get_device_by_addr(unsigned long addr) -{ - struct rb_node *n = pci_io_addr_cache_root.rb_root.rb_node; - - while (n) { - struct pci_io_addr_range *piar; - piar = rb_entry(n, struct pci_io_addr_range, rb_node); - - if (addr < piar->addr_lo) { - n = n->rb_left; - } else { - if (addr > piar->addr_hi) { - n = n->rb_right; - } else { - pci_dev_get(piar->pcidev); - return piar->pcidev; - } - } - } - - return NULL; -} - -/** - * pci_get_device_by_addr - Get device, given only address - * @addr: mmio (PIO) phys address or i/o port number - * - * Given an mmio phys address, or a port number, find a pci device - * that implements this address. Be sure to pci_dev_put the device - * when finished. I/O port numbers are assumed to be offset - * from zero (that is, they do *not* have pci_io_addr added in). - * It is safe to call this function within an interrupt. - */ -static struct pci_dev *pci_get_device_by_addr(unsigned long addr) -{ - struct pci_dev *dev; - unsigned long flags; - - spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); - dev = __pci_get_device_by_addr(addr); - spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); - return dev; -} - -#ifdef DEBUG -/* - * Handy-dandy debug print routine, does nothing more - * than print out the contents of our addr cache. - */ -static void pci_addr_cache_print(struct pci_io_addr_cache *cache) -{ - struct rb_node *n; - int cnt = 0; - - n = rb_first(&cache->rb_root); - while (n) { - struct pci_io_addr_range *piar; - piar = rb_entry(n, struct pci_io_addr_range, rb_node); - printk(KERN_DEBUG "PCI: %s addr range %d [%lx-%lx]: %s\n", - (piar->flags & IORESOURCE_IO) ? "i/o" : "mem", cnt, - piar->addr_lo, piar->addr_hi, pci_name(piar->pcidev)); - cnt++; - n = rb_next(n); - } -} -#endif - -/* Insert address range into the rb tree. */ -static struct pci_io_addr_range * -pci_addr_cache_insert(struct pci_dev *dev, unsigned long alo, - unsigned long ahi, unsigned int flags) -{ - struct rb_node **p = &pci_io_addr_cache_root.rb_root.rb_node; - struct rb_node *parent = NULL; - struct pci_io_addr_range *piar; - - /* Walk tree, find a place to insert into tree */ - while (*p) { - parent = *p; - piar = rb_entry(parent, struct pci_io_addr_range, rb_node); - if (ahi < piar->addr_lo) { - p = &parent->rb_left; - } else if (alo > piar->addr_hi) { - p = &parent->rb_right; - } else { - if (dev != piar->pcidev || - alo != piar->addr_lo || ahi != piar->addr_hi) { - printk(KERN_WARNING "PIAR: overlapping address range\n"); - } - return piar; - } - } - piar = (struct pci_io_addr_range *)kmalloc(sizeof(struct pci_io_addr_range), GFP_ATOMIC); - if (!piar) - return NULL; - - piar->addr_lo = alo; - piar->addr_hi = ahi; - piar->pcidev = dev; - piar->flags = flags; - -#ifdef DEBUG - printk(KERN_DEBUG "PIAR: insert range=[%lx:%lx] dev=%s\n", - alo, ahi, pci_name (dev)); -#endif - - rb_link_node(&piar->rb_node, parent, p); - rb_insert_color(&piar->rb_node, &pci_io_addr_cache_root.rb_root); - - return piar; -} - -static void __pci_addr_cache_insert_device(struct pci_dev *dev) -{ - struct device_node *dn; - struct pci_dn *pdn; - int i; - int inserted = 0; - - dn = pci_device_to_OF_node(dev); - if (!dn) { - printk(KERN_WARNING "PCI: no pci dn found for dev=%s\n", pci_name(dev)); - return; - } - - /* Skip any devices for which EEH is not enabled. */ - pdn = PCI_DN(dn); - if (!(pdn->eeh_mode & EEH_MODE_SUPPORTED) || - pdn->eeh_mode & EEH_MODE_NOCHECK) { -#ifdef DEBUG - printk(KERN_INFO "PCI: skip building address cache for=%s - %s\n", - pci_name(dev), pdn->node->full_name); -#endif - return; - } - - /* The cache holds a reference to the device... */ - pci_dev_get(dev); - - /* Walk resources on this device, poke them into the tree */ - for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) { - unsigned long start = pci_resource_start(dev,i); - unsigned long end = pci_resource_end(dev,i); - unsigned int flags = pci_resource_flags(dev,i); - - /* We are interested only bus addresses, not dma or other stuff */ - if (0 == (flags & (IORESOURCE_IO | IORESOURCE_MEM))) - continue; - if (start == 0 || ~start == 0 || end == 0 || ~end == 0) - continue; - pci_addr_cache_insert(dev, start, end, flags); - inserted = 1; - } - - /* If there was nothing to add, the cache has no reference... */ - if (!inserted) - pci_dev_put(dev); -} - -/** - * pci_addr_cache_insert_device - Add a device to the address cache - * @dev: PCI device whose I/O addresses we are interested in. - * - * In order to support the fast lookup of devices based on addresses, - * we maintain a cache of devices that can be quickly searched. - * This routine adds a device to that cache. - */ -static void pci_addr_cache_insert_device(struct pci_dev *dev) -{ - unsigned long flags; - - spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); - __pci_addr_cache_insert_device(dev); - spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); -} - -static inline void __pci_addr_cache_remove_device(struct pci_dev *dev) -{ - struct rb_node *n; - int removed = 0; - -restart: - n = rb_first(&pci_io_addr_cache_root.rb_root); - while (n) { - struct pci_io_addr_range *piar; - piar = rb_entry(n, struct pci_io_addr_range, rb_node); - - if (piar->pcidev == dev) { - rb_erase(n, &pci_io_addr_cache_root.rb_root); - removed = 1; - kfree(piar); - goto restart; - } - n = rb_next(n); - } - - /* The cache no longer holds its reference to this device... */ - if (removed) - pci_dev_put(dev); -} - -/** - * pci_addr_cache_remove_device - remove pci device from addr cache - * @dev: device to remove - * - * Remove a device from the addr-cache tree. - * This is potentially expensive, since it will walk - * the tree multiple times (once per resource). - * But so what; device removal doesn't need to be that fast. - */ -static void pci_addr_cache_remove_device(struct pci_dev *dev) -{ - unsigned long flags; - - spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); - __pci_addr_cache_remove_device(dev); - spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); -} - -/** - * pci_addr_cache_build - Build a cache of I/O addresses - * - * Build a cache of pci i/o addresses. This cache will be used to - * find the pci device that corresponds to a given address. - * This routine scans all pci busses to build the cache. - * Must be run late in boot process, after the pci controllers - * have been scaned for devices (after all device resources are known). - */ -void __init pci_addr_cache_build(void) -{ - struct pci_dev *dev = NULL; - - if (!eeh_subsystem_enabled) - return; - - spin_lock_init(&pci_io_addr_cache_root.piar_lock); - - while ((dev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) { - /* Ignore PCI bridges ( XXX why ??) */ - if ((dev->class >> 16) == PCI_BASE_CLASS_BRIDGE) { - continue; - } - pci_addr_cache_insert_device(dev); - } - -#ifdef DEBUG - /* Verify tree built up above, echo back the list of addrs. */ - pci_addr_cache_print(&pci_io_addr_cache_root); -#endif -} - -/* --------------------------------------------------------------- */ -/* Above lies the PCI Address Cache. Below lies the EEH event infrastructure */ - -void eeh_slot_error_detail (struct pci_dn *pdn, int severity) -{ - unsigned long flags; - int rc; - - /* Log the error with the rtas logger */ - spin_lock_irqsave(&slot_errbuf_lock, flags); - memset(slot_errbuf, 0, eeh_error_buf_size); - - rc = rtas_call(ibm_slot_error_detail, - 8, 1, NULL, pdn->eeh_config_addr, - BUID_HI(pdn->phb->buid), - BUID_LO(pdn->phb->buid), NULL, 0, - virt_to_phys(slot_errbuf), - eeh_error_buf_size, - severity); - - if (rc == 0) - log_error(slot_errbuf, ERR_TYPE_RTAS_LOG, 0); - spin_unlock_irqrestore(&slot_errbuf_lock, flags); -} - -/** - * eeh_register_notifier - Register to find out about EEH events. - * @nb: notifier block to callback on events - */ -int eeh_register_notifier(struct notifier_block *nb) -{ - return notifier_chain_register(&eeh_notifier_chain, nb); -} - -/** - * eeh_unregister_notifier - Unregister to an EEH event notifier. - * @nb: notifier block to callback on events - */ -int eeh_unregister_notifier(struct notifier_block *nb) -{ - return notifier_chain_unregister(&eeh_notifier_chain, nb); -} - -/** - * read_slot_reset_state - Read the reset state of a device node's slot - * @dn: device node to read - * @rets: array to return results in - */ -static int read_slot_reset_state(struct pci_dn *pdn, int rets[]) -{ - int token, outputs; - - if (ibm_read_slot_reset_state2 != RTAS_UNKNOWN_SERVICE) { - token = ibm_read_slot_reset_state2; - outputs = 4; - } else { - token = ibm_read_slot_reset_state; - rets[2] = 0; /* fake PE Unavailable info */ - outputs = 3; - } - - return rtas_call(token, 3, outputs, rets, pdn->eeh_config_addr, - BUID_HI(pdn->phb->buid), BUID_LO(pdn->phb->buid)); -} - -/** - * eeh_panic - call panic() for an eeh event that cannot be handled. - * The philosophy of this routine is that it is better to panic and - * halt the OS than it is to risk possible data corruption by - * oblivious device drivers that don't know better. - * - * @dev pci device that had an eeh event - * @reset_state current reset state of the device slot - */ -static void eeh_panic(struct pci_dev *dev, int reset_state) -{ - /* - * XXX We should create a separate sysctl for this. - * - * Since the panic_on_oops sysctl is used to halt the system - * in light of potential corruption, we can use it here. - */ - if (panic_on_oops) { - struct device_node *dn = pci_device_to_OF_node(dev); - eeh_slot_error_detail (PCI_DN(dn), 2 /* Permanent Error */); - panic("EEH: MMIO failure (%d) on device:%s\n", reset_state, - pci_name(dev)); - } - else { - __get_cpu_var(ignored_failures)++; - printk(KERN_INFO "EEH: Ignored MMIO failure (%d) on device:%s\n", - reset_state, pci_name(dev)); - } -} - -/** - * eeh_event_handler - dispatch EEH events. The detection of a frozen - * slot can occur inside an interrupt, where it can be hard to do - * anything about it. The goal of this routine is to pull these - * detection events out of the context of the interrupt handler, and - * re-dispatch them for processing at a later time in a normal context. - * - * @dummy - unused - */ -static void eeh_event_handler(void *dummy) -{ - unsigned long flags; - struct eeh_event *event; - - while (1) { - spin_lock_irqsave(&eeh_eventlist_lock, flags); - event = NULL; - if (!list_empty(&eeh_eventlist)) { - event = list_entry(eeh_eventlist.next, struct eeh_event, list); - list_del(&event->list); - } - spin_unlock_irqrestore(&eeh_eventlist_lock, flags); - if (event == NULL) - break; - - printk(KERN_INFO "EEH: MMIO failure (%d), notifiying device " - "%s\n", event->reset_state, - pci_name(event->dev)); - - notifier_call_chain (&eeh_notifier_chain, - EEH_NOTIFY_FREEZE, event); - - pci_dev_put(event->dev); - kfree(event); - } -} - -/** - * eeh_token_to_phys - convert EEH address token to phys address - * @token i/o token, should be address in the form 0xA.... - */ -static inline unsigned long eeh_token_to_phys(unsigned long token) -{ - pte_t *ptep; - unsigned long pa; - - ptep = find_linux_pte(init_mm.pgd, token); - if (!ptep) - return token; - pa = pte_pfn(*ptep) << PAGE_SHIFT; - - return pa | (token & (PAGE_SIZE-1)); -} - -/** - * Return the "partitionable endpoint" (pe) under which this device lies - */ -static struct device_node * find_device_pe(struct device_node *dn) -{ - while ((dn->parent) && PCI_DN(dn->parent) && - (PCI_DN(dn->parent)->eeh_mode & EEH_MODE_SUPPORTED)) { - dn = dn->parent; - } - return dn; -} - -/** Mark all devices that are peers of this device as failed. - * Mark the device driver too, so that it can see the failure - * immediately; this is critical, since some drivers poll - * status registers in interrupts ... If a driver is polling, - * and the slot is frozen, then the driver can deadlock in - * an interrupt context, which is bad. - */ - -static inline void __eeh_mark_slot (struct device_node *dn) -{ - while (dn) { - PCI_DN(dn)->eeh_mode |= EEH_MODE_ISOLATED; - - if (dn->child) - __eeh_mark_slot (dn->child); - dn = dn->sibling; - } -} - -static inline void __eeh_clear_slot (struct device_node *dn) -{ - while (dn) { - PCI_DN(dn)->eeh_mode &= ~EEH_MODE_ISOLATED; - if (dn->child) - __eeh_clear_slot (dn->child); - dn = dn->sibling; - } -} - -static inline void eeh_clear_slot (struct device_node *dn) -{ - unsigned long flags; - spin_lock_irqsave(&confirm_error_lock, flags); - __eeh_clear_slot (dn); - spin_unlock_irqrestore(&confirm_error_lock, flags); -} - -/** - * eeh_dn_check_failure - check if all 1's data is due to EEH slot freeze - * @dn device node - * @dev pci device, if known - * - * Check for an EEH failure for the given device node. Call this - * routine if the result of a read was all 0xff's and you want to - * find out if this is due to an EEH slot freeze. This routine - * will query firmware for the EEH status. - * - * Returns 0 if there has not been an EEH error; otherwise returns - * a non-zero value and queues up a slot isolation event notification. - * - * It is safe to call this routine in an interrupt context. - */ -int eeh_dn_check_failure(struct device_node *dn, struct pci_dev *dev) -{ - int ret; - int rets[3]; - unsigned long flags; - int reset_state; - struct eeh_event *event; - struct pci_dn *pdn; - struct device_node *pe_dn; - int rc = 0; - - __get_cpu_var(total_mmio_ffs)++; - - if (!eeh_subsystem_enabled) - return 0; - - if (!dn) { - __get_cpu_var(no_dn)++; - return 0; - } - pdn = PCI_DN(dn); - - /* Access to IO BARs might get this far and still not want checking. */ - if (!(pdn->eeh_mode & EEH_MODE_SUPPORTED) || - pdn->eeh_mode & EEH_MODE_NOCHECK) { - __get_cpu_var(ignored_check)++; -#ifdef DEBUG - printk ("EEH:ignored check (%x) for %s %s\n", - pdn->eeh_mode, pci_name (dev), dn->full_name); -#endif - return 0; - } - - if (!pdn->eeh_config_addr) { - __get_cpu_var(no_cfg_addr)++; - return 0; - } - - /* If we already have a pending isolation event for this - * slot, we know it's bad already, we don't need to check. - * Do this checking under a lock; as multiple PCI devices - * in one slot might report errors simultaneously, and we - * only want one error recovery routine running. - */ - spin_lock_irqsave(&confirm_error_lock, flags); - rc = 1; - if (pdn->eeh_mode & EEH_MODE_ISOLATED) { - pdn->eeh_check_count ++; - if (pdn->eeh_check_count >= EEH_MAX_FAILS) { - printk (KERN_ERR "EEH: Device driver ignored %d bad reads, panicing\n", - pdn->eeh_check_count); - dump_stack(); - - /* re-read the slot reset state */ - if (read_slot_reset_state(pdn, rets) != 0) - rets[0] = -1; /* reset state unknown */ - - /* If we are here, then we hit an infinite loop. Stop. */ - panic("EEH: MMIO halt (%d) on device:%s\n", rets[0], pci_name(dev)); - } - goto dn_unlock; - } - - /* - * Now test for an EEH failure. This is VERY expensive. - * Note that the eeh_config_addr may be a parent device - * in the case of a device behind a bridge, or it may be - * function zero of a multi-function device. - * In any case they must share a common PHB. - */ - ret = read_slot_reset_state(pdn, rets); - - /* If the call to firmware failed, punt */ - if (ret != 0) { - printk(KERN_WARNING "EEH: read_slot_reset_state() failed; rc=%d dn=%s\n", - ret, dn->full_name); - __get_cpu_var(false_positives)++; - rc = 0; - goto dn_unlock; - } - - /* If EEH is not supported on this device, punt. */ - if (rets[1] != 1) { - printk(KERN_WARNING "EEH: event on unsupported device, rc=%d dn=%s\n", - ret, dn->full_name); - __get_cpu_var(false_positives)++; - rc = 0; - goto dn_unlock; - } - - /* If not the kind of error we know about, punt. */ - if (rets[0] != 2 && rets[0] != 4 && rets[0] != 5) { - __get_cpu_var(false_positives)++; - rc = 0; - goto dn_unlock; - } - - /* Note that config-io to empty slots may fail; - * we recognize empty because they don't have children. */ - if ((rets[0] == 5) && (dn->child == NULL)) { - __get_cpu_var(false_positives)++; - rc = 0; - goto dn_unlock; - } - - __get_cpu_var(slot_resets)++; - - /* Avoid repeated reports of this failure, including problems - * with other functions on this device, and functions under - * bridges. */ - pe_dn = find_device_pe (dn); - __eeh_mark_slot (pe_dn); - spin_unlock_irqrestore(&confirm_error_lock, flags); - - reset_state = rets[0]; - - eeh_slot_error_detail (pdn, 1 /* Temporary Error */); - - printk(KERN_INFO "EEH: MMIO failure (%d) on device: %s %s\n", - rets[0], dn->name, dn->full_name); - event = kmalloc(sizeof(*event), GFP_ATOMIC); - if (event == NULL) { - eeh_panic(dev, reset_state); - return 1; - } - - event->dev = dev; - event->dn = dn; - event->reset_state = reset_state; - - /* We may or may not be called in an interrupt context */ - spin_lock_irqsave(&eeh_eventlist_lock, flags); - list_add(&event->list, &eeh_eventlist); - spin_unlock_irqrestore(&eeh_eventlist_lock, flags); - - /* Most EEH events are due to device driver bugs. Having - * a stack trace will help the device-driver authors figure - * out what happened. So print that out. */ - if (rets[0] != 5) dump_stack(); - schedule_work(&eeh_event_wq); - - return 1; - -dn_unlock: - spin_unlock_irqrestore(&confirm_error_lock, flags); - return rc; -} - -EXPORT_SYMBOL_GPL(eeh_dn_check_failure); - -/** - * eeh_check_failure - check if all 1's data is due to EEH slot freeze - * @token i/o token, should be address in the form 0xA.... - * @val value, should be all 1's (XXX why do we need this arg??) - * - * Check for an EEH failure at the given token address. Call this - * routine if the result of a read was all 0xff's and you want to - * find out if this is due to an EEH slot freeze event. This routine - * will query firmware for the EEH status. - * - * Note this routine is safe to call in an interrupt context. - */ -unsigned long eeh_check_failure(const volatile void __iomem *token, unsigned long val) -{ - unsigned long addr; - struct pci_dev *dev; - struct device_node *dn; - - /* Finding the phys addr + pci device; this is pretty quick. */ - addr = eeh_token_to_phys((unsigned long __force) token); - dev = pci_get_device_by_addr(addr); - if (!dev) { - __get_cpu_var(no_device)++; - return val; - } - - dn = pci_device_to_OF_node(dev); - eeh_dn_check_failure (dn, dev); - - pci_dev_put(dev); - return val; -} - -EXPORT_SYMBOL(eeh_check_failure); - -struct eeh_early_enable_info { - unsigned int buid_hi; - unsigned int buid_lo; -}; - -/* Enable eeh for the given device node. */ -static void *early_enable_eeh(struct device_node *dn, void *data) -{ - struct eeh_early_enable_info *info = data; - int ret; - char *status = get_property(dn, "status", NULL); - u32 *class_code = (u32 *)get_property(dn, "class-code", NULL); - u32 *vendor_id = (u32 *)get_property(dn, "vendor-id", NULL); - u32 *device_id = (u32 *)get_property(dn, "device-id", NULL); - u32 *regs; - int enable; - struct pci_dn *pdn = PCI_DN(dn); - - pdn->eeh_mode = 0; - pdn->eeh_check_count = 0; - pdn->eeh_freeze_count = 0; - - if (status && strcmp(status, "ok") != 0) - return NULL; /* ignore devices with bad status */ - - /* Ignore bad nodes. */ - if (!class_code || !vendor_id || !device_id) - return NULL; - - /* There is nothing to check on PCI to ISA bridges */ - if (dn->type && !strcmp(dn->type, "isa")) { - pdn->eeh_mode |= EEH_MODE_NOCHECK; - return NULL; - } - - /* - * Now decide if we are going to "Disable" EEH checking - * for this device. We still run with the EEH hardware active, - * but we won't be checking for ff's. This means a driver - * could return bad data (very bad!), an interrupt handler could - * hang waiting on status bits that won't change, etc. - * But there are a few cases like display devices that make sense. - */ - enable = 1; /* i.e. we will do checking */ - if ((*class_code >> 16) == PCI_BASE_CLASS_DISPLAY) - enable = 0; - - if (!enable) - pdn->eeh_mode |= EEH_MODE_NOCHECK; - - /* Ok... see if this device supports EEH. Some do, some don't, - * and the only way to find out is to check each and every one. */ - regs = (u32 *)get_property(dn, "reg", NULL); - if (regs) { - /* First register entry is addr (00BBSS00) */ - /* Try to enable eeh */ - ret = rtas_call(ibm_set_eeh_option, 4, 1, NULL, - regs[0], info->buid_hi, info->buid_lo, - EEH_ENABLE); - if (ret == 0) { - eeh_subsystem_enabled = 1; - pdn->eeh_mode |= EEH_MODE_SUPPORTED; - pdn->eeh_config_addr = regs[0]; -#ifdef DEBUG - printk(KERN_DEBUG "EEH: %s: eeh enabled\n", dn->full_name); -#endif - } else { - - /* This device doesn't support EEH, but it may have an - * EEH parent, in which case we mark it as supported. */ - if (dn->parent && PCI_DN(dn->parent) - && (PCI_DN(dn->parent)->eeh_mode & EEH_MODE_SUPPORTED)) { - /* Parent supports EEH. */ - pdn->eeh_mode |= EEH_MODE_SUPPORTED; - pdn->eeh_config_addr = PCI_DN(dn->parent)->eeh_config_addr; - return NULL; - } - } - } else { - printk(KERN_WARNING "EEH: %s: unable to get reg property.\n", - dn->full_name); - } - - return NULL; -} - -/* - * Initialize EEH by trying to enable it for all of the adapters in the system. - * As a side effect we can determine here if eeh is supported at all. - * Note that we leave EEH on so failed config cycles won't cause a machine - * check. If a user turns off EEH for a particular adapter they are really - * telling Linux to ignore errors. Some hardware (e.g. POWER5) won't - * grant access to a slot if EEH isn't enabled, and so we always enable - * EEH for all slots/all devices. - * - * The eeh-force-off option disables EEH checking globally, for all slots. - * Even if force-off is set, the EEH hardware is still enabled, so that - * newer systems can boot. - */ -void __init eeh_init(void) -{ - struct device_node *phb, *np; - struct eeh_early_enable_info info; - - spin_lock_init(&confirm_error_lock); - spin_lock_init(&slot_errbuf_lock); - - np = of_find_node_by_path("/rtas"); - if (np == NULL) - return; - - ibm_set_eeh_option = rtas_token("ibm,set-eeh-option"); - ibm_set_slot_reset = rtas_token("ibm,set-slot-reset"); - ibm_read_slot_reset_state2 = rtas_token("ibm,read-slot-reset-state2"); - ibm_read_slot_reset_state = rtas_token("ibm,read-slot-reset-state"); - ibm_slot_error_detail = rtas_token("ibm,slot-error-detail"); - - if (ibm_set_eeh_option == RTAS_UNKNOWN_SERVICE) - return; - - eeh_error_buf_size = rtas_token("rtas-error-log-max"); - if (eeh_error_buf_size == RTAS_UNKNOWN_SERVICE) { - eeh_error_buf_size = 1024; - } - if (eeh_error_buf_size > RTAS_ERROR_LOG_MAX) { - printk(KERN_WARNING "EEH: rtas-error-log-max is bigger than allocated " - "buffer ! (%d vs %d)", eeh_error_buf_size, RTAS_ERROR_LOG_MAX); - eeh_error_buf_size = RTAS_ERROR_LOG_MAX; - } - - /* Enable EEH for all adapters. Note that eeh requires buid's */ - for (phb = of_find_node_by_name(NULL, "pci"); phb; - phb = of_find_node_by_name(phb, "pci")) { - unsigned long buid; - - buid = get_phb_buid(phb); - if (buid == 0 || PCI_DN(phb) == NULL) - continue; - - info.buid_lo = BUID_LO(buid); - info.buid_hi = BUID_HI(buid); - traverse_pci_devices(phb, early_enable_eeh, &info); - } - - if (eeh_subsystem_enabled) - printk(KERN_INFO "EEH: PCI Enhanced I/O Error Handling Enabled\n"); - else - printk(KERN_WARNING "EEH: No capable adapters found\n"); -} - -/** - * eeh_add_device_early - enable EEH for the indicated device_node - * @dn: device node for which to set up EEH - * - * This routine must be used to perform EEH initialization for PCI - * devices that were added after system boot (e.g. hotplug, dlpar). - * This routine must be called before any i/o is performed to the - * adapter (inluding any config-space i/o). - * Whether this actually enables EEH or not for this device depends - * on the CEC architecture, type of the device, on earlier boot - * command-line arguments & etc. - */ -void eeh_add_device_early(struct device_node *dn) -{ - struct pci_controller *phb; - struct eeh_early_enable_info info; - - if (!dn || !PCI_DN(dn)) - return; - phb = PCI_DN(dn)->phb; - if (NULL == phb || 0 == phb->buid) { - printk(KERN_WARNING "EEH: Expected buid but found none for %s\n", - dn->full_name); - dump_stack(); - return; - } - - info.buid_hi = BUID_HI(phb->buid); - info.buid_lo = BUID_LO(phb->buid); - early_enable_eeh(dn, &info); -} -EXPORT_SYMBOL_GPL(eeh_add_device_early); - -/** - * eeh_add_device_late - perform EEH initialization for the indicated pci device - * @dev: pci device for which to set up EEH - * - * This routine must be used to complete EEH initialization for PCI - * devices that were added after system boot (e.g. hotplug, dlpar). - */ -void eeh_add_device_late(struct pci_dev *dev) -{ - struct device_node *dn; - - if (!dev || !eeh_subsystem_enabled) - return; - -#ifdef DEBUG - printk(KERN_DEBUG "EEH: adding device %s\n", pci_name(dev)); -#endif - - pci_dev_get (dev); - dn = pci_device_to_OF_node(dev); - PCI_DN(dn)->pcidev = dev; - - pci_addr_cache_insert_device (dev); -} -EXPORT_SYMBOL_GPL(eeh_add_device_late); - -/** - * eeh_remove_device - undo EEH setup for the indicated pci device - * @dev: pci device to be removed - * - * This routine should be when a device is removed from a running - * system (e.g. by hotplug or dlpar). - */ -void eeh_remove_device(struct pci_dev *dev) -{ - struct device_node *dn; - if (!dev || !eeh_subsystem_enabled) - return; - - /* Unregister the device with the EEH/PCI address search system */ -#ifdef DEBUG - printk(KERN_DEBUG "EEH: remove device %s\n", pci_name(dev)); -#endif - pci_addr_cache_remove_device(dev); - - dn = pci_device_to_OF_node(dev); - PCI_DN(dn)->pcidev = NULL; - pci_dev_put (dev); -} -EXPORT_SYMBOL_GPL(eeh_remove_device); - -static int proc_eeh_show(struct seq_file *m, void *v) -{ - unsigned int cpu; - unsigned long ffs = 0, positives = 0, failures = 0; - unsigned long resets = 0; - unsigned long no_dev = 0, no_dn = 0, no_cfg = 0, no_check = 0; - - for_each_cpu(cpu) { - ffs += per_cpu(total_mmio_ffs, cpu); - positives += per_cpu(false_positives, cpu); - failures += per_cpu(ignored_failures, cpu); - resets += per_cpu(slot_resets, cpu); - no_dev += per_cpu(no_device, cpu); - no_dn += per_cpu(no_dn, cpu); - no_cfg += per_cpu(no_cfg_addr, cpu); - no_check += per_cpu(ignored_check, cpu); - } - - if (0 == eeh_subsystem_enabled) { - seq_printf(m, "EEH Subsystem is globally disabled\n"); - seq_printf(m, "eeh_total_mmio_ffs=%ld\n", ffs); - } else { - seq_printf(m, "EEH Subsystem is enabled\n"); - seq_printf(m, - "no device=%ld\n" - "no device node=%ld\n" - "no config address=%ld\n" - "check not wanted=%ld\n" - "eeh_total_mmio_ffs=%ld\n" - "eeh_false_positives=%ld\n" - "eeh_ignored_failures=%ld\n" - "eeh_slot_resets=%ld\n", - no_dev, no_dn, no_cfg, no_check, - ffs, positives, failures, resets); - } - - return 0; -} - -static int proc_eeh_open(struct inode *inode, struct file *file) -{ - return single_open(file, proc_eeh_show, NULL); -} - -static struct file_operations proc_eeh_operations = { - .open = proc_eeh_open, - .read = seq_read, - .llseek = seq_lseek, - .release = single_release, -}; - -static int __init eeh_init_proc(void) -{ - struct proc_dir_entry *e; - - if (systemcfg->platform & PLATFORM_PSERIES) { - e = create_proc_entry("ppc64/eeh", 0, NULL); - if (e) - e->proc_fops = &proc_eeh_operations; - } - - return 0; -} -__initcall(eeh_init_proc); Index: linux-2.6.14-git3/arch/ppc64/kernel/Makefile =================================================================== --- linux-2.6.14-git3.orig/arch/ppc64/kernel/Makefile 2005-11-02 14:29:22.485829789 -0600 +++ linux-2.6.14-git3/arch/ppc64/kernel/Makefile 2005-11-02 14:30:49.805589414 -0600 @@ -35,7 +35,6 @@ bpa_iic.o spider-pic.o obj-$(CONFIG_KEXEC) += machine_kexec.o -obj-$(CONFIG_EEH) += eeh.o obj-$(CONFIG_PROC_FS) += proc_ppc64.o obj-$(CONFIG_RTAS_FLASH) += rtas_flash.o obj-$(CONFIG_SMP) += smp.o Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/Makefile =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/Makefile 2005-10-31 11:19:47.000000000 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/Makefile 2005-11-02 14:31:36.150092654 -0600 @@ -3,3 +3,4 @@ obj-$(CONFIG_SMP) += smp.o obj-$(CONFIG_IBMVIO) += vio.o obj-$(CONFIG_XICS) += xics.o +obj-$(CONFIG_EEH) += eeh.o Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:30:49.790591516 -0600 @@ -0,0 +1,1093 @@ +/* + * eeh.c + * Copyright (C) 2001 Dave Engebretsen & Todd Inglett IBM Corporation + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#undef DEBUG + +/** Overview: + * EEH, or "Extended Error Handling" is a PCI bridge technology for + * dealing with PCI bus errors that can't be dealt with within the + * usual PCI framework, except by check-stopping the CPU. Systems + * that are designed for high-availability/reliability cannot afford + * to crash due to a "mere" PCI error, thus the need for EEH. + * An EEH-capable bridge operates by converting a detected error + * into a "slot freeze", taking the PCI adapter off-line, making + * the slot behave, from the OS'es point of view, as if the slot + * were "empty": all reads return 0xff's and all writes are silently + * ignored. EEH slot isolation events can be triggered by parity + * errors on the address or data busses (e.g. during posted writes), + * which in turn might be caused by low voltage on the bus, dust, + * vibration, humidity, radioactivity or plain-old failed hardware. + * + * Note, however, that one of the leading causes of EEH slot + * freeze events are buggy device drivers, buggy device microcode, + * or buggy device hardware. This is because any attempt by the + * device to bus-master data to a memory address that is not + * assigned to the device will trigger a slot freeze. (The idea + * is to prevent devices-gone-wild from corrupting system memory). + * Buggy hardware/drivers will have a miserable time co-existing + * with EEH. + * + * Ideally, a PCI device driver, when suspecting that an isolation + * event has occured (e.g. by reading 0xff's), will then ask EEH + * whether this is the case, and then take appropriate steps to + * reset the PCI slot, the PCI device, and then resume operations. + * However, until that day, the checking is done here, with the + * eeh_check_failure() routine embedded in the MMIO macros. If + * the slot is found to be isolated, an "EEH Event" is synthesized + * and sent out for processing. + */ + +/* EEH event workqueue setup. */ +static DEFINE_SPINLOCK(eeh_eventlist_lock); +LIST_HEAD(eeh_eventlist); +static void eeh_event_handler(void *); +DECLARE_WORK(eeh_event_wq, eeh_event_handler, NULL); + +static struct notifier_block *eeh_notifier_chain; + +/* If a device driver keeps reading an MMIO register in an interrupt + * handler after a slot isolation event has occurred, we assume it + * is broken and panic. This sets the threshold for how many read + * attempts we allow before panicking. + */ +#define EEH_MAX_FAILS 100000 + +/* RTAS tokens */ +static int ibm_set_eeh_option; +static int ibm_set_slot_reset; +static int ibm_read_slot_reset_state; +static int ibm_read_slot_reset_state2; +static int ibm_slot_error_detail; + +static int eeh_subsystem_enabled; + +/* Lock to avoid races due to multiple reports of an error */ +static DEFINE_SPINLOCK(confirm_error_lock); + +/* Buffer for reporting slot-error-detail rtas calls */ +static unsigned char slot_errbuf[RTAS_ERROR_LOG_MAX]; +static DEFINE_SPINLOCK(slot_errbuf_lock); +static int eeh_error_buf_size; + +/* System monitoring statistics */ +static DEFINE_PER_CPU(unsigned long, no_device); +static DEFINE_PER_CPU(unsigned long, no_dn); +static DEFINE_PER_CPU(unsigned long, no_cfg_addr); +static DEFINE_PER_CPU(unsigned long, ignored_check); +static DEFINE_PER_CPU(unsigned long, total_mmio_ffs); +static DEFINE_PER_CPU(unsigned long, false_positives); +static DEFINE_PER_CPU(unsigned long, ignored_failures); +static DEFINE_PER_CPU(unsigned long, slot_resets); + +/** + * The pci address cache subsystem. This subsystem places + * PCI device address resources into a red-black tree, sorted + * according to the address range, so that given only an i/o + * address, the corresponding PCI device can be **quickly** + * found. It is safe to perform an address lookup in an interrupt + * context; this ability is an important feature. + * + * Currently, the only customer of this code is the EEH subsystem; + * thus, this code has been somewhat tailored to suit EEH better. + * In particular, the cache does *not* hold the addresses of devices + * for which EEH is not enabled. + * + * (Implementation Note: The RB tree seems to be better/faster + * than any hash algo I could think of for this problem, even + * with the penalty of slow pointer chases for d-cache misses). + */ +struct pci_io_addr_range +{ + struct rb_node rb_node; + unsigned long addr_lo; + unsigned long addr_hi; + struct pci_dev *pcidev; + unsigned int flags; +}; + +static struct pci_io_addr_cache +{ + struct rb_root rb_root; + spinlock_t piar_lock; +} pci_io_addr_cache_root; + +static inline struct pci_dev *__pci_get_device_by_addr(unsigned long addr) +{ + struct rb_node *n = pci_io_addr_cache_root.rb_root.rb_node; + + while (n) { + struct pci_io_addr_range *piar; + piar = rb_entry(n, struct pci_io_addr_range, rb_node); + + if (addr < piar->addr_lo) { + n = n->rb_left; + } else { + if (addr > piar->addr_hi) { + n = n->rb_right; + } else { + pci_dev_get(piar->pcidev); + return piar->pcidev; + } + } + } + + return NULL; +} + +/** + * pci_get_device_by_addr - Get device, given only address + * @addr: mmio (PIO) phys address or i/o port number + * + * Given an mmio phys address, or a port number, find a pci device + * that implements this address. Be sure to pci_dev_put the device + * when finished. I/O port numbers are assumed to be offset + * from zero (that is, they do *not* have pci_io_addr added in). + * It is safe to call this function within an interrupt. + */ +static struct pci_dev *pci_get_device_by_addr(unsigned long addr) +{ + struct pci_dev *dev; + unsigned long flags; + + spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); + dev = __pci_get_device_by_addr(addr); + spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); + return dev; +} + +#ifdef DEBUG +/* + * Handy-dandy debug print routine, does nothing more + * than print out the contents of our addr cache. + */ +static void pci_addr_cache_print(struct pci_io_addr_cache *cache) +{ + struct rb_node *n; + int cnt = 0; + + n = rb_first(&cache->rb_root); + while (n) { + struct pci_io_addr_range *piar; + piar = rb_entry(n, struct pci_io_addr_range, rb_node); + printk(KERN_DEBUG "PCI: %s addr range %d [%lx-%lx]: %s\n", + (piar->flags & IORESOURCE_IO) ? "i/o" : "mem", cnt, + piar->addr_lo, piar->addr_hi, pci_name(piar->pcidev)); + cnt++; + n = rb_next(n); + } +} +#endif + +/* Insert address range into the rb tree. */ +static struct pci_io_addr_range * +pci_addr_cache_insert(struct pci_dev *dev, unsigned long alo, + unsigned long ahi, unsigned int flags) +{ + struct rb_node **p = &pci_io_addr_cache_root.rb_root.rb_node; + struct rb_node *parent = NULL; + struct pci_io_addr_range *piar; + + /* Walk tree, find a place to insert into tree */ + while (*p) { + parent = *p; + piar = rb_entry(parent, struct pci_io_addr_range, rb_node); + if (ahi < piar->addr_lo) { + p = &parent->rb_left; + } else if (alo > piar->addr_hi) { + p = &parent->rb_right; + } else { + if (dev != piar->pcidev || + alo != piar->addr_lo || ahi != piar->addr_hi) { + printk(KERN_WARNING "PIAR: overlapping address range\n"); + } + return piar; + } + } + piar = (struct pci_io_addr_range *)kmalloc(sizeof(struct pci_io_addr_range), GFP_ATOMIC); + if (!piar) + return NULL; + + piar->addr_lo = alo; + piar->addr_hi = ahi; + piar->pcidev = dev; + piar->flags = flags; + +#ifdef DEBUG + printk(KERN_DEBUG "PIAR: insert range=[%lx:%lx] dev=%s\n", + alo, ahi, pci_name (dev)); +#endif + + rb_link_node(&piar->rb_node, parent, p); + rb_insert_color(&piar->rb_node, &pci_io_addr_cache_root.rb_root); + + return piar; +} + +static void __pci_addr_cache_insert_device(struct pci_dev *dev) +{ + struct device_node *dn; + struct pci_dn *pdn; + int i; + int inserted = 0; + + dn = pci_device_to_OF_node(dev); + if (!dn) { + printk(KERN_WARNING "PCI: no pci dn found for dev=%s\n", pci_name(dev)); + return; + } + + /* Skip any devices for which EEH is not enabled. */ + pdn = PCI_DN(dn); + if (!(pdn->eeh_mode & EEH_MODE_SUPPORTED) || + pdn->eeh_mode & EEH_MODE_NOCHECK) { +#ifdef DEBUG + printk(KERN_INFO "PCI: skip building address cache for=%s - %s\n", + pci_name(dev), pdn->node->full_name); +#endif + return; + } + + /* The cache holds a reference to the device... */ + pci_dev_get(dev); + + /* Walk resources on this device, poke them into the tree */ + for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) { + unsigned long start = pci_resource_start(dev,i); + unsigned long end = pci_resource_end(dev,i); + unsigned int flags = pci_resource_flags(dev,i); + + /* We are interested only bus addresses, not dma or other stuff */ + if (0 == (flags & (IORESOURCE_IO | IORESOURCE_MEM))) + continue; + if (start == 0 || ~start == 0 || end == 0 || ~end == 0) + continue; + pci_addr_cache_insert(dev, start, end, flags); + inserted = 1; + } + + /* If there was nothing to add, the cache has no reference... */ + if (!inserted) + pci_dev_put(dev); +} + +/** + * pci_addr_cache_insert_device - Add a device to the address cache + * @dev: PCI device whose I/O addresses we are interested in. + * + * In order to support the fast lookup of devices based on addresses, + * we maintain a cache of devices that can be quickly searched. + * This routine adds a device to that cache. + */ +static void pci_addr_cache_insert_device(struct pci_dev *dev) +{ + unsigned long flags; + + spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); + __pci_addr_cache_insert_device(dev); + spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); +} + +static inline void __pci_addr_cache_remove_device(struct pci_dev *dev) +{ + struct rb_node *n; + int removed = 0; + +restart: + n = rb_first(&pci_io_addr_cache_root.rb_root); + while (n) { + struct pci_io_addr_range *piar; + piar = rb_entry(n, struct pci_io_addr_range, rb_node); + + if (piar->pcidev == dev) { + rb_erase(n, &pci_io_addr_cache_root.rb_root); + removed = 1; + kfree(piar); + goto restart; + } + n = rb_next(n); + } + + /* The cache no longer holds its reference to this device... */ + if (removed) + pci_dev_put(dev); +} + +/** + * pci_addr_cache_remove_device - remove pci device from addr cache + * @dev: device to remove + * + * Remove a device from the addr-cache tree. + * This is potentially expensive, since it will walk + * the tree multiple times (once per resource). + * But so what; device removal doesn't need to be that fast. + */ +static void pci_addr_cache_remove_device(struct pci_dev *dev) +{ + unsigned long flags; + + spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); + __pci_addr_cache_remove_device(dev); + spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); +} + +/** + * pci_addr_cache_build - Build a cache of I/O addresses + * + * Build a cache of pci i/o addresses. This cache will be used to + * find the pci device that corresponds to a given address. + * This routine scans all pci busses to build the cache. + * Must be run late in boot process, after the pci controllers + * have been scaned for devices (after all device resources are known). + */ +void __init pci_addr_cache_build(void) +{ + struct pci_dev *dev = NULL; + + if (!eeh_subsystem_enabled) + return; + + spin_lock_init(&pci_io_addr_cache_root.piar_lock); + + while ((dev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) { + /* Ignore PCI bridges ( XXX why ??) */ + if ((dev->class >> 16) == PCI_BASE_CLASS_BRIDGE) { + continue; + } + pci_addr_cache_insert_device(dev); + } + +#ifdef DEBUG + /* Verify tree built up above, echo back the list of addrs. */ + pci_addr_cache_print(&pci_io_addr_cache_root); +#endif +} + +/* --------------------------------------------------------------- */ +/* Above lies the PCI Address Cache. Below lies the EEH event infrastructure */ + +void eeh_slot_error_detail (struct pci_dn *pdn, int severity) +{ + unsigned long flags; + int rc; + + /* Log the error with the rtas logger */ + spin_lock_irqsave(&slot_errbuf_lock, flags); + memset(slot_errbuf, 0, eeh_error_buf_size); + + rc = rtas_call(ibm_slot_error_detail, + 8, 1, NULL, pdn->eeh_config_addr, + BUID_HI(pdn->phb->buid), + BUID_LO(pdn->phb->buid), NULL, 0, + virt_to_phys(slot_errbuf), + eeh_error_buf_size, + severity); + + if (rc == 0) + log_error(slot_errbuf, ERR_TYPE_RTAS_LOG, 0); + spin_unlock_irqrestore(&slot_errbuf_lock, flags); +} + +/** + * eeh_register_notifier - Register to find out about EEH events. + * @nb: notifier block to callback on events + */ +int eeh_register_notifier(struct notifier_block *nb) +{ + return notifier_chain_register(&eeh_notifier_chain, nb); +} + +/** + * eeh_unregister_notifier - Unregister to an EEH event notifier. + * @nb: notifier block to callback on events + */ +int eeh_unregister_notifier(struct notifier_block *nb) +{ + return notifier_chain_unregister(&eeh_notifier_chain, nb); +} + +/** + * read_slot_reset_state - Read the reset state of a device node's slot + * @dn: device node to read + * @rets: array to return results in + */ +static int read_slot_reset_state(struct pci_dn *pdn, int rets[]) +{ + int token, outputs; + + if (ibm_read_slot_reset_state2 != RTAS_UNKNOWN_SERVICE) { + token = ibm_read_slot_reset_state2; + outputs = 4; + } else { + token = ibm_read_slot_reset_state; + rets[2] = 0; /* fake PE Unavailable info */ + outputs = 3; + } + + return rtas_call(token, 3, outputs, rets, pdn->eeh_config_addr, + BUID_HI(pdn->phb->buid), BUID_LO(pdn->phb->buid)); +} + +/** + * eeh_panic - call panic() for an eeh event that cannot be handled. + * The philosophy of this routine is that it is better to panic and + * halt the OS than it is to risk possible data corruption by + * oblivious device drivers that don't know better. + * + * @dev pci device that had an eeh event + * @reset_state current reset state of the device slot + */ +static void eeh_panic(struct pci_dev *dev, int reset_state) +{ + /* + * XXX We should create a separate sysctl for this. + * + * Since the panic_on_oops sysctl is used to halt the system + * in light of potential corruption, we can use it here. + */ + if (panic_on_oops) { + struct device_node *dn = pci_device_to_OF_node(dev); + eeh_slot_error_detail (PCI_DN(dn), 2 /* Permanent Error */); + panic("EEH: MMIO failure (%d) on device:%s\n", reset_state, + pci_name(dev)); + } + else { + __get_cpu_var(ignored_failures)++; + printk(KERN_INFO "EEH: Ignored MMIO failure (%d) on device:%s\n", + reset_state, pci_name(dev)); + } +} + +/** + * eeh_event_handler - dispatch EEH events. The detection of a frozen + * slot can occur inside an interrupt, where it can be hard to do + * anything about it. The goal of this routine is to pull these + * detection events out of the context of the interrupt handler, and + * re-dispatch them for processing at a later time in a normal context. + * + * @dummy - unused + */ +static void eeh_event_handler(void *dummy) +{ + unsigned long flags; + struct eeh_event *event; + + while (1) { + spin_lock_irqsave(&eeh_eventlist_lock, flags); + event = NULL; + if (!list_empty(&eeh_eventlist)) { + event = list_entry(eeh_eventlist.next, struct eeh_event, list); + list_del(&event->list); + } + spin_unlock_irqrestore(&eeh_eventlist_lock, flags); + if (event == NULL) + break; + + printk(KERN_INFO "EEH: MMIO failure (%d), notifiying device " + "%s\n", event->reset_state, + pci_name(event->dev)); + + notifier_call_chain (&eeh_notifier_chain, + EEH_NOTIFY_FREEZE, event); + + pci_dev_put(event->dev); + kfree(event); + } +} + +/** + * eeh_token_to_phys - convert EEH address token to phys address + * @token i/o token, should be address in the form 0xA.... + */ +static inline unsigned long eeh_token_to_phys(unsigned long token) +{ + pte_t *ptep; + unsigned long pa; + + ptep = find_linux_pte(init_mm.pgd, token); + if (!ptep) + return token; + pa = pte_pfn(*ptep) << PAGE_SHIFT; + + return pa | (token & (PAGE_SIZE-1)); +} + +/** + * Return the "partitionable endpoint" (pe) under which this device lies + */ +static struct device_node * find_device_pe(struct device_node *dn) +{ + while ((dn->parent) && PCI_DN(dn->parent) && + (PCI_DN(dn->parent)->eeh_mode & EEH_MODE_SUPPORTED)) { + dn = dn->parent; + } + return dn; +} + +/** Mark all devices that are peers of this device as failed. + * Mark the device driver too, so that it can see the failure + * immediately; this is critical, since some drivers poll + * status registers in interrupts ... If a driver is polling, + * and the slot is frozen, then the driver can deadlock in + * an interrupt context, which is bad. + */ + +static inline void __eeh_mark_slot (struct device_node *dn) +{ + while (dn) { + PCI_DN(dn)->eeh_mode |= EEH_MODE_ISOLATED; + + if (dn->child) + __eeh_mark_slot (dn->child); + dn = dn->sibling; + } +} + +static inline void __eeh_clear_slot (struct device_node *dn) +{ + while (dn) { + PCI_DN(dn)->eeh_mode &= ~EEH_MODE_ISOLATED; + if (dn->child) + __eeh_clear_slot (dn->child); + dn = dn->sibling; + } +} + +static inline void eeh_clear_slot (struct device_node *dn) +{ + unsigned long flags; + spin_lock_irqsave(&confirm_error_lock, flags); + __eeh_clear_slot (dn); + spin_unlock_irqrestore(&confirm_error_lock, flags); +} + +/** + * eeh_dn_check_failure - check if all 1's data is due to EEH slot freeze + * @dn device node + * @dev pci device, if known + * + * Check for an EEH failure for the given device node. Call this + * routine if the result of a read was all 0xff's and you want to + * find out if this is due to an EEH slot freeze. This routine + * will query firmware for the EEH status. + * + * Returns 0 if there has not been an EEH error; otherwise returns + * a non-zero value and queues up a slot isolation event notification. + * + * It is safe to call this routine in an interrupt context. + */ +int eeh_dn_check_failure(struct device_node *dn, struct pci_dev *dev) +{ + int ret; + int rets[3]; + unsigned long flags; + int reset_state; + struct eeh_event *event; + struct pci_dn *pdn; + struct device_node *pe_dn; + int rc = 0; + + __get_cpu_var(total_mmio_ffs)++; + + if (!eeh_subsystem_enabled) + return 0; + + if (!dn) { + __get_cpu_var(no_dn)++; + return 0; + } + pdn = PCI_DN(dn); + + /* Access to IO BARs might get this far and still not want checking. */ + if (!(pdn->eeh_mode & EEH_MODE_SUPPORTED) || + pdn->eeh_mode & EEH_MODE_NOCHECK) { + __get_cpu_var(ignored_check)++; +#ifdef DEBUG + printk ("EEH:ignored check (%x) for %s %s\n", + pdn->eeh_mode, pci_name (dev), dn->full_name); +#endif + return 0; + } + + if (!pdn->eeh_config_addr) { + __get_cpu_var(no_cfg_addr)++; + return 0; + } + + /* If we already have a pending isolation event for this + * slot, we know it's bad already, we don't need to check. + * Do this checking under a lock; as multiple PCI devices + * in one slot might report errors simultaneously, and we + * only want one error recovery routine running. + */ + spin_lock_irqsave(&confirm_error_lock, flags); + rc = 1; + if (pdn->eeh_mode & EEH_MODE_ISOLATED) { + pdn->eeh_check_count ++; + if (pdn->eeh_check_count >= EEH_MAX_FAILS) { + printk (KERN_ERR "EEH: Device driver ignored %d bad reads, panicing\n", + pdn->eeh_check_count); + dump_stack(); + + /* re-read the slot reset state */ + if (read_slot_reset_state(pdn, rets) != 0) + rets[0] = -1; /* reset state unknown */ + + /* If we are here, then we hit an infinite loop. Stop. */ + panic("EEH: MMIO halt (%d) on device:%s\n", rets[0], pci_name(dev)); + } + goto dn_unlock; + } + + /* + * Now test for an EEH failure. This is VERY expensive. + * Note that the eeh_config_addr may be a parent device + * in the case of a device behind a bridge, or it may be + * function zero of a multi-function device. + * In any case they must share a common PHB. + */ + ret = read_slot_reset_state(pdn, rets); + + /* If the call to firmware failed, punt */ + if (ret != 0) { + printk(KERN_WARNING "EEH: read_slot_reset_state() failed; rc=%d dn=%s\n", + ret, dn->full_name); + __get_cpu_var(false_positives)++; + rc = 0; + goto dn_unlock; + } + + /* If EEH is not supported on this device, punt. */ + if (rets[1] != 1) { + printk(KERN_WARNING "EEH: event on unsupported device, rc=%d dn=%s\n", + ret, dn->full_name); + __get_cpu_var(false_positives)++; + rc = 0; + goto dn_unlock; + } + + /* If not the kind of error we know about, punt. */ + if (rets[0] != 2 && rets[0] != 4 && rets[0] != 5) { + __get_cpu_var(false_positives)++; + rc = 0; + goto dn_unlock; + } + + /* Note that config-io to empty slots may fail; + * we recognize empty because they don't have children. */ + if ((rets[0] == 5) && (dn->child == NULL)) { + __get_cpu_var(false_positives)++; + rc = 0; + goto dn_unlock; + } + + __get_cpu_var(slot_resets)++; + + /* Avoid repeated reports of this failure, including problems + * with other functions on this device, and functions under + * bridges. */ + pe_dn = find_device_pe (dn); + __eeh_mark_slot (pe_dn); + spin_unlock_irqrestore(&confirm_error_lock, flags); + + reset_state = rets[0]; + + eeh_slot_error_detail (pdn, 1 /* Temporary Error */); + + printk(KERN_INFO "EEH: MMIO failure (%d) on device: %s %s\n", + rets[0], dn->name, dn->full_name); + event = kmalloc(sizeof(*event), GFP_ATOMIC); + if (event == NULL) { + eeh_panic(dev, reset_state); + return 1; + } + + event->dev = dev; + event->dn = dn; + event->reset_state = reset_state; + + /* We may or may not be called in an interrupt context */ + spin_lock_irqsave(&eeh_eventlist_lock, flags); + list_add(&event->list, &eeh_eventlist); + spin_unlock_irqrestore(&eeh_eventlist_lock, flags); + + /* Most EEH events are due to device driver bugs. Having + * a stack trace will help the device-driver authors figure + * out what happened. So print that out. */ + if (rets[0] != 5) dump_stack(); + schedule_work(&eeh_event_wq); + + return 1; + +dn_unlock: + spin_unlock_irqrestore(&confirm_error_lock, flags); + return rc; +} + +EXPORT_SYMBOL_GPL(eeh_dn_check_failure); + +/** + * eeh_check_failure - check if all 1's data is due to EEH slot freeze + * @token i/o token, should be address in the form 0xA.... + * @val value, should be all 1's (XXX why do we need this arg??) + * + * Check for an EEH failure at the given token address. Call this + * routine if the result of a read was all 0xff's and you want to + * find out if this is due to an EEH slot freeze event. This routine + * will query firmware for the EEH status. + * + * Note this routine is safe to call in an interrupt context. + */ +unsigned long eeh_check_failure(const volatile void __iomem *token, unsigned long val) +{ + unsigned long addr; + struct pci_dev *dev; + struct device_node *dn; + + /* Finding the phys addr + pci device; this is pretty quick. */ + addr = eeh_token_to_phys((unsigned long __force) token); + dev = pci_get_device_by_addr(addr); + if (!dev) { + __get_cpu_var(no_device)++; + return val; + } + + dn = pci_device_to_OF_node(dev); + eeh_dn_check_failure (dn, dev); + + pci_dev_put(dev); + return val; +} + +EXPORT_SYMBOL(eeh_check_failure); + +struct eeh_early_enable_info { + unsigned int buid_hi; + unsigned int buid_lo; +}; + +/* Enable eeh for the given device node. */ +static void *early_enable_eeh(struct device_node *dn, void *data) +{ + struct eeh_early_enable_info *info = data; + int ret; + char *status = get_property(dn, "status", NULL); + u32 *class_code = (u32 *)get_property(dn, "class-code", NULL); + u32 *vendor_id = (u32 *)get_property(dn, "vendor-id", NULL); + u32 *device_id = (u32 *)get_property(dn, "device-id", NULL); + u32 *regs; + int enable; + struct pci_dn *pdn = PCI_DN(dn); + + pdn->eeh_mode = 0; + pdn->eeh_check_count = 0; + pdn->eeh_freeze_count = 0; + + if (status && strcmp(status, "ok") != 0) + return NULL; /* ignore devices with bad status */ + + /* Ignore bad nodes. */ + if (!class_code || !vendor_id || !device_id) + return NULL; + + /* There is nothing to check on PCI to ISA bridges */ + if (dn->type && !strcmp(dn->type, "isa")) { + pdn->eeh_mode |= EEH_MODE_NOCHECK; + return NULL; + } + + /* + * Now decide if we are going to "Disable" EEH checking + * for this device. We still run with the EEH hardware active, + * but we won't be checking for ff's. This means a driver + * could return bad data (very bad!), an interrupt handler could + * hang waiting on status bits that won't change, etc. + * But there are a few cases like display devices that make sense. + */ + enable = 1; /* i.e. we will do checking */ + if ((*class_code >> 16) == PCI_BASE_CLASS_DISPLAY) + enable = 0; + + if (!enable) + pdn->eeh_mode |= EEH_MODE_NOCHECK; + + /* Ok... see if this device supports EEH. Some do, some don't, + * and the only way to find out is to check each and every one. */ + regs = (u32 *)get_property(dn, "reg", NULL); + if (regs) { + /* First register entry is addr (00BBSS00) */ + /* Try to enable eeh */ + ret = rtas_call(ibm_set_eeh_option, 4, 1, NULL, + regs[0], info->buid_hi, info->buid_lo, + EEH_ENABLE); + if (ret == 0) { + eeh_subsystem_enabled = 1; + pdn->eeh_mode |= EEH_MODE_SUPPORTED; + pdn->eeh_config_addr = regs[0]; +#ifdef DEBUG + printk(KERN_DEBUG "EEH: %s: eeh enabled\n", dn->full_name); +#endif + } else { + + /* This device doesn't support EEH, but it may have an + * EEH parent, in which case we mark it as supported. */ + if (dn->parent && PCI_DN(dn->parent) + && (PCI_DN(dn->parent)->eeh_mode & EEH_MODE_SUPPORTED)) { + /* Parent supports EEH. */ + pdn->eeh_mode |= EEH_MODE_SUPPORTED; + pdn->eeh_config_addr = PCI_DN(dn->parent)->eeh_config_addr; + return NULL; + } + } + } else { + printk(KERN_WARNING "EEH: %s: unable to get reg property.\n", + dn->full_name); + } + + return NULL; +} + +/* + * Initialize EEH by trying to enable it for all of the adapters in the system. + * As a side effect we can determine here if eeh is supported at all. + * Note that we leave EEH on so failed config cycles won't cause a machine + * check. If a user turns off EEH for a particular adapter they are really + * telling Linux to ignore errors. Some hardware (e.g. POWER5) won't + * grant access to a slot if EEH isn't enabled, and so we always enable + * EEH for all slots/all devices. + * + * The eeh-force-off option disables EEH checking globally, for all slots. + * Even if force-off is set, the EEH hardware is still enabled, so that + * newer systems can boot. + */ +void __init eeh_init(void) +{ + struct device_node *phb, *np; + struct eeh_early_enable_info info; + + spin_lock_init(&confirm_error_lock); + spin_lock_init(&slot_errbuf_lock); + + np = of_find_node_by_path("/rtas"); + if (np == NULL) + return; + + ibm_set_eeh_option = rtas_token("ibm,set-eeh-option"); + ibm_set_slot_reset = rtas_token("ibm,set-slot-reset"); + ibm_read_slot_reset_state2 = rtas_token("ibm,read-slot-reset-state2"); + ibm_read_slot_reset_state = rtas_token("ibm,read-slot-reset-state"); + ibm_slot_error_detail = rtas_token("ibm,slot-error-detail"); + + if (ibm_set_eeh_option == RTAS_UNKNOWN_SERVICE) + return; + + eeh_error_buf_size = rtas_token("rtas-error-log-max"); + if (eeh_error_buf_size == RTAS_UNKNOWN_SERVICE) { + eeh_error_buf_size = 1024; + } + if (eeh_error_buf_size > RTAS_ERROR_LOG_MAX) { + printk(KERN_WARNING "EEH: rtas-error-log-max is bigger than allocated " + "buffer ! (%d vs %d)", eeh_error_buf_size, RTAS_ERROR_LOG_MAX); + eeh_error_buf_size = RTAS_ERROR_LOG_MAX; + } + + /* Enable EEH for all adapters. Note that eeh requires buid's */ + for (phb = of_find_node_by_name(NULL, "pci"); phb; + phb = of_find_node_by_name(phb, "pci")) { + unsigned long buid; + + buid = get_phb_buid(phb); + if (buid == 0 || PCI_DN(phb) == NULL) + continue; + + info.buid_lo = BUID_LO(buid); + info.buid_hi = BUID_HI(buid); + traverse_pci_devices(phb, early_enable_eeh, &info); + } + + if (eeh_subsystem_enabled) + printk(KERN_INFO "EEH: PCI Enhanced I/O Error Handling Enabled\n"); + else + printk(KERN_WARNING "EEH: No capable adapters found\n"); +} + +/** + * eeh_add_device_early - enable EEH for the indicated device_node + * @dn: device node for which to set up EEH + * + * This routine must be used to perform EEH initialization for PCI + * devices that were added after system boot (e.g. hotplug, dlpar). + * This routine must be called before any i/o is performed to the + * adapter (inluding any config-space i/o). + * Whether this actually enables EEH or not for this device depends + * on the CEC architecture, type of the device, on earlier boot + * command-line arguments & etc. + */ +void eeh_add_device_early(struct device_node *dn) +{ + struct pci_controller *phb; + struct eeh_early_enable_info info; + + if (!dn || !PCI_DN(dn)) + return; + phb = PCI_DN(dn)->phb; + if (NULL == phb || 0 == phb->buid) { + printk(KERN_WARNING "EEH: Expected buid but found none for %s\n", + dn->full_name); + dump_stack(); + return; + } + + info.buid_hi = BUID_HI(phb->buid); + info.buid_lo = BUID_LO(phb->buid); + early_enable_eeh(dn, &info); +} +EXPORT_SYMBOL_GPL(eeh_add_device_early); + +/** + * eeh_add_device_late - perform EEH initialization for the indicated pci device + * @dev: pci device for which to set up EEH + * + * This routine must be used to complete EEH initialization for PCI + * devices that were added after system boot (e.g. hotplug, dlpar). + */ +void eeh_add_device_late(struct pci_dev *dev) +{ + struct device_node *dn; + + if (!dev || !eeh_subsystem_enabled) + return; + +#ifdef DEBUG + printk(KERN_DEBUG "EEH: adding device %s\n", pci_name(dev)); +#endif + + pci_dev_get (dev); + dn = pci_device_to_OF_node(dev); + PCI_DN(dn)->pcidev = dev; + + pci_addr_cache_insert_device (dev); +} +EXPORT_SYMBOL_GPL(eeh_add_device_late); + +/** + * eeh_remove_device - undo EEH setup for the indicated pci device + * @dev: pci device to be removed + * + * This routine should be when a device is removed from a running + * system (e.g. by hotplug or dlpar). + */ +void eeh_remove_device(struct pci_dev *dev) +{ + struct device_node *dn; + if (!dev || !eeh_subsystem_enabled) + return; + + /* Unregister the device with the EEH/PCI address search system */ +#ifdef DEBUG + printk(KERN_DEBUG "EEH: remove device %s\n", pci_name(dev)); +#endif + pci_addr_cache_remove_device(dev); + + dn = pci_device_to_OF_node(dev); + PCI_DN(dn)->pcidev = NULL; + pci_dev_put (dev); +} +EXPORT_SYMBOL_GPL(eeh_remove_device); + +static int proc_eeh_show(struct seq_file *m, void *v) +{ + unsigned int cpu; + unsigned long ffs = 0, positives = 0, failures = 0; + unsigned long resets = 0; + unsigned long no_dev = 0, no_dn = 0, no_cfg = 0, no_check = 0; + + for_each_cpu(cpu) { + ffs += per_cpu(total_mmio_ffs, cpu); + positives += per_cpu(false_positives, cpu); + failures += per_cpu(ignored_failures, cpu); + resets += per_cpu(slot_resets, cpu); + no_dev += per_cpu(no_device, cpu); + no_dn += per_cpu(no_dn, cpu); + no_cfg += per_cpu(no_cfg_addr, cpu); + no_check += per_cpu(ignored_check, cpu); + } + + if (0 == eeh_subsystem_enabled) { + seq_printf(m, "EEH Subsystem is globally disabled\n"); + seq_printf(m, "eeh_total_mmio_ffs=%ld\n", ffs); + } else { + seq_printf(m, "EEH Subsystem is enabled\n"); + seq_printf(m, + "no device=%ld\n" + "no device node=%ld\n" + "no config address=%ld\n" + "check not wanted=%ld\n" + "eeh_total_mmio_ffs=%ld\n" + "eeh_false_positives=%ld\n" + "eeh_ignored_failures=%ld\n" + "eeh_slot_resets=%ld\n", + no_dev, no_dn, no_cfg, no_check, + ffs, positives, failures, resets); + } + + return 0; +} + +static int proc_eeh_open(struct inode *inode, struct file *file) +{ + return single_open(file, proc_eeh_show, NULL); +} + +static struct file_operations proc_eeh_operations = { + .open = proc_eeh_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static int __init eeh_init_proc(void) +{ + struct proc_dir_entry *e; + + if (systemcfg->platform & PLATFORM_PSERIES) { + e = create_proc_entry("ppc64/eeh", 0, NULL); + if (e) + e->proc_fops = &proc_eeh_operations; + } + + return 0; +} +__initcall(eeh_init_proc); From linas at linas.org Fri Nov 4 11:50:04 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:50:04 -0600 Subject: [PATCH 12/42]: ppc64: PCI error event dispatcher References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005004.GA26878@mail.gnucash.org> 12-eeh-event-dispatcher.patch ppc64: EEH Recovery dispatcher thread This patch adds a mechanism to create recovery threads when an EEH event is received. Since an EEH freeze state may be detected within an interrupt context, we need to get out of the interrupt context before starting recovery. This dispatcher does this in two steps: first, it uses a workqueue to get out, and then lanuches a kernel thread, so that the recovery routine can sleep for exteded periods without upseting the keventd. A kernel thread is created with each EEH event, rather than having one long-running daemon started at boot time. This is because it is anticipated that EEH events will be very rare (very very rare, ideally) and so its pointless to cluter the process tables with a daemon that will almost never run. Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:30:49.790591516 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:32:35.713742506 -0600 @@ -19,7 +19,6 @@ #include #include -#include #include #include #include @@ -27,12 +26,12 @@ #include #include #include +#include #include #include +#include #include -#include #include -#include #undef DEBUG @@ -70,14 +69,6 @@ * and sent out for processing. */ -/* EEH event workqueue setup. */ -static DEFINE_SPINLOCK(eeh_eventlist_lock); -LIST_HEAD(eeh_eventlist); -static void eeh_event_handler(void *); -DECLARE_WORK(eeh_event_wq, eeh_event_handler, NULL); - -static struct notifier_block *eeh_notifier_chain; - /* If a device driver keeps reading an MMIO register in an interrupt * handler after a slot isolation event has occurred, we assume it * is broken and panic. This sets the threshold for how many read @@ -421,24 +412,6 @@ } /** - * eeh_register_notifier - Register to find out about EEH events. - * @nb: notifier block to callback on events - */ -int eeh_register_notifier(struct notifier_block *nb) -{ - return notifier_chain_register(&eeh_notifier_chain, nb); -} - -/** - * eeh_unregister_notifier - Unregister to an EEH event notifier. - * @nb: notifier block to callback on events - */ -int eeh_unregister_notifier(struct notifier_block *nb) -{ - return notifier_chain_unregister(&eeh_notifier_chain, nb); -} - -/** * read_slot_reset_state - Read the reset state of a device node's slot * @dn: device node to read * @rets: array to return results in @@ -461,73 +434,6 @@ } /** - * eeh_panic - call panic() for an eeh event that cannot be handled. - * The philosophy of this routine is that it is better to panic and - * halt the OS than it is to risk possible data corruption by - * oblivious device drivers that don't know better. - * - * @dev pci device that had an eeh event - * @reset_state current reset state of the device slot - */ -static void eeh_panic(struct pci_dev *dev, int reset_state) -{ - /* - * XXX We should create a separate sysctl for this. - * - * Since the panic_on_oops sysctl is used to halt the system - * in light of potential corruption, we can use it here. - */ - if (panic_on_oops) { - struct device_node *dn = pci_device_to_OF_node(dev); - eeh_slot_error_detail (PCI_DN(dn), 2 /* Permanent Error */); - panic("EEH: MMIO failure (%d) on device:%s\n", reset_state, - pci_name(dev)); - } - else { - __get_cpu_var(ignored_failures)++; - printk(KERN_INFO "EEH: Ignored MMIO failure (%d) on device:%s\n", - reset_state, pci_name(dev)); - } -} - -/** - * eeh_event_handler - dispatch EEH events. The detection of a frozen - * slot can occur inside an interrupt, where it can be hard to do - * anything about it. The goal of this routine is to pull these - * detection events out of the context of the interrupt handler, and - * re-dispatch them for processing at a later time in a normal context. - * - * @dummy - unused - */ -static void eeh_event_handler(void *dummy) -{ - unsigned long flags; - struct eeh_event *event; - - while (1) { - spin_lock_irqsave(&eeh_eventlist_lock, flags); - event = NULL; - if (!list_empty(&eeh_eventlist)) { - event = list_entry(eeh_eventlist.next, struct eeh_event, list); - list_del(&event->list); - } - spin_unlock_irqrestore(&eeh_eventlist_lock, flags); - if (event == NULL) - break; - - printk(KERN_INFO "EEH: MMIO failure (%d), notifiying device " - "%s\n", event->reset_state, - pci_name(event->dev)); - - notifier_call_chain (&eeh_notifier_chain, - EEH_NOTIFY_FREEZE, event); - - pci_dev_put(event->dev); - kfree(event); - } -} - -/** * eeh_token_to_phys - convert EEH address token to phys address * @token i/o token, should be address in the form 0xA.... */ @@ -613,8 +519,6 @@ int ret; int rets[3]; unsigned long flags; - int reset_state; - struct eeh_event *event; struct pci_dn *pdn; struct device_node *pe_dn; int rc = 0; @@ -722,33 +626,12 @@ __eeh_mark_slot (pe_dn); spin_unlock_irqrestore(&confirm_error_lock, flags); - reset_state = rets[0]; - - eeh_slot_error_detail (pdn, 1 /* Temporary Error */); - - printk(KERN_INFO "EEH: MMIO failure (%d) on device: %s %s\n", - rets[0], dn->name, dn->full_name); - event = kmalloc(sizeof(*event), GFP_ATOMIC); - if (event == NULL) { - eeh_panic(dev, reset_state); - return 1; - } - - event->dev = dev; - event->dn = dn; - event->reset_state = reset_state; - - /* We may or may not be called in an interrupt context */ - spin_lock_irqsave(&eeh_eventlist_lock, flags); - list_add(&event->list, &eeh_eventlist); - spin_unlock_irqrestore(&eeh_eventlist_lock, flags); - + eeh_send_failure_event (dn, dev, rets[0], rets[2]); + /* Most EEH events are due to device driver bugs. Having * a stack trace will help the device-driver authors figure * out what happened. So print that out. */ if (rets[0] != 5) dump_stack(); - schedule_work(&eeh_event_wq); - return 1; dn_unlock: @@ -793,6 +676,14 @@ EXPORT_SYMBOL(eeh_check_failure); +/* ------------------------------------------------------------- */ +/* The code below deals with enabling EEH for devices during the + * early boot sequence. EEH must be enabled before any PCI probing + * can be done. + */ + +#define EEH_ENABLE 1 + struct eeh_early_enable_info { unsigned int buid_hi; unsigned int buid_lo; @@ -850,8 +741,9 @@ /* First register entry is addr (00BBSS00) */ /* Try to enable eeh */ ret = rtas_call(ibm_set_eeh_option, 4, 1, NULL, - regs[0], info->buid_hi, info->buid_lo, - EEH_ENABLE); + regs[0], info->buid_hi, info->buid_lo, + EEH_ENABLE); + if (ret == 0) { eeh_subsystem_enabled = 1; pdn->eeh_mode |= EEH_MODE_SUPPORTED; Index: linux-2.6.14-git3/include/asm-powerpc/eeh_event.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.14-git3/include/asm-powerpc/eeh_event.h 2005-11-02 14:32:35.718741805 -0600 @@ -0,0 +1,52 @@ +/* + * eeh_event.h + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Copyright (c) 2005 Linas Vepstas + */ + +#ifndef ASM_PPC64_EEH_EVENT_H +#define ASM_PPC64_EEH_EVENT_H + +/** EEH event -- structure holding pci controller data that describes + * a change in the isolation status of a PCI slot. A pointer + * to this struct is passed as the data pointer in a notify callback. + */ +struct eeh_event { + struct list_head list; + struct device_node *dn; /* struct device node */ + struct pci_dev *dev; /* affected device */ + int state; + int time_unavail; /* milliseconds until device might be available */ +}; + +/** + * eeh_send_failure_event - generate a PCI error event + * @dev pci device + * + * This routine builds a PCI error event which will be delivered + * to all listeners on the peh_notifier_chain. + * + * This routine can be called within an interrupt context; + * the actual event will be delivered in a normal context + * (from a workqueue). + */ +int eeh_send_failure_event (struct device_node *dn, + struct pci_dev *dev, + int reset_state, + int time_unavail); + +#endif /* ASM_PPC64_EEH_EVENT_H */ Index: linux-2.6.14-git3/include/asm-ppc64/eeh.h =================================================================== --- linux-2.6.14-git3.orig/include/asm-ppc64/eeh.h 2005-11-02 14:29:21.496968403 -0600 +++ linux-2.6.14-git3/include/asm-ppc64/eeh.h 2005-11-02 14:32:35.725740824 -0600 @@ -1,4 +1,4 @@ -/* +/* * eeh.h * Copyright (C) 2001 Dave Engebretsen & Todd Inglett IBM Corporation. * @@ -6,12 +6,12 @@ * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. - * + * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. - * + * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA @@ -27,8 +27,6 @@ struct pci_dev; struct device_node; -struct device_node; -struct notifier_block; #ifdef CONFIG_EEH @@ -37,6 +35,10 @@ #define EEH_MODE_NOCHECK (1<<1) #define EEH_MODE_ISOLATED (1<<2) +/* Max number of EEH freezes allowed before we consider the device + * to be permanently disabled. */ +#define EEH_MAX_ALLOWED_FREEZES 5 + void __init eeh_init(void); unsigned long eeh_check_failure(const volatile void __iomem *token, unsigned long val); @@ -59,36 +61,14 @@ * eeh_remove_device - undo EEH setup for the indicated pci device * @dev: pci device to be removed * - * This routine should be when a device is removed from a running - * system (e.g. by hotplug or dlpar). + * This routine should be called when a device is removed from + * a running system (e.g. by hotplug or dlpar). It unregisters + * the PCI device from the EEH subsystem. I/O errors affecting + * this device will no longer be detected after this call; thus, + * i/o errors affecting this slot may leave this device unusable. */ void eeh_remove_device(struct pci_dev *); -#define EEH_DISABLE 0 -#define EEH_ENABLE 1 -#define EEH_RELEASE_LOADSTORE 2 -#define EEH_RELEASE_DMA 3 - -/** - * Notifier event flags. - */ -#define EEH_NOTIFY_FREEZE 1 - -/** EEH event -- structure holding pci slot data that describes - * a change in the isolation status of a PCI slot. A pointer - * to this struct is passed as the data pointer in a notify callback. - */ -struct eeh_event { - struct list_head list; - struct pci_dev *dev; - struct device_node *dn; - int reset_state; -}; - -/** Register to find out about EEH events. */ -int eeh_register_notifier(struct notifier_block *nb); -int eeh_unregister_notifier(struct notifier_block *nb); - /** * EEH_POSSIBLE_ERROR() -- test for possible MMIO failure. * @@ -129,7 +109,7 @@ #define EEH_IO_ERROR_VALUE(size) (-1UL) #endif /* CONFIG_EEH */ -/* +/* * MMIO read/write operations with EEH support. */ static inline u8 eeh_readb(const volatile void __iomem *addr) Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh_event.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh_event.c 2005-11-02 14:32:35.731739983 -0600 @@ -0,0 +1,155 @@ +/* + * eeh_event.c + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Copyright (c) 2005 Linas Vepstas + */ + +#include +#include +#include + +/** Overview: + * EEH error states may be detected within exception handlers; + * however, the recovery processing needs to occur asynchronously + * in a normal kernel context and not an interrupt context. + * This pair of routines creates an event and queues it onto a + * work-queue, where a worker thread can drive recovery. + */ + +/* EEH event workqueue setup. */ +static spinlock_t eeh_eventlist_lock = SPIN_LOCK_UNLOCKED; +LIST_HEAD(eeh_eventlist); +static void eeh_thread_launcher(void *); +DECLARE_WORK(eeh_event_wq, eeh_thread_launcher, NULL); + +/** + * eeh_panic - call panic() for an eeh event that cannot be handled. + * The philosophy of this routine is that it is better to panic and + * halt the OS than it is to risk possible data corruption by + * oblivious device drivers that don't know better. + * + * @dev pci device that had an eeh event + * @reset_state current reset state of the device slot + */ +static void eeh_panic(struct pci_dev *dev, int reset_state) +{ + /* + * Since the panic_on_oops sysctl is used to halt the system + * in light of potential corruption, we can use it here. + */ + if (panic_on_oops) { + panic("EEH: MMIO failure (%d) on device:%s\n", reset_state, + pci_name(dev)); + } + else { + printk(KERN_INFO "EEH: Ignored MMIO failure (%d) on device:%s\n", + reset_state, pci_name(dev)); + } +} + +/** + * eeh_event_handler - dispatch EEH events. The detection of a frozen + * slot can occur inside an interrupt, where it can be hard to do + * anything about it. The goal of this routine is to pull these + * detection events out of the context of the interrupt handler, and + * re-dispatch them for processing at a later time in a normal context. + * + * @dummy - unused + */ +static int eeh_event_handler(void * dummy) +{ + unsigned long flags; + struct eeh_event *event; + + daemonize ("eehd"); + + while (1) { + set_current_state(TASK_INTERRUPTIBLE); + + spin_lock_irqsave(&eeh_eventlist_lock, flags); + event = NULL; + if (!list_empty(&eeh_eventlist)) { + event = list_entry(eeh_eventlist.next, struct eeh_event, list); + list_del(&event->list); + } + spin_unlock_irqrestore(&eeh_eventlist_lock, flags); + if (event == NULL) + break; + + printk(KERN_INFO "EEH: Detected PCI bus error on device %s\n", + pci_name(event->dev)); + + eeh_panic (event->dev, event->state); + + kfree(event); + } + + return 0; +} + +/** + * eeh_thread_launcher + * + * @dummy - unused + */ +static void eeh_thread_launcher(void *dummy) +{ + if (kernel_thread(eeh_event_handler, NULL, CLONE_KERNEL) < 0) + printk(KERN_ERR "Failed to start EEH daemon\n"); +} + +/** + * eeh_send_failure_event - generate a PCI error event + * @dev pci device + * + * This routine can be called within an interrupt context; + * the actual event will be delivered in a normal context + * (from a workqueue). + */ +int eeh_send_failure_event (struct device_node *dn, + struct pci_dev *dev, + int state, + int time_unavail) +{ + unsigned long flags; + struct eeh_event *event; + + event = kmalloc(sizeof(*event), GFP_ATOMIC); + if (event == NULL) { + printk (KERN_ERR "EEH: out of memory, event not handled\n"); + return 1; + } + + if (dev) + pci_dev_get(dev); + + event->dn = dn; + event->dev = dev; + event->state = state; + event->time_unavail = time_unavail; + + /* We may or may not be called in an interrupt context */ + spin_lock_irqsave(&eeh_eventlist_lock, flags); + list_add(&event->list, &eeh_eventlist); + spin_unlock_irqrestore(&eeh_eventlist_lock, flags); + + schedule_work(&eeh_event_wq); + + return 0; +} + +/********************** END OF FILE ******************************/ Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/Makefile =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/Makefile 2005-11-02 14:31:36.150092654 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/Makefile 2005-11-02 14:32:55.306995693 -0600 @@ -3,4 +3,4 @@ obj-$(CONFIG_SMP) += smp.o obj-$(CONFIG_IBMVIO) += vio.o obj-$(CONFIG_XICS) += xics.o -obj-$(CONFIG_EEH) += eeh.o +obj-$(CONFIG_EEH) += eeh.o eeh_event.o From linas at linas.org Fri Nov 4 11:50:10 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:50:10 -0600 Subject: [PATCH 13/42]: ppc64: PCI reset support routines References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005010.GA26901@mail.gnucash.org> 13-eeh-recovery-support-routines.patch EEH Recovery support routines This patch adds routines required to help drive the recovery of EEH-frozen slots. The main function is to drive the PCI #RST signal line high for a qurter of a second, and then allow for a second & a half of settle time. Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/include/asm-powerpc/ppc-pci.h =================================================================== --- linux-2.6.14-git3.orig/include/asm-powerpc/ppc-pci.h 2005-11-02 14:29:20.596094683 -0600 +++ linux-2.6.14-git3/include/asm-powerpc/ppc-pci.h 2005-11-02 14:33:42.083437903 -0600 @@ -51,4 +51,18 @@ extern unsigned long pci_assign_all_buses; extern int pci_read_irq_line(struct pci_dev *pci_dev); +/* ---- EEH internal-use-only related routines ---- */ +#ifdef CONFIG_EEH +/** + * rtas_set_slot_reset -- unfreeze a frozen slot + * + * Clear the EEH-frozen condition on a slot. This routine + * does this by asserting the PCI #RST line for 1/8th of + * a second; this routine will sleep while the adapter is + * being reset. + */ +void rtas_set_slot_reset (struct pci_dn *); + +#endif + #endif /* _ASM_POWERPC_PPC_PCI_H */ Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:32:35.713742506 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:33:42.096436081 -0600 @@ -17,6 +17,7 @@ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ +#include #include #include #include @@ -677,6 +678,104 @@ EXPORT_SYMBOL(eeh_check_failure); /* ------------------------------------------------------------- */ +/* The code below deals with error recovery */ + +/** Return negative value if a permanent error, else return + * a number of milliseconds to wait until the PCI slot is + * ready to be used. + */ +static int +eeh_slot_availability(struct pci_dn *pdn) +{ + int rc; + int rets[3]; + + rc = read_slot_reset_state(pdn, rets); + + if (rc) return rc; + + if (rets[1] == 0) return -1; /* EEH is not supported */ + if (rets[0] == 0) return 0; /* Oll Korrect */ + if (rets[0] == 5) { + if (rets[2] == 0) return -1; /* permanently unavailable */ + return rets[2]; /* number of millisecs to wait */ + } + return -1; +} + +/** rtas_pci_slot_reset raises/lowers the pci #RST line + * state: 1/0 to raise/lower the #RST + * + * Clear the EEH-frozen condition on a slot. This routine + * asserts the PCI #RST line if the 'state' argument is '1', + * and drops the #RST line if 'state is '0'. This routine is + * safe to call in an interrupt context. + * + */ + +static void +rtas_pci_slot_reset(struct pci_dn *pdn, int state) +{ + int rc; + + BUG_ON (pdn==NULL); + + if (!pdn->phb) { + printk (KERN_WARNING "EEH: in slot reset, device node %s has no phb\n", + pdn->node->full_name); + return; + } + + rc = rtas_call(ibm_set_slot_reset,4,1, NULL, + pdn->eeh_config_addr, + BUID_HI(pdn->phb->buid), + BUID_LO(pdn->phb->buid), + state); + if (rc) { + printk (KERN_WARNING "EEH: Unable to reset the failed slot, (%d) #RST=%d dn=%s\n", + rc, state, pdn->node->full_name); + return; + } + + if (state == 0) + eeh_clear_slot (pdn->node->parent->child); +} + +/** rtas_set_slot_reset -- assert the pci #RST line for 1/4 second + * dn -- device node to be reset. + */ + +void +rtas_set_slot_reset(struct pci_dn *pdn) +{ + int i, rc; + + rtas_pci_slot_reset (pdn, 1); + + /* The PCI bus requires that the reset be held high for at least + * a 100 milliseconds. We wait a bit longer 'just in case'. */ + +#define PCI_BUS_RST_HOLD_TIME_MSEC 250 + msleep (PCI_BUS_RST_HOLD_TIME_MSEC); + rtas_pci_slot_reset (pdn, 0); + + /* After a PCI slot has been reset, the PCI Express spec requires + * a 1.5 second idle time for the bus to stabilize, before starting + * up traffic. */ +#define PCI_BUS_SETTLE_TIME_MSEC 1800 + msleep (PCI_BUS_SETTLE_TIME_MSEC); + + /* Now double check with the firmware to make sure the device is + * ready to be used; if not, wait for recovery. */ + for (i=0; i<10; i++) { + rc = eeh_slot_availability (pdn); + if (rc <= 0) break; + + msleep (rc+100); + } +} + +/* ------------------------------------------------------------- */ /* The code below deals with enabling EEH for devices during the * early boot sequence. EEH must be enabled before any PCI probing * can be done. From linas at linas.org Fri Nov 4 11:50:17 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:50:17 -0600 Subject: [PATCH 14/42]: ppc64: Save & restore of PCI device BARS References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005017.GA26911@mail.gnucash.org> 14-eeh-device-bar-save.patch After a PCI device has been resest, the device BAR's and other config space info must be restored to the same state as they were in when the firmware first handed us this device. This will allow the PCI device driver, when restarted, to correctly recognize and set up the device. Tis patch saves the device config space as early as reasonable after the firmware has handed over the device. Te state resore funcion is inteded for use by the EEH recovery routines. Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:33:42.096436081 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:34:19.926132452 -0600 @@ -77,6 +77,9 @@ */ #define EEH_MAX_FAILS 100000 +/* Misc forward declaraions */ +static void eeh_save_bars(struct pci_dev * pdev, struct pci_dn *pdn); + /* RTAS tokens */ static int ibm_set_eeh_option; static int ibm_set_slot_reset; @@ -366,6 +369,7 @@ */ void __init pci_addr_cache_build(void) { + struct device_node *dn; struct pci_dev *dev = NULL; if (!eeh_subsystem_enabled) @@ -379,6 +383,10 @@ continue; } pci_addr_cache_insert_device(dev); + + /* Save the BAR's; firmware doesn't restore these after EEH reset */ + dn = pci_device_to_OF_node(dev); + eeh_save_bars(dev, PCI_DN(dn)); } #ifdef DEBUG @@ -775,6 +783,108 @@ } } +/* ------------------------------------------------------- */ +/** Save and restore of PCI BARs + * + * Although firmware will set up BARs during boot, it doesn't + * set up device BAR's after a device reset, although it will, + * if requested, set up bridge configuration. Thus, we need to + * configure the PCI devices ourselves. + */ + +/** + * __restore_bars - Restore the Base Address Registers + * Loads the PCI configuration space base address registers, + * the expansion ROM base address, the latency timer, and etc. + * from the saved values in the device node. + */ +static inline void __restore_bars (struct pci_dn *pdn) +{ + int i; + + if (NULL==pdn->phb) return; + for (i=4; i<10; i++) { + rtas_write_config(pdn, i*4, 4, pdn->config_space[i]); + } + + /* 12 == Expansion ROM Address */ + rtas_write_config(pdn, 12*4, 4, pdn->config_space[12]); + +#define BYTE_SWAP(OFF) (8*((OFF)/4)+3-(OFF)) +#define SAVED_BYTE(OFF) (((u8 *)(pdn->config_space))[BYTE_SWAP(OFF)]) + + rtas_write_config (pdn, PCI_CACHE_LINE_SIZE, 1, + SAVED_BYTE(PCI_CACHE_LINE_SIZE)); + + rtas_write_config (pdn, PCI_LATENCY_TIMER, 1, + SAVED_BYTE(PCI_LATENCY_TIMER)); + + /* max latency, min grant, interrupt pin and line */ + rtas_write_config(pdn, 15*4, 4, pdn->config_space[15]); +} + +/** + * eeh_restore_bars - restore the PCI config space info + * + * This routine performs a recursive walk to the children + * of this device as well. + */ +void eeh_restore_bars(struct pci_dn *pdn) +{ + struct device_node *dn; + if (!pdn) + return; + + if (! pdn->eeh_is_bridge) + __restore_bars (pdn); + + dn = pdn->node->child; + while (dn) { + eeh_restore_bars (PCI_DN(dn)); + dn = dn->sibling; + } +} + +/** + * eeh_save_bars - save device bars + * + * Save the values of the device bars. Unlike the restore + * routine, this routine is *not* recursive. This is because + * PCI devices are added individuallly; but, for the restore, + * an entire slot is reset at a time. + */ +static void eeh_save_bars(struct pci_dev * pdev, struct pci_dn *pdn) +{ + int i; + + if (!pdev || !pdn ) + return; + + for (i = 0; i < 16; i++) + pci_read_config_dword(pdev, i * 4, &pdn->config_space[i]); + + if (pdev->hdr_type == PCI_HEADER_TYPE_BRIDGE) + pdn->eeh_is_bridge = 1; +} + +void +rtas_configure_bridge(struct pci_dn *pdn) +{ + int token = rtas_token ("ibm,configure-bridge"); + int rc; + + if (token == RTAS_UNKNOWN_SERVICE) + return; + rc = rtas_call(token,3,1, NULL, + pdn->eeh_config_addr, + BUID_HI(pdn->phb->buid), + BUID_LO(pdn->phb->buid)); + if (rc) { + printk (KERN_WARNING "EEH: Unable to configure device bridge (%d) for %s\n", + rc, pdn->node->full_name); + } +} + /* ------------------------------------------------------------- */ /* The code below deals with enabling EEH for devices during the * early boot sequence. EEH must be enabled before any PCI probing @@ -977,6 +1087,7 @@ void eeh_add_device_late(struct pci_dev *dev) { struct device_node *dn; + struct pci_dn *pdn; if (!dev || !eeh_subsystem_enabled) return; @@ -987,9 +1098,11 @@ pci_dev_get (dev); dn = pci_device_to_OF_node(dev); - PCI_DN(dn)->pcidev = dev; + pdn = PCI_DN(dn); + pdn->pcidev = dev; pci_addr_cache_insert_device (dev); + eeh_save_bars(dev, pdn); } EXPORT_SYMBOL_GPL(eeh_add_device_late); Index: linux-2.6.14-git3/include/asm-powerpc/ppc-pci.h =================================================================== --- linux-2.6.14-git3.orig/include/asm-powerpc/ppc-pci.h 2005-11-02 14:33:42.083437903 -0600 +++ linux-2.6.14-git3/include/asm-powerpc/ppc-pci.h 2005-11-02 14:34:19.931131751 -0600 @@ -63,6 +63,29 @@ */ void rtas_set_slot_reset (struct pci_dn *); +/** + * eeh_restore_bars - Restore device configuration info. + * + * A reset of a PCI device will clear out its config space. + * This routines will restore the config space for this + * device, and is children, to values previously obtained + * from the firmware. + */ +void eeh_restore_bars(struct pci_dn *); + +/** + * rtas_configure_bridge -- firmware initialization of pci bridge + * + * Ask the firmware to configure all PCI bridges devices + * located behind the indicated node. Required after a + * pci device reset. Does essentially the same hing as + * eeh_restore_bars, but for brdges, and lets firmware + * do the work. + */ +void rtas_configure_bridge(struct pci_dn *); + +int rtas_write_config(struct pci_dn *, int where, int size, u32 val); + #endif #endif /* _ASM_POWERPC_PPC_PCI_H */ From linas at linas.org Fri Nov 4 11:50:26 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:50:26 -0600 Subject: [PATCH 15/42]: Documentation: PCI Error Recovery References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005026.GA26919@mail.gnucash.org> 215-pci-error-recovery_docs.patch PCI Error Recovery: documentation patch Various PCI bus errors can be signaled by newer PCI controllers. Recovering from those errors requires an infrastructure to notify affected device drivers of the error, and a way of walking through a reset sequence. This patch adds documentation describing the current error recovery proposal. Signed-off-by: Linas Vepstas Documentation/pci-error-recovery.txt | 246 +++++++++++++++++++++++++++++++++++ MAINTAINERS | 7 2 files changed, 253 insertions(+) Index: linux-2.6.14-git3/Documentation/pci-error-recovery.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.14-git3/Documentation/pci-error-recovery.txt 2005-11-02 14:34:25.663328101 -0600 @@ -0,0 +1,246 @@ + + PCI Error Recovery + ------------------ + May 31, 2005 + + Current document maintainer: + Linas Vepstas + + +Some PCI bus controllers are able to detect certain "hard" PCI errors +on the bus, such as parity errors on the data and address busses, as +well as SERR and PERR errors. These chipsets are then able to disable +I/O to/from the affected device, so that, for example, a bad DMA +address doesn't end up corrupting system memory. These same chipsets +are also able to reset the affected PCI device, and return it to +working condition. This document describes a generic API form +performing error recovery. + +The core idea is that after a PCI error has been detected, there must +be a way for the kernel to coordinate with all affected device drivers +so that the pci card can be made operational again, possibly after +performing a full electrical #RST of the PCI card. The API below +provides a generic API for device drivers to be notified of PCI +errors, and to be notified of, and respond to, a reset sequence. + +Preliminary sketch of API, cut-n-pasted-n-modified email from +Ben Herrenschmidt, circa 5 april 2005 + +The error recovery API support is exposed to the driver in the form of +a structure of function pointers pointed to by a new field in struct +pci_driver. The absence of this pointer in pci_driver denotes an +"non-aware" driver, behaviour on these is platform dependant. +Platforms like ppc64 can try to simulate pci hotplug remove/add. + +The definition of "pci_error_token" is not covered here. It is based on +Seto's work on the synchronous error detection. We still need to define +functions for extracting infos out of an opaque error token. This is +separate from this API. + +This structure has the form: + +struct pci_error_handlers +{ + int (*error_detected)(struct pci_dev *dev, pci_error_token error); + int (*mmio_enabled)(struct pci_dev *dev); + int (*resume)(struct pci_dev *dev); + int (*link_reset)(struct pci_dev *dev); + int (*slot_reset)(struct pci_dev *dev); +}; + +A driver doesn't have to implement all of these callbacks. The +only mandatory one is error_detected(). If a callback is not +implemented, the corresponding feature is considered unsupported. +For example, if mmio_enabled() and resume() aren't there, then the +driver is assumed as not doing any direct recovery and requires +a reset. If link_reset() is not implemented, the card is assumed as +not caring about link resets, in which case, if recover is supported, +the core can try recover (but not slot_reset() unless it really did +reset the slot). If slot_reset() is not supported, link_reset() can +be called instead on a slot reset. + +At first, the call will always be : + + 1) error_detected() + + Error detected. This is sent once after an error has been detected. At +this point, the device might not be accessible anymore depending on the +platform (the slot will be isolated on ppc64). The driver may already +have "noticed" the error because of a failing IO, but this is the proper +"synchronisation point", that is, it gives a chance to the driver to +cleanup, waiting for pending stuff (timers, whatever, etc...) to +complete; it can take semaphores, schedule, etc... everything but touch +the device. Within this function and after it returns, the driver +shouldn't do any new IOs. Called in task context. This is sort of a +"quiesce" point. See note about interrupts at the end of this doc. + + Result codes: + - PCIERR_RESULT_CAN_RECOVER: + Driever returns this if it thinks it might be able to recover + the HW by just banging IOs or if it wants to be given + a chance to extract some diagnostic informations (see + below). + - PCIERR_RESULT_NEED_RESET: + Driver returns this if it thinks it can't recover unless the + slot is reset. + - PCIERR_RESULT_DISCONNECT: + Return this if driver thinks it won't recover at all, + (this will detach the driver ? or just leave it + dangling ? to be decided) + +So at this point, we have called error_detected() for all drivers +on the segment that had the error. On ppc64, the slot is isolated. What +happens now typically depends on the result from the drivers. If all +drivers on the segment/slot return PCIERR_RESULT_CAN_RECOVER, we would +re-enable IOs on the slot (or do nothing special if the platform doesn't +isolate slots) and call 2). If not and we can reset slots, we go to 4), +if neither, we have a dead slot. If it's an hotplug slot, we might +"simulate" reset by triggering HW unplug/replug though. + +>>> Current ppc64 implementation assumes that a device driver will +>>> *not* schedule or semaphore in this routine; the current ppc64 +>>> implementation uses one kernel thread to notify all devices; +>>> thus, of one device sleeps/schedules, all devices are affected. +>>> Doing better requires complex multi-threaded logic in the error +>>> recovery implementation (e.g. waiting for all notification threads +>>> to "join" before proceeding with recovery.) This seems excessively +>>> complex and not worth implementing. + +>>> The current ppc64 implementation doesn't much care if the device +>>> attempts i/o at this point, or not. I/O's will fail, returning +>>> a value of 0xff on read, and writes will be dropped. If the device +>>> driver attempts more than 10K I/O's to a frozen adapter, it will +>>> assume that the device driver has gone into an infinite loop, and +>>> it will panic the the kernel. + + 2) mmio_enabled() + + This is the "early recovery" call. IOs are allowed again, but DMA is +not (hrm... to be discussed, I prefer not), with some restrictions. This +is NOT a callback for the driver to start operations again, only to +peek/poke at the device, extract diagnostic information, if any, and +eventually do things like trigger a device local reset or some such, +but not restart operations. This is sent if all drivers on a segment +agree that they can try to recover and no automatic link reset was +performed by the HW. If the platform can't just re-enable IOs without +a slot reset or a link reset, it doesn't call this callback and goes +directly to 3) or 4). All IOs should be done _synchronously_ from +within this callback, errors triggered by them will be returned via +the normal pci_check_whatever() api, no new error_detected() callback +will be issued due to an error happening here. However, such an error +might cause IOs to be re-blocked for the whole segment, and thus +invalidate the recovery that other devices on the same segment might +have done, forcing the whole segment into one of the next states, +that is link reset or slot reset. + + Result codes: + - PCIERR_RESULT_RECOVERED + Driver returns this if it thinks the device is fully + functionnal and thinks it is ready to start + normal driver operations again. There is no + guarantee that the driver will actually be + allowed to proceed, as another driver on the + same segment might have failed and thus triggered a + slot reset on platforms that support it. + + - PCIERR_RESULT_NEED_RESET + Driver returns this if it thinks the device is not + recoverable in it's current state and it needs a slot + reset to proceed. + + - PCIERR_RESULT_DISCONNECT + Same as above. Total failure, no recovery even after + reset driver dead. (To be defined more precisely) + +>>> The current ppc64 implementation does not implement this callback. + + 3) link_reset() + + This is called after the link has been reset. This is typically +a PCI Express specific state at this point and is done whenever a +non-fatal error has been detected that can be "solved" by resetting +the link. This call informs the driver of the reset and the driver +should check if the device appears to be in working condition. +This function acts a bit like 2) mmio_enabled(), in that the driver +is not supposed to restart normal driver I/O operations right away. +Instead, it should just "probe" the device to check it's recoverability +status. If all is right, then the core will call resume() once all +drivers have ack'd link_reset(). + + Result codes: + (identical to mmio_enabled) + +>>> The current ppc64 implementation does not implement this callback. + + 4) slot_reset() + + This is called after the slot has been soft or hard reset by the +platform. A soft reset consists of asserting the adapter #RST line +and then restoring the PCI BARs and PCI configuration header. If the +platform supports PCI hotplug, then it might instead perform a hard +reset by toggling power on the slot off/on. This call gives drivers +the chance to re-initialize the hardware (re-download firmware, etc.), +but drivers shouldn't restart normal I/O processing operations at +this point. (See note about interrupts; interrupts aren't guaranteed +to be delivered until the resume() callback has been called). If all +device drivers report success on this callback, the patform will call +resume() to complete the error handling and let the driver restart +normal I/O processing. + +A driver can still return a critical failure for this function if +it can't get the device operational after reset. If the platform +previously tried a soft reset, it migh now try a hard reset (power +cycle) and then call slot_reset() again. It the device still can't +be recovered, there is nothing more that can be done; the platform +will typically report a "permanent failure" in such a case. The +device will be considered "dead" in this case. + + Result codes: + - PCIERR_RESULT_DISCONNECT + Same as above. + +>>> The current ppc64 implementation does not try a power-cycle reset +>>> if the driver returned PCIERR_RESULT_DISCONNECT. However, it should. + + 5) resume() + + This is called if all drivers on the segment have returned +PCIERR_RESULT_RECOVERED from one of the 3 prevous callbacks. +That basically tells the driver to restart activity, tht everything +is back and running. No result code is taken into account here. If +a new error happens, it will restart a new error handling process. + +That's it. I think this covers all the possibilities. The way those +callbacks are called is platform policy. A platform with no slot reset +capability for example may want to just "ignore" drivers that can't +recover (disconnect them) and try to let other cards on the same segment +recover. Keep in mind that in most real life cases, though, there will +be only one driver per segment. + +Now, there is a note about interrupts. If you get an interrupt and your +device is dead or has been isolated, there is a problem :) + +After much thinking, I decided to leave that to the platform. That is, +the recovery API only precies that: + + - There is no guarantee that interrupt delivery can proceed from any +device on the segment starting from the error detection and until the +restart callback is sent, at which point interrupts are expected to be +fully operational. + + - There is no guarantee that interrupt delivery is stopped, that is, ad +river that gets an interrupts after detecting an error, or that detects +and error within the interrupt handler such that it prevents proper +ack'ing of the interrupt (and thus removal of the source) should just +return IRQ_NOTHANDLED. It's up to the platform to deal with taht +condition, typically by masking the irq source during the duration of +the error handling. It is expected that the platform "knows" which +interrupts are routed to error-management capable slots and can deal +with temporarily disabling that irq number during error processing (this +isn't terribly complex). That means some IRQ latency for other devices +sharing the interrupt, but there is simply no other way. High end +platforms aren't supposed to share interrupts between many devices +anyway :) + + +Revised: 31 May 2005 Linas Vepstas Index: linux-2.6.14-git3/MAINTAINERS =================================================================== --- linux-2.6.14-git3.orig/MAINTAINERS 2005-11-02 14:29:19.433257684 -0600 +++ linux-2.6.14-git3/MAINTAINERS 2005-11-02 14:34:25.700322915 -0600 @@ -1885,6 +1885,13 @@ L: linux-abi-devel at lists.sourceforge.net S: Maintained +PCI ERROR RECOVERY +P: Linas Vepstas +M: linas at austin.ibm.com +L: linux-kernel at vger.kernel.org +L: linux-pci at atrey.karlin.mff.cuni.cz +S: Supported + PCI SOUND DRIVERS (ES1370, ES1371 and SONICVIBES) P: Thomas Sailer M: sailer at ife.ee.ethz.ch From linas at linas.org Fri Nov 4 11:50:35 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:50:35 -0600 Subject: [PATCH 16/42]: PCI: PCI Error reporting callbacks References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005035.GA26929@mail.gnucash.org> 16-pci-error-recovery_header.patch PCI Error Recovery: header file patch Various PCI bus errors can be signaled by newer PCI controllers. Recovering from those errors requires an infrastructure to notify affected device drivers of the error, and a way of walking through a reset sequence. This patch adds a set of callbacks to be used by error recovery routines to notify device drivers of the various stages of recovery. Signed-off-by: Linas Vepstas -- include/linux/pci.h | 49 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 49 insertions(+) Index: linux-2.6.14-git3/include/linux/pci.h =================================================================== --- linux-2.6.14-git3.orig/include/linux/pci.h 2005-11-02 14:29:18.856338553 -0600 +++ linux-2.6.14-git3/include/linux/pci.h 2005-11-02 14:34:32.272401512 -0600 @@ -78,6 +78,16 @@ #define PCI_UNKNOWN ((pci_power_t __force) 5) #define PCI_POWER_ERROR ((pci_power_t __force) -1) +/** The pci_channel state describes connectivity between the CPU and + * the pci device. If some PCI bus between here and the pci device + * has crashed or locked up, this info is reflected here. + */ +enum pci_channel_state { + pci_channel_io_normal = 0, /* I/O channel is in normal state */ + pci_channel_io_frozen = 1, /* I/O to channel is blocked */ + pci_channel_io_perm_failure, /* PCI card is dead */ +}; + /* * The pci_dev structure is used to describe PCI devices. */ @@ -110,6 +120,7 @@ this is D0-D3, D0 being fully functional, and D3 being off. */ + enum pci_channel_state error_state; /* current connectivity state */ struct device dev; /* Generic device interface */ /* device is compatible with these IDs */ @@ -232,6 +243,43 @@ unsigned int use_driver_data:1; /* pci_driver->driver_data is used */ }; +/* ---------------------------------------------------------------- */ +/** PCI error recovery infrastructure. If a PCI device driver provides + * a set fof callbacks in struct pci_error_handlers, then that device driver + * will be notified of PCI bus errors, and will be driven to recovery + * when an error occurs. + */ + +enum pcierr_result { + PCIERR_RESULT_NONE=0, /* no result/none/not supported in device driver */ + PCIERR_RESULT_CAN_RECOVER=1, /* Device driver can recover without slot reset */ + PCIERR_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */ + PCIERR_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */ + PCIERR_RESULT_RECOVERED, /* Device driver is fully recovered and operational */ +}; + +/* PCI bus error event callbacks */ +struct pci_error_handlers +{ + /* PCI bus error detected on this device */ + int (*error_detected)(struct pci_dev *dev, + enum pci_channel_state error); + + /* MMIO has been re-enabled, but not DMA */ + int (*mmio_enabled)(struct pci_dev *dev); + + /* PCI Express link has been reset */ + int (*link_reset)(struct pci_dev *dev); + + /* PCI slot has been reset */ + int (*slot_reset)(struct pci_dev *dev); + + /* Device driver may resume normal operations */ + void (*resume)(struct pci_dev *dev); +}; + +/* ---------------------------------------------------------------- */ + struct module; struct pci_driver { struct list_head node; @@ -245,6 +293,7 @@ int (*enable_wake) (struct pci_dev *dev, pci_power_t state, int enable); /* Enable wake event */ void (*shutdown) (struct pci_dev *dev); + struct pci_error_handlers *err_handler; struct device_driver driver; struct pci_dynids dynids; }; From linas at linas.org Fri Nov 4 11:50:48 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:50:48 -0600 Subject: [PATCH 17/42]: ppc64: mark failed devices References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005048.GA26970@mail.gnucash.org> 17-eeh-slot-marking-bug.patch A device that experiences a PCI outage may be just one deivce out of many that was affected. In order to avoid repeated reports of a failure, the entire tree of affected devices should be marked as failed. This patch marks up the entire tree. Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:34:19.926132452 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:35:39.290005477 -0600 @@ -479,32 +479,47 @@ * an interrupt context, which is bad. */ -static inline void __eeh_mark_slot (struct device_node *dn) +static inline void __eeh_mark_slot (struct device_node *dn, int mode_flag) { while (dn) { - PCI_DN(dn)->eeh_mode |= EEH_MODE_ISOLATED; + if (PCI_DN(dn)) { + PCI_DN(dn)->eeh_mode |= mode_flag; - if (dn->child) - __eeh_mark_slot (dn->child); + if (dn->child) + __eeh_mark_slot (dn->child, mode_flag); + } dn = dn->sibling; } } -static inline void __eeh_clear_slot (struct device_node *dn) +void eeh_mark_slot (struct device_node *dn, int mode_flag) +{ + dn = find_device_pe (dn); + PCI_DN(dn)->eeh_mode |= mode_flag; + __eeh_mark_slot (dn->child, mode_flag); +} + +static inline void __eeh_clear_slot (struct device_node *dn, int mode_flag) { while (dn) { - PCI_DN(dn)->eeh_mode &= ~EEH_MODE_ISOLATED; - if (dn->child) - __eeh_clear_slot (dn->child); + if (PCI_DN(dn)) { + PCI_DN(dn)->eeh_mode &= ~mode_flag; + PCI_DN(dn)->eeh_check_count = 0; + if (dn->child) + __eeh_clear_slot (dn->child, mode_flag); + } dn = dn->sibling; } } -static inline void eeh_clear_slot (struct device_node *dn) +void eeh_clear_slot (struct device_node *dn, int mode_flag) { unsigned long flags; spin_lock_irqsave(&confirm_error_lock, flags); - __eeh_clear_slot (dn); + dn = find_device_pe (dn); + PCI_DN(dn)->eeh_mode &= ~mode_flag; + PCI_DN(dn)->eeh_check_count = 0; + __eeh_clear_slot (dn->child, mode_flag); spin_unlock_irqrestore(&confirm_error_lock, flags); } @@ -529,7 +544,6 @@ int rets[3]; unsigned long flags; struct pci_dn *pdn; - struct device_node *pe_dn; int rc = 0; __get_cpu_var(total_mmio_ffs)++; @@ -631,8 +645,7 @@ /* Avoid repeated reports of this failure, including problems * with other functions on this device, and functions under * bridges. */ - pe_dn = find_device_pe (dn); - __eeh_mark_slot (pe_dn); + eeh_mark_slot (dn, EEH_MODE_ISOLATED); spin_unlock_irqrestore(&confirm_error_lock, flags); eeh_send_failure_event (dn, dev, rets[0], rets[2]); @@ -744,9 +757,6 @@ rc, state, pdn->node->full_name); return; } - - if (state == 0) - eeh_clear_slot (pdn->node->parent->child); } /** rtas_set_slot_reset -- assert the pci #RST line for 1/4 second @@ -765,6 +775,12 @@ #define PCI_BUS_RST_HOLD_TIME_MSEC 250 msleep (PCI_BUS_RST_HOLD_TIME_MSEC); + + /* We might get hit with another EEH freeze as soon as the + * pci slot reset line is dropped. Make sure we don't miss + * these, and clear the flag now. */ + eeh_clear_slot (pdn->node, EEH_MODE_ISOLATED); + rtas_pci_slot_reset (pdn, 0); /* After a PCI slot has been reset, the PCI Express spec requires Index: linux-2.6.14-git3/include/asm-powerpc/ppc-pci.h =================================================================== --- linux-2.6.14-git3.orig/include/asm-powerpc/ppc-pci.h 2005-11-02 14:34:19.931131751 -0600 +++ linux-2.6.14-git3/include/asm-powerpc/ppc-pci.h 2005-11-02 14:35:39.295004776 -0600 @@ -86,6 +86,13 @@ int rtas_write_config(struct pci_dn *, int where, int size, u32 val); +/** + * mark and clear slots: find "partition endpoint" PE and set or + * clear the flags for each subnode of the PE. + */ +void eeh_mark_slot (struct device_node *dn, int mode_flag); +void eeh_clear_slot (struct device_node *dn, int mode_flag); + #endif #endif /* _ASM_POWERPC_PPC_PCI_H */ From linas at linas.org Fri Nov 4 11:51:03 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:51:03 -0600 Subject: [PATCH 18/42]: ppc64: bugfix: crash on dlpar slot add, remove References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005103.GA26983@mail.gnucash.org> 18-crash-on-pci-slot-add.patch This patch fixes a bugs related to dlpar slot add. -- Crash is due to the fact the some children of pci nodes are not pci nodes themselves, and thus do not have pci_dn structures. For example: /pci at 800000020000002/pci at 2,3/usb at 1/hub at 1 /pci at 800000020000002/pci at 2,3/usb at 1,1/hub at 1 A typical stack trace: Vector: 300 (Data Access) at [c0000000555637d0] pc: c000000000202a50: .dlpar_add_slot+0x108/0x410 c000000000202e78 .add_slot_store+0x7c/0xac c000000000202da0 .dlpar_attr_store+0x48/0x64 c0000000000f8ee4 .sysfs_write_file+0x100/0x1a0 A similar stack trace is involved for the slot remove. This code survived testing, of adding and removing different slots, 23 times each, so far, as of this writing. Signed-off-by: Linas Vepstas emailed to To: paulus at samba.org Cc: linuxppc64-dev at ozlabs.org, johnrose at linux.ibm.com, linux-kernel at vger.kernel.org Subject: [PATCH 2/2] ppc64: Crash in DLPAR code on remove operation on 4 October 2005 Index: linux-2.6.14-git6/arch/ppc64/kernel/pci_dn.c =================================================================== --- linux-2.6.14-git6.orig/arch/ppc64/kernel/pci_dn.c 2005-11-03 14:15:40.520737607 -0600 +++ linux-2.6.14-git6/arch/ppc64/kernel/pci_dn.c 2005-11-03 14:15:45.182083115 -0600 @@ -194,7 +194,10 @@ switch (action) { case PSERIES_RECONFIG_ADD: - pci = np->parent->data; + pci = PCI_DN(np->parent); + if (!pci) + return NOTIFY_OK; + update_dn_pci_info(np, pci->phb); break; default: Index: linux-2.6.14-git6/arch/powerpc/platforms/pseries/iommu.c =================================================================== --- linux-2.6.14-git6.orig/arch/powerpc/platforms/pseries/iommu.c 2005-11-03 14:14:32.131340002 -0600 +++ linux-2.6.14-git6/arch/powerpc/platforms/pseries/iommu.c 2005-11-03 14:49:42.621970876 -0600 @@ -494,10 +494,13 @@ { int err = NOTIFY_OK; struct device_node *np = node; - struct pci_dn *pci = np->data; + struct pci_dn *pci; switch (action) { case PSERIES_RECONFIG_REMOVE: + pci = PCI_DN(np); + if (!pci) + return NOTIFY_OK; if (pci->iommu_table && get_property(np, "ibm,dma-window", NULL)) iommu_free_table(np); From linas at linas.org Fri Nov 4 11:51:17 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:51:17 -0600 Subject: [PATCH 19/42]: ppc64: bugfix: crash on PHB add References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005117.GA26991@mail.gnucash.org> 19-rpaphp-crashing.patch This patch fixes a bug related to dlpar PHB add, after a PHB removal. -- The crash was due to the PHB not having a pci_dn structure yet, when the phb is being added. This code survived testing, of adding and removeig the PHB and all slots underneath it, 17 times so far, as of this writing. Signed-off-by: Linas Vepstas emailed to To: paulus at samba.org Cc: linuxppc64-dev at ozlabs.org, linux-pci at atrey.karlin.mff.cuni.cz, johnrose at linux.ibm.com, linux-kernel at vger.kernel.org Subject: [PATCH] rpaphp: PCI Hotplug crash on PHB DLPAR add on 4 October 2005 Index: linux-2.6.14-git3/drivers/pci/hotplug/rpadlpar_core.c =================================================================== --- linux-2.6.14-git3.orig/drivers/pci/hotplug/rpadlpar_core.c 2005-11-02 14:29:02.115685162 -0600 +++ linux-2.6.14-git3/drivers/pci/hotplug/rpadlpar_core.c 2005-11-02 14:35:52.800111285 -0600 @@ -306,7 +306,7 @@ { struct pci_controller *phb; - if (PCI_DN(dn)->phb) { + if (PCI_DN(dn) && PCI_DN(dn)->phb) { /* PHB already exists */ return -EINVAL; } From linas at linas.org Fri Nov 4 11:51:31 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:51:31 -0600 Subject: [PATCH 20/42]: ppc64: PCI hotplug common code elimination References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005131.GA27000@mail.gnucash.org> 20-rpaphp-eeh-cleanup.patch This patch move some code from the rpaphp directory, to the ppc64 directory, where it should have been all along (Among other things, I need it in the ppc64 directory for the PCI error recovery.) Please note that patch affects TWO maintainers: Paul, after applying the ppc64 part, please ask that GregKH appli the PCI part. It is safe to have the ppc64 part go in first. It would be bad to have the PCI part go in first. Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:35:39.290005477 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:36:41.255317484 -0600 @@ -1093,6 +1093,15 @@ } EXPORT_SYMBOL_GPL(eeh_add_device_early); +void eeh_add_device_tree_early(struct device_node *dn) +{ + struct device_node *sib; + for (sib = dn->child; sib; sib = sib->sibling) + eeh_add_device_tree_early(sib); + eeh_add_device_early(dn); +} +EXPORT_SYMBOL_GPL(eeh_add_device_tree_early); + /** * eeh_add_device_late - perform EEH initialization for the indicated pci device * @dev: pci device for which to set up EEH @@ -1147,6 +1156,23 @@ } EXPORT_SYMBOL_GPL(eeh_remove_device); +void eeh_remove_bus_device(struct pci_dev *dev) +{ + eeh_remove_device(dev); + if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) { + struct pci_bus *bus = dev->subordinate; + struct list_head *ln; + if (!bus) + return; + for (ln = bus->devices.next; ln != &bus->devices; ln = ln->next) { + struct pci_dev *pdev = pci_dev_b(ln); + if (pdev) + eeh_remove_bus_device(pdev); + } + } +} +EXPORT_SYMBOL_GPL(eeh_remove_bus_device); + static int proc_eeh_show(struct seq_file *m, void *v) { unsigned int cpu; Index: linux-2.6.14-git3/include/asm-ppc64/eeh.h =================================================================== --- linux-2.6.14-git3.orig/include/asm-ppc64/eeh.h 2005-11-02 14:32:35.725740824 -0600 +++ linux-2.6.14-git3/include/asm-ppc64/eeh.h 2005-11-02 14:36:41.263316362 -0600 @@ -55,6 +55,7 @@ * to finish the eeh setup for this device. */ void eeh_add_device_early(struct device_node *); +void eeh_add_device_tree_early(struct device_node *); void eeh_add_device_late(struct pci_dev *); /** @@ -70,6 +71,15 @@ void eeh_remove_device(struct pci_dev *); /** + * eeh_remove_device_recursive - undo EEH for device & children. + * @dev: pci device to be removed + * + * As above, this removes the device; it also removes child + * pci devices as well. + */ +void eeh_remove_bus_device(struct pci_dev *); + +/** * EEH_POSSIBLE_ERROR() -- test for possible MMIO failure. * * If this macro yields TRUE, the caller relays to eeh_check_failure() Index: linux-2.6.14-git3/drivers/pci/hotplug/rpaphp_pci.c =================================================================== --- linux-2.6.14-git3.orig/drivers/pci/hotplug/rpaphp_pci.c 2005-11-02 14:28:58.955128188 -0600 +++ linux-2.6.14-git3/drivers/pci/hotplug/rpaphp_pci.c 2005-11-02 14:36:41.271315241 -0600 @@ -253,17 +253,6 @@ return dev; } -static void enable_eeh(struct device_node *dn) -{ - struct device_node *sib; - - for (sib = dn->child; sib; sib = sib->sibling) - enable_eeh(sib); - eeh_add_device_early(dn); - return; - -} - static void print_slot_pci_funcs(struct pci_bus *bus) { struct device_node *dn; @@ -289,7 +278,7 @@ if (!dn) goto exit; - enable_eeh(dn); + eeh_add_device_tree_early(dn); dev = rpaphp_pci_config_slot(bus); if (!dev) { err("%s: can't find any devices.\n", __FUNCTION__); @@ -303,30 +292,12 @@ } EXPORT_SYMBOL_GPL(rpaphp_config_pci_adapter); -static void rpaphp_eeh_remove_bus_device(struct pci_dev *dev) -{ - eeh_remove_device(dev); - if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) { - struct pci_bus *bus = dev->subordinate; - struct list_head *ln; - if (!bus) - return; - for (ln = bus->devices.next; ln != &bus->devices; ln = ln->next) { - struct pci_dev *pdev = pci_dev_b(ln); - if (pdev) - rpaphp_eeh_remove_bus_device(pdev); - } - - } - return; -} - int rpaphp_unconfig_pci_adapter(struct pci_bus *bus) { struct pci_dev *dev, *tmp; list_for_each_entry_safe(dev, tmp, &bus->devices, bus_list) { - rpaphp_eeh_remove_bus_device(dev); + eeh_remove_bus_device(dev); pci_remove_bus_device(dev); } return 0; From linas at linas.org Fri Nov 4 11:51:46 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:51:46 -0600 Subject: [PATCH 21/42]: PCI: cleanup/simplify ppc64 PCI hotplug code References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005146.GA27008@mail.gnucash.org> 21-rpaphp-eeh-cleanup.patch This patch cleans up some rpa dlpar code. Basically, the rpaphp_config_pci_adapter() was a wrapper routine, which made two calls, and wrapped a bunch of verbose no-op code around it. This was consolidated wih the routine it called. Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/drivers/pci/hotplug/rpaphp_pci.c =================================================================== --- linux-2.6.14-git3.orig/drivers/pci/hotplug/rpaphp_pci.c 2005-11-02 14:36:41.271315241 -0600 +++ linux-2.6.14-git3/drivers/pci/hotplug/rpaphp_pci.c 2005-11-02 14:36:48.081360405 -0600 @@ -221,18 +221,21 @@ rpaphp_pci_config_slot() will configure all devices under the given slot->dn and return the the first pci_dev. *****************************************************************************/ -static struct pci_dev * -rpaphp_pci_config_slot(struct pci_bus *bus) +int +rpaphp_config_pci_adapter(struct pci_bus *bus) { struct device_node *dn = pci_bus_to_OF_node(bus); struct pci_dev *dev = NULL; + int rc = -ENODEV; int slotno; int num; dbg("Enter %s: dn=%s bus=%s\n", __FUNCTION__, dn->full_name, bus->name); if (!dn || !dn->child) - return NULL; + goto exit; + eeh_add_device_tree_early(dn); + slotno = PCI_SLOT(PCI_DN(dn->child)->devfn); /* pci_scan_slot should find all children */ @@ -243,15 +246,23 @@ } if (list_empty(&bus->devices)) { err("%s: No new device found\n", __FUNCTION__); - return NULL; + goto exit; } list_for_each_entry(dev, &bus->devices, bus_list) { if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) rpaphp_pci_config_bridge(dev); } - return dev; + dbg("%s: pci_devs of slot[%s]\n", __FUNCTION__, dn->full_name); + list_for_each_entry (dev, &bus->devices, bus_list) + dbg("\t%s\n", pci_name(dev)); + + rc = 0; +exit: + dbg("Exit %s: rc=%d\n", __FUNCTION__, rc); + return rc; } +EXPORT_SYMBOL_GPL(rpaphp_config_pci_adapter); static void print_slot_pci_funcs(struct pci_bus *bus) { @@ -268,30 +279,6 @@ return; } -int rpaphp_config_pci_adapter(struct pci_bus *bus) -{ - struct device_node *dn = pci_bus_to_OF_node(bus); - struct pci_dev *dev; - int rc = -ENODEV; - - dbg("Entry %s: slot[%s]\n", __FUNCTION__, dn->full_name); - if (!dn) - goto exit; - - eeh_add_device_tree_early(dn); - dev = rpaphp_pci_config_slot(bus); - if (!dev) { - err("%s: can't find any devices.\n", __FUNCTION__); - goto exit; - } - print_slot_pci_funcs(bus); - rc = 0; -exit: - dbg("Exit %s: rc=%d\n", __FUNCTION__, rc); - return rc; -} -EXPORT_SYMBOL_GPL(rpaphp_config_pci_adapter); - int rpaphp_unconfig_pci_adapter(struct pci_bus *bus) { struct pci_dev *dev, *tmp; From linas at linas.org Fri Nov 4 11:52:01 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:52:01 -0600 Subject: [PATCH 22/42]: PCI: remove duplicted pci hotplug code References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005201.GA27016@mail.gnucash.org> 22-rpaphp-eliminate-dupe-code.patch The RPAPHP code contains two routines that appear to be gratiuitous copies of very similar pci code. In particular, rpaphp_claim_resource ~~ pci_claim_resource rpadlpar_claim_one_bus == pcibios_claim_one_bus This patch removes the rpaphp versions of the code. Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/drivers/pci/hotplug/rpaphp_pci.c =================================================================== --- linux-2.6.14-git3.orig/drivers/pci/hotplug/rpaphp_pci.c 2005-11-02 14:36:48.081360405 -0600 +++ linux-2.6.14-git3/drivers/pci/hotplug/rpaphp_pci.c 2005-11-02 14:36:51.785840999 -0600 @@ -62,28 +62,6 @@ } EXPORT_SYMBOL_GPL(rpaphp_find_pci_bus); -int rpaphp_claim_resource(struct pci_dev *dev, int resource) -{ - struct resource *res = &dev->resource[resource]; - struct resource *root = pci_find_parent_resource(dev, res); - char *dtype = resource < PCI_BRIDGE_RESOURCES ? "device" : "bridge"; - int err = -EINVAL; - - if (root != NULL) { - err = request_resource(root, res); - } - - if (err) { - err("PCI: %s region %d of %s %s [%lx:%lx]\n", - root ? "Address space collision on" : - "No parent found for", - resource, dtype, pci_name(dev), res->start, res->end); - } - return err; -} - -EXPORT_SYMBOL_GPL(rpaphp_claim_resource); - static int rpaphp_get_sensor_state(struct slot *slot, int *state) { int rc; @@ -178,7 +156,7 @@ if (r->parent || !r->start || !r->flags) continue; - rpaphp_claim_resource(dev, i); + pci_claim_resource(dev, i); } } } Index: linux-2.6.14-git3/drivers/pci/hotplug/rpadlpar_core.c =================================================================== --- linux-2.6.14-git3.orig/drivers/pci/hotplug/rpadlpar_core.c 2005-11-02 14:35:52.800111285 -0600 +++ linux-2.6.14-git3/drivers/pci/hotplug/rpadlpar_core.c 2005-11-02 14:36:51.793839877 -0600 @@ -112,28 +112,6 @@ return NULL; } -static void rpadlpar_claim_one_bus(struct pci_bus *b) -{ - struct list_head *ld; - struct pci_bus *child_bus; - - for (ld = b->devices.next; ld != &b->devices; ld = ld->next) { - struct pci_dev *dev = pci_dev_b(ld); - int i; - - for (i = 0; i < PCI_NUM_RESOURCES; i++) { - struct resource *r = &dev->resource[i]; - - if (r->parent || !r->start || !r->flags) - continue; - rpaphp_claim_resource(dev, i); - } - } - - list_for_each_entry(child_bus, &b->children, node) - rpadlpar_claim_one_bus(child_bus); -} - static int pci_add_secondary_bus(struct device_node *dn, struct pci_dev *bridge_dev) { @@ -158,7 +136,7 @@ pcibios_fixup_bus(child); /* Claim new bus resources */ - rpadlpar_claim_one_bus(bridge_dev->bus); + pcibios_claim_one_bus(bridge_dev->bus); if (hose->last_busno < child->number) hose->last_busno = child->number; Index: linux-2.6.14-git3/arch/ppc64/kernel/pci.c =================================================================== --- linux-2.6.14-git3.orig/arch/ppc64/kernel/pci.c 2005-11-02 14:28:57.119385510 -0600 +++ linux-2.6.14-git3/arch/ppc64/kernel/pci.c 2005-11-02 14:36:51.808837774 -0600 @@ -197,7 +197,7 @@ spin_unlock(&hose_spinlock); } -static void __init pcibios_claim_one_bus(struct pci_bus *b) +void __devinit pcibios_claim_one_bus(struct pci_bus *b) { struct pci_dev *dev; struct pci_bus *child_bus; Index: linux-2.6.14-git3/include/asm-ppc64/pci.h =================================================================== --- linux-2.6.14-git3.orig/include/asm-ppc64/pci.h 2005-11-02 14:28:57.119385510 -0600 +++ linux-2.6.14-git3/include/asm-ppc64/pci.h 2005-11-02 14:36:51.813837073 -0600 @@ -160,6 +160,8 @@ extern void pcibios_fixup_device_resources(struct pci_dev *dev, struct pci_bus *bus); +extern void pcibios_claim_one_bus(struct pci_bus *b); + extern struct pci_controller *init_phb_dynamic(struct device_node *dn); extern int pci_read_irq_line(struct pci_dev *dev); From linas at linas.org Fri Nov 4 11:52:16 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:52:16 -0600 Subject: [PATCH 23/42]: ppc64: migrate common PCI hotplug code References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005216.GA27025@mail.gnucash.org> 23-rpaphp-migrate.patch This patch moves some pci device add & remove code from the PCI hotplug directory to the arch/ppc64/kernel directory, and cleans it up a tad. The primary reason for this is that the code performs some fairly generic operations that are shared with the PCI error recovery code (living in the arch/ppc64/kernel directory). Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/pci_dlpar.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/pci_dlpar.c 2005-11-02 14:39:24.724396565 -0600 @@ -0,0 +1,174 @@ +/* + * PCI Dynamic LPAR, PCI Hot Plug and PCI EEH recovery code + * for RPA-compliant PPC64 platform. + * Copyright (C) 2003 Linda Xie + * Copyright (C) 2005 International Business Machines + * + * Updates, 2005, John Rose + * Updates, 2005, Linas Vepstas + * + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or (at + * your option) any later version. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or + * NON INFRINGEMENT. See the GNU General Public License for more + * details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +#include +#include + +static struct pci_bus * +find_bus_among_children(struct pci_bus *bus, + struct device_node *dn) +{ + struct pci_bus *child = NULL; + struct list_head *tmp; + struct device_node *busdn; + + busdn = pci_bus_to_OF_node(bus); + if (busdn == dn) + return bus; + + list_for_each(tmp, &bus->children) { + child = find_bus_among_children(pci_bus_b(tmp), dn); + if (child) + break; + }; + return child; +} + +struct pci_bus * +pcibios_find_pci_bus(struct device_node *dn) +{ + struct pci_dn *pdn = dn->data; + + if (!pdn || !pdn->phb || !pdn->phb->bus) + return NULL; + + return find_bus_among_children(pdn->phb->bus, dn); +} + +/** + * pcibios_remove_pci_devices - remove all devices under this bus + * + * Remove all of the PCI devices under this bus both from the + * linux pci device tree, and from the ppc64 EEH address cache. + */ +void +pcibios_remove_pci_devices(struct pci_bus *bus) +{ + struct pci_dev *dev, *tmp; + + list_for_each_entry_safe(dev, tmp, &bus->devices, bus_list) { + eeh_remove_bus_device(dev); + pci_remove_bus_device(dev); + } +} + +/* Must be called before pci_bus_add_devices */ +static void +pcibios_fixup_new_pci_devices(struct pci_bus *bus, int fix_bus) +{ + struct pci_dev *dev; + + list_for_each_entry(dev, &bus->devices, bus_list) { + /* + * Skip already-present devices (which are on the + * global device list.) + */ + if (list_empty(&dev->global_list)) { + int i; + + /* Need to setup IOMMU tables */ + ppc_md.iommu_dev_setup(dev); + + if(fix_bus) + pcibios_fixup_device_resources(dev, bus); + pci_read_irq_line(dev); + for (i = 0; i < PCI_NUM_RESOURCES; i++) { + struct resource *r = &dev->resource[i]; + + if (r->parent || !r->start || !r->flags) + continue; + pci_claim_resource(dev, i); + } + } + } +} + +static int +pcibios_pci_config_bridge(struct pci_dev *dev) +{ + u8 sec_busno; + struct pci_bus *child_bus; + struct pci_dev *child_dev; + + /* Get busno of downstream bus */ + pci_read_config_byte(dev, PCI_SECONDARY_BUS, &sec_busno); + + /* Add to children of PCI bridge dev->bus */ + child_bus = pci_add_new_bus(dev->bus, dev, sec_busno); + if (!child_bus) { + printk (KERN_ERR "%s: could not add second bus\n", __FUNCTION__); + return -EIO; + } + sprintf(child_bus->name, "PCI Bus #%02x", child_bus->number); + + pci_scan_child_bus(child_bus); + + list_for_each_entry(child_dev, &child_bus->devices, bus_list) { + eeh_add_device_late(child_dev); + } + + /* Fixup new pci devices without touching bus struct */ + pcibios_fixup_new_pci_devices(child_bus, 0); + + /* Make the discovered devices available */ + pci_bus_add_devices(child_bus); + return 0; +} + +/** + * pcibios_add_pci_devices - adds new pci devices to bus + * + * This routine will find and fixup new pci devices under + * the indicated bus. This routine presumes that there + * might already be some devices under this bridge, so + * it carefully tries to add only new devices. (And that + * is how this routine differs from other, similar pcibios + * routines.) + */ +void +pcibios_add_pci_devices(struct pci_bus * bus) +{ + int slotno, num; + struct pci_dev *dev; + struct device_node *dn = pci_bus_to_OF_node(bus); + + eeh_add_device_tree_early(dn); + + /* pci_scan_slot should find all children */ + slotno = PCI_SLOT(PCI_DN(dn->child)->devfn); + num = pci_scan_slot(bus, PCI_DEVFN(slotno, 0)); + if (num) { + pcibios_fixup_new_pci_devices(bus, 1); + pci_bus_add_devices(bus); + } + + list_for_each_entry(dev, &bus->devices, bus_list) { + eeh_add_device_late (dev); + if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) + pcibios_pci_config_bridge(dev); + } +} Index: linux-2.6.14-git3/drivers/pci/hotplug/rpaphp_pci.c =================================================================== --- linux-2.6.14-git3.orig/drivers/pci/hotplug/rpaphp_pci.c 2005-11-02 14:36:51.785840999 -0600 +++ linux-2.6.14-git3/drivers/pci/hotplug/rpaphp_pci.c 2005-11-02 14:39:24.730395724 -0600 @@ -32,36 +32,6 @@ #include "../pci.h" /* for pci_add_new_bus */ #include "rpaphp.h" -static struct pci_bus *find_bus_among_children(struct pci_bus *bus, - struct device_node *dn) -{ - struct pci_bus *child = NULL; - struct list_head *tmp; - struct device_node *busdn; - - busdn = pci_bus_to_OF_node(bus); - if (busdn == dn) - return bus; - - list_for_each(tmp, &bus->children) { - child = find_bus_among_children(pci_bus_b(tmp), dn); - if (child) - break; - } - return child; -} - -struct pci_bus *rpaphp_find_pci_bus(struct device_node *dn) -{ - struct pci_dn *pdn = dn->data; - - if (!pdn || !pdn->phb || !pdn->phb->bus) - return NULL; - - return find_bus_among_children(pdn->phb->bus, dn); -} -EXPORT_SYMBOL_GPL(rpaphp_find_pci_bus); - static int rpaphp_get_sensor_state(struct slot *slot, int *state) { int rc; @@ -120,7 +90,7 @@ /* config/unconfig adapter */ *value = slot->state; } else { - bus = rpaphp_find_pci_bus(slot->dn); + bus = pcibios_find_pci_bus(slot->dn); if (bus && !list_empty(&bus->devices)) *value = CONFIGURED; else @@ -131,117 +101,6 @@ return rc; } -/* Must be called before pci_bus_add_devices */ -static void -rpaphp_fixup_new_pci_devices(struct pci_bus *bus, int fix_bus) -{ - struct pci_dev *dev; - - list_for_each_entry(dev, &bus->devices, bus_list) { - /* - * Skip already-present devices (which are on the - * global device list.) - */ - if (list_empty(&dev->global_list)) { - int i; - - /* Need to setup IOMMU tables */ - ppc_md.iommu_dev_setup(dev); - - if(fix_bus) - pcibios_fixup_device_resources(dev, bus); - pci_read_irq_line(dev); - for (i = 0; i < PCI_NUM_RESOURCES; i++) { - struct resource *r = &dev->resource[i]; - - if (r->parent || !r->start || !r->flags) - continue; - pci_claim_resource(dev, i); - } - } - } -} - -static int rpaphp_pci_config_bridge(struct pci_dev *dev) -{ - u8 sec_busno; - struct pci_bus *child_bus; - struct pci_dev *child_dev; - - dbg("Enter %s: BRIDGE dev=%s\n", __FUNCTION__, pci_name(dev)); - - /* get busno of downstream bus */ - pci_read_config_byte(dev, PCI_SECONDARY_BUS, &sec_busno); - - /* add to children of PCI bridge dev->bus */ - child_bus = pci_add_new_bus(dev->bus, dev, sec_busno); - if (!child_bus) { - err("%s: could not add second bus\n", __FUNCTION__); - return -EIO; - } - sprintf(child_bus->name, "PCI Bus #%02x", child_bus->number); - /* do pci_scan_child_bus */ - pci_scan_child_bus(child_bus); - - list_for_each_entry(child_dev, &child_bus->devices, bus_list) { - eeh_add_device_late(child_dev); - } - - /* fixup new pci devices without touching bus struct */ - rpaphp_fixup_new_pci_devices(child_bus, 0); - - /* Make the discovered devices available */ - pci_bus_add_devices(child_bus); - return 0; -} - -/***************************************************************************** - rpaphp_pci_config_slot() will configure all devices under the - given slot->dn and return the the first pci_dev. - *****************************************************************************/ -int -rpaphp_config_pci_adapter(struct pci_bus *bus) -{ - struct device_node *dn = pci_bus_to_OF_node(bus); - struct pci_dev *dev = NULL; - int rc = -ENODEV; - int slotno; - int num; - - dbg("Enter %s: dn=%s bus=%s\n", __FUNCTION__, dn->full_name, bus->name); - if (!dn || !dn->child) - goto exit; - - eeh_add_device_tree_early(dn); - - slotno = PCI_SLOT(PCI_DN(dn->child)->devfn); - - /* pci_scan_slot should find all children */ - num = pci_scan_slot(bus, PCI_DEVFN(slotno, 0)); - if (num) { - rpaphp_fixup_new_pci_devices(bus, 1); - pci_bus_add_devices(bus); - } - if (list_empty(&bus->devices)) { - err("%s: No new device found\n", __FUNCTION__); - goto exit; - } - list_for_each_entry(dev, &bus->devices, bus_list) { - if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) - rpaphp_pci_config_bridge(dev); - } - - dbg("%s: pci_devs of slot[%s]\n", __FUNCTION__, dn->full_name); - list_for_each_entry (dev, &bus->devices, bus_list) - dbg("\t%s\n", pci_name(dev)); - - rc = 0; -exit: - dbg("Exit %s: rc=%d\n", __FUNCTION__, rc); - return rc; -} -EXPORT_SYMBOL_GPL(rpaphp_config_pci_adapter); - static void print_slot_pci_funcs(struct pci_bus *bus) { struct device_node *dn; @@ -257,17 +116,6 @@ return; } -int rpaphp_unconfig_pci_adapter(struct pci_bus *bus) -{ - struct pci_dev *dev, *tmp; - - list_for_each_entry_safe(dev, tmp, &bus->devices, bus_list) { - eeh_remove_bus_device(dev); - pci_remove_bus_device(dev); - } - return 0; -} - static int setup_pci_hotplug_slot_info(struct slot *slot) { dbg("%s Initilize the PCI slot's hotplug->info structure ...\n", @@ -303,7 +151,7 @@ struct pci_bus *bus; BUG_ON(!dn); - bus = rpaphp_find_pci_bus(dn); + bus = pcibios_find_pci_bus(dn); if (!bus) { err("%s: no pci_bus for dn %s\n", __FUNCTION__, dn->full_name); goto exit_rc; @@ -328,10 +176,7 @@ if (slot->hotplug_slot->info->adapter_status == NOT_CONFIGURED) { dbg("%s CONFIGURING pci adapter in slot[%s]\n", __FUNCTION__, slot->name); - if (rpaphp_config_pci_adapter(slot->bus)) { - err("%s: CONFIG pci adapter failed\n", __FUNCTION__); - goto exit_rc; - } + pcibios_add_pci_devices(slot->bus); } else if (slot->hotplug_slot->info->adapter_status != CONFIGURED) { err("%s: slot[%s]'s adapter_status is NOT_VALID.\n", @@ -377,16 +222,10 @@ /* if slot is not empty, enable the adapter */ if (state == PRESENT) { dbg("%s : slot[%s] is occupied.\n", __FUNCTION__, slot->name); - retval = rpaphp_config_pci_adapter(slot->bus); - if (!retval) { - slot->state = CONFIGURED; - dbg("%s: PCI devices in slot[%s] has been configured\n", + pcibios_add_pci_devices(slot->bus); + slot->state = CONFIGURED; + dbg("%s: PCI devices in slot[%s] has been configured\n", __FUNCTION__, slot->name); - } else { - slot->state = NOT_CONFIGURED; - dbg("%s: no pci_dev struct for adapter in slot[%s]\n", - __FUNCTION__, slot->name); - } } else if (state == EMPTY) { dbg("%s : slot[%s] is empty\n", __FUNCTION__, slot->name); slot->state = EMPTY; Index: linux-2.6.14-git3/drivers/pci/hotplug/rpadlpar_core.c =================================================================== --- linux-2.6.14-git3.orig/drivers/pci/hotplug/rpadlpar_core.c 2005-11-02 14:36:51.793839877 -0600 +++ linux-2.6.14-git3/drivers/pci/hotplug/rpadlpar_core.c 2005-11-02 14:39:24.737394743 -0600 @@ -197,9 +197,8 @@ static int dlpar_add_pci_slot(char *drc_name, struct device_node *dn) { struct pci_dev *dev; - int rc; - if (rpaphp_find_pci_bus(dn)) + if (pcibios_find_pci_bus(dn)) return -EINVAL; /* Add pci bus */ @@ -211,12 +210,7 @@ } if (dn->child) { - rc = rpaphp_config_pci_adapter(dev->subordinate); - if (rc < 0) { - printk(KERN_ERR "%s: unable to enable slot %s\n", - __FUNCTION__, drc_name); - return -EIO; - } + pcibios_add_pci_devices(dev->subordinate); } /* Add hotplug slot */ @@ -255,7 +249,7 @@ struct pci_dn *pdn; int rc = 0; - if (!rpaphp_find_pci_bus(dn)) + if (!pcibios_find_pci_bus(dn)) return -EINVAL; slot = find_slot(dn); @@ -400,7 +394,7 @@ struct pci_bus *bus; struct slot *slot; - bus = rpaphp_find_pci_bus(dn); + bus = pcibios_find_pci_bus(dn); if (!bus) return -EINVAL; Index: linux-2.6.14-git3/drivers/pci/hotplug/rpaphp_core.c =================================================================== --- linux-2.6.14-git3.orig/drivers/pci/hotplug/rpaphp_core.c 2005-11-02 14:28:55.984544585 -0600 +++ linux-2.6.14-git3/drivers/pci/hotplug/rpaphp_core.c 2005-11-02 14:39:24.744393761 -0600 @@ -426,7 +426,8 @@ dbg("DISABLING SLOT %s\n", slot->name); down(&rpaphp_sem); - retval = rpaphp_unconfig_pci_adapter(slot->bus); + pcibios_remove_pci_devices(slot->bus); + retval = 0; up(&rpaphp_sem); slot->state = NOT_CONFIGURED; info("%s: devices in slot[%s] unconfigured.\n", __FUNCTION__, Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/Makefile =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/Makefile 2005-11-02 14:32:55.306995693 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/Makefile 2005-11-02 14:40:05.531674439 -0600 @@ -1,6 +1,6 @@ obj-y := pci.o lpar.o hvCall.o nvram.o reconfig.o \ - setup.o iommu.o rtas-fw.o ras.o + setup.o iommu.o rtas-fw.o ras.o pci_dlpar.o obj-$(CONFIG_SMP) += smp.o obj-$(CONFIG_IBMVIO) += vio.o obj-$(CONFIG_XICS) += xics.o -obj-$(CONFIG_EEH) += eeh.o eeh_event.o +obj-$(CONFIG_EEH) += eeh.o eeh_event.o Index: linux-2.6.14-git3/include/asm-ppc64/pci-bridge.h =================================================================== --- linux-2.6.14-git3.orig/include/asm-ppc64/pci-bridge.h 2005-11-02 14:28:55.984544585 -0600 +++ linux-2.6.14-git3/include/asm-ppc64/pci-bridge.h 2005-11-02 14:39:24.755392219 -0600 @@ -121,9 +121,18 @@ return bus->sysdata; /* Must be root bus (PHB) */ } +/** Find the bus corresponding to the indicated device node */ +struct pci_bus * pcibios_find_pci_bus(struct device_node *dn); + extern void pci_process_bridge_OF_ranges(struct pci_controller *hose, struct device_node *dev, int primary); +/** Remove all of the PCI devices under this bus */ +void pcibios_remove_pci_devices(struct pci_bus *bus); + +/** Discover new pci devices under this bus, and add them */ +void pcibios_add_pci_devices(struct pci_bus * bus); + extern int pcibios_remove_root_bus(struct pci_controller *phb); extern void phbs_remap_io(void); From linas at linas.org Fri Nov 4 11:52:49 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:52:49 -0600 Subject: [PATCH 24/42]: ppc64: PCI Error Recovery: PPC64 core recovery routines References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005249.GA27034@mail.gnucash.org> Various PCI bus errors can be signaled by newer PCI controllers. The core error recovery routines are architecture dependent. This patch adds a recovery infrastructure for the PPC64 pSeries systems. Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:36:41.255317484 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:41:18.427452474 -0600 @@ -485,6 +485,11 @@ if (PCI_DN(dn)) { PCI_DN(dn)->eeh_mode |= mode_flag; + /* Mark the pci device driver too */ + struct pci_dev *dev = PCI_DN(dn)->pcidev; + if (dev && dev->driver) + dev->error_state = pci_channel_io_frozen; + if (dn->child) __eeh_mark_slot (dn->child, mode_flag); } @@ -544,6 +549,7 @@ int rets[3]; unsigned long flags; struct pci_dn *pdn; + enum pci_channel_state state; int rc = 0; __get_cpu_var(total_mmio_ffs)++; @@ -648,8 +654,13 @@ eeh_mark_slot (dn, EEH_MODE_ISOLATED); spin_unlock_irqrestore(&confirm_error_lock, flags); - eeh_send_failure_event (dn, dev, rets[0], rets[2]); - + state = pci_channel_io_normal; + if ((rets[0] == 2) || (rets[0] == 4)) + state = pci_channel_io_frozen; + if (rets[0] == 5) + state = pci_channel_io_perm_failure; + eeh_send_failure_event (dn, dev, state, rets[2]); + /* Most EEH events are due to device driver bugs. Having * a stack trace will help the device-driver authors figure * out what happened. So print that out. */ @@ -953,8 +964,10 @@ * But there are a few cases like display devices that make sense. */ enable = 1; /* i.e. we will do checking */ +#if 0 if ((*class_code >> 16) == PCI_BASE_CLASS_DISPLAY) enable = 0; +#endif if (!enable) pdn->eeh_mode |= EEH_MODE_NOCHECK; Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh_driver.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh_driver.c 2005-11-02 14:41:18.435451353 -0600 @@ -0,0 +1,366 @@ +/* + * PCI Error Recovery Driver for RPA-compliant PPC64 platform. + * Copyright (C) 2004, 2005 Linas Vepstas + * + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or (at + * your option) any later version. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or + * NON INFRINGEMENT. See the GNU General Public License for more + * details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + * + * Send feedback to + * + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + + +static inline const char * pcid_name (struct pci_dev *pdev) +{ + if (pdev->dev.driver) + return pdev->dev.driver->name; + return ""; +} + +/** + * Return the "partitionable endpoint" (pe) under which this device lies + */ +static struct device_node * find_device_pe(struct device_node *dn) +{ + while ((dn->parent) && PCI_DN(dn->parent) && + (PCI_DN(dn->parent)->eeh_mode & EEH_MODE_SUPPORTED)) { + dn = dn->parent; + } + return dn; +} + + +#ifdef DEBUG +static void print_device_node_tree (struct pci_dn *pdn, int dent) +{ + int i; + if (!pdn) return; + for (i=0;inode->name, pdn->eeh_mode, pdn->eeh_config_addr, + pdn->eeh_pe_config_addr, pdn->node->full_name); + dent += 3; + struct device_node *pc = pdn->node->child; + while (pc) { + print_device_node_tree(PCI_DN(pc), dent); + pc = pc->sibling; + } +} +#endif + +/** + * irq_in_use - return true if this irq is being used + */ +static int irq_in_use(unsigned int irq) +{ + int rc = 0; + unsigned long flags; + struct irq_desc *desc = irq_desc + irq; + + spin_lock_irqsave(&desc->lock, flags); + if (desc->action) + rc = 1; + spin_unlock_irqrestore(&desc->lock, flags); + return rc; +} + +/* ------------------------------------------------------- */ +/** eeh_report_error - report an EEH error to each device, + * collect up and merge the device responses. + */ + +static void eeh_report_error(struct pci_dev *dev, void *userdata) +{ + enum pcierr_result rc, *res = userdata; + struct pci_driver *driver = dev->driver; + + dev->error_state = pci_channel_io_frozen; + + if (!driver) + return; + + if (irq_in_use (dev->irq)) { + struct device_node *dn = pci_device_to_OF_node(dev); + PCI_DN(dn)->eeh_mode |= EEH_MODE_IRQ_DISABLED; + disable_irq_nosync(dev->irq); + } + if (!driver->err_handler) + return; + if (!driver->err_handler->error_detected) + return; + + rc = driver->err_handler->error_detected (dev, pci_channel_io_frozen); + if (*res == PCIERR_RESULT_NONE) *res = rc; + if (*res == PCIERR_RESULT_NEED_RESET) return; + if (*res == PCIERR_RESULT_DISCONNECT && + rc == PCIERR_RESULT_NEED_RESET) *res = rc; +} + +/** eeh_report_reset -- tell this device that the pci slot + * has been reset. + */ + +static void eeh_report_reset(struct pci_dev *dev, void *userdata) +{ + struct pci_driver *driver = dev->driver; + struct device_node *dn = pci_device_to_OF_node(dev); + + if (!driver) + return; + + if ((PCI_DN(dn)->eeh_mode) & EEH_MODE_IRQ_DISABLED) { + PCI_DN(dn)->eeh_mode &= ~EEH_MODE_IRQ_DISABLED; + enable_irq(dev->irq); + } + if (!driver->err_handler) + return; + if (!driver->err_handler->slot_reset) + return; + + driver->err_handler->slot_reset(dev); +} + +static void eeh_report_resume(struct pci_dev *dev, void *userdata) +{ + struct pci_driver *driver = dev->driver; + + dev->error_state = pci_channel_io_normal; + + if (!driver) + return; + if (!driver->err_handler) + return; + if (!driver->err_handler->resume) + return; + + driver->err_handler->resume(dev); +} + +static void eeh_report_failure(struct pci_dev *dev, void *userdata) +{ + struct pci_driver *driver = dev->driver; + + dev->error_state = pci_channel_io_perm_failure; + + if (!driver) + return; + + if (irq_in_use (dev->irq)) { + struct device_node *dn = pci_device_to_OF_node(dev); + PCI_DN(dn)->eeh_mode |= EEH_MODE_IRQ_DISABLED; + disable_irq_nosync(dev->irq); + } + if (!driver->err_handler) + return; + if (!driver->err_handler->error_detected) + return; + driver->err_handler->error_detected(dev, pci_channel_io_perm_failure); +} + +/* ------------------------------------------------------- */ +/** + * handle_eeh_events -- reset a PCI device after hard lockup. + * + * pSeries systems will isolate a PCI slot if the PCI-Host + * bridge detects address or data parity errors, DMA's + * occuring to wild addresses (which usually happen due to + * bugs in device drivers or in PCI adapter firmware). + * Slot isolations also occur if #SERR, #PERR or other misc + * PCI-related errors are detected. + * + * Recovery process consists of unplugging the device driver + * (which generated hotplug events to userspace), then issuing + * a PCI #RST to the device, then reconfiguring the PCI config + * space for all bridges & devices under this slot, and then + * finally restarting the device drivers (which cause a second + * set of hotplug events to go out to userspace). + */ + +/** + * eeh_reset_device() -- perform actual reset of a pci slot + * Args: bus: pointer to the pci bus structure corresponding + * to the isolated slot. A non-null value will + * cause all devices under the bus to be removed + * and then re-added. + * pe_dn: pointer to a "Partionable Endpoint" device node. + * This is the top-level structure on which pci + * bus resets can be performed. + */ + +static void eeh_reset_device (struct pci_dn *pe_dn, struct pci_bus *bus) +{ + if (bus) + pcibios_remove_pci_devices(bus); + + /* Reset the pci controller. (Asserts RST#; resets config space). + * Reconfigure bridges and devices */ + rtas_set_slot_reset(pe_dn); + + /* Walk over all functions on this device */ + rtas_configure_bridge(pe_dn); + eeh_restore_bars(pe_dn); + + /* Give the system 5 seconds to finish running the user-space + * hotplug shutdown scripts, e.g. ifdown for ethernet. Yes, + * this is a hack, but if we don't do this, and try to bring + * the device up before the scripts have taken it down, + * potentially weird things happen. + */ + if (bus) { + ssleep (5); + pcibios_add_pci_devices(bus); + } +} + +/* The longest amount of time to wait for a pci device + * to come back on line, in seconds. + */ +#define MAX_WAIT_FOR_RECOVERY 15 + +void handle_eeh_events (struct eeh_event *event) +{ + struct device_node *frozen_dn; + struct pci_dn *frozen_pdn; + struct pci_bus *frozen_bus; + int perm_failure = 0; + + frozen_dn = find_device_pe(event->dn); + frozen_bus = pcibios_find_pci_bus(frozen_dn); + + if (!frozen_dn) { + printk(KERN_ERR "EEH: Error: Cannot find partition endpoint for %s\n", + pci_name(event->dev)); + return; + } + + /* There are two different styles for coming up with the PE. + * In the old style, it was the highest EEH-capable device + * which was always an EADS pci bridge. In the new style, + * there might not be any EADS bridges, and even when there are, + * the firmware marks them as "EEH incapable". So another + * two-step is needed to find the pci bus.. */ + if (!frozen_bus) + frozen_bus = pcibios_find_pci_bus (frozen_dn->parent); + + if (!frozen_bus) { + printk(KERN_ERR "EEH: Cannot find PCI bus for %s\n", + frozen_dn->full_name); + return; + } + +#if 0 + /* We may get "permanent failure" messages on empty slots. + * These are false alarms. Empty slots have no child dn. */ + if ((event->state == pci_channel_io_perm_failure) && (frozen_device == NULL)) + return; +#endif + + frozen_pdn = PCI_DN(frozen_dn); + frozen_pdn->eeh_freeze_count++; + + if (frozen_pdn->eeh_freeze_count > EEH_MAX_ALLOWED_FREEZES) + perm_failure = 1; + + /* If the reset state is a '5' and the time to reset is 0 (infinity) + * or is more then 15 seconds, then mark this as a permanent failure. + */ + if ((event->state == pci_channel_io_perm_failure) && + ((event->time_unavail <= 0) || + (event->time_unavail > MAX_WAIT_FOR_RECOVERY*1000))) + { + perm_failure = 1; + } + + /* Log the error with the rtas logger. */ + if (perm_failure) { + /* + * About 90% of all real-life EEH failures in the field + * are due to poorly seated PCI cards. Only 10% or so are + * due to actual, failed cards. + */ + printk(KERN_ERR + "EEH: PCI device %s - %s has failed %d times \n" + "and has been permanently disabled. Please try reseating\n" + "this device or replacing it.\n", + pci_name (frozen_pdn->pcidev), + pcid_name(frozen_pdn->pcidev), + frozen_pdn->eeh_freeze_count); + + eeh_slot_error_detail(frozen_pdn, 2 /* Permanent Error */); + + /* Notify all devices that they're about to go down. */ + pci_walk_bus(frozen_bus, eeh_report_failure, 0); + + /* Shut down the device drivers for good. */ + pcibios_remove_pci_devices(frozen_bus); + return; + } + + eeh_slot_error_detail(frozen_pdn, 1 /* Temporary Error */); + printk(KERN_WARNING + "EEH: This PCI device has failed %d times since last reboot: %s - %s\n", + frozen_pdn->eeh_freeze_count, + pci_name (frozen_pdn->pcidev), + pcid_name(frozen_pdn->pcidev)); + + /* Walk the various device drivers attached to this slot through + * a reset sequence, giving each an opportunity to do what it needs + * to accomplish the reset. Each child gets a report of the + * status ... if any child can't handle the reset, then the entire + * slot is dlpar removed and added. + */ + enum pcierr_result result = PCIERR_RESULT_NONE; + pci_walk_bus(frozen_bus, eeh_report_error, &result); + + /* If all device drivers were EEH-unaware, then shut + * down all of the device drivers, and hope they + * go down willingly, without panicing the system. + */ + if (result == PCIERR_RESULT_NONE) { + eeh_reset_device(frozen_pdn, frozen_bus); + } + + /* If any device called out for a reset, then reset the slot */ + if (result == PCIERR_RESULT_NEED_RESET) { + eeh_reset_device(frozen_pdn, NULL); + pci_walk_bus(frozen_bus, eeh_report_reset, 0); + } + + /* If all devices reported they can proceed, the re-enable PIO */ + if (result == PCIERR_RESULT_CAN_RECOVER) { + /* XXX Not supported; we brute-force reset the device */ + eeh_reset_device(frozen_pdn, NULL); + pci_walk_bus(frozen_bus, eeh_report_reset, 0); + } + + /* Tell all device drivers that they can resume operations */ + pci_walk_bus(frozen_bus, eeh_report_resume, 0); +} + +/* ---------- end of file ---------- */ Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh_event.c =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/eeh_event.c 2005-11-02 14:32:35.731739983 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh_event.c 2005-11-02 14:41:18.440450652 -0600 @@ -21,6 +21,7 @@ #include #include #include +#include /** Overview: * EEH error states may be detected within exception handlers; @@ -37,31 +38,6 @@ DECLARE_WORK(eeh_event_wq, eeh_thread_launcher, NULL); /** - * eeh_panic - call panic() for an eeh event that cannot be handled. - * The philosophy of this routine is that it is better to panic and - * halt the OS than it is to risk possible data corruption by - * oblivious device drivers that don't know better. - * - * @dev pci device that had an eeh event - * @reset_state current reset state of the device slot - */ -static void eeh_panic(struct pci_dev *dev, int reset_state) -{ - /* - * Since the panic_on_oops sysctl is used to halt the system - * in light of potential corruption, we can use it here. - */ - if (panic_on_oops) { - panic("EEH: MMIO failure (%d) on device:%s\n", reset_state, - pci_name(dev)); - } - else { - printk(KERN_INFO "EEH: Ignored MMIO failure (%d) on device:%s\n", - reset_state, pci_name(dev)); - } -} - -/** * eeh_event_handler - dispatch EEH events. The detection of a frozen * slot can occur inside an interrupt, where it can be hard to do * anything about it. The goal of this routine is to pull these @@ -82,10 +58,16 @@ spin_lock_irqsave(&eeh_eventlist_lock, flags); event = NULL; + + /* Unqueue the event, get ready to process. */ if (!list_empty(&eeh_eventlist)) { event = list_entry(eeh_eventlist.next, struct eeh_event, list); list_del(&event->list); } + + if (event) + eeh_mark_slot(event->dn, EEH_MODE_RECOVERING); + spin_unlock_irqrestore(&eeh_eventlist_lock, flags); if (event == NULL) break; @@ -93,8 +75,11 @@ printk(KERN_INFO "EEH: Detected PCI bus error on device %s\n", pci_name(event->dev)); - eeh_panic (event->dev, event->state); + handle_eeh_events(event); + + eeh_clear_slot(event->dn, EEH_MODE_RECOVERING); + pci_dev_put(event->dev); kfree(event); } @@ -122,7 +107,7 @@ */ int eeh_send_failure_event (struct device_node *dn, struct pci_dev *dev, - int state, + enum pci_channel_state state, int time_unavail) { unsigned long flags; Index: linux-2.6.14-git3/include/asm-powerpc/eeh_event.h =================================================================== --- linux-2.6.14-git3.orig/include/asm-powerpc/eeh_event.h 2005-11-02 14:32:35.718741805 -0600 +++ linux-2.6.14-git3/include/asm-powerpc/eeh_event.h 2005-11-02 14:41:18.444450091 -0600 @@ -29,7 +29,7 @@ struct list_head list; struct device_node *dn; /* struct device node */ struct pci_dev *dev; /* affected device */ - int state; + enum pci_channel_state state; /* PCI bus state for the affected device */ int time_unavail; /* milliseconds until device might be available */ }; @@ -46,7 +46,10 @@ */ int eeh_send_failure_event (struct device_node *dn, struct pci_dev *dev, - int reset_state, + enum pci_channel_state state, int time_unavail); +/* Main recovery function */ +void handle_eeh_events (struct eeh_event *); + #endif /* ASM_PPC64_EEH_EVENT_H */ Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/Makefile =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/Makefile 2005-11-02 14:40:05.531674439 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/Makefile 2005-11-02 14:41:48.393250352 -0600 @@ -3,4 +3,4 @@ obj-$(CONFIG_SMP) += smp.o obj-$(CONFIG_IBMVIO) += vio.o obj-$(CONFIG_XICS) += xics.o -obj-$(CONFIG_EEH) += eeh.o eeh_event.o +obj-$(CONFIG_EEH) += eeh.o eeh_driver.o eeh_event.o Index: linux-2.6.14-git3/include/asm-powerpc/ppc-pci.h =================================================================== --- linux-2.6.14-git3.orig/include/asm-powerpc/ppc-pci.h 2005-11-02 14:35:39.295004776 -0600 +++ linux-2.6.14-git3/include/asm-powerpc/ppc-pci.h 2005-11-02 14:41:18.454448689 -0600 @@ -54,6 +54,15 @@ /* ---- EEH internal-use-only related routines ---- */ #ifdef CONFIG_EEH /** + * eeh_slot_error_detail -- record and EEH error condition to the log + * @severity: 1 if temporary, 2 if permanent failure. + * + * Obtains the the EEH error details from the RTAS subsystem, + * and then logs these details with the RTAS error log system. + */ +void eeh_slot_error_detail (struct pci_dn *pdn, int severity); + +/** * rtas_set_slot_reset -- unfreeze a frozen slot * * Clear the EEH-frozen condition on a slot. This routine Index: linux-2.6.14-git3/include/asm-ppc64/eeh.h =================================================================== --- linux-2.6.14-git3.orig/include/asm-ppc64/eeh.h 2005-11-02 14:36:41.263316362 -0600 +++ linux-2.6.14-git3/include/asm-ppc64/eeh.h 2005-11-02 14:41:18.461447707 -0600 @@ -31,9 +31,11 @@ #ifdef CONFIG_EEH /* Values for eeh_mode bits in device_node */ -#define EEH_MODE_SUPPORTED (1<<0) -#define EEH_MODE_NOCHECK (1<<1) -#define EEH_MODE_ISOLATED (1<<2) +#define EEH_MODE_SUPPORTED (1<<0) +#define EEH_MODE_NOCHECK (1<<1) +#define EEH_MODE_ISOLATED (1<<2) +#define EEH_MODE_RECOVERING (1<<3) +#define EEH_MODE_IRQ_DISABLED (1<<4) /* Max number of EEH freezes allowed before we consider the device * to be permanently disabled. */ From linas at linas.org Fri Nov 4 11:53:07 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:53:07 -0600 Subject: [PATCH 25/42]: ppc64: Split out PCI address cache to its own file References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005307.GA27041@mail.gnucash.org> 25-pci-address-cache.patch The core EEH files is rather large. This patch splits out a self-contained chunk of it into its own file. This is the chunk that performes the caching and lookup of pci devices based on the i/o addresses of thier resoures. This code is almos archiecture-independent and could be used by any system that wanted to find a pci device based only on the i/o address used by the device. Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/Makefile =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/Makefile 2005-11-02 14:41:48.393250352 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/Makefile 2005-11-02 14:42:58.323443756 -0600 @@ -3,4 +3,4 @@ obj-$(CONFIG_SMP) += smp.o obj-$(CONFIG_IBMVIO) += vio.o obj-$(CONFIG_XICS) += xics.o -obj-$(CONFIG_EEH) += eeh.o eeh_driver.o eeh_event.o +obj-$(CONFIG_EEH) += eeh.o eeh_cache.o eeh_driver.o eeh_event.o Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:41:18.427452474 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:42:38.986155538 -0600 @@ -77,9 +77,6 @@ */ #define EEH_MAX_FAILS 100000 -/* Misc forward declaraions */ -static void eeh_save_bars(struct pci_dev * pdev, struct pci_dn *pdn); - /* RTAS tokens */ static int ibm_set_eeh_option; static int ibm_set_slot_reset; @@ -107,296 +104,8 @@ static DEFINE_PER_CPU(unsigned long, ignored_failures); static DEFINE_PER_CPU(unsigned long, slot_resets); -/** - * The pci address cache subsystem. This subsystem places - * PCI device address resources into a red-black tree, sorted - * according to the address range, so that given only an i/o - * address, the corresponding PCI device can be **quickly** - * found. It is safe to perform an address lookup in an interrupt - * context; this ability is an important feature. - * - * Currently, the only customer of this code is the EEH subsystem; - * thus, this code has been somewhat tailored to suit EEH better. - * In particular, the cache does *not* hold the addresses of devices - * for which EEH is not enabled. - * - * (Implementation Note: The RB tree seems to be better/faster - * than any hash algo I could think of for this problem, even - * with the penalty of slow pointer chases for d-cache misses). - */ -struct pci_io_addr_range -{ - struct rb_node rb_node; - unsigned long addr_lo; - unsigned long addr_hi; - struct pci_dev *pcidev; - unsigned int flags; -}; - -static struct pci_io_addr_cache -{ - struct rb_root rb_root; - spinlock_t piar_lock; -} pci_io_addr_cache_root; - -static inline struct pci_dev *__pci_get_device_by_addr(unsigned long addr) -{ - struct rb_node *n = pci_io_addr_cache_root.rb_root.rb_node; - - while (n) { - struct pci_io_addr_range *piar; - piar = rb_entry(n, struct pci_io_addr_range, rb_node); - - if (addr < piar->addr_lo) { - n = n->rb_left; - } else { - if (addr > piar->addr_hi) { - n = n->rb_right; - } else { - pci_dev_get(piar->pcidev); - return piar->pcidev; - } - } - } - - return NULL; -} - -/** - * pci_get_device_by_addr - Get device, given only address - * @addr: mmio (PIO) phys address or i/o port number - * - * Given an mmio phys address, or a port number, find a pci device - * that implements this address. Be sure to pci_dev_put the device - * when finished. I/O port numbers are assumed to be offset - * from zero (that is, they do *not* have pci_io_addr added in). - * It is safe to call this function within an interrupt. - */ -static struct pci_dev *pci_get_device_by_addr(unsigned long addr) -{ - struct pci_dev *dev; - unsigned long flags; - - spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); - dev = __pci_get_device_by_addr(addr); - spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); - return dev; -} - -#ifdef DEBUG -/* - * Handy-dandy debug print routine, does nothing more - * than print out the contents of our addr cache. - */ -static void pci_addr_cache_print(struct pci_io_addr_cache *cache) -{ - struct rb_node *n; - int cnt = 0; - - n = rb_first(&cache->rb_root); - while (n) { - struct pci_io_addr_range *piar; - piar = rb_entry(n, struct pci_io_addr_range, rb_node); - printk(KERN_DEBUG "PCI: %s addr range %d [%lx-%lx]: %s\n", - (piar->flags & IORESOURCE_IO) ? "i/o" : "mem", cnt, - piar->addr_lo, piar->addr_hi, pci_name(piar->pcidev)); - cnt++; - n = rb_next(n); - } -} -#endif - -/* Insert address range into the rb tree. */ -static struct pci_io_addr_range * -pci_addr_cache_insert(struct pci_dev *dev, unsigned long alo, - unsigned long ahi, unsigned int flags) -{ - struct rb_node **p = &pci_io_addr_cache_root.rb_root.rb_node; - struct rb_node *parent = NULL; - struct pci_io_addr_range *piar; - - /* Walk tree, find a place to insert into tree */ - while (*p) { - parent = *p; - piar = rb_entry(parent, struct pci_io_addr_range, rb_node); - if (ahi < piar->addr_lo) { - p = &parent->rb_left; - } else if (alo > piar->addr_hi) { - p = &parent->rb_right; - } else { - if (dev != piar->pcidev || - alo != piar->addr_lo || ahi != piar->addr_hi) { - printk(KERN_WARNING "PIAR: overlapping address range\n"); - } - return piar; - } - } - piar = (struct pci_io_addr_range *)kmalloc(sizeof(struct pci_io_addr_range), GFP_ATOMIC); - if (!piar) - return NULL; - - piar->addr_lo = alo; - piar->addr_hi = ahi; - piar->pcidev = dev; - piar->flags = flags; - -#ifdef DEBUG - printk(KERN_DEBUG "PIAR: insert range=[%lx:%lx] dev=%s\n", - alo, ahi, pci_name (dev)); -#endif - - rb_link_node(&piar->rb_node, parent, p); - rb_insert_color(&piar->rb_node, &pci_io_addr_cache_root.rb_root); - - return piar; -} - -static void __pci_addr_cache_insert_device(struct pci_dev *dev) -{ - struct device_node *dn; - struct pci_dn *pdn; - int i; - int inserted = 0; - - dn = pci_device_to_OF_node(dev); - if (!dn) { - printk(KERN_WARNING "PCI: no pci dn found for dev=%s\n", pci_name(dev)); - return; - } - - /* Skip any devices for which EEH is not enabled. */ - pdn = PCI_DN(dn); - if (!(pdn->eeh_mode & EEH_MODE_SUPPORTED) || - pdn->eeh_mode & EEH_MODE_NOCHECK) { -#ifdef DEBUG - printk(KERN_INFO "PCI: skip building address cache for=%s - %s\n", - pci_name(dev), pdn->node->full_name); -#endif - return; - } - - /* The cache holds a reference to the device... */ - pci_dev_get(dev); - - /* Walk resources on this device, poke them into the tree */ - for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) { - unsigned long start = pci_resource_start(dev,i); - unsigned long end = pci_resource_end(dev,i); - unsigned int flags = pci_resource_flags(dev,i); - - /* We are interested only bus addresses, not dma or other stuff */ - if (0 == (flags & (IORESOURCE_IO | IORESOURCE_MEM))) - continue; - if (start == 0 || ~start == 0 || end == 0 || ~end == 0) - continue; - pci_addr_cache_insert(dev, start, end, flags); - inserted = 1; - } - - /* If there was nothing to add, the cache has no reference... */ - if (!inserted) - pci_dev_put(dev); -} - -/** - * pci_addr_cache_insert_device - Add a device to the address cache - * @dev: PCI device whose I/O addresses we are interested in. - * - * In order to support the fast lookup of devices based on addresses, - * we maintain a cache of devices that can be quickly searched. - * This routine adds a device to that cache. - */ -static void pci_addr_cache_insert_device(struct pci_dev *dev) -{ - unsigned long flags; - - spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); - __pci_addr_cache_insert_device(dev); - spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); -} - -static inline void __pci_addr_cache_remove_device(struct pci_dev *dev) -{ - struct rb_node *n; - int removed = 0; - -restart: - n = rb_first(&pci_io_addr_cache_root.rb_root); - while (n) { - struct pci_io_addr_range *piar; - piar = rb_entry(n, struct pci_io_addr_range, rb_node); - - if (piar->pcidev == dev) { - rb_erase(n, &pci_io_addr_cache_root.rb_root); - removed = 1; - kfree(piar); - goto restart; - } - n = rb_next(n); - } - - /* The cache no longer holds its reference to this device... */ - if (removed) - pci_dev_put(dev); -} - -/** - * pci_addr_cache_remove_device - remove pci device from addr cache - * @dev: device to remove - * - * Remove a device from the addr-cache tree. - * This is potentially expensive, since it will walk - * the tree multiple times (once per resource). - * But so what; device removal doesn't need to be that fast. - */ -static void pci_addr_cache_remove_device(struct pci_dev *dev) -{ - unsigned long flags; - - spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); - __pci_addr_cache_remove_device(dev); - spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); -} - -/** - * pci_addr_cache_build - Build a cache of I/O addresses - * - * Build a cache of pci i/o addresses. This cache will be used to - * find the pci device that corresponds to a given address. - * This routine scans all pci busses to build the cache. - * Must be run late in boot process, after the pci controllers - * have been scaned for devices (after all device resources are known). - */ -void __init pci_addr_cache_build(void) -{ - struct device_node *dn; - struct pci_dev *dev = NULL; - - if (!eeh_subsystem_enabled) - return; - - spin_lock_init(&pci_io_addr_cache_root.piar_lock); - - while ((dev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) { - /* Ignore PCI bridges ( XXX why ??) */ - if ((dev->class >> 16) == PCI_BASE_CLASS_BRIDGE) { - continue; - } - pci_addr_cache_insert_device(dev); - - /* Save the BAR's; firmware doesn't restore these after EEH reset */ - dn = pci_device_to_OF_node(dev); - eeh_save_bars(dev, PCI_DN(dn)); - } - -#ifdef DEBUG - /* Verify tree built up above, echo back the list of addrs. */ - pci_addr_cache_print(&pci_io_addr_cache_root); -#endif -} - /* --------------------------------------------------------------- */ -/* Above lies the PCI Address Cache. Below lies the EEH event infrastructure */ +/* Below lies the EEH event infrastructure */ void eeh_slot_error_detail (struct pci_dn *pdn, int severity) { @@ -880,7 +589,7 @@ * PCI devices are added individuallly; but, for the restore, * an entire slot is reset at a time. */ -static void eeh_save_bars(struct pci_dev * pdev, struct pci_dn *pdn) +void eeh_save_bars(struct pci_dev * pdev, struct pci_dn *pdn) { int i; Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh_cache.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh_cache.c 2005-11-02 14:42:38.994154417 -0600 @@ -0,0 +1,317 @@ +/* + * eeh_cache.c + * PCI address cache; allows the lookup of PCI devices based on I/O address + * + * Copyright (C) 2004 Linas Vepstas IBM Corporation + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#undef DEBUG + +/** + * The pci address cache subsystem. This subsystem places + * PCI device address resources into a red-black tree, sorted + * according to the address range, so that given only an i/o + * address, the corresponding PCI device can be **quickly** + * found. It is safe to perform an address lookup in an interrupt + * context; this ability is an important feature. + * + * Currently, the only customer of this code is the EEH subsystem; + * thus, this code has been somewhat tailored to suit EEH better. + * In particular, the cache does *not* hold the addresses of devices + * for which EEH is not enabled. + * + * (Implementation Note: The RB tree seems to be better/faster + * than any hash algo I could think of for this problem, even + * with the penalty of slow pointer chases for d-cache misses). + */ +struct pci_io_addr_range +{ + struct rb_node rb_node; + unsigned long addr_lo; + unsigned long addr_hi; + struct pci_dev *pcidev; + unsigned int flags; +}; + +static struct pci_io_addr_cache +{ + struct rb_root rb_root; + spinlock_t piar_lock; +} pci_io_addr_cache_root; + +static inline struct pci_dev *__pci_get_device_by_addr(unsigned long addr) +{ + struct rb_node *n = pci_io_addr_cache_root.rb_root.rb_node; + + while (n) { + struct pci_io_addr_range *piar; + piar = rb_entry(n, struct pci_io_addr_range, rb_node); + + if (addr < piar->addr_lo) { + n = n->rb_left; + } else { + if (addr > piar->addr_hi) { + n = n->rb_right; + } else { + pci_dev_get(piar->pcidev); + return piar->pcidev; + } + } + } + + return NULL; +} + +/** + * pci_get_device_by_addr - Get device, given only address + * @addr: mmio (PIO) phys address or i/o port number + * + * Given an mmio phys address, or a port number, find a pci device + * that implements this address. Be sure to pci_dev_put the device + * when finished. I/O port numbers are assumed to be offset + * from zero (that is, they do *not* have pci_io_addr added in). + * It is safe to call this function within an interrupt. + */ +struct pci_dev *pci_get_device_by_addr(unsigned long addr) +{ + struct pci_dev *dev; + unsigned long flags; + + spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); + dev = __pci_get_device_by_addr(addr); + spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); + return dev; +} + +#ifdef DEBUG +/* + * Handy-dandy debug print routine, does nothing more + * than print out the contents of our addr cache. + */ +static void pci_addr_cache_print(struct pci_io_addr_cache *cache) +{ + struct rb_node *n; + int cnt = 0; + + n = rb_first(&cache->rb_root); + while (n) { + struct pci_io_addr_range *piar; + piar = rb_entry(n, struct pci_io_addr_range, rb_node); + printk(KERN_DEBUG "PCI: %s addr range %d [%lx-%lx]: %s\n", + (piar->flags & IORESOURCE_IO) ? "i/o" : "mem", cnt, + piar->addr_lo, piar->addr_hi, pci_name(piar->pcidev)); + cnt++; + n = rb_next(n); + } +} +#endif + +/* Insert address range into the rb tree. */ +static struct pci_io_addr_range * +pci_addr_cache_insert(struct pci_dev *dev, unsigned long alo, + unsigned long ahi, unsigned int flags) +{ + struct rb_node **p = &pci_io_addr_cache_root.rb_root.rb_node; + struct rb_node *parent = NULL; + struct pci_io_addr_range *piar; + + /* Walk tree, find a place to insert into tree */ + while (*p) { + parent = *p; + piar = rb_entry(parent, struct pci_io_addr_range, rb_node); + if (ahi < piar->addr_lo) { + p = &parent->rb_left; + } else if (alo > piar->addr_hi) { + p = &parent->rb_right; + } else { + if (dev != piar->pcidev || + alo != piar->addr_lo || ahi != piar->addr_hi) { + printk(KERN_WARNING "PIAR: overlapping address range\n"); + } + return piar; + } + } + piar = (struct pci_io_addr_range *)kmalloc(sizeof(struct pci_io_addr_range), GFP_ATOMIC); + if (!piar) + return NULL; + + piar->addr_lo = alo; + piar->addr_hi = ahi; + piar->pcidev = dev; + piar->flags = flags; + +#ifdef DEBUG + printk(KERN_DEBUG "PIAR: insert range=[%lx:%lx] dev=%s\n", + alo, ahi, pci_name (dev)); +#endif + + rb_link_node(&piar->rb_node, parent, p); + rb_insert_color(&piar->rb_node, &pci_io_addr_cache_root.rb_root); + + return piar; +} + +static void __pci_addr_cache_insert_device(struct pci_dev *dev) +{ + struct device_node *dn; + struct pci_dn *pdn; + int i; + int inserted = 0; + + dn = pci_device_to_OF_node(dev); + if (!dn) { + printk(KERN_WARNING "PCI: no pci dn found for dev=%s\n", pci_name(dev)); + return; + } + + /* Skip any devices for which EEH is not enabled. */ + pdn = PCI_DN(dn); + if (!(pdn->eeh_mode & EEH_MODE_SUPPORTED) || + pdn->eeh_mode & EEH_MODE_NOCHECK) { +#ifdef DEBUG + printk(KERN_INFO "PCI: skip building address cache for=%s - %s\n", + pci_name(dev), pdn->node->full_name); +#endif + return; + } + + /* The cache holds a reference to the device... */ + pci_dev_get(dev); + + /* Walk resources on this device, poke them into the tree */ + for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) { + unsigned long start = pci_resource_start(dev,i); + unsigned long end = pci_resource_end(dev,i); + unsigned int flags = pci_resource_flags(dev,i); + + /* We are interested only bus addresses, not dma or other stuff */ + if (0 == (flags & (IORESOURCE_IO | IORESOURCE_MEM))) + continue; + if (start == 0 || ~start == 0 || end == 0 || ~end == 0) + continue; + pci_addr_cache_insert(dev, start, end, flags); + inserted = 1; + } + + /* If there was nothing to add, the cache has no reference... */ + if (!inserted) + pci_dev_put(dev); +} + +/** + * pci_addr_cache_insert_device - Add a device to the address cache + * @dev: PCI device whose I/O addresses we are interested in. + * + * In order to support the fast lookup of devices based on addresses, + * we maintain a cache of devices that can be quickly searched. + * This routine adds a device to that cache. + */ +void pci_addr_cache_insert_device(struct pci_dev *dev) +{ + unsigned long flags; + + spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); + __pci_addr_cache_insert_device(dev); + spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); +} + +static inline void __pci_addr_cache_remove_device(struct pci_dev *dev) +{ + struct rb_node *n; + int removed = 0; + +restart: + n = rb_first(&pci_io_addr_cache_root.rb_root); + while (n) { + struct pci_io_addr_range *piar; + piar = rb_entry(n, struct pci_io_addr_range, rb_node); + + if (piar->pcidev == dev) { + rb_erase(n, &pci_io_addr_cache_root.rb_root); + removed = 1; + kfree(piar); + goto restart; + } + n = rb_next(n); + } + + /* The cache no longer holds its reference to this device... */ + if (removed) + pci_dev_put(dev); +} + +/** + * pci_addr_cache_remove_device - remove pci device from addr cache + * @dev: device to remove + * + * Remove a device from the addr-cache tree. + * This is potentially expensive, since it will walk + * the tree multiple times (once per resource). + * But so what; device removal doesn't need to be that fast. + */ +void pci_addr_cache_remove_device(struct pci_dev *dev) +{ + unsigned long flags; + + spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); + __pci_addr_cache_remove_device(dev); + spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); +} + +/** + * pci_addr_cache_build - Build a cache of I/O addresses + * + * Build a cache of pci i/o addresses. This cache will be used to + * find the pci device that corresponds to a given address. + * This routine scans all pci busses to build the cache. + * Must be run late in boot process, after the pci controllers + * have been scaned for devices (after all device resources are known). + */ +void __init pci_addr_cache_build(void) +{ + struct device_node *dn; + struct pci_dev *dev = NULL; + + spin_lock_init(&pci_io_addr_cache_root.piar_lock); + + while ((dev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) { + /* Ignore PCI bridges */ + if ((dev->class >> 16) == PCI_BASE_CLASS_BRIDGE) + continue; + + pci_addr_cache_insert_device(dev); + + /* Save the BAR's; firmware doesn't restore these after EEH reset */ + dn = pci_device_to_OF_node(dev); + eeh_save_bars(dev, PCI_DN(dn)); + } + +#ifdef DEBUG + /* Verify tree built up above, echo back the list of addrs. */ + pci_addr_cache_print(&pci_io_addr_cache_root); +#endif +} + Index: linux-2.6.14-git3/include/asm-powerpc/ppc-pci.h =================================================================== --- linux-2.6.14-git3.orig/include/asm-powerpc/ppc-pci.h 2005-11-02 14:41:18.454448689 -0600 +++ linux-2.6.14-git3/include/asm-powerpc/ppc-pci.h 2005-11-02 14:42:38.998153856 -0600 @@ -53,6 +53,14 @@ /* ---- EEH internal-use-only related routines ---- */ #ifdef CONFIG_EEH + +void pci_addr_cache_insert_device(struct pci_dev *dev); +void pci_addr_cache_remove_device(struct pci_dev *dev); +void pci_addr_cache_build(void); +struct pci_dev *pci_get_device_by_addr(unsigned long addr); + +void eeh_save_bars(struct pci_dev * pdev, struct pci_dn *pdn); + /** * eeh_slot_error_detail -- record and EEH error condition to the log * @severity: 1 if temporary, 2 if permanent failure. From linas at linas.org Fri Nov 4 11:53:20 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:53:20 -0600 Subject: [PATCH 26/42]: ppc64: Add "partion endpoint" support References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005320.GA27049@mail.gnucash.org> 26-eeh-partition-endpoint.patch New versions of firmware introduce a new method by which the "partition endpoint" (the point at which the pci bus is cut). This code adds the support for this (mandatory) new feature. Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:42:38.986155538 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:43:49.212307192 -0600 @@ -83,6 +83,7 @@ static int ibm_read_slot_reset_state; static int ibm_read_slot_reset_state2; static int ibm_slot_error_detail; +static int ibm_get_config_addr_info; static int eeh_subsystem_enabled; @@ -457,6 +458,7 @@ static void rtas_pci_slot_reset(struct pci_dn *pdn, int state) { + int config_addr; int rc; BUG_ON (pdn==NULL); @@ -467,8 +469,13 @@ return; } + /* Use PE configuration address, if present */ + config_addr = pdn->eeh_config_addr; + if (pdn->eeh_pe_config_addr) + config_addr = pdn->eeh_pe_config_addr; + rc = rtas_call(ibm_set_slot_reset,4,1, NULL, - pdn->eeh_config_addr, + config_addr, BUID_HI(pdn->phb->buid), BUID_LO(pdn->phb->buid), state); @@ -695,8 +702,22 @@ eeh_subsystem_enabled = 1; pdn->eeh_mode |= EEH_MODE_SUPPORTED; pdn->eeh_config_addr = regs[0]; + + /* If the newer, better, ibm,get-config-addr-info is supported, + * then use that instead. */ + pdn->eeh_pe_config_addr = 0; + if (ibm_get_config_addr_info != RTAS_UNKNOWN_SERVICE) { + unsigned int rets[2]; + ret = rtas_call (ibm_get_config_addr_info, 4, 2, rets, + pdn->eeh_config_addr, + info->buid_hi, info->buid_lo, + 0); + if (ret == 0) + pdn->eeh_pe_config_addr = rets[0]; + } #ifdef DEBUG - printk(KERN_DEBUG "EEH: %s: eeh enabled\n", dn->full_name); + printk(KERN_DEBUG "EEH: %s: eeh enabled, config=%x pe_config=%x\n", + dn->full_name, pdn->eeh_config_addr, pdn->eeh_pe_config_addr); #endif } else { @@ -748,6 +769,7 @@ ibm_read_slot_reset_state2 = rtas_token("ibm,read-slot-reset-state2"); ibm_read_slot_reset_state = rtas_token("ibm,read-slot-reset-state"); ibm_slot_error_detail = rtas_token("ibm,slot-error-detail"); + ibm_get_config_addr_info = rtas_token("ibm,get-config-addr-info"); if (ibm_set_eeh_option == RTAS_UNKNOWN_SERVICE) return; Index: linux-2.6.14-git3/include/asm-ppc64/pci-bridge.h =================================================================== --- linux-2.6.14-git3.orig/include/asm-ppc64/pci-bridge.h 2005-11-02 14:39:24.755392219 -0600 +++ linux-2.6.14-git3/include/asm-ppc64/pci-bridge.h 2005-11-02 14:43:49.218306351 -0600 @@ -63,6 +63,7 @@ int devfn; /* for pci devices */ int eeh_mode; /* See eeh.h for possible EEH_MODEs */ int eeh_config_addr; + int eeh_pe_config_addr; /* new-style partition endpoint address */ int eeh_check_count; /* # times driver ignored error */ int eeh_freeze_count; /* # times this device froze up. */ int eeh_is_bridge; /* device is pci-to-pci bridge */ From linas at linas.org Fri Nov 4 11:53:36 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:53:36 -0600 Subject: [PATCH 27/42]: SCSI: add PCI error recovery to IPR dev driver References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005336.GA27057@mail.gnucash.org> 27-pci-error-recovery_IPR-driver.patch Subject: PCI Error Recovery: IPR SCSI device driver Various PCI bus errors can be signaled by newer PCI controllers. This patch adds the PCI error recovery callbacks to the IPR SCSI device driver. The patch has been tested, and appears to work well. Signed-off-by: Linas Vepstas Signed-off-by: Brian King -- Index: linux-2.6.14-git3/drivers/scsi/ipr.c =================================================================== --- linux-2.6.14-git3.orig/drivers/scsi/ipr.c 2005-11-02 14:28:53.284922999 -0600 +++ linux-2.6.14-git3/drivers/scsi/ipr.c 2005-11-02 14:43:52.782806465 -0600 @@ -5328,6 +5328,94 @@ shutdown_type); } +/* --------------- PCI Error Recovery infrastructure ----------- */ +/** If the PCI slot is frozen, hold off all i/o + * activity; then, as soon as the slot is available again, + * initiate an adapter reset. + */ +static int ipr_reset_freeze(struct ipr_cmnd *ipr_cmd) +{ + /* Disallow new interrupts, avoid loop */ + ipr_cmd->ioa_cfg->allow_interrupts = 0; + list_add_tail(&ipr_cmd->queue, &ipr_cmd->ioa_cfg->pending_q); + ipr_cmd->done = ipr_reset_ioa_job; + return IPR_RC_JOB_RETURN; +} + +/** ipr_eeh_frozen -- called when slot has experience PCI bus error. + * This routine is called to tell us that the PCI bus is down. + * Can't do anything here, except put the device driver into a + * holding pattern, waiting for the PCI bus to come back. + */ +static void ipr_eeh_frozen (struct pci_dev *pdev) +{ + unsigned long flags = 0; + struct ipr_ioa_cfg *ioa_cfg = pci_get_drvdata(pdev); + + spin_lock_irqsave(ioa_cfg->host->host_lock, flags); + _ipr_initiate_ioa_reset(ioa_cfg, ipr_reset_freeze, IPR_SHUTDOWN_NONE); + spin_unlock_irqrestore(ioa_cfg->host->host_lock, flags); +} + +/** ipr_eeh_slot_reset - called when pci slot has been reset. + * + * This routine is called by the pci error recovery recovery + * code after the PCI slot has been reset, just before we + * should resume normal operations. + */ +static int ipr_eeh_slot_reset(struct pci_dev *pdev) +{ + unsigned long flags = 0; + struct ipr_ioa_cfg *ioa_cfg = pci_get_drvdata(pdev); + + // pci_enable_device(pdev); + // pci_set_master(pdev); + spin_lock_irqsave(ioa_cfg->host->host_lock, flags); + _ipr_initiate_ioa_reset(ioa_cfg, ipr_reset_restore_cfg_space, + IPR_SHUTDOWN_NONE); + spin_unlock_irqrestore(ioa_cfg->host->host_lock, flags); + + return PCIERR_RESULT_RECOVERED; +} + +/** This routine is called when the PCI bus has permanently + * failed. This routine should purge all pending I/O and + * shut down the device driver (close and unload). + */ +static void ipr_eeh_perm_failure(struct pci_dev *pdev) +{ + unsigned long flags = 0; + struct ipr_ioa_cfg *ioa_cfg = pci_get_drvdata(pdev); + + spin_lock_irqsave(ioa_cfg->host->host_lock, flags); + if (ioa_cfg->sdt_state == WAIT_FOR_DUMP) + ioa_cfg->sdt_state = ABORT_DUMP; + ioa_cfg->reset_retries = IPR_NUM_RESET_RELOAD_RETRIES; + ioa_cfg->in_ioa_bringdown = 1; + ipr_initiate_ioa_reset(ioa_cfg, IPR_SHUTDOWN_NONE); + spin_unlock_irqrestore(ioa_cfg->host->host_lock, flags); +} + +static int ipr_eeh_error_detected(struct pci_dev *pdev, + enum pci_channel_state state) +{ + switch (state) { + case pci_channel_io_frozen: + ipr_eeh_frozen (pdev); + return PCIERR_RESULT_NEED_RESET; + + case pci_channel_io_perm_failure: + ipr_eeh_perm_failure (pdev); + return PCIERR_RESULT_DISCONNECT; + break; + default: + break; + } + return PCIERR_RESULT_NEED_RESET; +} + +/* ------------- end of PCI Error Recovery suport ----------- */ + /** * ipr_probe_ioa_part2 - Initializes IOAs found in ipr_probe_ioa(..) * @ioa_cfg: ioa cfg struct @@ -6065,12 +6153,18 @@ }; MODULE_DEVICE_TABLE(pci, ipr_pci_table); +static struct pci_error_handlers ipr_err_handler = { + .error_detected = ipr_eeh_error_detected, + .slot_reset = ipr_eeh_slot_reset, +}; + static struct pci_driver ipr_driver = { .name = IPR_NAME, .id_table = ipr_pci_table, .probe = ipr_probe, .remove = ipr_remove, .shutdown = ipr_shutdown, + .err_handler = &ipr_err_handler, }; /** From linas at linas.org Fri Nov 4 11:53:46 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:53:46 -0600 Subject: [PATCH 28/42]: SCSI: add PCI error recovery to Symbios dev driver References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005346.GA27066@mail.gnucash.org> Various PCI bus errors can be signaled by newer PCI controllers. This patch adds the PCI error recovery callbacks to the Symbios SCSI device driver. The patch has been tested, and appears to work well. Signed-off-by: Linas Vepstas -- Index: linux-2.6.14-git3/drivers/scsi/sym53c8xx_2/sym_glue.c =================================================================== --- linux-2.6.14-git3.orig/drivers/scsi/sym53c8xx_2/sym_glue.c 2005-11-02 14:28:52.512031337 -0600 +++ linux-2.6.14-git3/drivers/scsi/sym53c8xx_2/sym_glue.c 2005-11-02 14:43:56.084343457 -0600 @@ -686,6 +686,10 @@ if (DEBUG_FLAGS & DEBUG_TINY) printf_debug ("["); + /* Avoid spinloop trying to handle interrupts on frozen device */ + if (np->s.io_state != pci_channel_io_normal) + return IRQ_HANDLED; + spin_lock_irqsave(np->s.host->host_lock, flags); sym_interrupt(np); spin_unlock_irqrestore(np->s.host->host_lock, flags); @@ -759,6 +763,25 @@ */ static void sym_eh_timeout(u_long p) { __sym_eh_done((struct scsi_cmnd *)p, 1); } +static void sym_eeh_timeout(u_long p) +{ + struct sym_eh_wait *ep = (struct sym_eh_wait *) p; + if (!ep) + return; + complete(&ep->done); +} + +static void sym_eeh_done(struct sym_eh_wait *ep) +{ + if (!ep) + return; + ep->timed_out = 0; + if (!del_timer(&ep->timer)) + return; + + complete(&ep->done); +} + /* * Generic method for our eh processing. * The 'op' argument tells what we have to do. @@ -799,6 +822,35 @@ /* Try to proceed the operation we have been asked for */ sts = -1; + + /* We may be in an error condition because the PCI bus + * went down. In this case, we need to wait until the + * PCI bus is reset, the card is reset, and only then + * proceed with the scsi error recovery. We'll wait + * for 15 seconds for this to happen. + */ +#define WAIT_FOR_PCI_RECOVERY 15 + if (np->s.io_state != pci_channel_io_normal) { + struct sym_eh_wait eeh, *eep = &eeh; + np->s.io_reset_wait = eep; + init_completion(&eep->done); + init_timer(&eep->timer); + eep->to_do = SYM_EH_DO_WAIT; + eep->timer.expires = jiffies + (WAIT_FOR_PCI_RECOVERY*HZ); + eep->timer.function = sym_eeh_timeout; + eep->timer.data = (u_long)eep; + eep->timed_out = 1; /* Be pessimistic for once :) */ + add_timer(&eep->timer); + spin_unlock_irq(np->s.host->host_lock); + wait_for_completion(&eep->done); + spin_lock_irq(np->s.host->host_lock); + if (eep->timed_out) { + printk (KERN_ERR "%s: Timed out waiting for PCI reset\n", + sym_name(np)); + } + np->s.io_reset_wait = NULL; + } + switch(op) { case SYM_EH_ABORT: sts = sym_abort_scsiio(np, cmd, 1); @@ -1584,6 +1636,8 @@ np->maxoffs = dev->chip.offset_max; np->maxburst = dev->chip.burst_max; np->myaddr = dev->host_id; + np->s.io_state = pci_channel_io_normal; + np->s.io_reset_wait = NULL; /* * Edit its name. @@ -1916,6 +1970,58 @@ return 1; } +/* ------------- PCI Error Recovery infrastructure -------------- */ +/** sym2_io_error_detected() is called when PCI error is detected */ +static int sym2_io_error_detected (struct pci_dev *pdev, enum pci_channel_state state) +{ + struct sym_hcb *np = pci_get_drvdata(pdev); + + np->s.io_state = state; + // XXX If slot is permanently frozen, then what? + // Should we scsi_remove_host() maybe ?? + + /* Request a slot slot reset. */ + return PCIERR_RESULT_NEED_RESET; +} + +/** sym2_io_slot_reset is called when the pci bus has been reset. + * Restart the card from scratch. */ +static int sym2_io_slot_reset (struct pci_dev *pdev) +{ + struct sym_hcb *np = pci_get_drvdata(pdev); + + printk (KERN_INFO "%s: recovering from a PCI slot reset\n", + sym_name(np)); + + if (pci_enable_device(pdev)) + printk (KERN_ERR "%s: device setup failed most egregiously\n", + sym_name(np)); + + pci_set_master(pdev); + enable_irq (pdev->irq); + + /* Perform host reset only on one instance of the card */ + if (0 == PCI_FUNC (pdev->devfn)) + sym_reset_scsi_bus(np, 0); + + return PCIERR_RESULT_RECOVERED; +} + +/** sym2_io_resume is called when the error recovery driver + * tells us that its OK to resume normal operation. + */ +static void sym2_io_resume (struct pci_dev *pdev) +{ + struct sym_hcb *np = pci_get_drvdata(pdev); + + /* Perform device startup only once for this card. */ + if (0 == PCI_FUNC (pdev->devfn)) + sym_start_up (np, 1); + + np->s.io_state = pci_channel_io_normal; + sym_eeh_done (np->s.io_reset_wait); +} + /* * Driver host template. */ @@ -2169,11 +2275,18 @@ MODULE_DEVICE_TABLE(pci, sym2_id_table); +static struct pci_error_handlers sym2_err_handler = { + .error_detected = sym2_io_error_detected, + .slot_reset = sym2_io_slot_reset, + .resume = sym2_io_resume, +}; + static struct pci_driver sym2_driver = { .name = NAME53C8XX, .id_table = sym2_id_table, .probe = sym2_probe, .remove = __devexit_p(sym2_remove), + .err_handler = &sym2_err_handler, }; static int __init sym2_init(void) Index: linux-2.6.14-git3/drivers/scsi/sym53c8xx_2/sym_glue.h =================================================================== --- linux-2.6.14-git3.orig/drivers/scsi/sym53c8xx_2/sym_glue.h 2005-11-02 14:28:52.513031197 -0600 +++ linux-2.6.14-git3/drivers/scsi/sym53c8xx_2/sym_glue.h 2005-11-02 14:43:56.089342756 -0600 @@ -181,6 +181,10 @@ char chip_name[8]; struct pci_dev *device; + /* pci bus i/o state; waiter for clearing of i/o state */ + enum pci_channel_state io_state; + struct sym_eh_wait *io_reset_wait; + struct Scsi_Host *host; void __iomem * ioaddr; /* MMIO kernel io address */ Index: linux-2.6.14-git3/drivers/scsi/sym53c8xx_2/sym_hipd.c =================================================================== --- linux-2.6.14-git3.orig/drivers/scsi/sym53c8xx_2/sym_hipd.c 2005-11-02 14:28:52.513031197 -0600 +++ linux-2.6.14-git3/drivers/scsi/sym53c8xx_2/sym_hipd.c 2005-11-02 14:43:56.141335464 -0600 @@ -2809,6 +2809,7 @@ u_char istat, istatc; u_char dstat; u_short sist; + u_int icnt; /* * interrupt on the fly ? @@ -2850,6 +2851,7 @@ sist = 0; dstat = 0; istatc = istat; + icnt = 0; do { if (istatc & SIP) sist |= INW(np, nc_sist); @@ -2857,6 +2859,19 @@ dstat |= INB(np, nc_dstat); istatc = INB(np, nc_istat); istat |= istatc; + + /* Prevent deadlock waiting on a condition that may never clear. */ + /* XXX this is a temporary kludge; the correct to detect + * a PCI bus error would be to use the io_check interfaces + * proposed by Hidetoshi Seto + * Problem with polling like that is the state flag might not + * be set. + */ + icnt ++; + if (100 < icnt) { + if (np->s.device->error_state != pci_channel_io_normal) + return; + } } while (istatc & (SIP|DIP)); if (DEBUG_FLAGS & DEBUG_TINY) From linas at linas.org Fri Nov 4 11:53:53 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:53:53 -0600 Subject: [PATCH 29/42]: ethernet: add PCI error recovery to e100 dev driver References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005353.GA27074@mail.gnucash.org> Various PCI bus errors can be signaled by newer PCI controllers. This patch adds the PCI error recovery callbacks to the intel ethernet e100 device driver. The patch has been tested, and appears to work well. Signed-off-by: Linas Vepstas -- Index: linux-2.6.14-git3/drivers/net/e100.c =================================================================== --- linux-2.6.14-git3.orig/drivers/net/e100.c 2005-11-02 14:28:51.524169808 -0600 +++ linux-2.6.14-git3/drivers/net/e100.c 2005-11-02 14:43:58.890949857 -0600 @@ -2465,6 +2465,75 @@ } +/* ------------------ PCI Error Recovery infrastructure -------------- */ +/** e100_io_error_detected() is called when PCI error is detected */ +static int e100_io_error_detected(struct pci_dev *pdev, enum pci_channel_state state) +{ + struct net_device *netdev = pci_get_drvdata(pdev); + + /* Same as calling e100_down(netdev_priv(netdev)), but generic */ + netdev->stop(netdev); + + /* Is a detach needed ?? */ + // netif_device_detach(netdev); + + /* Request a slot reset. */ + return PCIERR_RESULT_NEED_RESET; +} + +/** e100_io_slot_reset is called after the pci bus has been reset. + * Restart the card from scratch. */ +static int e100_io_slot_reset(struct pci_dev *pdev) +{ + struct net_device *netdev = pci_get_drvdata(pdev); + struct nic *nic = netdev_priv(netdev); + + if(pci_enable_device(pdev)) { + printk(KERN_ERR "e100: Cannot re-enable PCI device after reset.\n"); + return PCIERR_RESULT_DISCONNECT; + } + pci_set_master(pdev); + + /* Only one device per card can do a reset */ + if (0 != PCI_FUNC (pdev->devfn)) + return PCIERR_RESULT_RECOVERED; + + e100_hw_reset(nic); + e100_phy_init(nic); + + if(e100_hw_init(nic)) { + DPRINTK(HW, ERR, "e100_hw_init failed\n"); + return PCIERR_RESULT_DISCONNECT; + } + + return PCIERR_RESULT_RECOVERED; +} + +/** e100_io_resume is called when the error recovery driver + * tells us that its OK to resume normal operation. + */ +static void e100_io_resume(struct pci_dev *pdev) +{ + struct net_device *netdev = pci_get_drvdata(pdev); + struct nic *nic = netdev_priv(netdev); + + /* ack any pending wake events, disable PME */ + pci_enable_wake(pdev, 0, 0); + + netif_device_attach(netdev); + if(netif_running(netdev)) { + e100_open (netdev); + mod_timer(&nic->watchdog, jiffies); + } +} + +static struct pci_error_handlers e100_err_handler = { + .error_detected = e100_io_error_detected, + .slot_reset = e100_io_slot_reset, + .resume = e100_io_resume, +}; + + static struct pci_driver e100_driver = { .name = DRV_NAME, .id_table = e100_id_table, @@ -2475,6 +2544,7 @@ .resume = e100_resume, #endif .shutdown = e100_shutdown, + .err_handler = &e100_err_handler, }; static int __init e100_init_module(void) From linas at linas.org Fri Nov 4 11:54:04 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:54:04 -0600 Subject: [PATCH 30/42]: ethernet: add PCI error recovery to e1000 dev driver References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005404.GA27082@mail.gnucash.org> Various PCI bus errors can be signaled by newer PCI controllers. This patch adds the PCI error recovery callbacks to the intel gigabit ethernet e1000 device driver. The patch has been tested, and appears to work well. Signed-off-by: Linas Vepstas -- Index: linux-2.6.14-git3/drivers/net/e1000/e1000_main.c =================================================================== --- linux-2.6.14-git3.orig/drivers/net/e1000/e1000_main.c 2005-11-02 14:28:50.471317390 -0600 +++ linux-2.6.14-git3/drivers/net/e1000/e1000_main.c 2005-11-02 14:44:00.730691851 -0600 @@ -206,6 +206,16 @@ void e1000_rx_schedule(void *data); #endif +static int e1000_io_error_detected(struct pci_dev *pdev, enum pci_channel_state state); +static int e1000_io_slot_reset(struct pci_dev *pdev); +static void e1000_io_resume(struct pci_dev *pdev); + +static struct pci_error_handlers e1000_err_handler = { + .error_detected = e1000_io_error_detected, + .slot_reset = e1000_io_slot_reset, + .resume = e1000_io_resume, +}; + /* Exported from other modules */ extern void e1000_check_options(struct e1000_adapter *adapter); @@ -218,8 +228,9 @@ /* Power Managment Hooks */ #ifdef CONFIG_PM .suspend = e1000_suspend, - .resume = e1000_resume + .resume = e1000_resume, #endif + .err_handler = &e1000_err_handler, }; MODULE_AUTHOR("Intel Corporation, "); @@ -2937,6 +2948,10 @@ #define PHY_IDLE_ERROR_COUNT_MASK 0x00FF + /* Prevent stats update while adapter is being reset */ + if (adapter->link_speed == 0) + return; + spin_lock_irqsave(&adapter->stats_lock, flags); /* these counters are modified from e1000_adjust_tbi_stats, @@ -4358,4 +4373,88 @@ } #endif +/* --------------- PCI Error Recovery infrastructure ------------ */ +/** e1000_io_error_detected() is called when PCI error is detected */ +static int e1000_io_error_detected(struct pci_dev *pdev, enum pci_channel_state state) +{ + struct net_device *netdev = pci_get_drvdata(pdev); + struct e1000_adapter *adapter = netdev->priv; + + if (netif_running(netdev)) + e1000_down(adapter); + + /* Request a slot slot reset. */ + return PCIERR_RESULT_NEED_RESET; +} + +/** e1000_io_slot_reset is called after the pci bus has been reset. + * Restart the card from scratch. + * Implementation resembles the first-half of the + * e1000_resume routine. + */ +static int e1000_io_slot_reset(struct pci_dev *pdev) +{ + struct net_device *netdev = pci_get_drvdata(pdev); + struct e1000_adapter *adapter = netdev->priv; + + if (pci_enable_device(pdev)) { + printk(KERN_ERR "e1000: Cannot re-enable PCI device after reset.\n"); + return PCIERR_RESULT_DISCONNECT; + } + pci_set_master(pdev); + + pci_enable_wake(pdev, 3, 0); + pci_enable_wake(pdev, 4, 0); /* 4 == D3 cold */ + + /* Perform card reset only on one instance of the card */ + if(0 != PCI_FUNC (pdev->devfn)) + return PCIERR_RESULT_RECOVERED; + + e1000_reset(adapter); + E1000_WRITE_REG(&adapter->hw, WUS, ~0); + + return PCIERR_RESULT_RECOVERED; +} + +/** e1000_io_resume is called when the error recovery driver + * tells us that its OK to resume normal operation. + * Implementation resembles the second-half of the + * e1000_resume routine. + */ +static void e1000_io_resume(struct pci_dev *pdev) +{ + struct net_device *netdev = pci_get_drvdata(pdev); + struct e1000_adapter *adapter = netdev->priv; + uint32_t manc, swsm; + + if(netif_running(netdev)) { + if (e1000_up(adapter)) { + printk("e1000: can't bring device back up after reset\n"); + return; + } + } + + netif_device_attach(netdev); + + if(adapter->hw.mac_type >= e1000_82540 && + adapter->hw.media_type == e1000_media_type_copper) { + manc = E1000_READ_REG(&adapter->hw, MANC); + manc &= ~(E1000_MANC_ARP_EN); + E1000_WRITE_REG(&adapter->hw, MANC, manc); + } + + switch(adapter->hw.mac_type) { + case e1000_82573: + swsm = E1000_READ_REG(&adapter->hw, SWSM); + E1000_WRITE_REG(&adapter->hw, SWSM, + swsm | E1000_SWSM_DRV_LOAD); + break; + default: + break; + } + + if(netif_running(netdev)) + mod_timer(&adapter->watchdog_timer, jiffies); +} + /* e1000_main.c */ From linas at linas.org Fri Nov 4 11:54:11 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:54:11 -0600 Subject: [PATCH 31/42]: ethernet: add PCI error recovery to ixgb dev driver References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005411.GA27090@mail.gnucash.org> Various PCI bus errors can be signaled by newer PCI controllers. This patch adds the PCI error recovery callbacks to the intel ten-gigabit ethernet ixgb device driver. The patch has been tested, and appears to work well. Signed-off-by: Linas Vepstas -- Index: linux-2.6.14-git3/drivers/net/ixgb/ixgb_main.c =================================================================== --- linux-2.6.14-git3.orig/drivers/net/ixgb/ixgb_main.c 2005-11-02 14:28:49.225492020 -0600 +++ linux-2.6.14-git3/drivers/net/ixgb/ixgb_main.c 2005-11-02 14:44:02.380460486 -0600 @@ -132,6 +132,16 @@ static void ixgb_netpoll(struct net_device *dev); #endif +static int ixgb_io_error_detected (struct pci_dev *pdev, enum pci_channel_state state); +static int ixgb_io_slot_reset (struct pci_dev *pdev); +static void ixgb_io_resume (struct pci_dev *pdev); + +static struct pci_error_handlers ixgb_err_handler = { + .error_detected = ixgb_io_error_detected, + .slot_reset = ixgb_io_slot_reset, + .resume = ixgb_io_resume, +}; + /* Exported from other modules */ extern void ixgb_check_options(struct ixgb_adapter *adapter); @@ -141,6 +151,8 @@ .id_table = ixgb_pci_tbl, .probe = ixgb_probe, .remove = __devexit_p(ixgb_remove), + .err_handler = &ixgb_err_handler, + }; MODULE_AUTHOR("Intel Corporation, "); @@ -1654,8 +1666,16 @@ unsigned int i; #endif +#ifdef XXX_CONFIG_IXGB_EEH_RECOVERY + if(unlikely(icr==EEH_IO_ERROR_VALUE(4))) { + if (eeh_slot_is_isolated (adapter->pdev)) + // disable_irq_nosync (adapter->pdev->irq); + return IRQ_NONE; /* Not our interrupt */ + } +#else if(unlikely(!icr)) return IRQ_NONE; /* Not our interrupt */ +#endif /* CONFIG_IXGB_EEH_RECOVERY */ if(unlikely(icr & (IXGB_INT_RXSEQ | IXGB_INT_LSC))) { mod_timer(&adapter->watchdog_timer, jiffies); @@ -2125,4 +2145,70 @@ } #endif +/* -------------- PCI Error Recovery infrastructure ---------------- */ +/** ixgb_io_error_detected() is called when PCI error is detected */ +static int ixgb_io_error_detected (struct pci_dev *pdev, enum pci_channel_state state) +{ + struct net_device *netdev = pci_get_drvdata(pdev); + struct ixgb_adapter *adapter = netdev->priv; + + if(netif_running(netdev)) + ixgb_down(adapter, TRUE); + + /* Request a slot reset. */ + return PCIERR_RESULT_NEED_RESET; +} + +/** ixgb_io_slot_reset is called after the pci bus has been reset. + * Restart the card from scratch. + * Implementation resembles the first-half of the + * ixgb_resume routine. + */ +static int ixgb_io_slot_reset (struct pci_dev *pdev) +{ + struct net_device *netdev = pci_get_drvdata(pdev); + struct ixgb_adapter *adapter = netdev->priv; + + if(pci_enable_device(pdev)) { + printk(KERN_ERR "ixgb: Cannot re-enable PCI device after reset.\n"); + return PCIERR_RESULT_DISCONNECT; + } + pci_set_master(pdev); + + /* Perform card reset only on one instance of the card */ + if (0 != PCI_FUNC (pdev->devfn)) + return PCIERR_RESULT_RECOVERED; + + ixgb_reset(adapter); + + return PCIERR_RESULT_RECOVERED; +} + +/** ixgb_io_resume is called when the error recovery driver + * tells us that its OK to resume normal operation. + * Implementation resembles the second-half of the + * ixgb_resume routine. + */ +static void ixgb_io_resume (struct pci_dev *pdev) +{ + struct net_device *netdev = pci_get_drvdata(pdev); + struct ixgb_adapter *adapter = netdev->priv; + + if(netif_running(netdev)) { + if(ixgb_up(adapter)) { + printk ("ixgb: can't bring device back up after reset\n"); + return; + } + } + + netif_device_attach(netdev); + if(netif_running(netdev)) + mod_timer(&adapter->watchdog_timer, jiffies); + + /* Reading all-ff's from the adapter will completely hose + * the counts and statistics. So just clear them out */ + memset(&adapter->stats, 0, sizeof(struct ixgb_hw_stats)); + ixgb_update_stats(adapter); +} + /* ixgb_main.c */ From linas at linas.org Fri Nov 4 11:54:17 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:54:17 -0600 Subject: [PATCH 32/42]: RFC: Add compile-time config options References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005417.GA27098@mail.gnucash.org> 32-pci-error-recovery_config-option.patch This OPTIONAL/RFC patch adds ifdef's around the PCI error recovery code in the various device drivers. This patch is "optional" in that its a little bit messy, but it does solve a little problem. -- The good news: this gives some users (e.g. embeddd systems) the option of not compiling in this code, thus making thier device drivers a tiny bit smaller. -- The bad news: This also clutters up the drivers with extraneous markup and the config process with yet another config. I don't know if this patch is worth it. Apply or reject, as desired ... Its up to you ... :-) Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/drivers/scsi/ipr.c =================================================================== --- linux-2.6.14-git3.orig/drivers/scsi/ipr.c 2005-11-02 14:43:52.782806465 -0600 +++ linux-2.6.14-git3/drivers/scsi/ipr.c 2005-11-02 14:44:04.167209911 -0600 @@ -5329,6 +5329,8 @@ } /* --------------- PCI Error Recovery infrastructure ----------- */ +#ifdef CONFIG_PCIERR_RECOVERY + /** If the PCI slot is frozen, hold off all i/o * activity; then, as soon as the slot is available again, * initiate an adapter reset. @@ -5414,6 +5416,7 @@ return PCIERR_RESULT_NEED_RESET; } +#endif /* CONFIG_PCIERR_RECOVERY */ /* ------------- end of PCI Error Recovery suport ----------- */ /** @@ -6153,10 +6156,12 @@ }; MODULE_DEVICE_TABLE(pci, ipr_pci_table); +#ifdef CONFIG_PCIERR_RECOVERY static struct pci_error_handlers ipr_err_handler = { .error_detected = ipr_eeh_error_detected, .slot_reset = ipr_eeh_slot_reset, }; +#endif /* CONFIG_PCIERR_RECOVERY */ static struct pci_driver ipr_driver = { .name = IPR_NAME, @@ -6164,7 +6169,9 @@ .probe = ipr_probe, .remove = ipr_remove, .shutdown = ipr_shutdown, +#ifdef CONFIG_PCIERR_RECOVERY .err_handler = &ipr_err_handler, +#endif /* CONFIG_PCIERR_RECOVERY */ }; /** Index: linux-2.6.14-git3/drivers/pci/Kconfig =================================================================== --- linux-2.6.14-git3.orig/drivers/pci/Kconfig 2005-11-02 14:28:48.597580036 -0600 +++ linux-2.6.14-git3/drivers/pci/Kconfig 2005-11-02 14:44:04.172209210 -0600 @@ -13,6 +13,21 @@ If you don't know what to do here, say N. +config PCIERR_RECOVERY + bool "PCI Error Recovery support" + depends on PCI + depends on PPC_PSERIES + default y + help + PCI Error Recovery is a mechanism by which crashed/hung + PCI adapters are automatically detected and rebooted without + otherwise disturbing the operation of the system. Support + for this recovery requires special PCI bridge chips (some + PCI-E chips may have this support) as well as support in + the device drivers (not all device drivers can handle this). + + When in doubt, say Y. + config PCI_LEGACY_PROC bool "Legacy /proc/pci interface" depends on PCI Index: linux-2.6.14-git3/drivers/scsi/sym53c8xx_2/sym_glue.c =================================================================== --- linux-2.6.14-git3.orig/drivers/scsi/sym53c8xx_2/sym_glue.c 2005-11-02 14:43:56.084343457 -0600 +++ linux-2.6.14-git3/drivers/scsi/sym53c8xx_2/sym_glue.c 2005-11-02 14:44:04.195205985 -0600 @@ -763,6 +763,7 @@ */ static void sym_eh_timeout(u_long p) { __sym_eh_done((struct scsi_cmnd *)p, 1); } +#ifdef CONFIG_PCIERR_RECOVERY static void sym_eeh_timeout(u_long p) { struct sym_eh_wait *ep = (struct sym_eh_wait *) p; @@ -781,6 +782,7 @@ complete(&ep->done); } +#endif /* CONFIG_PCIERR_RECOVERY */ /* * Generic method for our eh processing. @@ -823,6 +825,7 @@ /* Try to proceed the operation we have been asked for */ sts = -1; +#ifdef CONFIG_PCIERR_RECOVERY /* We may be in an error condition because the PCI bus * went down. In this case, we need to wait until the * PCI bus is reset, the card is reset, and only then @@ -850,6 +853,7 @@ } np->s.io_reset_wait = NULL; } +#endif /* CONFIG_PCIERR_RECOVERY */ switch(op) { case SYM_EH_ABORT: @@ -1971,6 +1975,7 @@ } /* ------------- PCI Error Recovery infrastructure -------------- */ +#ifdef CONFIG_PCIERR_RECOVERY /** sym2_io_error_detected() is called when PCI error is detected */ static int sym2_io_error_detected (struct pci_dev *pdev, enum pci_channel_state state) { @@ -2021,6 +2026,7 @@ np->s.io_state = pci_channel_io_normal; sym_eeh_done (np->s.io_reset_wait); } +#endif /* CONFIG_PCIERR_RECOVERY */ /* * Driver host template. @@ -2275,18 +2281,22 @@ MODULE_DEVICE_TABLE(pci, sym2_id_table); +#ifdef CONFIG_PCIERR_RECOVERY static struct pci_error_handlers sym2_err_handler = { .error_detected = sym2_io_error_detected, .slot_reset = sym2_io_slot_reset, .resume = sym2_io_resume, }; +#endif /* CONFIG_PCIERR_RECOVERY */ static struct pci_driver sym2_driver = { .name = NAME53C8XX, .id_table = sym2_id_table, .probe = sym2_probe, .remove = __devexit_p(sym2_remove), +#ifdef CONFIG_PCIERR_RECOVERY .err_handler = &sym2_err_handler, +#endif /* CONFIG_PCIERR_RECOVERY */ }; static int __init sym2_init(void) Index: linux-2.6.14-git3/drivers/net/e100.c =================================================================== --- linux-2.6.14-git3.orig/drivers/net/e100.c 2005-11-02 14:43:58.890949857 -0600 +++ linux-2.6.14-git3/drivers/net/e100.c 2005-11-02 14:44:04.222202199 -0600 @@ -2466,6 +2466,7 @@ /* ------------------ PCI Error Recovery infrastructure -------------- */ +#ifdef CONFIG_PCIERR_RECOVERY /** e100_io_error_detected() is called when PCI error is detected */ static int e100_io_error_detected(struct pci_dev *pdev, enum pci_channel_state state) { @@ -2532,6 +2533,7 @@ .slot_reset = e100_io_slot_reset, .resume = e100_io_resume, }; +#endif /* CONFIG_PCIERR_RECOVERY */ static struct pci_driver e100_driver = { @@ -2544,7 +2546,9 @@ .resume = e100_resume, #endif .shutdown = e100_shutdown, +#ifdef CONFIG_PCIERR_RECOVERY .err_handler = &e100_err_handler, +#endif /* CONFIG_PCIERR_RECOVERY */ }; static int __init e100_init_module(void) Index: linux-2.6.14-git3/drivers/net/e1000/e1000_main.c =================================================================== --- linux-2.6.14-git3.orig/drivers/net/e1000/e1000_main.c 2005-11-02 14:44:00.730691851 -0600 +++ linux-2.6.14-git3/drivers/net/e1000/e1000_main.c 2005-11-02 14:44:04.266196029 -0600 @@ -206,6 +206,7 @@ void e1000_rx_schedule(void *data); #endif +#ifdef CONFIG_PCIERR_RECOVERY static int e1000_io_error_detected(struct pci_dev *pdev, enum pci_channel_state state); static int e1000_io_slot_reset(struct pci_dev *pdev); static void e1000_io_resume(struct pci_dev *pdev); @@ -215,6 +216,7 @@ .slot_reset = e1000_io_slot_reset, .resume = e1000_io_resume, }; +#endif /* CONFIG_PCIERR_RECOVERY */ /* Exported from other modules */ @@ -230,7 +232,9 @@ .suspend = e1000_suspend, .resume = e1000_resume, #endif +#ifdef CONFIG_PCIERR_RECOVERY .err_handler = &e1000_err_handler, +#endif /* CONFIG_PCIERR_RECOVERY */ }; MODULE_AUTHOR("Intel Corporation, "); @@ -4374,6 +4378,7 @@ #endif /* --------------- PCI Error Recovery infrastructure ------------ */ +#ifdef CONFIG_PCIERR_RECOVERY /** e1000_io_error_detected() is called when PCI error is detected */ static int e1000_io_error_detected(struct pci_dev *pdev, enum pci_channel_state state) { @@ -4456,5 +4461,6 @@ if(netif_running(netdev)) mod_timer(&adapter->watchdog_timer, jiffies); } +#endif /* CONFIG_PCIERR_RECOVERY */ /* e1000_main.c */ Index: linux-2.6.14-git3/drivers/net/ixgb/ixgb_main.c =================================================================== --- linux-2.6.14-git3.orig/drivers/net/ixgb/ixgb_main.c 2005-11-02 14:44:02.380460486 -0600 +++ linux-2.6.14-git3/drivers/net/ixgb/ixgb_main.c 2005-11-02 14:44:04.289192804 -0600 @@ -132,6 +132,7 @@ static void ixgb_netpoll(struct net_device *dev); #endif +#ifdef CONFIG_PCIERR_RECOVERY static int ixgb_io_error_detected (struct pci_dev *pdev, enum pci_channel_state state); static int ixgb_io_slot_reset (struct pci_dev *pdev); static void ixgb_io_resume (struct pci_dev *pdev); @@ -141,6 +142,7 @@ .slot_reset = ixgb_io_slot_reset, .resume = ixgb_io_resume, }; +#endif /* CONFIG_PCIERR_RECOVERY */ /* Exported from other modules */ @@ -151,8 +153,9 @@ .id_table = ixgb_pci_tbl, .probe = ixgb_probe, .remove = __devexit_p(ixgb_remove), +#ifdef CONFIG_PCIERR_RECOVERY .err_handler = &ixgb_err_handler, - +#endif /* CONFIG_PCIERR_RECOVERY */ }; MODULE_AUTHOR("Intel Corporation, "); @@ -2146,6 +2149,7 @@ #endif /* -------------- PCI Error Recovery infrastructure ---------------- */ +#ifdef CONFIG_PCIERR_RECOVERY /** ixgb_io_error_detected() is called when PCI error is detected */ static int ixgb_io_error_detected (struct pci_dev *pdev, enum pci_channel_state state) { @@ -2210,5 +2214,6 @@ memset(&adapter->stats, 0, sizeof(struct ixgb_hw_stats)); ixgb_update_stats(adapter); } +#endif /* CONFIG_PCIERR_RECOVERY */ /* ixgb_main.c */ From linas at linas.org Fri Nov 4 11:54:23 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:54:23 -0600 Subject: [PATCH 33/42]: ppc64: remove bogus printk References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005423.GA27106@mail.gnucash.org> 233-eeh-buid-fix.patch Remove un-desired warning print from EEH code. Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:43:49.212307192 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:45:00.429319560 -0600 @@ -824,12 +824,10 @@ if (!dn || !PCI_DN(dn)) return; phb = PCI_DN(dn)->phb; - if (NULL == phb || 0 == phb->buid) { - printk(KERN_WARNING "EEH: Expected buid but found none for %s\n", - dn->full_name); - dump_stack(); + + /* USB Bus children of PCI devices will not have BUID's */ + if (NULL == phb || 0 == phb->buid) return; - } info.buid_hi = BUID_HI(phb->buid); info.buid_lo = BUID_LO(phb->buid); From linas at linas.org Fri Nov 4 11:54:29 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:54:29 -0600 Subject: [PATCH 34/42]: ppc64: Remove duplicate code References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005429.GA27114@mail.gnucash.org> 234-eeh-find-pe.patch The find_device_pe() routine is duplicated in two files. Remove one of the two copies, declare the other extern. Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh_driver.c =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/eeh_driver.c 2005-11-02 14:41:18.435451353 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh_driver.c 2005-11-02 14:45:43.638259683 -0600 @@ -42,19 +42,6 @@ return ""; } -/** - * Return the "partitionable endpoint" (pe) under which this device lies - */ -static struct device_node * find_device_pe(struct device_node *dn) -{ - while ((dn->parent) && PCI_DN(dn->parent) && - (PCI_DN(dn->parent)->eeh_mode & EEH_MODE_SUPPORTED)) { - dn = dn->parent; - } - return dn; -} - - #ifdef DEBUG static void print_device_node_tree (struct pci_dn *pdn, int dent) { Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:45:00.429319560 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:45:43.651257860 -0600 @@ -172,7 +172,7 @@ /** * Return the "partitionable endpoint" (pe) under which this device lies */ -static struct device_node * find_device_pe(struct device_node *dn) +struct device_node * find_device_pe(struct device_node *dn) { while ((dn->parent) && PCI_DN(dn->parent) && (PCI_DN(dn->parent)->eeh_mode & EEH_MODE_SUPPORTED)) { Index: linux-2.6.14-git3/include/asm-powerpc/ppc-pci.h =================================================================== --- linux-2.6.14-git3.orig/include/asm-powerpc/ppc-pci.h 2005-11-02 14:42:38.998153856 -0600 +++ linux-2.6.14-git3/include/asm-powerpc/ppc-pci.h 2005-11-02 14:45:43.656257159 -0600 @@ -110,6 +110,9 @@ void eeh_mark_slot (struct device_node *dn, int mode_flag); void eeh_clear_slot (struct device_node *dn, int mode_flag); +/* Find the associated "Partiationable Endpoint" PE */ +struct device_node * find_device_pe(struct device_node *dn); + #endif #endif /* _ASM_POWERPC_PPC_PCI_H */ From linas at linas.org Fri Nov 4 11:54:34 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:54:34 -0600 Subject: [PATCH 35/42]: ppc64: bugfix: fill in un-initialzed field References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005434.GA27122@mail.gnucash.org> 235-eeh-set-pcidev-bugfix.patch The pci device field should be initialized to a valid value. Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh_cache.c =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/eeh_cache.c 2005-11-02 14:42:38.994154417 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh_cache.c 2005-11-02 14:46:23.687642815 -0600 @@ -307,6 +307,9 @@ /* Save the BAR's; firmware doesn't restore these after EEH reset */ dn = pci_device_to_OF_node(dev); eeh_save_bars(dev, PCI_DN(dn)); + + pci_dev_get (dev); /* matching put is in eeh_remove_device() */ + PCI_DN(dn)->pcidev = dev; } #ifdef DEBUG From linas at linas.org Fri Nov 4 11:54:39 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:54:39 -0600 Subject: [PATCH 36/42]: ppc64: Use PE configuration address consistently References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005439.GA27130@mail.gnucash.org> 236-eeh-config-addr.patch The PE configuration address wasn't being cnsistently used in all locations where a config address is called for. This patch adds it to the places it should have appeared in. Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:45:43.651257860 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:47:07.798456202 -0600 @@ -110,6 +110,7 @@ void eeh_slot_error_detail (struct pci_dn *pdn, int severity) { + int config_addr; unsigned long flags; int rc; @@ -117,8 +118,13 @@ spin_lock_irqsave(&slot_errbuf_lock, flags); memset(slot_errbuf, 0, eeh_error_buf_size); + /* Use PE configuration address, if present */ + config_addr = pdn->eeh_config_addr; + if (pdn->eeh_pe_config_addr) + config_addr = pdn->eeh_pe_config_addr; + rc = rtas_call(ibm_slot_error_detail, - 8, 1, NULL, pdn->eeh_config_addr, + 8, 1, NULL, config_addr, BUID_HI(pdn->phb->buid), BUID_LO(pdn->phb->buid), NULL, 0, virt_to_phys(slot_errbuf), @@ -138,6 +144,7 @@ static int read_slot_reset_state(struct pci_dn *pdn, int rets[]) { int token, outputs; + int config_addr; if (ibm_read_slot_reset_state2 != RTAS_UNKNOWN_SERVICE) { token = ibm_read_slot_reset_state2; @@ -148,7 +155,12 @@ outputs = 3; } - return rtas_call(token, 3, outputs, rets, pdn->eeh_config_addr, + /* Use PE configuration address, if present */ + config_addr = pdn->eeh_config_addr; + if (pdn->eeh_pe_config_addr) + config_addr = pdn->eeh_pe_config_addr; + + return rtas_call(token, 3, outputs, rets, config_addr, BUID_HI(pdn->phb->buid), BUID_LO(pdn->phb->buid)); } @@ -284,7 +296,7 @@ return 0; } - if (!pdn->eeh_config_addr) { + if (!pdn->eeh_config_addr && !pdn->eeh_pe_config_addr) { __get_cpu_var(no_cfg_addr)++; return 0; } @@ -613,13 +625,20 @@ void rtas_configure_bridge(struct pci_dn *pdn) { + int config_addr; int token = rtas_token ("ibm,configure-bridge"); int rc; if (token == RTAS_UNKNOWN_SERVICE) return; + + /* Use PE configuration address, if present */ + config_addr = pdn->eeh_config_addr; + if (pdn->eeh_pe_config_addr) + config_addr = pdn->eeh_pe_config_addr; + rc = rtas_call(token,3,1, NULL, - pdn->eeh_config_addr, + config_addr, BUID_HI(pdn->phb->buid), BUID_LO(pdn->phb->buid)); if (rc) { From linas at linas.org Fri Nov 4 11:54:47 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:54:47 -0600 Subject: [PATCH 37/42]: ppc64: set up the RTAS token just like the rest of them. References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005447.GA27138@mail.gnucash.org> 237-eeh-bridge-token.patch Minor: the rtas-bridge toekn should be set up the same way that all the other tokens rtas tokens are set up. Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:47:07.798456202 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:47:38.997080468 -0600 @@ -84,6 +84,7 @@ static int ibm_read_slot_reset_state2; static int ibm_slot_error_detail; static int ibm_get_config_addr_info; +static int ibm_configure_bridge; static int eeh_subsystem_enabled; @@ -626,18 +627,14 @@ rtas_configure_bridge(struct pci_dn *pdn) { int config_addr; - int token = rtas_token ("ibm,configure-bridge"); int rc; - if (token == RTAS_UNKNOWN_SERVICE) - return; - /* Use PE configuration address, if present */ config_addr = pdn->eeh_config_addr; if (pdn->eeh_pe_config_addr) config_addr = pdn->eeh_pe_config_addr; - rc = rtas_call(token,3,1, NULL, + rc = rtas_call(ibm_configure_bridge,3,1, NULL, config_addr, BUID_HI(pdn->phb->buid), BUID_LO(pdn->phb->buid)); @@ -789,6 +786,7 @@ ibm_read_slot_reset_state = rtas_token("ibm,read-slot-reset-state"); ibm_slot_error_detail = rtas_token("ibm,slot-error-detail"); ibm_get_config_addr_info = rtas_token("ibm,get-config-addr-info"); + ibm_configure_bridge = rtas_token ("ibm,configure-bridge"); if (ibm_set_eeh_option == RTAS_UNKNOWN_SERVICE) return; From linas at linas.org Fri Nov 4 11:54:54 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:54:54 -0600 Subject: [PATCH 38/42]: ppc64: Don't continue with PCI Error recovery if slot reset failed. References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005454.GA27146@mail.gnucash.org> 238-eeh-stop-if-reset_failed.patch If the firmware is unable to reset the PCI slot for some reason, then don't attempt any further recovery steps after that point. Instead, mark the device as permanently failed. Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:47:38.997080468 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:48:13.093298267 -0600 @@ -450,11 +450,16 @@ if (rc) return rc; if (rets[1] == 0) return -1; /* EEH is not supported */ - if (rets[0] == 0) return 0; /* Oll Korrect */ + if (rets[0] == 0) return 0; /* Oll Korrect */ if (rets[0] == 5) { if (rets[2] == 0) return -1; /* permanently unavailable */ return rets[2]; /* number of millisecs to wait */ } + if (rets[0] == 1) + return 250; + + printk (KERN_ERR "EEH: Slot unavailable: rc=%d, rets=%d %d %d\n", + rc, rets[0], rets[1], rets[2]); return -1; } @@ -501,9 +506,11 @@ /** rtas_set_slot_reset -- assert the pci #RST line for 1/4 second * dn -- device node to be reset. + * + * Return 0 if success, else a non-zero value. */ -void +int rtas_set_slot_reset(struct pci_dn *pdn) { int i, rc; @@ -533,10 +540,21 @@ * ready to be used; if not, wait for recovery. */ for (i=0; i<10; i++) { rc = eeh_slot_availability (pdn); - if (rc <= 0) break; + if (rc < 0) + printk (KERN_ERR "EEH: failed (%d) to reset slot %s\n", rc, pdn->node->full_name); + if (rc == 0) + return 0; + if (rc < 0) + return -1; msleep (rc+100); } + + rc = eeh_slot_availability (pdn); + if (rc) + printk (KERN_ERR "EEH: timeout resetting slot %s\n", pdn->node->full_name); + + return rc; } /* ------------------------------------------------------- */ Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh_driver.c =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/eeh_driver.c 2005-11-02 14:45:43.638259683 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh_driver.c 2005-11-02 14:48:13.100297285 -0600 @@ -200,14 +200,18 @@ * bus resets can be performed. */ -static void eeh_reset_device (struct pci_dn *pe_dn, struct pci_bus *bus) +static int eeh_reset_device (struct pci_dn *pe_dn, struct pci_bus *bus) { + int rc; if (bus) pcibios_remove_pci_devices(bus); /* Reset the pci controller. (Asserts RST#; resets config space). - * Reconfigure bridges and devices */ - rtas_set_slot_reset(pe_dn); + * Reconfigure bridges and devices. Don't try to bring the system + * up if the reset failed for some reason. */ + rc = rtas_set_slot_reset(pe_dn); + if (rc) + return rc; /* Walk over all functions on this device */ rtas_configure_bridge(pe_dn); @@ -223,6 +227,8 @@ ssleep (5); pcibios_add_pci_devices(bus); } + + return 0; } /* The longest amount of time to wait for a pci device @@ -235,7 +241,7 @@ struct device_node *frozen_dn; struct pci_dn *frozen_pdn; struct pci_bus *frozen_bus; - int perm_failure = 0; + int rc = 0; frozen_dn = find_device_pe(event->dn); frozen_bus = pcibios_find_pci_bus(frozen_dn); @@ -272,7 +278,7 @@ frozen_pdn->eeh_freeze_count++; if (frozen_pdn->eeh_freeze_count > EEH_MAX_ALLOWED_FREEZES) - perm_failure = 1; + goto hard_fail; /* If the reset state is a '5' and the time to reset is 0 (infinity) * or is more then 15 seconds, then mark this as a permanent failure. @@ -280,34 +286,7 @@ if ((event->state == pci_channel_io_perm_failure) && ((event->time_unavail <= 0) || (event->time_unavail > MAX_WAIT_FOR_RECOVERY*1000))) - { - perm_failure = 1; - } - - /* Log the error with the rtas logger. */ - if (perm_failure) { - /* - * About 90% of all real-life EEH failures in the field - * are due to poorly seated PCI cards. Only 10% or so are - * due to actual, failed cards. - */ - printk(KERN_ERR - "EEH: PCI device %s - %s has failed %d times \n" - "and has been permanently disabled. Please try reseating\n" - "this device or replacing it.\n", - pci_name (frozen_pdn->pcidev), - pcid_name(frozen_pdn->pcidev), - frozen_pdn->eeh_freeze_count); - - eeh_slot_error_detail(frozen_pdn, 2 /* Permanent Error */); - - /* Notify all devices that they're about to go down. */ - pci_walk_bus(frozen_bus, eeh_report_failure, 0); - - /* Shut down the device drivers for good. */ - pcibios_remove_pci_devices(frozen_bus); - return; - } + goto hard_fail; eeh_slot_error_detail(frozen_pdn, 1 /* Temporary Error */); printk(KERN_WARNING @@ -330,24 +309,54 @@ * go down willingly, without panicing the system. */ if (result == PCIERR_RESULT_NONE) { - eeh_reset_device(frozen_pdn, frozen_bus); + rc = eeh_reset_device(frozen_pdn, frozen_bus); + if (rc) + goto hard_fail; } /* If any device called out for a reset, then reset the slot */ if (result == PCIERR_RESULT_NEED_RESET) { - eeh_reset_device(frozen_pdn, NULL); + rc = eeh_reset_device(frozen_pdn, NULL); + if (rc) + goto hard_fail; pci_walk_bus(frozen_bus, eeh_report_reset, 0); } /* If all devices reported they can proceed, the re-enable PIO */ if (result == PCIERR_RESULT_CAN_RECOVER) { /* XXX Not supported; we brute-force reset the device */ - eeh_reset_device(frozen_pdn, NULL); + rc = eeh_reset_device(frozen_pdn, NULL); + if (rc) + goto hard_fail; pci_walk_bus(frozen_bus, eeh_report_reset, 0); } /* Tell all device drivers that they can resume operations */ pci_walk_bus(frozen_bus, eeh_report_resume, 0); + + return; + +hard_fail: + /* + * About 90% of all real-life EEH failures in the field + * are due to poorly seated PCI cards. Only 10% or so are + * due to actual, failed cards. + */ + printk(KERN_ERR + "EEH: PCI device %s - %s has failed %d times \n" + "and has been permanently disabled. Please try reseating\n" + "this device or replacing it.\n", + pci_name (frozen_pdn->pcidev), + pcid_name(frozen_pdn->pcidev), + frozen_pdn->eeh_freeze_count); + + eeh_slot_error_detail(frozen_pdn, 2 /* Permanent Error */); + + /* Notify all devices that they're about to go down. */ + pci_walk_bus(frozen_bus, eeh_report_failure, 0); + + /* Shut down the device drivers for good. */ + pcibios_remove_pci_devices(frozen_bus); } /* ---------- end of file ---------- */ Index: linux-2.6.14-git3/include/asm-powerpc/ppc-pci.h =================================================================== --- linux-2.6.14-git3.orig/include/asm-powerpc/ppc-pci.h 2005-11-02 14:45:43.656257159 -0600 +++ linux-2.6.14-git3/include/asm-powerpc/ppc-pci.h 2005-11-02 14:48:13.104296724 -0600 @@ -77,8 +77,10 @@ * does this by asserting the PCI #RST line for 1/8th of * a second; this routine will sleep while the adapter is * being reset. + * + * Returns a non-zero value if the reset failed. */ -void rtas_set_slot_reset (struct pci_dn *); +int rtas_set_slot_reset (struct pci_dn *); /** * eeh_restore_bars - Restore device configuration info. From linas at linas.org Fri Nov 4 11:55:01 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:55:01 -0600 Subject: [PATCH 39/42]: ppc64: handle multifunction PCI devices properly References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005501.GA27154@mail.gnucash.org> 239-eeh-multifunction-consolidate.patch New-style firmware will often place multiple different functions under a non-EEH-aware parent. However, tehse devices might share a common PE "partition endpoint" and config address, ad thus any EEH events will affect all of the devices in common. This patch makes the effort to find all of these common devices and handle them together. Signed-off-by: Linas Vepstas -- Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:48:13.093298267 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:48:44.941831253 -0600 @@ -223,6 +223,11 @@ void eeh_mark_slot (struct device_node *dn, int mode_flag) { dn = find_device_pe (dn); + + /* Back up one, since config addrs might be shared */ + if (PCI_DN(dn) && PCI_DN(dn)->eeh_pe_config_addr) + dn = dn->parent; + PCI_DN(dn)->eeh_mode |= mode_flag; __eeh_mark_slot (dn->child, mode_flag); } @@ -244,7 +249,13 @@ { unsigned long flags; spin_lock_irqsave(&confirm_error_lock, flags); + dn = find_device_pe (dn); + + /* Back up one, since config addrs might be shared */ + if (PCI_DN(dn) && PCI_DN(dn)->eeh_pe_config_addr) + dn = dn->parent; + PCI_DN(dn)->eeh_mode &= ~mode_flag; PCI_DN(dn)->eeh_check_count = 0; __eeh_clear_slot (dn->child, mode_flag); @@ -609,7 +620,7 @@ if (!pdn) return; - if (! pdn->eeh_is_bridge) + if ((pdn->eeh_mode & EEH_MODE_SUPPORTED) && (!pdn->eeh_is_bridge)) __restore_bars (pdn); dn = pdn->node->child; Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh_driver.c =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/eeh_driver.c 2005-11-02 14:48:13.100297285 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh_driver.c 2005-11-02 14:48:44.950829991 -0600 @@ -213,9 +213,23 @@ if (rc) return rc; - /* Walk over all functions on this device */ - rtas_configure_bridge(pe_dn); - eeh_restore_bars(pe_dn); + /* New-style config addrs might be shared across multiple devices, + * Walk over all functions on this device */ + if (pe_dn->eeh_pe_config_addr) { + struct device_node *pe = pe_dn->node; + pe = pe->parent->child; + while (pe) { + struct pci_dn *ppe = PCI_DN(pe); + if (pe_dn->eeh_pe_config_addr == ppe->eeh_pe_config_addr) { + rtas_configure_bridge(ppe); + eeh_restore_bars(ppe); + } + pe = pe->sibling; + } + } else { + rtas_configure_bridge(pe_dn); + eeh_restore_bars(pe_dn); + } /* Give the system 5 seconds to finish running the user-space * hotplug shutdown scripts, e.g. ifdown for ethernet. Yes, From linas at linas.org Fri Nov 4 11:55:14 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:55:14 -0600 Subject: [PATCH 40/42]: ppc64: IOMMU: don't ioremap null pointers References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005514.GA27179@mail.gnucash.org> 240-ioremap-null-ptr-test.patch Under highly unusual circumstances, a buggy driver will ask a null ptr to be ioremapped, an operation that curently suceeds but leads to later trouble. Instead, refuse to remap the null pointer. Signed-off-by: Linas Vepstas -- Index: linux-2.6.14-git3/arch/powerpc/mm/pgtable_64.c =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/mm/pgtable_64.c 2005-11-02 14:59:56.507624778 -0600 +++ linux-2.6.14-git3/arch/powerpc/mm/pgtable_64.c 2005-11-02 15:01:04.284115774 -0600 @@ -185,7 +185,7 @@ pa = addr & PAGE_MASK; size = PAGE_ALIGN(addr + size) - pa; - if (size == 0) + if ((size == 0) || (pa == 0)) return NULL; if (mem_init_done) { From linas at linas.org Fri Nov 4 11:55:19 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:55:19 -0600 Subject: [PATCH 41/42]: ppc64: Save device BARS much earlier in the boot sequence References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005519.GA27189@mail.gnucash.org> 241-eeh-save-bars-earlier.patch Save the PCI device bars *before* any PCI probing is done. Signed-off-by: Linas Vepstas -- Index: linux-2.6.14-git3/arch/ppc64/kernel/rtas_pci.c =================================================================== --- linux-2.6.14-git3.orig/arch/ppc64/kernel/rtas_pci.c 2005-10-31 12:01:21.000000000 -0600 +++ linux-2.6.14-git3/arch/ppc64/kernel/rtas_pci.c 2005-11-02 16:52:48.556202006 -0600 @@ -72,7 +72,7 @@ return 0; } -static int rtas_read_config(struct pci_dn *pdn, int where, int size, u32 *val) +int rtas_read_config(struct pci_dn *pdn, int where, int size, u32 *val) { int returnval = -1; unsigned long buid, addr; Index: linux-2.6.14-git3/include/asm-powerpc/ppc-pci.h =================================================================== --- linux-2.6.14-git3.orig/include/asm-powerpc/ppc-pci.h 2005-11-02 16:53:29.000000000 -0600 +++ linux-2.6.14-git3/include/asm-powerpc/ppc-pci.h 2005-11-02 17:28:14.843073955 -0600 @@ -59,8 +59,6 @@ void pci_addr_cache_build(void); struct pci_dev *pci_get_device_by_addr(unsigned long addr); -void eeh_save_bars(struct pci_dev * pdev, struct pci_dn *pdn); - /** * eeh_slot_error_detail -- record and EEH error condition to the log * @severity: 1 if temporary, 2 if permanent failure. @@ -104,6 +102,7 @@ void rtas_configure_bridge(struct pci_dn *); int rtas_write_config(struct pci_dn *, int where, int size, u32 val); +int rtas_read_config(struct pci_dn *, int where, int size, u32 *val); /** * mark and clear slots: find "partition endpoint" PE and set or Index: linux-2.6.14-git3/include/asm-ppc64/pci-bridge.h =================================================================== --- linux-2.6.14-git3.orig/include/asm-ppc64/pci-bridge.h 2005-11-02 14:43:49.000000000 -0600 +++ linux-2.6.14-git3/include/asm-ppc64/pci-bridge.h 2005-11-02 17:13:07.358586231 -0600 @@ -58,15 +58,15 @@ struct iommu_table; struct pci_dn { - int busno; /* for pci devices */ - int bussubno; /* for pci devices */ - int devfn; /* for pci devices */ + int busno; /* pci bus number */ + int bussubno; /* pci subordinate bus number */ + int devfn; /* pci device and function number */ + int class_code; /* pci device class */ int eeh_mode; /* See eeh.h for possible EEH_MODEs */ int eeh_config_addr; int eeh_pe_config_addr; /* new-style partition endpoint address */ int eeh_check_count; /* # times driver ignored error */ int eeh_freeze_count; /* # times this device froze up. */ - int eeh_is_bridge; /* device is pci-to-pci bridge */ int pci_ext_config_space; /* for pci devices */ struct pci_controller *phb; /* for pci devices */ Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 16:45:55.000000000 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 18:42:28.243139205 -0600 @@ -106,6 +106,8 @@ static DEFINE_PER_CPU(unsigned long, ignored_failures); static DEFINE_PER_CPU(unsigned long, slot_resets); +#define IS_BRIDGE(class_code) (((class_code)<<16) == PCI_BASE_CLASS_BRIDGE) + /* --------------------------------------------------------------- */ /* Below lies the EEH event infrastructure */ @@ -620,7 +622,7 @@ if (!pdn) return; - if ((pdn->eeh_mode & EEH_MODE_SUPPORTED) && (!pdn->eeh_is_bridge)) + if ((pdn->eeh_mode & EEH_MODE_SUPPORTED) && !IS_BRIDGE(pdn->class_code)) __restore_bars (pdn); dn = pdn->node->child; @@ -638,18 +640,15 @@ * PCI devices are added individuallly; but, for the restore, * an entire slot is reset at a time. */ -void eeh_save_bars(struct pci_dev * pdev, struct pci_dn *pdn) +static void eeh_save_bars(struct pci_dn *pdn) { int i; - if (!pdev || !pdn ) + if (!pdn ) return; for (i = 0; i < 16; i++) - pci_read_config_dword(pdev, i * 4, &pdn->config_space[i]); - - if (pdev->hdr_type == PCI_HEADER_TYPE_BRIDGE) - pdn->eeh_is_bridge = 1; + rtas_read_config(pdn, i * 4, 4, &pdn->config_space[i]); } void @@ -699,6 +698,7 @@ int enable; struct pci_dn *pdn = PCI_DN(dn); + pdn->class_code = *class_code; pdn->eeh_mode = 0; pdn->eeh_check_count = 0; pdn->eeh_freeze_count = 0; @@ -781,6 +781,7 @@ dn->full_name); } + eeh_save_bars(pdn); return NULL; } @@ -915,7 +916,6 @@ pdn->pcidev = dev; pci_addr_cache_insert_device (dev); - eeh_save_bars(dev, pdn); } EXPORT_SYMBOL_GPL(eeh_add_device_late); Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh_cache.c =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/eeh_cache.c 2005-11-02 16:45:55.000000000 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh_cache.c 2005-11-02 18:40:54.893242771 -0600 @@ -304,10 +304,7 @@ pci_addr_cache_insert_device(dev); - /* Save the BAR's; firmware doesn't restore these after EEH reset */ dn = pci_device_to_OF_node(dev); - eeh_save_bars(dev, PCI_DN(dn)); - pci_dev_get (dev); /* matching put is in eeh_remove_device() */ PCI_DN(dn)->pcidev = dev; } From linas at linas.org Fri Nov 4 11:55:25 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:55:25 -0600 Subject: [PATCH 42/42]: ppc64: get rid of per_cpu counters References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005525.GA27197@mail.gnucash.org> 242-eeh-no-percpu-counters.patch Remove per-cpu counters from the EEH code. These statistics counters are incremented at a very low-frequency, and the performance gains of per-cpu variables are negligable. By conrast, the counters weren't safe against cpu gard operations, and its not worth the effeort to make them so (other than to turn them into plain globals). Signed-off-by: Linas Vepstas -- Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 18:42:28.243139205 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 18:49:24.196716323 -0600 @@ -97,14 +97,14 @@ static int eeh_error_buf_size; /* System monitoring statistics */ -static DEFINE_PER_CPU(unsigned long, no_device); -static DEFINE_PER_CPU(unsigned long, no_dn); -static DEFINE_PER_CPU(unsigned long, no_cfg_addr); -static DEFINE_PER_CPU(unsigned long, ignored_check); -static DEFINE_PER_CPU(unsigned long, total_mmio_ffs); -static DEFINE_PER_CPU(unsigned long, false_positives); -static DEFINE_PER_CPU(unsigned long, ignored_failures); -static DEFINE_PER_CPU(unsigned long, slot_resets); +static unsigned long no_device; +static unsigned long no_dn; +static unsigned long no_cfg_addr; +static unsigned long ignored_check; +static unsigned long total_mmio_ffs; +static unsigned long false_positives; +static unsigned long ignored_failures; +static unsigned long slot_resets; #define IS_BRIDGE(class_code) (((class_code)<<16) == PCI_BASE_CLASS_BRIDGE) @@ -288,13 +288,13 @@ enum pci_channel_state state; int rc = 0; - __get_cpu_var(total_mmio_ffs)++; + total_mmio_ffs++; if (!eeh_subsystem_enabled) return 0; if (!dn) { - __get_cpu_var(no_dn)++; + no_dn++; return 0; } pdn = PCI_DN(dn); @@ -302,7 +302,7 @@ /* Access to IO BARs might get this far and still not want checking. */ if (!(pdn->eeh_mode & EEH_MODE_SUPPORTED) || pdn->eeh_mode & EEH_MODE_NOCHECK) { - __get_cpu_var(ignored_check)++; + ignored_check++; #ifdef DEBUG printk ("EEH:ignored check (%x) for %s %s\n", pdn->eeh_mode, pci_name (dev), dn->full_name); @@ -311,7 +311,7 @@ } if (!pdn->eeh_config_addr && !pdn->eeh_pe_config_addr) { - __get_cpu_var(no_cfg_addr)++; + no_cfg_addr++; return 0; } @@ -353,7 +353,7 @@ if (ret != 0) { printk(KERN_WARNING "EEH: read_slot_reset_state() failed; rc=%d dn=%s\n", ret, dn->full_name); - __get_cpu_var(false_positives)++; + false_positives++; rc = 0; goto dn_unlock; } @@ -362,14 +362,14 @@ if (rets[1] != 1) { printk(KERN_WARNING "EEH: event on unsupported device, rc=%d dn=%s\n", ret, dn->full_name); - __get_cpu_var(false_positives)++; + false_positives++; rc = 0; goto dn_unlock; } /* If not the kind of error we know about, punt. */ if (rets[0] != 2 && rets[0] != 4 && rets[0] != 5) { - __get_cpu_var(false_positives)++; + false_positives++; rc = 0; goto dn_unlock; } @@ -377,12 +377,12 @@ /* Note that config-io to empty slots may fail; * we recognize empty because they don't have children. */ if ((rets[0] == 5) && (dn->child == NULL)) { - __get_cpu_var(false_positives)++; + false_positives++; rc = 0; goto dn_unlock; } - __get_cpu_var(slot_resets)++; + slot_resets++; /* Avoid repeated reports of this failure, including problems * with other functions on this device, and functions under @@ -432,7 +432,7 @@ addr = eeh_token_to_phys((unsigned long __force) token); dev = pci_get_device_by_addr(addr); if (!dev) { - __get_cpu_var(no_device)++; + no_device++; return val; } @@ -963,25 +963,9 @@ static int proc_eeh_show(struct seq_file *m, void *v) { - unsigned int cpu; - unsigned long ffs = 0, positives = 0, failures = 0; - unsigned long resets = 0; - unsigned long no_dev = 0, no_dn = 0, no_cfg = 0, no_check = 0; - - for_each_cpu(cpu) { - ffs += per_cpu(total_mmio_ffs, cpu); - positives += per_cpu(false_positives, cpu); - failures += per_cpu(ignored_failures, cpu); - resets += per_cpu(slot_resets, cpu); - no_dev += per_cpu(no_device, cpu); - no_dn += per_cpu(no_dn, cpu); - no_cfg += per_cpu(no_cfg_addr, cpu); - no_check += per_cpu(ignored_check, cpu); - } - if (0 == eeh_subsystem_enabled) { seq_printf(m, "EEH Subsystem is globally disabled\n"); - seq_printf(m, "eeh_total_mmio_ffs=%ld\n", ffs); + seq_printf(m, "eeh_total_mmio_ffs=%ld\n", total_mmio_ffs); } else { seq_printf(m, "EEH Subsystem is enabled\n"); seq_printf(m, @@ -993,8 +977,10 @@ "eeh_false_positives=%ld\n" "eeh_ignored_failures=%ld\n" "eeh_slot_resets=%ld\n", - no_dev, no_dn, no_cfg, no_check, - ffs, positives, failures, resets); + no_device, no_dn, no_cfg_addr, + ignored_check, total_mmio_ffs, + false_positives, ignored_failures, + slot_resets); } return 0; From linas at linas.org Fri Nov 4 11:57:35 2005 From: linas at linas.org (Linas Vepstas) Date: Thu, 3 Nov 2005 18:57:35 -0600 Subject: [PATCH 11/42]: ppc64: move code to powerpc directory from ppc64 References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104005735.GA27243@mail.gnucash.org> 11-eeh-move-to-powerpc.patch Move arch/ppc64/kernel/eeh.c to arch//powerpc/platforms/pseries/eeh.c No other changes (except for Makefile to build it) Signed-off-by: Linas Vepstas Index: linux-2.6.14-git3/arch/ppc64/kernel/eeh.c =================================================================== --- linux-2.6.14-git3.orig/arch/ppc64/kernel/eeh.c 2005-11-02 14:29:22.485829789 -0600 +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 @@ -1,1093 +0,0 @@ -/* - * eeh.c - * Copyright (C) 2001 Dave Engebretsen & Todd Inglett IBM Corporation - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#undef DEBUG - -/** Overview: - * EEH, or "Extended Error Handling" is a PCI bridge technology for - * dealing with PCI bus errors that can't be dealt with within the - * usual PCI framework, except by check-stopping the CPU. Systems - * that are designed for high-availability/reliability cannot afford - * to crash due to a "mere" PCI error, thus the need for EEH. - * An EEH-capable bridge operates by converting a detected error - * into a "slot freeze", taking the PCI adapter off-line, making - * the slot behave, from the OS'es point of view, as if the slot - * were "empty": all reads return 0xff's and all writes are silently - * ignored. EEH slot isolation events can be triggered by parity - * errors on the address or data busses (e.g. during posted writes), - * which in turn might be caused by low voltage on the bus, dust, - * vibration, humidity, radioactivity or plain-old failed hardware. - * - * Note, however, that one of the leading causes of EEH slot - * freeze events are buggy device drivers, buggy device microcode, - * or buggy device hardware. This is because any attempt by the - * device to bus-master data to a memory address that is not - * assigned to the device will trigger a slot freeze. (The idea - * is to prevent devices-gone-wild from corrupting system memory). - * Buggy hardware/drivers will have a miserable time co-existing - * with EEH. - * - * Ideally, a PCI device driver, when suspecting that an isolation - * event has occured (e.g. by reading 0xff's), will then ask EEH - * whether this is the case, and then take appropriate steps to - * reset the PCI slot, the PCI device, and then resume operations. - * However, until that day, the checking is done here, with the - * eeh_check_failure() routine embedded in the MMIO macros. If - * the slot is found to be isolated, an "EEH Event" is synthesized - * and sent out for processing. - */ - -/* EEH event workqueue setup. */ -static DEFINE_SPINLOCK(eeh_eventlist_lock); -LIST_HEAD(eeh_eventlist); -static void eeh_event_handler(void *); -DECLARE_WORK(eeh_event_wq, eeh_event_handler, NULL); - -static struct notifier_block *eeh_notifier_chain; - -/* If a device driver keeps reading an MMIO register in an interrupt - * handler after a slot isolation event has occurred, we assume it - * is broken and panic. This sets the threshold for how many read - * attempts we allow before panicking. - */ -#define EEH_MAX_FAILS 100000 - -/* RTAS tokens */ -static int ibm_set_eeh_option; -static int ibm_set_slot_reset; -static int ibm_read_slot_reset_state; -static int ibm_read_slot_reset_state2; -static int ibm_slot_error_detail; - -static int eeh_subsystem_enabled; - -/* Lock to avoid races due to multiple reports of an error */ -static DEFINE_SPINLOCK(confirm_error_lock); - -/* Buffer for reporting slot-error-detail rtas calls */ -static unsigned char slot_errbuf[RTAS_ERROR_LOG_MAX]; -static DEFINE_SPINLOCK(slot_errbuf_lock); -static int eeh_error_buf_size; - -/* System monitoring statistics */ -static DEFINE_PER_CPU(unsigned long, no_device); -static DEFINE_PER_CPU(unsigned long, no_dn); -static DEFINE_PER_CPU(unsigned long, no_cfg_addr); -static DEFINE_PER_CPU(unsigned long, ignored_check); -static DEFINE_PER_CPU(unsigned long, total_mmio_ffs); -static DEFINE_PER_CPU(unsigned long, false_positives); -static DEFINE_PER_CPU(unsigned long, ignored_failures); -static DEFINE_PER_CPU(unsigned long, slot_resets); - -/** - * The pci address cache subsystem. This subsystem places - * PCI device address resources into a red-black tree, sorted - * according to the address range, so that given only an i/o - * address, the corresponding PCI device can be **quickly** - * found. It is safe to perform an address lookup in an interrupt - * context; this ability is an important feature. - * - * Currently, the only customer of this code is the EEH subsystem; - * thus, this code has been somewhat tailored to suit EEH better. - * In particular, the cache does *not* hold the addresses of devices - * for which EEH is not enabled. - * - * (Implementation Note: The RB tree seems to be better/faster - * than any hash algo I could think of for this problem, even - * with the penalty of slow pointer chases for d-cache misses). - */ -struct pci_io_addr_range -{ - struct rb_node rb_node; - unsigned long addr_lo; - unsigned long addr_hi; - struct pci_dev *pcidev; - unsigned int flags; -}; - -static struct pci_io_addr_cache -{ - struct rb_root rb_root; - spinlock_t piar_lock; -} pci_io_addr_cache_root; - -static inline struct pci_dev *__pci_get_device_by_addr(unsigned long addr) -{ - struct rb_node *n = pci_io_addr_cache_root.rb_root.rb_node; - - while (n) { - struct pci_io_addr_range *piar; - piar = rb_entry(n, struct pci_io_addr_range, rb_node); - - if (addr < piar->addr_lo) { - n = n->rb_left; - } else { - if (addr > piar->addr_hi) { - n = n->rb_right; - } else { - pci_dev_get(piar->pcidev); - return piar->pcidev; - } - } - } - - return NULL; -} - -/** - * pci_get_device_by_addr - Get device, given only address - * @addr: mmio (PIO) phys address or i/o port number - * - * Given an mmio phys address, or a port number, find a pci device - * that implements this address. Be sure to pci_dev_put the device - * when finished. I/O port numbers are assumed to be offset - * from zero (that is, they do *not* have pci_io_addr added in). - * It is safe to call this function within an interrupt. - */ -static struct pci_dev *pci_get_device_by_addr(unsigned long addr) -{ - struct pci_dev *dev; - unsigned long flags; - - spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); - dev = __pci_get_device_by_addr(addr); - spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); - return dev; -} - -#ifdef DEBUG -/* - * Handy-dandy debug print routine, does nothing more - * than print out the contents of our addr cache. - */ -static void pci_addr_cache_print(struct pci_io_addr_cache *cache) -{ - struct rb_node *n; - int cnt = 0; - - n = rb_first(&cache->rb_root); - while (n) { - struct pci_io_addr_range *piar; - piar = rb_entry(n, struct pci_io_addr_range, rb_node); - printk(KERN_DEBUG "PCI: %s addr range %d [%lx-%lx]: %s\n", - (piar->flags & IORESOURCE_IO) ? "i/o" : "mem", cnt, - piar->addr_lo, piar->addr_hi, pci_name(piar->pcidev)); - cnt++; - n = rb_next(n); - } -} -#endif - -/* Insert address range into the rb tree. */ -static struct pci_io_addr_range * -pci_addr_cache_insert(struct pci_dev *dev, unsigned long alo, - unsigned long ahi, unsigned int flags) -{ - struct rb_node **p = &pci_io_addr_cache_root.rb_root.rb_node; - struct rb_node *parent = NULL; - struct pci_io_addr_range *piar; - - /* Walk tree, find a place to insert into tree */ - while (*p) { - parent = *p; - piar = rb_entry(parent, struct pci_io_addr_range, rb_node); - if (ahi < piar->addr_lo) { - p = &parent->rb_left; - } else if (alo > piar->addr_hi) { - p = &parent->rb_right; - } else { - if (dev != piar->pcidev || - alo != piar->addr_lo || ahi != piar->addr_hi) { - printk(KERN_WARNING "PIAR: overlapping address range\n"); - } - return piar; - } - } - piar = (struct pci_io_addr_range *)kmalloc(sizeof(struct pci_io_addr_range), GFP_ATOMIC); - if (!piar) - return NULL; - - piar->addr_lo = alo; - piar->addr_hi = ahi; - piar->pcidev = dev; - piar->flags = flags; - -#ifdef DEBUG - printk(KERN_DEBUG "PIAR: insert range=[%lx:%lx] dev=%s\n", - alo, ahi, pci_name (dev)); -#endif - - rb_link_node(&piar->rb_node, parent, p); - rb_insert_color(&piar->rb_node, &pci_io_addr_cache_root.rb_root); - - return piar; -} - -static void __pci_addr_cache_insert_device(struct pci_dev *dev) -{ - struct device_node *dn; - struct pci_dn *pdn; - int i; - int inserted = 0; - - dn = pci_device_to_OF_node(dev); - if (!dn) { - printk(KERN_WARNING "PCI: no pci dn found for dev=%s\n", pci_name(dev)); - return; - } - - /* Skip any devices for which EEH is not enabled. */ - pdn = PCI_DN(dn); - if (!(pdn->eeh_mode & EEH_MODE_SUPPORTED) || - pdn->eeh_mode & EEH_MODE_NOCHECK) { -#ifdef DEBUG - printk(KERN_INFO "PCI: skip building address cache for=%s - %s\n", - pci_name(dev), pdn->node->full_name); -#endif - return; - } - - /* The cache holds a reference to the device... */ - pci_dev_get(dev); - - /* Walk resources on this device, poke them into the tree */ - for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) { - unsigned long start = pci_resource_start(dev,i); - unsigned long end = pci_resource_end(dev,i); - unsigned int flags = pci_resource_flags(dev,i); - - /* We are interested only bus addresses, not dma or other stuff */ - if (0 == (flags & (IORESOURCE_IO | IORESOURCE_MEM))) - continue; - if (start == 0 || ~start == 0 || end == 0 || ~end == 0) - continue; - pci_addr_cache_insert(dev, start, end, flags); - inserted = 1; - } - - /* If there was nothing to add, the cache has no reference... */ - if (!inserted) - pci_dev_put(dev); -} - -/** - * pci_addr_cache_insert_device - Add a device to the address cache - * @dev: PCI device whose I/O addresses we are interested in. - * - * In order to support the fast lookup of devices based on addresses, - * we maintain a cache of devices that can be quickly searched. - * This routine adds a device to that cache. - */ -static void pci_addr_cache_insert_device(struct pci_dev *dev) -{ - unsigned long flags; - - spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); - __pci_addr_cache_insert_device(dev); - spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); -} - -static inline void __pci_addr_cache_remove_device(struct pci_dev *dev) -{ - struct rb_node *n; - int removed = 0; - -restart: - n = rb_first(&pci_io_addr_cache_root.rb_root); - while (n) { - struct pci_io_addr_range *piar; - piar = rb_entry(n, struct pci_io_addr_range, rb_node); - - if (piar->pcidev == dev) { - rb_erase(n, &pci_io_addr_cache_root.rb_root); - removed = 1; - kfree(piar); - goto restart; - } - n = rb_next(n); - } - - /* The cache no longer holds its reference to this device... */ - if (removed) - pci_dev_put(dev); -} - -/** - * pci_addr_cache_remove_device - remove pci device from addr cache - * @dev: device to remove - * - * Remove a device from the addr-cache tree. - * This is potentially expensive, since it will walk - * the tree multiple times (once per resource). - * But so what; device removal doesn't need to be that fast. - */ -static void pci_addr_cache_remove_device(struct pci_dev *dev) -{ - unsigned long flags; - - spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); - __pci_addr_cache_remove_device(dev); - spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); -} - -/** - * pci_addr_cache_build - Build a cache of I/O addresses - * - * Build a cache of pci i/o addresses. This cache will be used to - * find the pci device that corresponds to a given address. - * This routine scans all pci busses to build the cache. - * Must be run late in boot process, after the pci controllers - * have been scaned for devices (after all device resources are known). - */ -void __init pci_addr_cache_build(void) -{ - struct pci_dev *dev = NULL; - - if (!eeh_subsystem_enabled) - return; - - spin_lock_init(&pci_io_addr_cache_root.piar_lock); - - while ((dev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) { - /* Ignore PCI bridges ( XXX why ??) */ - if ((dev->class >> 16) == PCI_BASE_CLASS_BRIDGE) { - continue; - } - pci_addr_cache_insert_device(dev); - } - -#ifdef DEBUG - /* Verify tree built up above, echo back the list of addrs. */ - pci_addr_cache_print(&pci_io_addr_cache_root); -#endif -} - -/* --------------------------------------------------------------- */ -/* Above lies the PCI Address Cache. Below lies the EEH event infrastructure */ - -void eeh_slot_error_detail (struct pci_dn *pdn, int severity) -{ - unsigned long flags; - int rc; - - /* Log the error with the rtas logger */ - spin_lock_irqsave(&slot_errbuf_lock, flags); - memset(slot_errbuf, 0, eeh_error_buf_size); - - rc = rtas_call(ibm_slot_error_detail, - 8, 1, NULL, pdn->eeh_config_addr, - BUID_HI(pdn->phb->buid), - BUID_LO(pdn->phb->buid), NULL, 0, - virt_to_phys(slot_errbuf), - eeh_error_buf_size, - severity); - - if (rc == 0) - log_error(slot_errbuf, ERR_TYPE_RTAS_LOG, 0); - spin_unlock_irqrestore(&slot_errbuf_lock, flags); -} - -/** - * eeh_register_notifier - Register to find out about EEH events. - * @nb: notifier block to callback on events - */ -int eeh_register_notifier(struct notifier_block *nb) -{ - return notifier_chain_register(&eeh_notifier_chain, nb); -} - -/** - * eeh_unregister_notifier - Unregister to an EEH event notifier. - * @nb: notifier block to callback on events - */ -int eeh_unregister_notifier(struct notifier_block *nb) -{ - return notifier_chain_unregister(&eeh_notifier_chain, nb); -} - -/** - * read_slot_reset_state - Read the reset state of a device node's slot - * @dn: device node to read - * @rets: array to return results in - */ -static int read_slot_reset_state(struct pci_dn *pdn, int rets[]) -{ - int token, outputs; - - if (ibm_read_slot_reset_state2 != RTAS_UNKNOWN_SERVICE) { - token = ibm_read_slot_reset_state2; - outputs = 4; - } else { - token = ibm_read_slot_reset_state; - rets[2] = 0; /* fake PE Unavailable info */ - outputs = 3; - } - - return rtas_call(token, 3, outputs, rets, pdn->eeh_config_addr, - BUID_HI(pdn->phb->buid), BUID_LO(pdn->phb->buid)); -} - -/** - * eeh_panic - call panic() for an eeh event that cannot be handled. - * The philosophy of this routine is that it is better to panic and - * halt the OS than it is to risk possible data corruption by - * oblivious device drivers that don't know better. - * - * @dev pci device that had an eeh event - * @reset_state current reset state of the device slot - */ -static void eeh_panic(struct pci_dev *dev, int reset_state) -{ - /* - * XXX We should create a separate sysctl for this. - * - * Since the panic_on_oops sysctl is used to halt the system - * in light of potential corruption, we can use it here. - */ - if (panic_on_oops) { - struct device_node *dn = pci_device_to_OF_node(dev); - eeh_slot_error_detail (PCI_DN(dn), 2 /* Permanent Error */); - panic("EEH: MMIO failure (%d) on device:%s\n", reset_state, - pci_name(dev)); - } - else { - __get_cpu_var(ignored_failures)++; - printk(KERN_INFO "EEH: Ignored MMIO failure (%d) on device:%s\n", - reset_state, pci_name(dev)); - } -} - -/** - * eeh_event_handler - dispatch EEH events. The detection of a frozen - * slot can occur inside an interrupt, where it can be hard to do - * anything about it. The goal of this routine is to pull these - * detection events out of the context of the interrupt handler, and - * re-dispatch them for processing at a later time in a normal context. - * - * @dummy - unused - */ -static void eeh_event_handler(void *dummy) -{ - unsigned long flags; - struct eeh_event *event; - - while (1) { - spin_lock_irqsave(&eeh_eventlist_lock, flags); - event = NULL; - if (!list_empty(&eeh_eventlist)) { - event = list_entry(eeh_eventlist.next, struct eeh_event, list); - list_del(&event->list); - } - spin_unlock_irqrestore(&eeh_eventlist_lock, flags); - if (event == NULL) - break; - - printk(KERN_INFO "EEH: MMIO failure (%d), notifiying device " - "%s\n", event->reset_state, - pci_name(event->dev)); - - notifier_call_chain (&eeh_notifier_chain, - EEH_NOTIFY_FREEZE, event); - - pci_dev_put(event->dev); - kfree(event); - } -} - -/** - * eeh_token_to_phys - convert EEH address token to phys address - * @token i/o token, should be address in the form 0xA.... - */ -static inline unsigned long eeh_token_to_phys(unsigned long token) -{ - pte_t *ptep; - unsigned long pa; - - ptep = find_linux_pte(init_mm.pgd, token); - if (!ptep) - return token; - pa = pte_pfn(*ptep) << PAGE_SHIFT; - - return pa | (token & (PAGE_SIZE-1)); -} - -/** - * Return the "partitionable endpoint" (pe) under which this device lies - */ -static struct device_node * find_device_pe(struct device_node *dn) -{ - while ((dn->parent) && PCI_DN(dn->parent) && - (PCI_DN(dn->parent)->eeh_mode & EEH_MODE_SUPPORTED)) { - dn = dn->parent; - } - return dn; -} - -/** Mark all devices that are peers of this device as failed. - * Mark the device driver too, so that it can see the failure - * immediately; this is critical, since some drivers poll - * status registers in interrupts ... If a driver is polling, - * and the slot is frozen, then the driver can deadlock in - * an interrupt context, which is bad. - */ - -static inline void __eeh_mark_slot (struct device_node *dn) -{ - while (dn) { - PCI_DN(dn)->eeh_mode |= EEH_MODE_ISOLATED; - - if (dn->child) - __eeh_mark_slot (dn->child); - dn = dn->sibling; - } -} - -static inline void __eeh_clear_slot (struct device_node *dn) -{ - while (dn) { - PCI_DN(dn)->eeh_mode &= ~EEH_MODE_ISOLATED; - if (dn->child) - __eeh_clear_slot (dn->child); - dn = dn->sibling; - } -} - -static inline void eeh_clear_slot (struct device_node *dn) -{ - unsigned long flags; - spin_lock_irqsave(&confirm_error_lock, flags); - __eeh_clear_slot (dn); - spin_unlock_irqrestore(&confirm_error_lock, flags); -} - -/** - * eeh_dn_check_failure - check if all 1's data is due to EEH slot freeze - * @dn device node - * @dev pci device, if known - * - * Check for an EEH failure for the given device node. Call this - * routine if the result of a read was all 0xff's and you want to - * find out if this is due to an EEH slot freeze. This routine - * will query firmware for the EEH status. - * - * Returns 0 if there has not been an EEH error; otherwise returns - * a non-zero value and queues up a slot isolation event notification. - * - * It is safe to call this routine in an interrupt context. - */ -int eeh_dn_check_failure(struct device_node *dn, struct pci_dev *dev) -{ - int ret; - int rets[3]; - unsigned long flags; - int reset_state; - struct eeh_event *event; - struct pci_dn *pdn; - struct device_node *pe_dn; - int rc = 0; - - __get_cpu_var(total_mmio_ffs)++; - - if (!eeh_subsystem_enabled) - return 0; - - if (!dn) { - __get_cpu_var(no_dn)++; - return 0; - } - pdn = PCI_DN(dn); - - /* Access to IO BARs might get this far and still not want checking. */ - if (!(pdn->eeh_mode & EEH_MODE_SUPPORTED) || - pdn->eeh_mode & EEH_MODE_NOCHECK) { - __get_cpu_var(ignored_check)++; -#ifdef DEBUG - printk ("EEH:ignored check (%x) for %s %s\n", - pdn->eeh_mode, pci_name (dev), dn->full_name); -#endif - return 0; - } - - if (!pdn->eeh_config_addr) { - __get_cpu_var(no_cfg_addr)++; - return 0; - } - - /* If we already have a pending isolation event for this - * slot, we know it's bad already, we don't need to check. - * Do this checking under a lock; as multiple PCI devices - * in one slot might report errors simultaneously, and we - * only want one error recovery routine running. - */ - spin_lock_irqsave(&confirm_error_lock, flags); - rc = 1; - if (pdn->eeh_mode & EEH_MODE_ISOLATED) { - pdn->eeh_check_count ++; - if (pdn->eeh_check_count >= EEH_MAX_FAILS) { - printk (KERN_ERR "EEH: Device driver ignored %d bad reads, panicing\n", - pdn->eeh_check_count); - dump_stack(); - - /* re-read the slot reset state */ - if (read_slot_reset_state(pdn, rets) != 0) - rets[0] = -1; /* reset state unknown */ - - /* If we are here, then we hit an infinite loop. Stop. */ - panic("EEH: MMIO halt (%d) on device:%s\n", rets[0], pci_name(dev)); - } - goto dn_unlock; - } - - /* - * Now test for an EEH failure. This is VERY expensive. - * Note that the eeh_config_addr may be a parent device - * in the case of a device behind a bridge, or it may be - * function zero of a multi-function device. - * In any case they must share a common PHB. - */ - ret = read_slot_reset_state(pdn, rets); - - /* If the call to firmware failed, punt */ - if (ret != 0) { - printk(KERN_WARNING "EEH: read_slot_reset_state() failed; rc=%d dn=%s\n", - ret, dn->full_name); - __get_cpu_var(false_positives)++; - rc = 0; - goto dn_unlock; - } - - /* If EEH is not supported on this device, punt. */ - if (rets[1] != 1) { - printk(KERN_WARNING "EEH: event on unsupported device, rc=%d dn=%s\n", - ret, dn->full_name); - __get_cpu_var(false_positives)++; - rc = 0; - goto dn_unlock; - } - - /* If not the kind of error we know about, punt. */ - if (rets[0] != 2 && rets[0] != 4 && rets[0] != 5) { - __get_cpu_var(false_positives)++; - rc = 0; - goto dn_unlock; - } - - /* Note that config-io to empty slots may fail; - * we recognize empty because they don't have children. */ - if ((rets[0] == 5) && (dn->child == NULL)) { - __get_cpu_var(false_positives)++; - rc = 0; - goto dn_unlock; - } - - __get_cpu_var(slot_resets)++; - - /* Avoid repeated reports of this failure, including problems - * with other functions on this device, and functions under - * bridges. */ - pe_dn = find_device_pe (dn); - __eeh_mark_slot (pe_dn); - spin_unlock_irqrestore(&confirm_error_lock, flags); - - reset_state = rets[0]; - - eeh_slot_error_detail (pdn, 1 /* Temporary Error */); - - printk(KERN_INFO "EEH: MMIO failure (%d) on device: %s %s\n", - rets[0], dn->name, dn->full_name); - event = kmalloc(sizeof(*event), GFP_ATOMIC); - if (event == NULL) { - eeh_panic(dev, reset_state); - return 1; - } - - event->dev = dev; - event->dn = dn; - event->reset_state = reset_state; - - /* We may or may not be called in an interrupt context */ - spin_lock_irqsave(&eeh_eventlist_lock, flags); - list_add(&event->list, &eeh_eventlist); - spin_unlock_irqrestore(&eeh_eventlist_lock, flags); - - /* Most EEH events are due to device driver bugs. Having - * a stack trace will help the device-driver authors figure - * out what happened. So print that out. */ - if (rets[0] != 5) dump_stack(); - schedule_work(&eeh_event_wq); - - return 1; - -dn_unlock: - spin_unlock_irqrestore(&confirm_error_lock, flags); - return rc; -} - -EXPORT_SYMBOL_GPL(eeh_dn_check_failure); - -/** - * eeh_check_failure - check if all 1's data is due to EEH slot freeze - * @token i/o token, should be address in the form 0xA.... - * @val value, should be all 1's (XXX why do we need this arg??) - * - * Check for an EEH failure at the given token address. Call this - * routine if the result of a read was all 0xff's and you want to - * find out if this is due to an EEH slot freeze event. This routine - * will query firmware for the EEH status. - * - * Note this routine is safe to call in an interrupt context. - */ -unsigned long eeh_check_failure(const volatile void __iomem *token, unsigned long val) -{ - unsigned long addr; - struct pci_dev *dev; - struct device_node *dn; - - /* Finding the phys addr + pci device; this is pretty quick. */ - addr = eeh_token_to_phys((unsigned long __force) token); - dev = pci_get_device_by_addr(addr); - if (!dev) { - __get_cpu_var(no_device)++; - return val; - } - - dn = pci_device_to_OF_node(dev); - eeh_dn_check_failure (dn, dev); - - pci_dev_put(dev); - return val; -} - -EXPORT_SYMBOL(eeh_check_failure); - -struct eeh_early_enable_info { - unsigned int buid_hi; - unsigned int buid_lo; -}; - -/* Enable eeh for the given device node. */ -static void *early_enable_eeh(struct device_node *dn, void *data) -{ - struct eeh_early_enable_info *info = data; - int ret; - char *status = get_property(dn, "status", NULL); - u32 *class_code = (u32 *)get_property(dn, "class-code", NULL); - u32 *vendor_id = (u32 *)get_property(dn, "vendor-id", NULL); - u32 *device_id = (u32 *)get_property(dn, "device-id", NULL); - u32 *regs; - int enable; - struct pci_dn *pdn = PCI_DN(dn); - - pdn->eeh_mode = 0; - pdn->eeh_check_count = 0; - pdn->eeh_freeze_count = 0; - - if (status && strcmp(status, "ok") != 0) - return NULL; /* ignore devices with bad status */ - - /* Ignore bad nodes. */ - if (!class_code || !vendor_id || !device_id) - return NULL; - - /* There is nothing to check on PCI to ISA bridges */ - if (dn->type && !strcmp(dn->type, "isa")) { - pdn->eeh_mode |= EEH_MODE_NOCHECK; - return NULL; - } - - /* - * Now decide if we are going to "Disable" EEH checking - * for this device. We still run with the EEH hardware active, - * but we won't be checking for ff's. This means a driver - * could return bad data (very bad!), an interrupt handler could - * hang waiting on status bits that won't change, etc. - * But there are a few cases like display devices that make sense. - */ - enable = 1; /* i.e. we will do checking */ - if ((*class_code >> 16) == PCI_BASE_CLASS_DISPLAY) - enable = 0; - - if (!enable) - pdn->eeh_mode |= EEH_MODE_NOCHECK; - - /* Ok... see if this device supports EEH. Some do, some don't, - * and the only way to find out is to check each and every one. */ - regs = (u32 *)get_property(dn, "reg", NULL); - if (regs) { - /* First register entry is addr (00BBSS00) */ - /* Try to enable eeh */ - ret = rtas_call(ibm_set_eeh_option, 4, 1, NULL, - regs[0], info->buid_hi, info->buid_lo, - EEH_ENABLE); - if (ret == 0) { - eeh_subsystem_enabled = 1; - pdn->eeh_mode |= EEH_MODE_SUPPORTED; - pdn->eeh_config_addr = regs[0]; -#ifdef DEBUG - printk(KERN_DEBUG "EEH: %s: eeh enabled\n", dn->full_name); -#endif - } else { - - /* This device doesn't support EEH, but it may have an - * EEH parent, in which case we mark it as supported. */ - if (dn->parent && PCI_DN(dn->parent) - && (PCI_DN(dn->parent)->eeh_mode & EEH_MODE_SUPPORTED)) { - /* Parent supports EEH. */ - pdn->eeh_mode |= EEH_MODE_SUPPORTED; - pdn->eeh_config_addr = PCI_DN(dn->parent)->eeh_config_addr; - return NULL; - } - } - } else { - printk(KERN_WARNING "EEH: %s: unable to get reg property.\n", - dn->full_name); - } - - return NULL; -} - -/* - * Initialize EEH by trying to enable it for all of the adapters in the system. - * As a side effect we can determine here if eeh is supported at all. - * Note that we leave EEH on so failed config cycles won't cause a machine - * check. If a user turns off EEH for a particular adapter they are really - * telling Linux to ignore errors. Some hardware (e.g. POWER5) won't - * grant access to a slot if EEH isn't enabled, and so we always enable - * EEH for all slots/all devices. - * - * The eeh-force-off option disables EEH checking globally, for all slots. - * Even if force-off is set, the EEH hardware is still enabled, so that - * newer systems can boot. - */ -void __init eeh_init(void) -{ - struct device_node *phb, *np; - struct eeh_early_enable_info info; - - spin_lock_init(&confirm_error_lock); - spin_lock_init(&slot_errbuf_lock); - - np = of_find_node_by_path("/rtas"); - if (np == NULL) - return; - - ibm_set_eeh_option = rtas_token("ibm,set-eeh-option"); - ibm_set_slot_reset = rtas_token("ibm,set-slot-reset"); - ibm_read_slot_reset_state2 = rtas_token("ibm,read-slot-reset-state2"); - ibm_read_slot_reset_state = rtas_token("ibm,read-slot-reset-state"); - ibm_slot_error_detail = rtas_token("ibm,slot-error-detail"); - - if (ibm_set_eeh_option == RTAS_UNKNOWN_SERVICE) - return; - - eeh_error_buf_size = rtas_token("rtas-error-log-max"); - if (eeh_error_buf_size == RTAS_UNKNOWN_SERVICE) { - eeh_error_buf_size = 1024; - } - if (eeh_error_buf_size > RTAS_ERROR_LOG_MAX) { - printk(KERN_WARNING "EEH: rtas-error-log-max is bigger than allocated " - "buffer ! (%d vs %d)", eeh_error_buf_size, RTAS_ERROR_LOG_MAX); - eeh_error_buf_size = RTAS_ERROR_LOG_MAX; - } - - /* Enable EEH for all adapters. Note that eeh requires buid's */ - for (phb = of_find_node_by_name(NULL, "pci"); phb; - phb = of_find_node_by_name(phb, "pci")) { - unsigned long buid; - - buid = get_phb_buid(phb); - if (buid == 0 || PCI_DN(phb) == NULL) - continue; - - info.buid_lo = BUID_LO(buid); - info.buid_hi = BUID_HI(buid); - traverse_pci_devices(phb, early_enable_eeh, &info); - } - - if (eeh_subsystem_enabled) - printk(KERN_INFO "EEH: PCI Enhanced I/O Error Handling Enabled\n"); - else - printk(KERN_WARNING "EEH: No capable adapters found\n"); -} - -/** - * eeh_add_device_early - enable EEH for the indicated device_node - * @dn: device node for which to set up EEH - * - * This routine must be used to perform EEH initialization for PCI - * devices that were added after system boot (e.g. hotplug, dlpar). - * This routine must be called before any i/o is performed to the - * adapter (inluding any config-space i/o). - * Whether this actually enables EEH or not for this device depends - * on the CEC architecture, type of the device, on earlier boot - * command-line arguments & etc. - */ -void eeh_add_device_early(struct device_node *dn) -{ - struct pci_controller *phb; - struct eeh_early_enable_info info; - - if (!dn || !PCI_DN(dn)) - return; - phb = PCI_DN(dn)->phb; - if (NULL == phb || 0 == phb->buid) { - printk(KERN_WARNING "EEH: Expected buid but found none for %s\n", - dn->full_name); - dump_stack(); - return; - } - - info.buid_hi = BUID_HI(phb->buid); - info.buid_lo = BUID_LO(phb->buid); - early_enable_eeh(dn, &info); -} -EXPORT_SYMBOL_GPL(eeh_add_device_early); - -/** - * eeh_add_device_late - perform EEH initialization for the indicated pci device - * @dev: pci device for which to set up EEH - * - * This routine must be used to complete EEH initialization for PCI - * devices that were added after system boot (e.g. hotplug, dlpar). - */ -void eeh_add_device_late(struct pci_dev *dev) -{ - struct device_node *dn; - - if (!dev || !eeh_subsystem_enabled) - return; - -#ifdef DEBUG - printk(KERN_DEBUG "EEH: adding device %s\n", pci_name(dev)); -#endif - - pci_dev_get (dev); - dn = pci_device_to_OF_node(dev); - PCI_DN(dn)->pcidev = dev; - - pci_addr_cache_insert_device (dev); -} -EXPORT_SYMBOL_GPL(eeh_add_device_late); - -/** - * eeh_remove_device - undo EEH setup for the indicated pci device - * @dev: pci device to be removed - * - * This routine should be when a device is removed from a running - * system (e.g. by hotplug or dlpar). - */ -void eeh_remove_device(struct pci_dev *dev) -{ - struct device_node *dn; - if (!dev || !eeh_subsystem_enabled) - return; - - /* Unregister the device with the EEH/PCI address search system */ -#ifdef DEBUG - printk(KERN_DEBUG "EEH: remove device %s\n", pci_name(dev)); -#endif - pci_addr_cache_remove_device(dev); - - dn = pci_device_to_OF_node(dev); - PCI_DN(dn)->pcidev = NULL; - pci_dev_put (dev); -} -EXPORT_SYMBOL_GPL(eeh_remove_device); - -static int proc_eeh_show(struct seq_file *m, void *v) -{ - unsigned int cpu; - unsigned long ffs = 0, positives = 0, failures = 0; - unsigned long resets = 0; - unsigned long no_dev = 0, no_dn = 0, no_cfg = 0, no_check = 0; - - for_each_cpu(cpu) { - ffs += per_cpu(total_mmio_ffs, cpu); - positives += per_cpu(false_positives, cpu); - failures += per_cpu(ignored_failures, cpu); - resets += per_cpu(slot_resets, cpu); - no_dev += per_cpu(no_device, cpu); - no_dn += per_cpu(no_dn, cpu); - no_cfg += per_cpu(no_cfg_addr, cpu); - no_check += per_cpu(ignored_check, cpu); - } - - if (0 == eeh_subsystem_enabled) { - seq_printf(m, "EEH Subsystem is globally disabled\n"); - seq_printf(m, "eeh_total_mmio_ffs=%ld\n", ffs); - } else { - seq_printf(m, "EEH Subsystem is enabled\n"); - seq_printf(m, - "no device=%ld\n" - "no device node=%ld\n" - "no config address=%ld\n" - "check not wanted=%ld\n" - "eeh_total_mmio_ffs=%ld\n" - "eeh_false_positives=%ld\n" - "eeh_ignored_failures=%ld\n" - "eeh_slot_resets=%ld\n", - no_dev, no_dn, no_cfg, no_check, - ffs, positives, failures, resets); - } - - return 0; -} - -static int proc_eeh_open(struct inode *inode, struct file *file) -{ - return single_open(file, proc_eeh_show, NULL); -} - -static struct file_operations proc_eeh_operations = { - .open = proc_eeh_open, - .read = seq_read, - .llseek = seq_lseek, - .release = single_release, -}; - -static int __init eeh_init_proc(void) -{ - struct proc_dir_entry *e; - - if (systemcfg->platform & PLATFORM_PSERIES) { - e = create_proc_entry("ppc64/eeh", 0, NULL); - if (e) - e->proc_fops = &proc_eeh_operations; - } - - return 0; -} -__initcall(eeh_init_proc); Index: linux-2.6.14-git3/arch/ppc64/kernel/Makefile =================================================================== --- linux-2.6.14-git3.orig/arch/ppc64/kernel/Makefile 2005-11-02 14:29:22.485829789 -0600 +++ linux-2.6.14-git3/arch/ppc64/kernel/Makefile 2005-11-02 14:30:49.805589414 -0600 @@ -35,7 +35,6 @@ bpa_iic.o spider-pic.o obj-$(CONFIG_KEXEC) += machine_kexec.o -obj-$(CONFIG_EEH) += eeh.o obj-$(CONFIG_PROC_FS) += proc_ppc64.o obj-$(CONFIG_RTAS_FLASH) += rtas_flash.o obj-$(CONFIG_SMP) += smp.o Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/Makefile =================================================================== --- linux-2.6.14-git3.orig/arch/powerpc/platforms/pseries/Makefile 2005-10-31 11:19:47.000000000 -0600 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/Makefile 2005-11-02 14:31:36.150092654 -0600 @@ -3,3 +3,4 @@ obj-$(CONFIG_SMP) += smp.o obj-$(CONFIG_IBMVIO) += vio.o obj-$(CONFIG_XICS) += xics.o +obj-$(CONFIG_EEH) += eeh.o Index: linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.14-git3/arch/powerpc/platforms/pseries/eeh.c 2005-11-02 14:30:49.790591516 -0600 @@ -0,0 +1,1093 @@ +/* + * eeh.c + * Copyright (C) 2001 Dave Engebretsen & Todd Inglett IBM Corporation + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#undef DEBUG + +/** Overview: + * EEH, or "Extended Error Handling" is a PCI bridge technology for + * dealing with PCI bus errors that can't be dealt with within the + * usual PCI framework, except by check-stopping the CPU. Systems + * that are designed for high-availability/reliability cannot afford + * to crash due to a "mere" PCI error, thus the need for EEH. + * An EEH-capable bridge operates by converting a detected error + * into a "slot freeze", taking the PCI adapter off-line, making + * the slot behave, from the OS'es point of view, as if the slot + * were "empty": all reads return 0xff's and all writes are silently + * ignored. EEH slot isolation events can be triggered by parity + * errors on the address or data busses (e.g. during posted writes), + * which in turn might be caused by low voltage on the bus, dust, + * vibration, humidity, radioactivity or plain-old failed hardware. + * + * Note, however, that one of the leading causes of EEH slot + * freeze events are buggy device drivers, buggy device microcode, + * or buggy device hardware. This is because any attempt by the + * device to bus-master data to a memory address that is not + * assigned to the device will trigger a slot freeze. (The idea + * is to prevent devices-gone-wild from corrupting system memory). + * Buggy hardware/drivers will have a miserable time co-existing + * with EEH. + * + * Ideally, a PCI device driver, when suspecting that an isolation + * event has occured (e.g. by reading 0xff's), will then ask EEH + * whether this is the case, and then take appropriate steps to + * reset the PCI slot, the PCI device, and then resume operations. + * However, until that day, the checking is done here, with the + * eeh_check_failure() routine embedded in the MMIO macros. If + * the slot is found to be isolated, an "EEH Event" is synthesized + * and sent out for processing. + */ + +/* EEH event workqueue setup. */ +static DEFINE_SPINLOCK(eeh_eventlist_lock); +LIST_HEAD(eeh_eventlist); +static void eeh_event_handler(void *); +DECLARE_WORK(eeh_event_wq, eeh_event_handler, NULL); + +static struct notifier_block *eeh_notifier_chain; + +/* If a device driver keeps reading an MMIO register in an interrupt + * handler after a slot isolation event has occurred, we assume it + * is broken and panic. This sets the threshold for how many read + * attempts we allow before panicking. + */ +#define EEH_MAX_FAILS 100000 + +/* RTAS tokens */ +static int ibm_set_eeh_option; +static int ibm_set_slot_reset; +static int ibm_read_slot_reset_state; +static int ibm_read_slot_reset_state2; +static int ibm_slot_error_detail; + +static int eeh_subsystem_enabled; + +/* Lock to avoid races due to multiple reports of an error */ +static DEFINE_SPINLOCK(confirm_error_lock); + +/* Buffer for reporting slot-error-detail rtas calls */ +static unsigned char slot_errbuf[RTAS_ERROR_LOG_MAX]; +static DEFINE_SPINLOCK(slot_errbuf_lock); +static int eeh_error_buf_size; + +/* System monitoring statistics */ +static DEFINE_PER_CPU(unsigned long, no_device); +static DEFINE_PER_CPU(unsigned long, no_dn); +static DEFINE_PER_CPU(unsigned long, no_cfg_addr); +static DEFINE_PER_CPU(unsigned long, ignored_check); +static DEFINE_PER_CPU(unsigned long, total_mmio_ffs); +static DEFINE_PER_CPU(unsigned long, false_positives); +static DEFINE_PER_CPU(unsigned long, ignored_failures); +static DEFINE_PER_CPU(unsigned long, slot_resets); + +/** + * The pci address cache subsystem. This subsystem places + * PCI device address resources into a red-black tree, sorted + * according to the address range, so that given only an i/o + * address, the corresponding PCI device can be **quickly** + * found. It is safe to perform an address lookup in an interrupt + * context; this ability is an important feature. + * + * Currently, the only customer of this code is the EEH subsystem; + * thus, this code has been somewhat tailored to suit EEH better. + * In particular, the cache does *not* hold the addresses of devices + * for which EEH is not enabled. + * + * (Implementation Note: The RB tree seems to be better/faster + * than any hash algo I could think of for this problem, even + * with the penalty of slow pointer chases for d-cache misses). + */ +struct pci_io_addr_range +{ + struct rb_node rb_node; + unsigned long addr_lo; + unsigned long addr_hi; + struct pci_dev *pcidev; + unsigned int flags; +}; + +static struct pci_io_addr_cache +{ + struct rb_root rb_root; + spinlock_t piar_lock; +} pci_io_addr_cache_root; + +static inline struct pci_dev *__pci_get_device_by_addr(unsigned long addr) +{ + struct rb_node *n = pci_io_addr_cache_root.rb_root.rb_node; + + while (n) { + struct pci_io_addr_range *piar; + piar = rb_entry(n, struct pci_io_addr_range, rb_node); + + if (addr < piar->addr_lo) { + n = n->rb_left; + } else { + if (addr > piar->addr_hi) { + n = n->rb_right; + } else { + pci_dev_get(piar->pcidev); + return piar->pcidev; + } + } + } + + return NULL; +} + +/** + * pci_get_device_by_addr - Get device, given only address + * @addr: mmio (PIO) phys address or i/o port number + * + * Given an mmio phys address, or a port number, find a pci device + * that implements this address. Be sure to pci_dev_put the device + * when finished. I/O port numbers are assumed to be offset + * from zero (that is, they do *not* have pci_io_addr added in). + * It is safe to call this function within an interrupt. + */ +static struct pci_dev *pci_get_device_by_addr(unsigned long addr) +{ + struct pci_dev *dev; + unsigned long flags; + + spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); + dev = __pci_get_device_by_addr(addr); + spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); + return dev; +} + +#ifdef DEBUG +/* + * Handy-dandy debug print routine, does nothing more + * than print out the contents of our addr cache. + */ +static void pci_addr_cache_print(struct pci_io_addr_cache *cache) +{ + struct rb_node *n; + int cnt = 0; + + n = rb_first(&cache->rb_root); + while (n) { + struct pci_io_addr_range *piar; + piar = rb_entry(n, struct pci_io_addr_range, rb_node); + printk(KERN_DEBUG "PCI: %s addr range %d [%lx-%lx]: %s\n", + (piar->flags & IORESOURCE_IO) ? "i/o" : "mem", cnt, + piar->addr_lo, piar->addr_hi, pci_name(piar->pcidev)); + cnt++; + n = rb_next(n); + } +} +#endif + +/* Insert address range into the rb tree. */ +static struct pci_io_addr_range * +pci_addr_cache_insert(struct pci_dev *dev, unsigned long alo, + unsigned long ahi, unsigned int flags) +{ + struct rb_node **p = &pci_io_addr_cache_root.rb_root.rb_node; + struct rb_node *parent = NULL; + struct pci_io_addr_range *piar; + + /* Walk tree, find a place to insert into tree */ + while (*p) { + parent = *p; + piar = rb_entry(parent, struct pci_io_addr_range, rb_node); + if (ahi < piar->addr_lo) { + p = &parent->rb_left; + } else if (alo > piar->addr_hi) { + p = &parent->rb_right; + } else { + if (dev != piar->pcidev || + alo != piar->addr_lo || ahi != piar->addr_hi) { + printk(KERN_WARNING "PIAR: overlapping address range\n"); + } + return piar; + } + } + piar = (struct pci_io_addr_range *)kmalloc(sizeof(struct pci_io_addr_range), GFP_ATOMIC); + if (!piar) + return NULL; + + piar->addr_lo = alo; + piar->addr_hi = ahi; + piar->pcidev = dev; + piar->flags = flags; + +#ifdef DEBUG + printk(KERN_DEBUG "PIAR: insert range=[%lx:%lx] dev=%s\n", + alo, ahi, pci_name (dev)); +#endif + + rb_link_node(&piar->rb_node, parent, p); + rb_insert_color(&piar->rb_node, &pci_io_addr_cache_root.rb_root); + + return piar; +} + +static void __pci_addr_cache_insert_device(struct pci_dev *dev) +{ + struct device_node *dn; + struct pci_dn *pdn; + int i; + int inserted = 0; + + dn = pci_device_to_OF_node(dev); + if (!dn) { + printk(KERN_WARNING "PCI: no pci dn found for dev=%s\n", pci_name(dev)); + return; + } + + /* Skip any devices for which EEH is not enabled. */ + pdn = PCI_DN(dn); + if (!(pdn->eeh_mode & EEH_MODE_SUPPORTED) || + pdn->eeh_mode & EEH_MODE_NOCHECK) { +#ifdef DEBUG + printk(KERN_INFO "PCI: skip building address cache for=%s - %s\n", + pci_name(dev), pdn->node->full_name); +#endif + return; + } + + /* The cache holds a reference to the device... */ + pci_dev_get(dev); + + /* Walk resources on this device, poke them into the tree */ + for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) { + unsigned long start = pci_resource_start(dev,i); + unsigned long end = pci_resource_end(dev,i); + unsigned int flags = pci_resource_flags(dev,i); + + /* We are interested only bus addresses, not dma or other stuff */ + if (0 == (flags & (IORESOURCE_IO | IORESOURCE_MEM))) + continue; + if (start == 0 || ~start == 0 || end == 0 || ~end == 0) + continue; + pci_addr_cache_insert(dev, start, end, flags); + inserted = 1; + } + + /* If there was nothing to add, the cache has no reference... */ + if (!inserted) + pci_dev_put(dev); +} + +/** + * pci_addr_cache_insert_device - Add a device to the address cache + * @dev: PCI device whose I/O addresses we are interested in. + * + * In order to support the fast lookup of devices based on addresses, + * we maintain a cache of devices that can be quickly searched. + * This routine adds a device to that cache. + */ +static void pci_addr_cache_insert_device(struct pci_dev *dev) +{ + unsigned long flags; + + spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); + __pci_addr_cache_insert_device(dev); + spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); +} + +static inline void __pci_addr_cache_remove_device(struct pci_dev *dev) +{ + struct rb_node *n; + int removed = 0; + +restart: + n = rb_first(&pci_io_addr_cache_root.rb_root); + while (n) { + struct pci_io_addr_range *piar; + piar = rb_entry(n, struct pci_io_addr_range, rb_node); + + if (piar->pcidev == dev) { + rb_erase(n, &pci_io_addr_cache_root.rb_root); + removed = 1; + kfree(piar); + goto restart; + } + n = rb_next(n); + } + + /* The cache no longer holds its reference to this device... */ + if (removed) + pci_dev_put(dev); +} + +/** + * pci_addr_cache_remove_device - remove pci device from addr cache + * @dev: device to remove + * + * Remove a device from the addr-cache tree. + * This is potentially expensive, since it will walk + * the tree multiple times (once per resource). + * But so what; device removal doesn't need to be that fast. + */ +static void pci_addr_cache_remove_device(struct pci_dev *dev) +{ + unsigned long flags; + + spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); + __pci_addr_cache_remove_device(dev); + spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); +} + +/** + * pci_addr_cache_build - Build a cache of I/O addresses + * + * Build a cache of pci i/o addresses. This cache will be used to + * find the pci device that corresponds to a given address. + * This routine scans all pci busses to build the cache. + * Must be run late in boot process, after the pci controllers + * have been scaned for devices (after all device resources are known). + */ +void __init pci_addr_cache_build(void) +{ + struct pci_dev *dev = NULL; + + if (!eeh_subsystem_enabled) + return; + + spin_lock_init(&pci_io_addr_cache_root.piar_lock); + + while ((dev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) { + /* Ignore PCI bridges ( XXX why ??) */ + if ((dev->class >> 16) == PCI_BASE_CLASS_BRIDGE) { + continue; + } + pci_addr_cache_insert_device(dev); + } + +#ifdef DEBUG + /* Verify tree built up above, echo back the list of addrs. */ + pci_addr_cache_print(&pci_io_addr_cache_root); +#endif +} + +/* --------------------------------------------------------------- */ +/* Above lies the PCI Address Cache. Below lies the EEH event infrastructure */ + +void eeh_slot_error_detail (struct pci_dn *pdn, int severity) +{ + unsigned long flags; + int rc; + + /* Log the error with the rtas logger */ + spin_lock_irqsave(&slot_errbuf_lock, flags); + memset(slot_errbuf, 0, eeh_error_buf_size); + + rc = rtas_call(ibm_slot_error_detail, + 8, 1, NULL, pdn->eeh_config_addr, + BUID_HI(pdn->phb->buid), + BUID_LO(pdn->phb->buid), NULL, 0, + virt_to_phys(slot_errbuf), + eeh_error_buf_size, + severity); + + if (rc == 0) + log_error(slot_errbuf, ERR_TYPE_RTAS_LOG, 0); + spin_unlock_irqrestore(&slot_errbuf_lock, flags); +} + +/** + * eeh_register_notifier - Register to find out about EEH events. + * @nb: notifier block to callback on events + */ +int eeh_register_notifier(struct notifier_block *nb) +{ + return notifier_chain_register(&eeh_notifier_chain, nb); +} + +/** + * eeh_unregister_notifier - Unregister to an EEH event notifier. + * @nb: notifier block to callback on events + */ +int eeh_unregister_notifier(struct notifier_block *nb) +{ + return notifier_chain_unregister(&eeh_notifier_chain, nb); +} + +/** + * read_slot_reset_state - Read the reset state of a device node's slot + * @dn: device node to read + * @rets: array to return results in + */ +static int read_slot_reset_state(struct pci_dn *pdn, int rets[]) +{ + int token, outputs; + + if (ibm_read_slot_reset_state2 != RTAS_UNKNOWN_SERVICE) { + token = ibm_read_slot_reset_state2; + outputs = 4; + } else { + token = ibm_read_slot_reset_state; + rets[2] = 0; /* fake PE Unavailable info */ + outputs = 3; + } + + return rtas_call(token, 3, outputs, rets, pdn->eeh_config_addr, + BUID_HI(pdn->phb->buid), BUID_LO(pdn->phb->buid)); +} + +/** + * eeh_panic - call panic() for an eeh event that cannot be handled. + * The philosophy of this routine is that it is better to panic and + * halt the OS than it is to risk possible data corruption by + * oblivious device drivers that don't know better. + * + * @dev pci device that had an eeh event + * @reset_state current reset state of the device slot + */ +static void eeh_panic(struct pci_dev *dev, int reset_state) +{ + /* + * XXX We should create a separate sysctl for this. + * + * Since the panic_on_oops sysctl is used to halt the system + * in light of potential corruption, we can use it here. + */ + if (panic_on_oops) { + struct device_node *dn = pci_device_to_OF_node(dev); + eeh_slot_error_detail (PCI_DN(dn), 2 /* Permanent Error */); + panic("EEH: MMIO failure (%d) on device:%s\n", reset_state, + pci_name(dev)); + } + else { + __get_cpu_var(ignored_failures)++; + printk(KERN_INFO "EEH: Ignored MMIO failure (%d) on device:%s\n", + reset_state, pci_name(dev)); + } +} + +/** + * eeh_event_handler - dispatch EEH events. The detection of a frozen + * slot can occur inside an interrupt, where it can be hard to do + * anything about it. The goal of this routine is to pull these + * detection events out of the context of the interrupt handler, and + * re-dispatch them for processing at a later time in a normal context. + * + * @dummy - unused + */ +static void eeh_event_handler(void *dummy) +{ + unsigned long flags; + struct eeh_event *event; + + while (1) { + spin_lock_irqsave(&eeh_eventlist_lock, flags); + event = NULL; + if (!list_empty(&eeh_eventlist)) { + event = list_entry(eeh_eventlist.next, struct eeh_event, list); + list_del(&event->list); + } + spin_unlock_irqrestore(&eeh_eventlist_lock, flags); + if (event == NULL) + break; + + printk(KERN_INFO "EEH: MMIO failure (%d), notifiying device " + "%s\n", event->reset_state, + pci_name(event->dev)); + + notifier_call_chain (&eeh_notifier_chain, + EEH_NOTIFY_FREEZE, event); + + pci_dev_put(event->dev); + kfree(event); + } +} + +/** + * eeh_token_to_phys - convert EEH address token to phys address + * @token i/o token, should be address in the form 0xA.... + */ +static inline unsigned long eeh_token_to_phys(unsigned long token) +{ + pte_t *ptep; + unsigned long pa; + + ptep = find_linux_pte(init_mm.pgd, token); + if (!ptep) + return token; + pa = pte_pfn(*ptep) << PAGE_SHIFT; + + return pa | (token & (PAGE_SIZE-1)); +} + +/** + * Return the "partitionable endpoint" (pe) under which this device lies + */ +static struct device_node * find_device_pe(struct device_node *dn) +{ + while ((dn->parent) && PCI_DN(dn->parent) && + (PCI_DN(dn->parent)->eeh_mode & EEH_MODE_SUPPORTED)) { + dn = dn->parent; + } + return dn; +} + +/** Mark all devices that are peers of this device as failed. + * Mark the device driver too, so that it can see the failure + * immediately; this is critical, since some drivers poll + * status registers in interrupts ... If a driver is polling, + * and the slot is frozen, then the driver can deadlock in + * an interrupt context, which is bad. + */ + +static inline void __eeh_mark_slot (struct device_node *dn) +{ + while (dn) { + PCI_DN(dn)->eeh_mode |= EEH_MODE_ISOLATED; + + if (dn->child) + __eeh_mark_slot (dn->child); + dn = dn->sibling; + } +} + +static inline void __eeh_clear_slot (struct device_node *dn) +{ + while (dn) { + PCI_DN(dn)->eeh_mode &= ~EEH_MODE_ISOLATED; + if (dn->child) + __eeh_clear_slot (dn->child); + dn = dn->sibling; + } +} + +static inline void eeh_clear_slot (struct device_node *dn) +{ + unsigned long flags; + spin_lock_irqsave(&confirm_error_lock, flags); + __eeh_clear_slot (dn); + spin_unlock_irqrestore(&confirm_error_lock, flags); +} + +/** + * eeh_dn_check_failure - check if all 1's data is due to EEH slot freeze + * @dn device node + * @dev pci device, if known + * + * Check for an EEH failure for the given device node. Call this + * routine if the result of a read was all 0xff's and you want to + * find out if this is due to an EEH slot freeze. This routine + * will query firmware for the EEH status. + * + * Returns 0 if there has not been an EEH error; otherwise returns + * a non-zero value and queues up a slot isolation event notification. + * + * It is safe to call this routine in an interrupt context. + */ +int eeh_dn_check_failure(struct device_node *dn, struct pci_dev *dev) +{ + int ret; + int rets[3]; + unsigned long flags; + int reset_state; + struct eeh_event *event; + struct pci_dn *pdn; + struct device_node *pe_dn; + int rc = 0; + + __get_cpu_var(total_mmio_ffs)++; + + if (!eeh_subsystem_enabled) + return 0; + + if (!dn) { + __get_cpu_var(no_dn)++; + return 0; + } + pdn = PCI_DN(dn); + + /* Access to IO BARs might get this far and still not want checking. */ + if (!(pdn->eeh_mode & EEH_MODE_SUPPORTED) || + pdn->eeh_mode & EEH_MODE_NOCHECK) { + __get_cpu_var(ignored_check)++; +#ifdef DEBUG + printk ("EEH:ignored check (%x) for %s %s\n", + pdn->eeh_mode, pci_name (dev), dn->full_name); +#endif + return 0; + } + + if (!pdn->eeh_config_addr) { + __get_cpu_var(no_cfg_addr)++; + return 0; + } + + /* If we already have a pending isolation event for this + * slot, we know it's bad already, we don't need to check. + * Do this checking under a lock; as multiple PCI devices + * in one slot might report errors simultaneously, and we + * only want one error recovery routine running. + */ + spin_lock_irqsave(&confirm_error_lock, flags); + rc = 1; + if (pdn->eeh_mode & EEH_MODE_ISOLATED) { + pdn->eeh_check_count ++; + if (pdn->eeh_check_count >= EEH_MAX_FAILS) { + printk (KERN_ERR "EEH: Device driver ignored %d bad reads, panicing\n", + pdn->eeh_check_count); + dump_stack(); + + /* re-read the slot reset state */ + if (read_slot_reset_state(pdn, rets) != 0) + rets[0] = -1; /* reset state unknown */ + + /* If we are here, then we hit an infinite loop. Stop. */ + panic("EEH: MMIO halt (%d) on device:%s\n", rets[0], pci_name(dev)); + } + goto dn_unlock; + } + + /* + * Now test for an EEH failure. This is VERY expensive. + * Note that the eeh_config_addr may be a parent device + * in the case of a device behind a bridge, or it may be + * function zero of a multi-function device. + * In any case they must share a common PHB. + */ + ret = read_slot_reset_state(pdn, rets); + + /* If the call to firmware failed, punt */ + if (ret != 0) { + printk(KERN_WARNING "EEH: read_slot_reset_state() failed; rc=%d dn=%s\n", + ret, dn->full_name); + __get_cpu_var(false_positives)++; + rc = 0; + goto dn_unlock; + } + + /* If EEH is not supported on this device, punt. */ + if (rets[1] != 1) { + printk(KERN_WARNING "EEH: event on unsupported device, rc=%d dn=%s\n", + ret, dn->full_name); + __get_cpu_var(false_positives)++; + rc = 0; + goto dn_unlock; + } + + /* If not the kind of error we know about, punt. */ + if (rets[0] != 2 && rets[0] != 4 && rets[0] != 5) { + __get_cpu_var(false_positives)++; + rc = 0; + goto dn_unlock; + } + + /* Note that config-io to empty slots may fail; + * we recognize empty because they don't have children. */ + if ((rets[0] == 5) && (dn->child == NULL)) { + __get_cpu_var(false_positives)++; + rc = 0; + goto dn_unlock; + } + + __get_cpu_var(slot_resets)++; + + /* Avoid repeated reports of this failure, including problems + * with other functions on this device, and functions under + * bridges. */ + pe_dn = find_device_pe (dn); + __eeh_mark_slot (pe_dn); + spin_unlock_irqrestore(&confirm_error_lock, flags); + + reset_state = rets[0]; + + eeh_slot_error_detail (pdn, 1 /* Temporary Error */); + + printk(KERN_INFO "EEH: MMIO failure (%d) on device: %s %s\n", + rets[0], dn->name, dn->full_name); + event = kmalloc(sizeof(*event), GFP_ATOMIC); + if (event == NULL) { + eeh_panic(dev, reset_state); + return 1; + } + + event->dev = dev; + event->dn = dn; + event->reset_state = reset_state; + + /* We may or may not be called in an interrupt context */ + spin_lock_irqsave(&eeh_eventlist_lock, flags); + list_add(&event->list, &eeh_eventlist); + spin_unlock_irqrestore(&eeh_eventlist_lock, flags); + + /* Most EEH events are due to device driver bugs. Having + * a stack trace will help the device-driver authors figure + * out what happened. So print that out. */ + if (rets[0] != 5) dump_stack(); + schedule_work(&eeh_event_wq); + + return 1; + +dn_unlock: + spin_unlock_irqrestore(&confirm_error_lock, flags); + return rc; +} + +EXPORT_SYMBOL_GPL(eeh_dn_check_failure); + +/** + * eeh_check_failure - check if all 1's data is due to EEH slot freeze + * @token i/o token, should be address in the form 0xA.... + * @val value, should be all 1's (XXX why do we need this arg??) + * + * Check for an EEH failure at the given token address. Call this + * routine if the result of a read was all 0xff's and you want to + * find out if this is due to an EEH slot freeze event. This routine + * will query firmware for the EEH status. + * + * Note this routine is safe to call in an interrupt context. + */ +unsigned long eeh_check_failure(const volatile void __iomem *token, unsigned long val) +{ + unsigned long addr; + struct pci_dev *dev; + struct device_node *dn; + + /* Finding the phys addr + pci device; this is pretty quick. */ + addr = eeh_token_to_phys((unsigned long __force) token); + dev = pci_get_device_by_addr(addr); + if (!dev) { + __get_cpu_var(no_device)++; + return val; + } + + dn = pci_device_to_OF_node(dev); + eeh_dn_check_failure (dn, dev); + + pci_dev_put(dev); + return val; +} + +EXPORT_SYMBOL(eeh_check_failure); + +struct eeh_early_enable_info { + unsigned int buid_hi; + unsigned int buid_lo; +}; + +/* Enable eeh for the given device node. */ +static void *early_enable_eeh(struct device_node *dn, void *data) +{ + struct eeh_early_enable_info *info = data; + int ret; + char *status = get_property(dn, "status", NULL); + u32 *class_code = (u32 *)get_property(dn, "class-code", NULL); + u32 *vendor_id = (u32 *)get_property(dn, "vendor-id", NULL); + u32 *device_id = (u32 *)get_property(dn, "device-id", NULL); + u32 *regs; + int enable; + struct pci_dn *pdn = PCI_DN(dn); + + pdn->eeh_mode = 0; + pdn->eeh_check_count = 0; + pdn->eeh_freeze_count = 0; + + if (status && strcmp(status, "ok") != 0) + return NULL; /* ignore devices with bad status */ + + /* Ignore bad nodes. */ + if (!class_code || !vendor_id || !device_id) + return NULL; + + /* There is nothing to check on PCI to ISA bridges */ + if (dn->type && !strcmp(dn->type, "isa")) { + pdn->eeh_mode |= EEH_MODE_NOCHECK; + return NULL; + } + + /* + * Now decide if we are going to "Disable" EEH checking + * for this device. We still run with the EEH hardware active, + * but we won't be checking for ff's. This means a driver + * could return bad data (very bad!), an interrupt handler could + * hang waiting on status bits that won't change, etc. + * But there are a few cases like display devices that make sense. + */ + enable = 1; /* i.e. we will do checking */ + if ((*class_code >> 16) == PCI_BASE_CLASS_DISPLAY) + enable = 0; + + if (!enable) + pdn->eeh_mode |= EEH_MODE_NOCHECK; + + /* Ok... see if this device supports EEH. Some do, some don't, + * and the only way to find out is to check each and every one. */ + regs = (u32 *)get_property(dn, "reg", NULL); + if (regs) { + /* First register entry is addr (00BBSS00) */ + /* Try to enable eeh */ + ret = rtas_call(ibm_set_eeh_option, 4, 1, NULL, + regs[0], info->buid_hi, info->buid_lo, + EEH_ENABLE); + if (ret == 0) { + eeh_subsystem_enabled = 1; + pdn->eeh_mode |= EEH_MODE_SUPPORTED; + pdn->eeh_config_addr = regs[0]; +#ifdef DEBUG + printk(KERN_DEBUG "EEH: %s: eeh enabled\n", dn->full_name); +#endif + } else { + + /* This device doesn't support EEH, but it may have an + * EEH parent, in which case we mark it as supported. */ + if (dn->parent && PCI_DN(dn->parent) + && (PCI_DN(dn->parent)->eeh_mode & EEH_MODE_SUPPORTED)) { + /* Parent supports EEH. */ + pdn->eeh_mode |= EEH_MODE_SUPPORTED; + pdn->eeh_config_addr = PCI_DN(dn->parent)->eeh_config_addr; + return NULL; + } + } + } else { + printk(KERN_WARNING "EEH: %s: unable to get reg property.\n", + dn->full_name); + } + + return NULL; +} + +/* + * Initialize EEH by trying to enable it for all of the adapters in the system. + * As a side effect we can determine here if eeh is supported at all. + * Note that we leave EEH on so failed config cycles won't cause a machine + * check. If a user turns off EEH for a particular adapter they are really + * telling Linux to ignore errors. Some hardware (e.g. POWER5) won't + * grant access to a slot if EEH isn't enabled, and so we always enable + * EEH for all slots/all devices. + * + * The eeh-force-off option disables EEH checking globally, for all slots. + * Even if force-off is set, the EEH hardware is still enabled, so that + * newer systems can boot. + */ +void __init eeh_init(void) +{ + struct device_node *phb, *np; + struct eeh_early_enable_info info; + + spin_lock_init(&confirm_error_lock); + spin_lock_init(&slot_errbuf_lock); + + np = of_find_node_by_path("/rtas"); + if (np == NULL) + return; + + ibm_set_eeh_option = rtas_token("ibm,set-eeh-option"); + ibm_set_slot_reset = rtas_token("ibm,set-slot-reset"); + ibm_read_slot_reset_state2 = rtas_token("ibm,read-slot-reset-state2"); + ibm_read_slot_reset_state = rtas_token("ibm,read-slot-reset-state"); + ibm_slot_error_detail = rtas_token("ibm,slot-error-detail"); + + if (ibm_set_eeh_option == RTAS_UNKNOWN_SERVICE) + return; + + eeh_error_buf_size = rtas_token("rtas-error-log-max"); + if (eeh_error_buf_size == RTAS_UNKNOWN_SERVICE) { + eeh_error_buf_size = 1024; + } + if (eeh_error_buf_size > RTAS_ERROR_LOG_MAX) { + printk(KERN_WARNING "EEH: rtas-error-log-max is bigger than allocated " + "buffer ! (%d vs %d)", eeh_error_buf_size, RTAS_ERROR_LOG_MAX); + eeh_error_buf_size = RTAS_ERROR_LOG_MAX; + } + + /* Enable EEH for all adapters. Note that eeh requires buid's */ + for (phb = of_find_node_by_name(NULL, "pci"); phb; + phb = of_find_node_by_name(phb, "pci")) { + unsigned long buid; + + buid = get_phb_buid(phb); + if (buid == 0 || PCI_DN(phb) == NULL) + continue; + + info.buid_lo = BUID_LO(buid); + info.buid_hi = BUID_HI(buid); + traverse_pci_devices(phb, early_enable_eeh, &info); + } + + if (eeh_subsystem_enabled) + printk(KERN_INFO "EEH: PCI Enhanced I/O Error Handling Enabled\n"); + else + printk(KERN_WARNING "EEH: No capable adapters found\n"); +} + +/** + * eeh_add_device_early - enable EEH for the indicated device_node + * @dn: device node for which to set up EEH + * + * This routine must be used to perform EEH initialization for PCI + * devices that were added after system boot (e.g. hotplug, dlpar). + * This routine must be called before any i/o is performed to the + * adapter (inluding any config-space i/o). + * Whether this actually enables EEH or not for this device depends + * on the CEC architecture, type of the device, on earlier boot + * command-line arguments & etc. + */ +void eeh_add_device_early(struct device_node *dn) +{ + struct pci_controller *phb; + struct eeh_early_enable_info info; + + if (!dn || !PCI_DN(dn)) + return; + phb = PCI_DN(dn)->phb; + if (NULL == phb || 0 == phb->buid) { + printk(KERN_WARNING "EEH: Expected buid but found none for %s\n", + dn->full_name); + dump_stack(); + return; + } + + info.buid_hi = BUID_HI(phb->buid); + info.buid_lo = BUID_LO(phb->buid); + early_enable_eeh(dn, &info); +} +EXPORT_SYMBOL_GPL(eeh_add_device_early); + +/** + * eeh_add_device_late - perform EEH initialization for the indicated pci device + * @dev: pci device for which to set up EEH + * + * This routine must be used to complete EEH initialization for PCI + * devices that were added after system boot (e.g. hotplug, dlpar). + */ +void eeh_add_device_late(struct pci_dev *dev) +{ + struct device_node *dn; + + if (!dev || !eeh_subsystem_enabled) + return; + +#ifdef DEBUG + printk(KERN_DEBUG "EEH: adding device %s\n", pci_name(dev)); +#endif + + pci_dev_get (dev); + dn = pci_device_to_OF_node(dev); + PCI_DN(dn)->pcidev = dev; + + pci_addr_cache_insert_device (dev); +} +EXPORT_SYMBOL_GPL(eeh_add_device_late); + +/** + * eeh_remove_device - undo EEH setup for the indicated pci device + * @dev: pci device to be removed + * + * This routine should be when a device is removed from a running + * system (e.g. by hotplug or dlpar). + */ +void eeh_remove_device(struct pci_dev *dev) +{ + struct device_node *dn; + if (!dev || !eeh_subsystem_enabled) + return; + + /* Unregister the device with the EEH/PCI address search system */ +#ifdef DEBUG + printk(KERN_DEBUG "EEH: remove device %s\n", pci_name(dev)); +#endif + pci_addr_cache_remove_device(dev); + + dn = pci_device_to_OF_node(dev); + PCI_DN(dn)->pcidev = NULL; + pci_dev_put (dev); +} +EXPORT_SYMBOL_GPL(eeh_remove_device); + +static int proc_eeh_show(struct seq_file *m, void *v) +{ + unsigned int cpu; + unsigned long ffs = 0, positives = 0, failures = 0; + unsigned long resets = 0; + unsigned long no_dev = 0, no_dn = 0, no_cfg = 0, no_check = 0; + + for_each_cpu(cpu) { + ffs += per_cpu(total_mmio_ffs, cpu); + positives += per_cpu(false_positives, cpu); + failures += per_cpu(ignored_failures, cpu); + resets += per_cpu(slot_resets, cpu); + no_dev += per_cpu(no_device, cpu); + no_dn += per_cpu(no_dn, cpu); + no_cfg += per_cpu(no_cfg_addr, cpu); + no_check += per_cpu(ignored_check, cpu); + } + + if (0 == eeh_subsystem_enabled) { + seq_printf(m, "EEH Subsystem is globally disabled\n"); + seq_printf(m, "eeh_total_mmio_ffs=%ld\n", ffs); + } else { + seq_printf(m, "EEH Subsystem is enabled\n"); + seq_printf(m, + "no device=%ld\n" + "no device node=%ld\n" + "no config address=%ld\n" + "check not wanted=%ld\n" + "eeh_total_mmio_ffs=%ld\n" + "eeh_false_positives=%ld\n" + "eeh_ignored_failures=%ld\n" + "eeh_slot_resets=%ld\n", + no_dev, no_dn, no_cfg, no_check, + ffs, positives, failures, resets); + } + + return 0; +} + +static int proc_eeh_open(struct inode *inode, struct file *file) +{ + return single_open(file, proc_eeh_show, NULL); +} + +static struct file_operations proc_eeh_operations = { + .open = proc_eeh_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static int __init eeh_init_proc(void) +{ + struct proc_dir_entry *e; + + if (systemcfg->platform & PLATFORM_PSERIES) { + e = create_proc_entry("ppc64/eeh", 0, NULL); + if (e) + e->proc_fops = &proc_eeh_operations; + } + + return 0; +} +__initcall(eeh_init_proc); From jesse.brandeburg at gmail.com Fri Nov 4 12:34:53 2005 From: jesse.brandeburg at gmail.com (Jesse Brandeburg) Date: Thu, 3 Nov 2005 17:34:53 -0800 Subject: [PATCH 29/42]: ethernet: add PCI error recovery to e100 dev driver In-Reply-To: <20051104005353.GA27074@mail.gnucash.org> References: <20051103235918.GA25616@mail.gnucash.org> <20051104005353.GA27074@mail.gnucash.org> Message-ID: <4807377b0511031734gfc23c5fm31050bc8ee47c0c5@mail.gmail.com> On 11/3/05, Linas Vepstas wrote: > Various PCI bus errors can be signaled by newer PCI controllers. This > patch adds the PCI error recovery callbacks to the intel ethernet e100 > device driver. The patch has been tested, and appears to work well. > > Signed-off-by: Linas Vepstas > > -- > Index: linux-2.6.14-git3/drivers/net/e100.c I think these patches will be great, on the pseries, but is there not a compile option that should compile out all this code, i.e. #ifdef PCI_ERROR_RECOVERY if the arch doesn't support it? Jesse From jesse.brandeburg at gmail.com Fri Nov 4 12:51:22 2005 From: jesse.brandeburg at gmail.com (Jesse Brandeburg) Date: Thu, 3 Nov 2005 17:51:22 -0800 Subject: [PATCH 29/42]: ethernet: add PCI error recovery to e100 dev driver In-Reply-To: <4807377b0511031734gfc23c5fm31050bc8ee47c0c5@mail.gmail.com> References: <20051103235918.GA25616@mail.gnucash.org> <20051104005353.GA27074@mail.gnucash.org> <4807377b0511031734gfc23c5fm31050bc8ee47c0c5@mail.gmail.com> Message-ID: <4807377b0511031751r22301fd8q4397dd504a32fa39@mail.gmail.com> On 11/3/05, Jesse Brandeburg wrote: > I think these patches will be great, on the pseries, but > is there not a compile option that should compile out all this code, i.e. > #ifdef PCI_ERROR_RECOVERY > > if the arch doesn't support it? Uh, i just saw patch 32, never mind. From david at gibson.dropbear.id.au Fri Nov 4 14:16:10 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Fri, 4 Nov 2005 14:16:10 +1100 Subject: powerpc: Consolidate asm compatibility macros Message-ID: <20051104031609.GA962@localhost.localdomain> Paulus, not entirely sure if this is close to what you had in mind for consolidating the asm macro stuff. Apply if you like... This patch consolidates macros used to generate assembly for compatibility across different CPUs or configs. A new header, asm-powerpc/asm-compat.h contains the main compatibility macros. It uses some preprocessor magic to make the macros suitable both for use in .S files, and in inline asm in .c files. Headers (bitops.h, uaccess.h, atomic.h, bug.h) which had their own such compatibility macros are changed to use asm-compat.h. ppc_asm.h is now for use in .S files *only*, and a #error enforces that. As such, we're a lot more careless about namespace pollution here than in asm-compat.h. While we're at it, this patch adds a call to the PPC405_ERR77 macro in futex.h which should have had it already, but didn't. Built and booted on pSeries, Maple and iSeries (ARCH=powerpc). Built for 32-bit powermac (ARCH=powerpc) and Walnut (ARCH=ppc). Signed-off-by: David Gibson Index: working-2.6/include/asm-powerpc/ppc_asm.h =================================================================== --- working-2.6.orig/include/asm-powerpc/ppc_asm.h 2005-11-03 16:26:58.000000000 +1100 +++ working-2.6/include/asm-powerpc/ppc_asm.h 2005-11-04 14:04:05.000000000 +1100 @@ -6,8 +6,13 @@ #include #include +#include -#ifdef __ASSEMBLY__ +#ifndef __ASSEMBLY__ +#error __FILE__ should only be used in assembler files +#else + +#define SZL (BITS_PER_LONG/8) /* * Macros for storing registers into and loading registers from @@ -184,12 +189,6 @@ oris reg,reg,(label)@h; \ ori reg,reg,(label)@l; -/* operations for longs and pointers */ -#define LDL ld -#define STL std -#define CMPI cmpdi -#define SZL 8 - /* offsets for stack frame layout */ #define LRSAVE 16 @@ -203,12 +202,6 @@ #define OFF(name) name at l -/* operations for longs and pointers */ -#define LDL lwz -#define STL stw -#define CMPI cmpwi -#define SZL 4 - /* offsets for stack frame layout */ #define LRSAVE 4 @@ -266,15 +259,6 @@ #endif -#ifdef CONFIG_IBM405_ERR77 -#define PPC405_ERR77(ra,rb) dcbt ra, rb; -#define PPC405_ERR77_SYNC sync; -#else -#define PPC405_ERR77(ra,rb) -#define PPC405_ERR77_SYNC -#endif - - #ifdef CONFIG_IBM440EP_ERR42 #define PPC440EP_ERR42 isync #else @@ -502,17 +486,6 @@ #define N_SLINE 68 #define N_SO 100 -#define ASM_CONST(x) x -#else - #define __ASM_CONST(x) x##UL - #define ASM_CONST(x) __ASM_CONST(x) - -#ifdef CONFIG_PPC64 -#define DATAL ".llong" -#else -#define DATAL ".long" -#endif - #endif /* __ASSEMBLY__ */ #endif /* _ASM_POWERPC_PPC_ASM_H */ Index: working-2.6/arch/powerpc/kernel/fpu.S =================================================================== --- working-2.6.orig/arch/powerpc/kernel/fpu.S 2005-10-31 15:20:20.000000000 +1100 +++ working-2.6/arch/powerpc/kernel/fpu.S 2005-11-04 14:04:05.000000000 +1100 @@ -41,7 +41,7 @@ #ifndef CONFIG_SMP LOADBASE(r3, last_task_used_math) toreal(r3) - LDL r4,OFF(last_task_used_math)(r3) + PPC_LD r4,OFF(last_task_used_math)(r3) CMPI 0,r4,0 beq 1f toreal(r4) @@ -49,12 +49,12 @@ SAVE_32FPRS(0, r4) mffs fr0 stfd fr0,THREAD_FPSCR(r4) - LDL r5,PT_REGS(r4) + PPC_LD r5,PT_REGS(r4) toreal(r5) - LDL r4,_MSR-STACK_FRAME_OVERHEAD(r5) + PPC_LD r4,_MSR-STACK_FRAME_OVERHEAD(r5) li r10,MSR_FP|MSR_FE0|MSR_FE1 andc r4,r4,r10 /* disable FP for previous task */ - STL r4,_MSR-STACK_FRAME_OVERHEAD(r5) + PPC_ST r4,_MSR-STACK_FRAME_OVERHEAD(r5) 1: #endif /* CONFIG_SMP */ /* enable use of FP after return */ @@ -77,7 +77,7 @@ #ifndef CONFIG_SMP subi r4,r5,THREAD fromreal(r4) - STL r4,OFF(last_task_used_math)(r3) + PPC_ST r4,OFF(last_task_used_math)(r3) #endif /* CONFIG_SMP */ /* restore registers and return */ /* we haven't used ctr or xer or lr */ @@ -100,21 +100,21 @@ CMPI 0,r3,0 beqlr- /* if no previous owner, done */ addi r3,r3,THREAD /* want THREAD of task */ - LDL r5,PT_REGS(r3) + PPC_LD r5,PT_REGS(r3) CMPI 0,r5,0 SAVE_32FPRS(0, r3) mffs fr0 stfd fr0,THREAD_FPSCR(r3) beq 1f - LDL r4,_MSR-STACK_FRAME_OVERHEAD(r5) + PPC_LD r4,_MSR-STACK_FRAME_OVERHEAD(r5) li r3,MSR_FP|MSR_FE0|MSR_FE1 andc r4,r4,r3 /* disable FP for previous task */ - STL r4,_MSR-STACK_FRAME_OVERHEAD(r5) + PPC_ST r4,_MSR-STACK_FRAME_OVERHEAD(r5) 1: #ifndef CONFIG_SMP li r5,0 LOADBASE(r4,last_task_used_math) - STL r5,OFF(last_task_used_math)(r4) + PPC_ST r5,OFF(last_task_used_math)(r4) #endif /* CONFIG_SMP */ blr Index: working-2.6/include/asm-powerpc/bitops.h =================================================================== --- working-2.6.orig/include/asm-powerpc/bitops.h 2005-11-03 16:26:58.000000000 +1100 +++ working-2.6/include/asm-powerpc/bitops.h 2005-11-04 14:04:05.000000000 +1100 @@ -40,6 +40,7 @@ #include #include +#include #include /* @@ -52,16 +53,6 @@ #define BITOP_WORD(nr) ((nr) / BITS_PER_LONG) #define BITOP_LE_SWIZZLE ((BITS_PER_LONG-1) & ~0x7) -#ifdef CONFIG_PPC64 -#define LARXL "ldarx" -#define STCXL "stdcx." -#define CNTLZL "cntlzd" -#else -#define LARXL "lwarx" -#define STCXL "stwcx." -#define CNTLZL "cntlzw" -#endif - static __inline__ void set_bit(int nr, volatile unsigned long *addr) { unsigned long old; @@ -69,10 +60,10 @@ unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); __asm__ __volatile__( -"1:" LARXL " %0,0,%3 # set_bit\n" +"1:" PPC_LARX "%0,0,%3 # set_bit\n" "or %0,%0,%2\n" PPC405_ERR77(0,%3) - STCXL " %0,0,%3\n" + PPC_STCX "%0,0,%3\n" "bne- 1b" : "=&r"(old), "=m"(*p) : "r"(mask), "r"(p), "m"(*p) @@ -86,10 +77,10 @@ unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); __asm__ __volatile__( -"1:" LARXL " %0,0,%3 # set_bit\n" +"1:" PPC_LARX "%0,0,%3 # clear_bit\n" "andc %0,%0,%2\n" PPC405_ERR77(0,%3) - STCXL " %0,0,%3\n" + PPC_STCX "%0,0,%3\n" "bne- 1b" : "=&r"(old), "=m"(*p) : "r"(mask), "r"(p), "m"(*p) @@ -103,10 +94,10 @@ unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); __asm__ __volatile__( -"1:" LARXL " %0,0,%3 # set_bit\n" +"1:" PPC_LARX "%0,0,%3 # change_bit\n" "xor %0,%0,%2\n" PPC405_ERR77(0,%3) - STCXL " %0,0,%3\n" + PPC_STCX "%0,0,%3\n" "bne- 1b" : "=&r"(old), "=m"(*p) : "r"(mask), "r"(p), "m"(*p) @@ -122,10 +113,10 @@ __asm__ __volatile__( EIEIO_ON_SMP -"1:" LARXL " %0,0,%3 # test_and_set_bit\n" +"1:" PPC_LARX "%0,0,%3 # test_and_set_bit\n" "or %1,%0,%2 \n" PPC405_ERR77(0,%3) - STCXL " %1,0,%3 \n" + PPC_STCX "%1,0,%3 \n" "bne- 1b" ISYNC_ON_SMP : "=&r" (old), "=&r" (t) @@ -144,10 +135,10 @@ __asm__ __volatile__( EIEIO_ON_SMP -"1:" LARXL " %0,0,%3 # test_and_clear_bit\n" +"1:" PPC_LARX "%0,0,%3 # test_and_clear_bit\n" "andc %1,%0,%2 \n" PPC405_ERR77(0,%3) - STCXL " %1,0,%3 \n" + PPC_STCX "%1,0,%3 \n" "bne- 1b" ISYNC_ON_SMP : "=&r" (old), "=&r" (t) @@ -166,10 +157,10 @@ __asm__ __volatile__( EIEIO_ON_SMP -"1:" LARXL " %0,0,%3 # test_and_change_bit\n" +"1:" PPC_LARX "%0,0,%3 # test_and_change_bit\n" "xor %1,%0,%2 \n" PPC405_ERR77(0,%3) - STCXL " %1,0,%3 \n" + PPC_STCX "%1,0,%3 \n" "bne- 1b" ISYNC_ON_SMP : "=&r" (old), "=&r" (t) @@ -184,9 +175,9 @@ unsigned long old; __asm__ __volatile__( -"1:" LARXL " %0,0,%3 # set_bit\n" +"1:" PPC_LARX "%0,0,%3 # set_bits\n" "or %0,%0,%2\n" - STCXL " %0,0,%3\n" + PPC_STCX "%0,0,%3\n" "bne- 1b" : "=&r" (old), "=m" (*addr) : "r" (mask), "r" (addr), "m" (*addr) @@ -268,7 +259,7 @@ { int lz; - asm (CNTLZL " %0,%1" : "=r" (lz) : "r" (x)); + asm (PPC_CNTLZ "%0,%1" : "=r" (lz) : "r" (x)); return BITS_PER_LONG - 1 - lz; } Index: working-2.6/include/asm-powerpc/bug.h =================================================================== --- working-2.6.orig/include/asm-powerpc/bug.h 2005-11-03 16:26:58.000000000 +1100 +++ working-2.6/include/asm-powerpc/bug.h 2005-11-04 14:04:05.000000000 +1100 @@ -1,6 +1,7 @@ #ifndef _ASM_POWERPC_BUG_H #define _ASM_POWERPC_BUG_H +#include /* * Define an illegal instr to trap on the bug. * We don't use 0 because that marks the end of a function @@ -11,14 +12,6 @@ #ifndef __ASSEMBLY__ -#ifdef __powerpc64__ -#define BUG_TABLE_ENTRY ".llong" -#define BUG_TRAP_OP "tdnei" -#else -#define BUG_TABLE_ENTRY ".long" -#define BUG_TRAP_OP "twnei" -#endif /* __powerpc64__ */ - struct bug_entry { unsigned long bug_addr; long line; @@ -40,16 +33,16 @@ __asm__ __volatile__( \ "1: twi 31,0,0\n" \ ".section __bug_table,\"a\"\n" \ - "\t"BUG_TABLE_ENTRY" 1b,%0,%1,%2\n" \ + "\t"PPC_LONG" 1b,%0,%1,%2\n" \ ".previous" \ : : "i" (__LINE__), "i" (__FILE__), "i" (__FUNCTION__)); \ } while (0) #define BUG_ON(x) do { \ __asm__ __volatile__( \ - "1: "BUG_TRAP_OP" %0,0\n" \ + "1: "PPC_TNEI" %0,0\n" \ ".section __bug_table,\"a\"\n" \ - "\t"BUG_TABLE_ENTRY" 1b,%1,%2,%3\n" \ + "\t"PPC_LONG" 1b,%1,%2,%3\n" \ ".previous" \ : : "r" ((long)(x)), "i" (__LINE__), \ "i" (__FILE__), "i" (__FUNCTION__)); \ @@ -57,9 +50,9 @@ #define WARN_ON(x) do { \ __asm__ __volatile__( \ - "1: "BUG_TRAP_OP" %0,0\n" \ + "1: "PPC_TNEI" %0,0\n" \ ".section __bug_table,\"a\"\n" \ - "\t"BUG_TABLE_ENTRY" 1b,%1,%2,%3\n" \ + "\t"PPC_LONG" 1b,%1,%2,%3\n" \ ".previous" \ : : "r" ((long)(x)), \ "i" (__LINE__ + BUG_WARNING_TRAP), \ Index: working-2.6/include/asm-powerpc/futex.h =================================================================== --- working-2.6.orig/include/asm-powerpc/futex.h 2005-11-03 16:26:58.000000000 +1100 +++ working-2.6/include/asm-powerpc/futex.h 2005-11-04 14:04:05.000000000 +1100 @@ -7,13 +7,14 @@ #include #include #include -#include +#include #define __futex_atomic_op(insn, ret, oldval, uaddr, oparg) \ __asm__ __volatile ( \ SYNC_ON_SMP \ "1: lwarx %0,0,%2\n" \ insn \ + PPC405_ERR77(0, %2) \ "2: stwcx. %1,0,%2\n" \ "bne- 1b\n" \ "li %1,0\n" \ @@ -23,7 +24,7 @@ ".previous\n" \ ".section __ex_table,\"a\"\n" \ ".align 3\n" \ - DATAL " 1b,4b,2b,4b\n" \ + PPC_LONG "1b,4b,2b,4b\n" \ ".previous" \ : "=&r" (oldval), "=&r" (ret) \ : "b" (uaddr), "i" (-EFAULT), "1" (oparg) \ Index: working-2.6/include/asm-powerpc/cputable.h =================================================================== --- working-2.6.orig/include/asm-powerpc/cputable.h 2005-10-31 15:20:22.000000000 +1100 +++ working-2.6/include/asm-powerpc/cputable.h 2005-11-04 14:04:05.000000000 +1100 @@ -2,7 +2,7 @@ #define __ASM_POWERPC_CPUTABLE_H #include -#include /* for ASM_CONST */ +#include #define PPC_FEATURE_32 0x80000000 #define PPC_FEATURE_64 0x40000000 Index: working-2.6/include/asm-ppc64/mmu.h =================================================================== --- working-2.6.orig/include/asm-ppc64/mmu.h 2005-10-31 15:20:22.000000000 +1100 +++ working-2.6/include/asm-ppc64/mmu.h 2005-11-04 14:04:05.000000000 +1100 @@ -14,7 +14,7 @@ #define _PPC64_MMU_H_ #include -#include /* for ASM_CONST */ +#include #include /* Index: working-2.6/include/asm-ppc64/page.h =================================================================== --- working-2.6.orig/include/asm-ppc64/page.h 2005-10-31 15:20:22.000000000 +1100 +++ working-2.6/include/asm-ppc64/page.h 2005-11-04 14:04:05.000000000 +1100 @@ -11,7 +11,7 @@ */ #include -#include /* for ASM_CONST */ +#include /* PAGE_SHIFT determines the page size */ #define PAGE_SHIFT 12 Index: working-2.6/include/asm-powerpc/asm-compat.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ working-2.6/include/asm-powerpc/asm-compat.h 2005-11-04 14:04:05.000000000 +1100 @@ -0,0 +1,55 @@ +#ifndef _ASM_POWERPC_ASM_COMPAT_H +#define _ASM_POWERPC_ASM_COMPAT_H + +#include +#include + +#ifdef __ASSEMBLY__ +# define stringify_in_c(...) __VA_ARGS__ +# define ASM_CONST(x) x +#else +/* This version of stringify will deal with commas... */ +# define __stringify_in_c(...) #__VA_ARGS__ +# define stringify_in_c(...) __stringify_in_c(__VA_ARGS__) " " +# define __ASM_CONST(x) x##UL +# define ASM_CONST(x) __ASM_CONST(x) +#endif + +#ifdef __powerpc64__ + +/* operations for longs and pointers */ +#define PPC_LD stringify_in_c(ld) +#define PPC_ST stringify_in_c(std) +#define PPC_CMPI stringify_in_c(cmpdi) +#define PPC_LONG stringify_in_c(.llong) +#define PPC_TNEI stringify_in_c(tdnei) +#define PPC_LARX stringify_in_c(ldarx) +#define PPC_STCX stringify_in_c(stdcx.) +#define PPC_CNTLZ stringify_in_c(cntlzd) + +#else /* 32-bit */ + +/* operations for longs and pointers */ +#define PPC_LD stringify_in_c(lwz) +#define PPC_ST stringify_in_c(stw) +#define PPC_CMPI stringify_in_c(cmpwi) +#define PPC_LONG stringify_in_c(.long) +#define PPC_TNEI stringify_in_c(twnei) +#define PPC_LARX stringify_in_c(lwarx) +#define PPC_STCX stringify_in_c(stwcx.) +#define PPC_CNTLZ stringify_in_c(cntlzw) + +#endif + +#ifdef CONFIG_IBM405_ERR77 +/* Erratum #77 on the 405 means we need a sync or dcbt before every + * stwcx. The old ATOMIC_SYNC_FIX covered some but not all of this. + */ +#define PPC405_ERR77(ra,rb) stringify_in_c(dcbt ra, rb;) +#define PPC405_ERR77_SYNC stringify_in_c(sync;) +#else +#define PPC405_ERR77(ra,rb) +#define PPC405_ERR77_SYNC +#endif + +#endif /* _ASM_POWERPC_ASM_COMPAT_H */ Index: working-2.6/arch/powerpc/xmon/setjmp.S =================================================================== --- working-2.6.orig/arch/powerpc/xmon/setjmp.S 2005-10-31 15:20:57.000000000 +1100 +++ working-2.6/arch/powerpc/xmon/setjmp.S 2005-11-04 14:04:05.000000000 +1100 @@ -14,61 +14,61 @@ _GLOBAL(xmon_setjmp) mflr r0 - STL r0,0(r3) - STL r1,SZL(r3) - STL r2,2*SZL(r3) + PPC_ST r0,0(r3) + PPC_ST r1,SZL(r3) + PPC_ST r2,2*SZL(r3) mfcr r0 - STL r0,3*SZL(r3) - STL r13,4*SZL(r3) - STL r14,5*SZL(r3) - STL r15,6*SZL(r3) - STL r16,7*SZL(r3) - STL r17,8*SZL(r3) - STL r18,9*SZL(r3) - STL r19,10*SZL(r3) - STL r20,11*SZL(r3) - STL r21,12*SZL(r3) - STL r22,13*SZL(r3) - STL r23,14*SZL(r3) - STL r24,15*SZL(r3) - STL r25,16*SZL(r3) - STL r26,17*SZL(r3) - STL r27,18*SZL(r3) - STL r28,19*SZL(r3) - STL r29,20*SZL(r3) - STL r30,21*SZL(r3) - STL r31,22*SZL(r3) + PPC_ST r0,3*SZL(r3) + PPC_ST r13,4*SZL(r3) + PPC_ST r14,5*SZL(r3) + PPC_ST r15,6*SZL(r3) + PPC_ST r16,7*SZL(r3) + PPC_ST r17,8*SZL(r3) + PPC_ST r18,9*SZL(r3) + PPC_ST r19,10*SZL(r3) + PPC_ST r20,11*SZL(r3) + PPC_ST r21,12*SZL(r3) + PPC_ST r22,13*SZL(r3) + PPC_ST r23,14*SZL(r3) + PPC_ST r24,15*SZL(r3) + PPC_ST r25,16*SZL(r3) + PPC_ST r26,17*SZL(r3) + PPC_ST r27,18*SZL(r3) + PPC_ST r28,19*SZL(r3) + PPC_ST r29,20*SZL(r3) + PPC_ST r30,21*SZL(r3) + PPC_ST r31,22*SZL(r3) li r3,0 blr _GLOBAL(xmon_longjmp) - CMPI r4,0 + PPC_CMPI r4,0 bne 1f li r4,1 -1: LDL r13,4*SZL(r3) - LDL r14,5*SZL(r3) - LDL r15,6*SZL(r3) - LDL r16,7*SZL(r3) - LDL r17,8*SZL(r3) - LDL r18,9*SZL(r3) - LDL r19,10*SZL(r3) - LDL r20,11*SZL(r3) - LDL r21,12*SZL(r3) - LDL r22,13*SZL(r3) - LDL r23,14*SZL(r3) - LDL r24,15*SZL(r3) - LDL r25,16*SZL(r3) - LDL r26,17*SZL(r3) - LDL r27,18*SZL(r3) - LDL r28,19*SZL(r3) - LDL r29,20*SZL(r3) - LDL r30,21*SZL(r3) - LDL r31,22*SZL(r3) - LDL r0,3*SZL(r3) +1: PPC_LD r13,4*SZL(r3) + PPC_LD r14,5*SZL(r3) + PPC_LD r15,6*SZL(r3) + PPC_LD r16,7*SZL(r3) + PPC_LD r17,8*SZL(r3) + PPC_LD r18,9*SZL(r3) + PPC_LD r19,10*SZL(r3) + PPC_LD r20,11*SZL(r3) + PPC_LD r21,12*SZL(r3) + PPC_LD r22,13*SZL(r3) + PPC_LD r23,14*SZL(r3) + PPC_LD r24,15*SZL(r3) + PPC_LD r25,16*SZL(r3) + PPC_LD r26,17*SZL(r3) + PPC_LD r27,18*SZL(r3) + PPC_LD r28,19*SZL(r3) + PPC_LD r29,20*SZL(r3) + PPC_LD r30,21*SZL(r3) + PPC_LD r31,22*SZL(r3) + PPC_LD r0,3*SZL(r3) mtcrf 0x38,r0 - LDL r0,0(r3) - LDL r1,SZL(r3) - LDL r2,2*SZL(r3) + PPC_LD r0,0(r3) + PPC_LD r1,SZL(r3) + PPC_LD r2,2*SZL(r3) mtlr r0 mr r3,r4 blr @@ -84,52 +84,52 @@ * different ABIs, though). */ _GLOBAL(xmon_save_regs) - STL r0,0*SZL(r3) - STL r2,2*SZL(r3) - STL r3,3*SZL(r3) - STL r4,4*SZL(r3) - STL r5,5*SZL(r3) - STL r6,6*SZL(r3) - STL r7,7*SZL(r3) - STL r8,8*SZL(r3) - STL r9,9*SZL(r3) - STL r10,10*SZL(r3) - STL r11,11*SZL(r3) - STL r12,12*SZL(r3) - STL r13,13*SZL(r3) - STL r14,14*SZL(r3) - STL r15,15*SZL(r3) - STL r16,16*SZL(r3) - STL r17,17*SZL(r3) - STL r18,18*SZL(r3) - STL r19,19*SZL(r3) - STL r20,20*SZL(r3) - STL r21,21*SZL(r3) - STL r22,22*SZL(r3) - STL r23,23*SZL(r3) - STL r24,24*SZL(r3) - STL r25,25*SZL(r3) - STL r26,26*SZL(r3) - STL r27,27*SZL(r3) - STL r28,28*SZL(r3) - STL r29,29*SZL(r3) - STL r30,30*SZL(r3) - STL r31,31*SZL(r3) + PPC_ST r0,0*SZL(r3) + PPC_ST r2,2*SZL(r3) + PPC_ST r3,3*SZL(r3) + PPC_ST r4,4*SZL(r3) + PPC_ST r5,5*SZL(r3) + PPC_ST r6,6*SZL(r3) + PPC_ST r7,7*SZL(r3) + PPC_ST r8,8*SZL(r3) + PPC_ST r9,9*SZL(r3) + PPC_ST r10,10*SZL(r3) + PPC_ST r11,11*SZL(r3) + PPC_ST r12,12*SZL(r3) + PPC_ST r13,13*SZL(r3) + PPC_ST r14,14*SZL(r3) + PPC_ST r15,15*SZL(r3) + PPC_ST r16,16*SZL(r3) + PPC_ST r17,17*SZL(r3) + PPC_ST r18,18*SZL(r3) + PPC_ST r19,19*SZL(r3) + PPC_ST r20,20*SZL(r3) + PPC_ST r21,21*SZL(r3) + PPC_ST r22,22*SZL(r3) + PPC_ST r23,23*SZL(r3) + PPC_ST r24,24*SZL(r3) + PPC_ST r25,25*SZL(r3) + PPC_ST r26,26*SZL(r3) + PPC_ST r27,27*SZL(r3) + PPC_ST r28,28*SZL(r3) + PPC_ST r29,29*SZL(r3) + PPC_ST r30,30*SZL(r3) + PPC_ST r31,31*SZL(r3) /* go up one stack frame for SP */ - LDL r4,0(r1) - STL r4,1*SZL(r3) + PPC_LD r4,0(r1) + PPC_ST r4,1*SZL(r3) /* get caller's LR */ - LDL r0,LRSAVE(r4) - STL r0,_NIP-STACK_FRAME_OVERHEAD(r3) - STL r0,_LINK-STACK_FRAME_OVERHEAD(r3) + PPC_LD r0,LRSAVE(r4) + PPC_ST r0,_NIP-STACK_FRAME_OVERHEAD(r3) + PPC_ST r0,_LINK-STACK_FRAME_OVERHEAD(r3) mfmsr r0 - STL r0,_MSR-STACK_FRAME_OVERHEAD(r3) + PPC_ST r0,_MSR-STACK_FRAME_OVERHEAD(r3) mfctr r0 - STL r0,_CTR-STACK_FRAME_OVERHEAD(r3) + PPC_ST r0,_CTR-STACK_FRAME_OVERHEAD(r3) mfxer r0 - STL r0,_XER-STACK_FRAME_OVERHEAD(r3) + PPC_ST r0,_XER-STACK_FRAME_OVERHEAD(r3) mfcr r0 - STL r0,_CCR-STACK_FRAME_OVERHEAD(r3) + PPC_ST r0,_CCR-STACK_FRAME_OVERHEAD(r3) li r0,0 - STL r0,_TRAP-STACK_FRAME_OVERHEAD(r3) + PPC_ST r0,_TRAP-STACK_FRAME_OVERHEAD(r3) blr Index: working-2.6/include/asm-powerpc/system.h =================================================================== --- working-2.6.orig/include/asm-powerpc/system.h 2005-10-31 15:45:01.000000000 +1100 +++ working-2.6/include/asm-powerpc/system.h 2005-11-04 14:04:05.000000000 +1100 @@ -8,7 +8,6 @@ #include #include -#include #include /* Index: working-2.6/include/asm-powerpc/atomic.h =================================================================== --- working-2.6.orig/include/asm-powerpc/atomic.h 2005-10-31 15:20:22.000000000 +1100 +++ working-2.6/include/asm-powerpc/atomic.h 2005-11-04 14:04:05.000000000 +1100 @@ -9,21 +9,13 @@ #ifdef __KERNEL__ #include +#include #define ATOMIC_INIT(i) { (i) } #define atomic_read(v) ((v)->counter) #define atomic_set(v,i) (((v)->counter) = (i)) -/* Erratum #77 on the 405 means we need a sync or dcbt before every stwcx. - * The old ATOMIC_SYNC_FIX covered some but not all of this. - */ -#ifdef CONFIG_IBM405_ERR77 -#define PPC405_ERR77(ra,rb) "dcbt " #ra "," #rb ";" -#else -#define PPC405_ERR77(ra,rb) -#endif - static __inline__ void atomic_add(int a, atomic_t *v) { int t; Index: working-2.6/include/asm-powerpc/uaccess.h =================================================================== --- working-2.6.orig/include/asm-powerpc/uaccess.h 2005-11-03 16:26:58.000000000 +1100 +++ working-2.6/include/asm-powerpc/uaccess.h 2005-11-04 14:04:05.000000000 +1100 @@ -120,14 +120,6 @@ extern long __put_user_bad(void); -#ifdef __powerpc64__ -#define __EX_TABLE_ALIGN "3" -#define __EX_TABLE_TYPE "llong" -#else -#define __EX_TABLE_ALIGN "2" -#define __EX_TABLE_TYPE "long" -#endif - /* * We don't tell gcc that we are accessing memory, but this is OK * because we do not write to any memory gcc knows about, so there @@ -142,11 +134,12 @@ " b 2b\n" \ ".previous\n" \ ".section __ex_table,\"a\"\n" \ - " .align " __EX_TABLE_ALIGN "\n" \ - " ."__EX_TABLE_TYPE" 1b,3b\n" \ + " .balign %5\n" \ + PPC_LONG "1b,3b\n" \ ".previous" \ : "=r" (err) \ - : "r" (x), "b" (addr), "i" (-EFAULT), "0" (err)) + : "r" (x), "b" (addr), "i" (-EFAULT), "0" (err),\ + "i"(sizeof(unsigned long))) #ifdef __powerpc64__ #define __put_user_asm2(x, ptr, retval) \ @@ -162,12 +155,13 @@ " b 3b\n" \ ".previous\n" \ ".section __ex_table,\"a\"\n" \ - " .align " __EX_TABLE_ALIGN "\n" \ - " ." __EX_TABLE_TYPE " 1b,4b\n" \ - " ." __EX_TABLE_TYPE " 2b,4b\n" \ + " .balign %5\n" \ + PPC_LONG "1b,4b\n" \ + PPC_LONG "2b,4b\n" \ ".previous" \ : "=r" (err) \ - : "r" (x), "b" (addr), "i" (-EFAULT), "0" (err)) + : "r" (x), "b" (addr), "i" (-EFAULT), "0" (err),\ + "i"(sizeof(unsigned long))) #endif /* __powerpc64__ */ #define __put_user_size(x, ptr, size, retval) \ @@ -213,11 +207,12 @@ " b 2b\n" \ ".previous\n" \ ".section __ex_table,\"a\"\n" \ - " .align "__EX_TABLE_ALIGN "\n" \ - " ." __EX_TABLE_TYPE " 1b,3b\n" \ + " .balign %5\n" \ + PPC_LONG "1b,3b\n" \ ".previous" \ : "=r" (err), "=r" (x) \ - : "b" (addr), "i" (-EFAULT), "0" (err)) + : "b" (addr), "i" (-EFAULT), "0" (err), \ + "i"(sizeof(unsigned long))) #ifdef __powerpc64__ #define __get_user_asm2(x, addr, err) \ @@ -235,12 +230,13 @@ " b 3b\n" \ ".previous\n" \ ".section __ex_table,\"a\"\n" \ - " .align " __EX_TABLE_ALIGN "\n" \ - " ." __EX_TABLE_TYPE " 1b,4b\n" \ - " ." __EX_TABLE_TYPE " 2b,4b\n" \ + " .balign %5\n" \ + PPC_LONG "1b,4b\n" \ + PPC_LONG "2b,4b\n" \ ".previous" \ : "=r" (err), "=&r" (x) \ - : "b" (addr), "i" (-EFAULT), "0" (err)) + : "b" (addr), "i" (-EFAULT), "0" (err), \ + "i"(sizeof(unsigned long))) #endif /* __powerpc64__ */ #define __get_user_size(x, ptr, size, retval) \ Index: working-2.6/arch/powerpc/platforms/iseries/misc.S =================================================================== --- working-2.6.orig/arch/powerpc/platforms/iseries/misc.S 2005-10-31 15:20:20.000000000 +1100 +++ working-2.6/arch/powerpc/platforms/iseries/misc.S 2005-11-04 14:04:05.000000000 +1100 @@ -15,6 +15,7 @@ #include #include +#include .text Index: working-2.6/arch/ppc/boot/openfirmware/Makefile =================================================================== --- working-2.6.orig/arch/ppc/boot/openfirmware/Makefile 2005-10-25 11:59:53.000000000 +1000 +++ working-2.6/arch/ppc/boot/openfirmware/Makefile 2005-11-04 14:04:05.000000000 +1100 @@ -80,8 +80,7 @@ $(call if_changed,mknote) -$(obj)/coffcrt0.o: EXTRA_AFLAGS := -traditional -DXCOFF -$(obj)/crt0.o: EXTRA_AFLAGS := -traditional +$(obj)/coffcrt0.o: EXTRA_AFLAGS := -DXCOFF targets += coffcrt0.o crt0.o $(obj)/coffcrt0.o $(obj)/crt0.o: $(common)/crt0.S FORCE $(call if_changed_dep,as_o_S) -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From geoffrey.levand at am.sony.com Fri Nov 4 14:55:19 2005 From: geoffrey.levand at am.sony.com (Geoff Levand) Date: Thu, 03 Nov 2005 19:55:19 -0800 Subject: pci_resource_end() changed problem with 2.6.14 Message-ID: <436ADBA7.7030706@am.sony.com> I found that the serial port probe code in drivers/serial/8250_pci.c no longer works properly for ppc64 in 2.6.14. It seems the value returned by pci_resource_len() on ppc64 changed from 8 to 16 since 2.6.13. I tested on a PC and pci_resource_len() returns 8 as expected. Any help on on where to look for the problem would be appreciated. Here's the code that hits the problem: if (pci_resource_flags(dev, i) & IORESOURCE_IO && pci_resource_len(dev, i) == 8 && And here are some test results: 2.6.13-ppc64 --serial_pci_guess_board flags: 101h, start: 80, end: 87, len: 8 --serial_pci_guess_board found --serial_pci_guess_board flags: 101h, start: 64, end: 71, len: 8 --serial_pci_guess_board found 2.6.14-ppc64 --serial_pci_guess_board flags: 101h, start: 80, end: 95, len: 16 --serial_pci_guess_board not found --serial_pci_guess_board flags: 101h, start: 64, end: 79, len: 16 --serial_pci_guess_board not found 2.6.14-i386: --serial_pci_guess_board flags: 101h, start: 48128, end: 48135, len: 8 --serial_pci_guess_board found --serial_pci_guess_board flags: 101h, start: 46080, end: 46087, len: 8 --serial_pci_guess_board found -Geoff From mikey at neuling.org Fri Nov 4 16:02:11 2005 From: mikey at neuling.org (Michael Neuling) Date: Fri, 4 Nov 2005 16:02:11 +1100 Subject: [PATCH] HVC init race Message-ID: <20051104160211.c66d82f3.mikey@neuling.org> I've been hitting a crash on boot where tty_open is being called before the hvc console driver setup is complete. Below patch fixes this problem. Thanks to benh for his help on this. Mikey Signed-off-by: Michael Neuling drivers/char/hvc_console.c | 32 ++++++++++++++++++-------------- 1 files changed, 18 insertions(+), 14 deletions(-) Index: linux-2.6/drivers/char/hvc_console.c =================================================================== --- linux-2.6.orig/drivers/char/hvc_console.c +++ linux-2.6/drivers/char/hvc_console.c @@ -823,34 +823,38 @@ * interfaces start to become available. */ int __init hvc_init(void) { + struct tty_driver *drv; + /* We need more than hvc_count adapters due to hotplug additions. */ - hvc_driver = alloc_tty_driver(HVC_ALLOC_TTY_ADAPTERS); - if (!hvc_driver) + drv = alloc_tty_driver(HVC_ALLOC_TTY_ADAPTERS); + if (!drv) return -ENOMEM; - hvc_driver->owner = THIS_MODULE; - hvc_driver->devfs_name = "hvc/"; - hvc_driver->driver_name = "hvc"; - hvc_driver->name = "hvc"; - hvc_driver->major = HVC_MAJOR; - hvc_driver->minor_start = HVC_MINOR; - hvc_driver->type = TTY_DRIVER_TYPE_SYSTEM; - hvc_driver->init_termios = tty_std_termios; - hvc_driver->flags = TTY_DRIVER_REAL_RAW; - tty_set_operations(hvc_driver, &hvc_ops); + drv->owner = THIS_MODULE; + drv->devfs_name = "hvc/"; + drv->driver_name = "hvc"; + drv->name = "hvc"; + drv->major = HVC_MAJOR; + drv->minor_start = HVC_MINOR; + drv->type = TTY_DRIVER_TYPE_SYSTEM; + drv->init_termios = tty_std_termios; + drv->flags = TTY_DRIVER_REAL_RAW; + tty_set_operations(drv, &hvc_ops); /* Always start the kthread because there can be hotplug vty adapters * added later. */ hvc_task = kthread_run(khvcd, NULL, "khvcd"); if (IS_ERR(hvc_task)) { panic("Couldn't create kthread for console.\n"); - put_tty_driver(hvc_driver); + put_tty_driver(drv); return -EIO; } - if (tty_register_driver(hvc_driver)) + if (tty_register_driver(drv)) panic("Couldn't register hvc console driver\n"); + mb(); + hvc_driver = drv; return 0; } module_init(hvc_init); From hollis at penguinppc.org Fri Nov 4 16:10:19 2005 From: hollis at penguinppc.org (Hollis Blanchard) Date: Thu, 3 Nov 2005 23:10:19 -0600 Subject: [PATCH] HVC init race In-Reply-To: <20051104160211.c66d82f3.mikey@neuling.org> References: <20051104160211.c66d82f3.mikey@neuling.org> Message-ID: On Nov 3, 2005, at 11:02 PM, Michael Neuling wrote: > I've been hitting a crash on boot where tty_open is being called > before the > hvc console driver setup is complete. Below patch fixes this problem. What is the race exactly? I guess nothing should be calling into hvc_open before tty_register_driver()...? -Hollis > drivers/char/hvc_console.c | 32 ++++++++++++++++++-------------- > 1 files changed, 18 insertions(+), 14 deletions(-) > > Index: linux-2.6/drivers/char/hvc_console.c > =================================================================== > --- linux-2.6.orig/drivers/char/hvc_console.c > +++ linux-2.6/drivers/char/hvc_console.c > @@ -823,34 +823,38 @@ > * interfaces start to become available. */ > int __init hvc_init(void) > { > + struct tty_driver *drv; > + > /* We need more than hvc_count adapters due to hotplug additions. */ > - hvc_driver = alloc_tty_driver(HVC_ALLOC_TTY_ADAPTERS); > - if (!hvc_driver) > + drv = alloc_tty_driver(HVC_ALLOC_TTY_ADAPTERS); > + if (!drv) > return -ENOMEM; > > - hvc_driver->owner = THIS_MODULE; > - hvc_driver->devfs_name = "hvc/"; > - hvc_driver->driver_name = "hvc"; > - hvc_driver->name = "hvc"; > - hvc_driver->major = HVC_MAJOR; > - hvc_driver->minor_start = HVC_MINOR; > - hvc_driver->type = TTY_DRIVER_TYPE_SYSTEM; > - hvc_driver->init_termios = tty_std_termios; > - hvc_driver->flags = TTY_DRIVER_REAL_RAW; > - tty_set_operations(hvc_driver, &hvc_ops); > + drv->owner = THIS_MODULE; > + drv->devfs_name = "hvc/"; > + drv->driver_name = "hvc"; > + drv->name = "hvc"; > + drv->major = HVC_MAJOR; > + drv->minor_start = HVC_MINOR; > + drv->type = TTY_DRIVER_TYPE_SYSTEM; > + drv->init_termios = tty_std_termios; > + drv->flags = TTY_DRIVER_REAL_RAW; > + tty_set_operations(drv, &hvc_ops); > > /* Always start the kthread because there can be hotplug vty adapters > * added later. */ > hvc_task = kthread_run(khvcd, NULL, "khvcd"); > if (IS_ERR(hvc_task)) { > panic("Couldn't create kthread for console.\n"); > - put_tty_driver(hvc_driver); > + put_tty_driver(drv); > return -EIO; > } > > - if (tty_register_driver(hvc_driver)) > + if (tty_register_driver(drv)) > panic("Couldn't register hvc console driver\n"); > > + mb(); > + hvc_driver = drv; > return 0; > } > module_init(hvc_init); > _______________________________________________ > Linuxppc64-dev mailing list > Linuxppc64-dev at ozlabs.org > https://ozlabs.org/mailman/listinfo/linuxppc64-dev > From galak at kernel.crashing.org Fri Nov 4 16:33:14 2005 From: galak at kernel.crashing.org (Kumar Gala) Date: Thu, 3 Nov 2005 23:33:14 -0600 Subject: powerpc: Consolidate asm compatibility macros In-Reply-To: <20051104031609.GA962@localhost.localdomain> References: <20051104031609.GA962@localhost.localdomain> Message-ID: <7487F450-429B-4836-AF05-DD47B02D5BC1@kernel.crashing.org> David, I hate to be anal, but I think keep the 'L' is useful in the macro names. PPC_LD -> PPC_LL I read 'PPC_LD' as either "PPC load" or "PPC load double" never of which is useful. How about "PPC_LL", which I read as "PPC load long". I would propose the following names which at least follow some PPC naming convention: PPC_LL PPC_STL PPC_LLARX PPC_STLCX - kumar On Nov 3, 2005, at 9:16 PM, David Gibson wrote: > Paulus, not entirely sure if this is close to what you had in mind for > consolidating the asm macro stuff. Apply if you like... > > This patch consolidates macros used to generate assembly for > compatibility across different CPUs or configs. A new header, > asm-powerpc/asm-compat.h contains the main compatibility macros. It > uses some preprocessor magic to make the macros suitable both for use > in .S files, and in inline asm in .c files. Headers (bitops.h, > uaccess.h, atomic.h, bug.h) which had their own such compatibility > macros are changed to use asm-compat.h. > > ppc_asm.h is now for use in .S files *only*, and a #error enforces > that. As such, we're a lot more careless about namespace pollution > here than in asm-compat.h. > > While we're at it, this patch adds a call to the PPC405_ERR77 macro in > futex.h which should have had it already, but didn't. > > Built and booted on pSeries, Maple and iSeries (ARCH=powerpc). Built > for 32-bit powermac (ARCH=powerpc) and Walnut (ARCH=ppc). > > Signed-off-by: David Gibson > > Index: working-2.6/include/asm-powerpc/ppc_asm.h > =================================================================== > --- working-2.6.orig/include/asm-powerpc/ppc_asm.h 2005-11-03 > 16:26:58.000000000 +1100 > +++ working-2.6/include/asm-powerpc/ppc_asm.h 2005-11-04 > 14:04:05.000000000 +1100 > @@ -6,8 +6,13 @@ > > #include > #include > +#include > > -#ifdef __ASSEMBLY__ > +#ifndef __ASSEMBLY__ > +#error __FILE__ should only be used in assembler files > +#else > + > +#define SZL (BITS_PER_LONG/8) > > /* > * Macros for storing registers into and loading registers from > @@ -184,12 +189,6 @@ > oris reg,reg,(label)@h; \ > ori reg,reg,(label)@l; > > -/* operations for longs and pointers */ > -#define LDL ld > -#define STL std > -#define CMPI cmpdi > -#define SZL 8 > - > /* offsets for stack frame layout */ > #define LRSAVE 16 > > @@ -203,12 +202,6 @@ > > #define OFF(name) name at l > > -/* operations for longs and pointers */ > -#define LDL lwz > -#define STL stw > -#define CMPI cmpwi > -#define SZL 4 > - > /* offsets for stack frame layout */ > #define LRSAVE 4 > > @@ -266,15 +259,6 @@ > #endif > > > -#ifdef CONFIG_IBM405_ERR77 > -#define PPC405_ERR77(ra,rb) dcbt ra, rb; > -#define PPC405_ERR77_SYNC sync; > -#else > -#define PPC405_ERR77(ra,rb) > -#define PPC405_ERR77_SYNC > -#endif > - > - > #ifdef CONFIG_IBM440EP_ERR42 > #define PPC440EP_ERR42 isync > #else > @@ -502,17 +486,6 @@ > #define N_SLINE 68 > #define N_SO 100 > > -#define ASM_CONST(x) x > -#else > - #define __ASM_CONST(x) x##UL > - #define ASM_CONST(x) __ASM_CONST(x) > - > -#ifdef CONFIG_PPC64 > -#define DATAL ".llong" > -#else > -#define DATAL ".long" > -#endif > - > #endif /* __ASSEMBLY__ */ > > #endif /* _ASM_POWERPC_PPC_ASM_H */ > Index: working-2.6/arch/powerpc/kernel/fpu.S > =================================================================== > --- working-2.6.orig/arch/powerpc/kernel/fpu.S 2005-10-31 > 15:20:20.000000000 +1100 > +++ working-2.6/arch/powerpc/kernel/fpu.S 2005-11-04 > 14:04:05.000000000 +1100 > @@ -41,7 +41,7 @@ > #ifndef CONFIG_SMP > LOADBASE(r3, last_task_used_math) > toreal(r3) > - LDL r4,OFF(last_task_used_math)(r3) > + PPC_LD r4,OFF(last_task_used_math)(r3) > CMPI 0,r4,0 > beq 1f > toreal(r4) > @@ -49,12 +49,12 @@ > SAVE_32FPRS(0, r4) > mffs fr0 > stfd fr0,THREAD_FPSCR(r4) > - LDL r5,PT_REGS(r4) > + PPC_LD r5,PT_REGS(r4) > toreal(r5) > - LDL r4,_MSR-STACK_FRAME_OVERHEAD(r5) > + PPC_LD r4,_MSR-STACK_FRAME_OVERHEAD(r5) > li r10,MSR_FP|MSR_FE0|MSR_FE1 > andc r4,r4,r10 /* disable FP for previous task */ > - STL r4,_MSR-STACK_FRAME_OVERHEAD(r5) > + PPC_ST r4,_MSR-STACK_FRAME_OVERHEAD(r5) > 1: > #endif /* CONFIG_SMP */ > /* enable use of FP after return */ > @@ -77,7 +77,7 @@ > #ifndef CONFIG_SMP > subi r4,r5,THREAD > fromreal(r4) > - STL r4,OFF(last_task_used_math)(r3) > + PPC_ST r4,OFF(last_task_used_math)(r3) > #endif /* CONFIG_SMP */ > /* restore registers and return */ > /* we haven't used ctr or xer or lr */ > @@ -100,21 +100,21 @@ > CMPI 0,r3,0 > beqlr- /* if no previous owner, done */ > addi r3,r3,THREAD /* want THREAD of task */ > - LDL r5,PT_REGS(r3) > + PPC_LD r5,PT_REGS(r3) > CMPI 0,r5,0 > SAVE_32FPRS(0, r3) > mffs fr0 > stfd fr0,THREAD_FPSCR(r3) > beq 1f > - LDL r4,_MSR-STACK_FRAME_OVERHEAD(r5) > + PPC_LD r4,_MSR-STACK_FRAME_OVERHEAD(r5) > li r3,MSR_FP|MSR_FE0|MSR_FE1 > andc r4,r4,r3 /* disable FP for previous task */ > - STL r4,_MSR-STACK_FRAME_OVERHEAD(r5) > + PPC_ST r4,_MSR-STACK_FRAME_OVERHEAD(r5) > 1: > #ifndef CONFIG_SMP > li r5,0 > LOADBASE(r4,last_task_used_math) > - STL r5,OFF(last_task_used_math)(r4) > + PPC_ST r5,OFF(last_task_used_math)(r4) > #endif /* CONFIG_SMP */ > blr > > Index: working-2.6/include/asm-powerpc/bitops.h > =================================================================== > --- working-2.6.orig/include/asm-powerpc/bitops.h 2005-11-03 > 16:26:58.000000000 +1100 > +++ working-2.6/include/asm-powerpc/bitops.h 2005-11-04 > 14:04:05.000000000 +1100 > @@ -40,6 +40,7 @@ > > #include > #include > +#include > #include > > /* > @@ -52,16 +53,6 @@ > #define BITOP_WORD(nr) ((nr) / BITS_PER_LONG) > #define BITOP_LE_SWIZZLE ((BITS_PER_LONG-1) & ~0x7) > > -#ifdef CONFIG_PPC64 > -#define LARXL "ldarx" > -#define STCXL "stdcx." > -#define CNTLZL "cntlzd" > -#else > -#define LARXL "lwarx" > -#define STCXL "stwcx." > -#define CNTLZL "cntlzw" > -#endif > - > static __inline__ void set_bit(int nr, volatile unsigned long *addr) > { > unsigned long old; > @@ -69,10 +60,10 @@ > unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); > > __asm__ __volatile__( > -"1:" LARXL " %0,0,%3 # set_bit\n" > +"1:" PPC_LARX "%0,0,%3 # set_bit\n" > "or %0,%0,%2\n" > PPC405_ERR77(0,%3) > - STCXL " %0,0,%3\n" > + PPC_STCX "%0,0,%3\n" > "bne- 1b" > : "=&r"(old), "=m"(*p) > : "r"(mask), "r"(p), "m"(*p) > @@ -86,10 +77,10 @@ > unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); > > __asm__ __volatile__( > -"1:" LARXL " %0,0,%3 # set_bit\n" > +"1:" PPC_LARX "%0,0,%3 # clear_bit\n" > "andc %0,%0,%2\n" > PPC405_ERR77(0,%3) > - STCXL " %0,0,%3\n" > + PPC_STCX "%0,0,%3\n" > "bne- 1b" > : "=&r"(old), "=m"(*p) > : "r"(mask), "r"(p), "m"(*p) > @@ -103,10 +94,10 @@ > unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); > > __asm__ __volatile__( > -"1:" LARXL " %0,0,%3 # set_bit\n" > +"1:" PPC_LARX "%0,0,%3 # change_bit\n" > "xor %0,%0,%2\n" > PPC405_ERR77(0,%3) > - STCXL " %0,0,%3\n" > + PPC_STCX "%0,0,%3\n" > "bne- 1b" > : "=&r"(old), "=m"(*p) > : "r"(mask), "r"(p), "m"(*p) > @@ -122,10 +113,10 @@ > > __asm__ __volatile__( > EIEIO_ON_SMP > -"1:" LARXL " %0,0,%3 # test_and_set_bit\n" > +"1:" PPC_LARX "%0,0,%3 # test_and_set_bit\n" > "or %1,%0,%2 \n" > PPC405_ERR77(0,%3) > - STCXL " %1,0,%3 \n" > + PPC_STCX "%1,0,%3 \n" > "bne- 1b" > ISYNC_ON_SMP > : "=&r" (old), "=&r" (t) > @@ -144,10 +135,10 @@ > > __asm__ __volatile__( > EIEIO_ON_SMP > -"1:" LARXL " %0,0,%3 # test_and_clear_bit\n" > +"1:" PPC_LARX "%0,0,%3 # test_and_clear_bit\n" > "andc %1,%0,%2 \n" > PPC405_ERR77(0,%3) > - STCXL " %1,0,%3 \n" > + PPC_STCX "%1,0,%3 \n" > "bne- 1b" > ISYNC_ON_SMP > : "=&r" (old), "=&r" (t) > @@ -166,10 +157,10 @@ > > __asm__ __volatile__( > EIEIO_ON_SMP > -"1:" LARXL " %0,0,%3 # test_and_change_bit\n" > +"1:" PPC_LARX "%0,0,%3 # test_and_change_bit\n" > "xor %1,%0,%2 \n" > PPC405_ERR77(0,%3) > - STCXL " %1,0,%3 \n" > + PPC_STCX "%1,0,%3 \n" > "bne- 1b" > ISYNC_ON_SMP > : "=&r" (old), "=&r" (t) > @@ -184,9 +175,9 @@ > unsigned long old; > > __asm__ __volatile__( > -"1:" LARXL " %0,0,%3 # set_bit\n" > +"1:" PPC_LARX "%0,0,%3 # set_bits\n" > "or %0,%0,%2\n" > - STCXL " %0,0,%3\n" > + PPC_STCX "%0,0,%3\n" > "bne- 1b" > : "=&r" (old), "=m" (*addr) > : "r" (mask), "r" (addr), "m" (*addr) > @@ -268,7 +259,7 @@ > { > int lz; > > - asm (CNTLZL " %0,%1" : "=r" (lz) : "r" (x)); > + asm (PPC_CNTLZ "%0,%1" : "=r" (lz) : "r" (x)); > return BITS_PER_LONG - 1 - lz; > } > > Index: working-2.6/include/asm-powerpc/bug.h > =================================================================== > --- working-2.6.orig/include/asm-powerpc/bug.h 2005-11-03 > 16:26:58.000000000 +1100 > +++ working-2.6/include/asm-powerpc/bug.h 2005-11-04 > 14:04:05.000000000 +1100 > @@ -1,6 +1,7 @@ > #ifndef _ASM_POWERPC_BUG_H > #define _ASM_POWERPC_BUG_H > > +#include > /* > * Define an illegal instr to trap on the bug. > * We don't use 0 because that marks the end of a function > @@ -11,14 +12,6 @@ > > #ifndef __ASSEMBLY__ > > -#ifdef __powerpc64__ > -#define BUG_TABLE_ENTRY ".llong" > -#define BUG_TRAP_OP "tdnei" > -#else > -#define BUG_TABLE_ENTRY ".long" > -#define BUG_TRAP_OP "twnei" > -#endif /* __powerpc64__ */ > - > struct bug_entry { > unsigned long bug_addr; > long line; > @@ -40,16 +33,16 @@ > __asm__ __volatile__( \ > "1: twi 31,0,0\n" \ > ".section __bug_table,\"a\"\n" \ > - "\t"BUG_TABLE_ENTRY" 1b,%0,%1,%2\n" \ > + "\t"PPC_LONG" 1b,%0,%1,%2\n" \ > ".previous" \ > : : "i" (__LINE__), "i" (__FILE__), "i" (__FUNCTION__)); \ > } while (0) > > #define BUG_ON(x) do { \ > __asm__ __volatile__( \ > - "1: "BUG_TRAP_OP" %0,0\n" \ > + "1: "PPC_TNEI" %0,0\n" \ > ".section __bug_table,\"a\"\n" \ > - "\t"BUG_TABLE_ENTRY" 1b,%1,%2,%3\n" \ > + "\t"PPC_LONG" 1b,%1,%2,%3\n" \ > ".previous" \ > : : "r" ((long)(x)), "i" (__LINE__), \ > "i" (__FILE__), "i" (__FUNCTION__)); \ > @@ -57,9 +50,9 @@ > > #define WARN_ON(x) do { \ > __asm__ __volatile__( \ > - "1: "BUG_TRAP_OP" %0,0\n" \ > + "1: "PPC_TNEI" %0,0\n" \ > ".section __bug_table,\"a\"\n" \ > - "\t"BUG_TABLE_ENTRY" 1b,%1,%2,%3\n" \ > + "\t"PPC_LONG" 1b,%1,%2,%3\n" \ > ".previous" \ > : : "r" ((long)(x)), \ > "i" (__LINE__ + BUG_WARNING_TRAP), \ > Index: working-2.6/include/asm-powerpc/futex.h > =================================================================== > --- working-2.6.orig/include/asm-powerpc/futex.h 2005-11-03 > 16:26:58.000000000 +1100 > +++ working-2.6/include/asm-powerpc/futex.h 2005-11-04 > 14:04:05.000000000 +1100 > @@ -7,13 +7,14 @@ > #include > #include > #include > -#include > +#include > > #define __futex_atomic_op(insn, ret, oldval, uaddr, oparg) \ > __asm__ __volatile ( \ > SYNC_ON_SMP \ > "1: lwarx %0,0,%2\n" \ > insn \ > + PPC405_ERR77(0, %2) \ > "2: stwcx. %1,0,%2\n" \ > "bne- 1b\n" \ > "li %1,0\n" \ > @@ -23,7 +24,7 @@ > ".previous\n" \ > ".section __ex_table,\"a\"\n" \ > ".align 3\n" \ > - DATAL " 1b,4b,2b,4b\n" \ > + PPC_LONG "1b,4b,2b,4b\n" \ > ".previous" \ > : "=&r" (oldval), "=&r" (ret) \ > : "b" (uaddr), "i" (-EFAULT), "1" (oparg) \ > Index: working-2.6/include/asm-powerpc/cputable.h > =================================================================== > --- working-2.6.orig/include/asm-powerpc/cputable.h 2005-10-31 > 15:20:22.000000000 +1100 > +++ working-2.6/include/asm-powerpc/cputable.h 2005-11-04 > 14:04:05.000000000 +1100 > @@ -2,7 +2,7 @@ > #define __ASM_POWERPC_CPUTABLE_H > > #include > -#include /* for ASM_CONST */ > +#include > > #define PPC_FEATURE_32 0x80000000 > #define PPC_FEATURE_64 0x40000000 > Index: working-2.6/include/asm-ppc64/mmu.h > =================================================================== > --- working-2.6.orig/include/asm-ppc64/mmu.h 2005-10-31 > 15:20:22.000000000 +1100 > +++ working-2.6/include/asm-ppc64/mmu.h 2005-11-04 > 14:04:05.000000000 +1100 > @@ -14,7 +14,7 @@ > #define _PPC64_MMU_H_ > > #include > -#include /* for ASM_CONST */ > +#include > #include > > /* > Index: working-2.6/include/asm-ppc64/page.h > =================================================================== > --- working-2.6.orig/include/asm-ppc64/page.h 2005-10-31 > 15:20:22.000000000 +1100 > +++ working-2.6/include/asm-ppc64/page.h 2005-11-04 > 14:04:05.000000000 +1100 > @@ -11,7 +11,7 @@ > */ > > #include > -#include /* for ASM_CONST */ > +#include > > /* PAGE_SHIFT determines the page size */ > #define PAGE_SHIFT 12 > Index: working-2.6/include/asm-powerpc/asm-compat.h > =================================================================== > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > +++ working-2.6/include/asm-powerpc/asm-compat.h 2005-11-04 > 14:04:05.000000000 +1100 > @@ -0,0 +1,55 @@ > +#ifndef _ASM_POWERPC_ASM_COMPAT_H > +#define _ASM_POWERPC_ASM_COMPAT_H > + > +#include > +#include > + > +#ifdef __ASSEMBLY__ > +# define stringify_in_c(...) __VA_ARGS__ > +# define ASM_CONST(x) x > +#else > +/* This version of stringify will deal with commas... */ > +# define __stringify_in_c(...) #__VA_ARGS__ > +# define stringify_in_c(...) __stringify_in_c(__VA_ARGS__) " " > +# define __ASM_CONST(x) x##UL > +# define ASM_CONST(x) __ASM_CONST(x) > +#endif > + > +#ifdef __powerpc64__ > + > +/* operations for longs and pointers */ > +#define PPC_LD stringify_in_c(ld) > +#define PPC_ST stringify_in_c(std) > +#define PPC_CMPI stringify_in_c(cmpdi) > +#define PPC_LONG stringify_in_c(.llong) > +#define PPC_TNEI stringify_in_c(tdnei) > +#define PPC_LARX stringify_in_c(ldarx) > +#define PPC_STCX stringify_in_c(stdcx.) > +#define PPC_CNTLZ stringify_in_c(cntlzd) > + > +#else /* 32-bit */ > + > +/* operations for longs and pointers */ > +#define PPC_LD stringify_in_c(lwz) > +#define PPC_ST stringify_in_c(stw) > +#define PPC_CMPI stringify_in_c(cmpwi) > +#define PPC_LONG stringify_in_c(.long) > +#define PPC_TNEI stringify_in_c(twnei) > +#define PPC_LARX stringify_in_c(lwarx) > +#define PPC_STCX stringify_in_c(stwcx.) > +#define PPC_CNTLZ stringify_in_c(cntlzw) > + > +#endif > + > +#ifdef CONFIG_IBM405_ERR77 > +/* Erratum #77 on the 405 means we need a sync or dcbt before every > + * stwcx. The old ATOMIC_SYNC_FIX covered some but not all of this. > + */ > +#define PPC405_ERR77(ra,rb) stringify_in_c(dcbt ra, rb;) > +#define PPC405_ERR77_SYNC stringify_in_c(sync;) > +#else > +#define PPC405_ERR77(ra,rb) > +#define PPC405_ERR77_SYNC > +#endif > + > +#endif /* _ASM_POWERPC_ASM_COMPAT_H */ > Index: working-2.6/arch/powerpc/xmon/setjmp.S > =================================================================== > --- working-2.6.orig/arch/powerpc/xmon/setjmp.S 2005-10-31 > 15:20:57.000000000 +1100 > +++ working-2.6/arch/powerpc/xmon/setjmp.S 2005-11-04 > 14:04:05.000000000 +1100 > @@ -14,61 +14,61 @@ > > _GLOBAL(xmon_setjmp) > mflr r0 > - STL r0,0(r3) > - STL r1,SZL(r3) > - STL r2,2*SZL(r3) > + PPC_ST r0,0(r3) > + PPC_ST r1,SZL(r3) > + PPC_ST r2,2*SZL(r3) > mfcr r0 > - STL r0,3*SZL(r3) > - STL r13,4*SZL(r3) > - STL r14,5*SZL(r3) > - STL r15,6*SZL(r3) > - STL r16,7*SZL(r3) > - STL r17,8*SZL(r3) > - STL r18,9*SZL(r3) > - STL r19,10*SZL(r3) > - STL r20,11*SZL(r3) > - STL r21,12*SZL(r3) > - STL r22,13*SZL(r3) > - STL r23,14*SZL(r3) > - STL r24,15*SZL(r3) > - STL r25,16*SZL(r3) > - STL r26,17*SZL(r3) > - STL r27,18*SZL(r3) > - STL r28,19*SZL(r3) > - STL r29,20*SZL(r3) > - STL r30,21*SZL(r3) > - STL r31,22*SZL(r3) > + PPC_ST r0,3*SZL(r3) > + PPC_ST r13,4*SZL(r3) > + PPC_ST r14,5*SZL(r3) > + PPC_ST r15,6*SZL(r3) > + PPC_ST r16,7*SZL(r3) > + PPC_ST r17,8*SZL(r3) > + PPC_ST r18,9*SZL(r3) > + PPC_ST r19,10*SZL(r3) > + PPC_ST r20,11*SZL(r3) > + PPC_ST r21,12*SZL(r3) > + PPC_ST r22,13*SZL(r3) > + PPC_ST r23,14*SZL(r3) > + PPC_ST r24,15*SZL(r3) > + PPC_ST r25,16*SZL(r3) > + PPC_ST r26,17*SZL(r3) > + PPC_ST r27,18*SZL(r3) > + PPC_ST r28,19*SZL(r3) > + PPC_ST r29,20*SZL(r3) > + PPC_ST r30,21*SZL(r3) > + PPC_ST r31,22*SZL(r3) > li r3,0 > blr > > _GLOBAL(xmon_longjmp) > - CMPI r4,0 > + PPC_CMPI r4,0 > bne 1f > li r4,1 > -1: LDL r13,4*SZL(r3) > - LDL r14,5*SZL(r3) > - LDL r15,6*SZL(r3) > - LDL r16,7*SZL(r3) > - LDL r17,8*SZL(r3) > - LDL r18,9*SZL(r3) > - LDL r19,10*SZL(r3) > - LDL r20,11*SZL(r3) > - LDL r21,12*SZL(r3) > - LDL r22,13*SZL(r3) > - LDL r23,14*SZL(r3) > - LDL r24,15*SZL(r3) > - LDL r25,16*SZL(r3) > - LDL r26,17*SZL(r3) > - LDL r27,18*SZL(r3) > - LDL r28,19*SZL(r3) > - LDL r29,20*SZL(r3) > - LDL r30,21*SZL(r3) > - LDL r31,22*SZL(r3) > - LDL r0,3*SZL(r3) > +1: PPC_LD r13,4*SZL(r3) > + PPC_LD r14,5*SZL(r3) > + PPC_LD r15,6*SZL(r3) > + PPC_LD r16,7*SZL(r3) > + PPC_LD r17,8*SZL(r3) > + PPC_LD r18,9*SZL(r3) > + PPC_LD r19,10*SZL(r3) > + PPC_LD r20,11*SZL(r3) > + PPC_LD r21,12*SZL(r3) > + PPC_LD r22,13*SZL(r3) > + PPC_LD r23,14*SZL(r3) > + PPC_LD r24,15*SZL(r3) > + PPC_LD r25,16*SZL(r3) > + PPC_LD r26,17*SZL(r3) > + PPC_LD r27,18*SZL(r3) > + PPC_LD r28,19*SZL(r3) > + PPC_LD r29,20*SZL(r3) > + PPC_LD r30,21*SZL(r3) > + PPC_LD r31,22*SZL(r3) > + PPC_LD r0,3*SZL(r3) > mtcrf 0x38,r0 > - LDL r0,0(r3) > - LDL r1,SZL(r3) > - LDL r2,2*SZL(r3) > + PPC_LD r0,0(r3) > + PPC_LD r1,SZL(r3) > + PPC_LD r2,2*SZL(r3) > mtlr r0 > mr r3,r4 > blr > @@ -84,52 +84,52 @@ > * different ABIs, though). > */ > _GLOBAL(xmon_save_regs) > - STL r0,0*SZL(r3) > - STL r2,2*SZL(r3) > - STL r3,3*SZL(r3) > - STL r4,4*SZL(r3) > - STL r5,5*SZL(r3) > - STL r6,6*SZL(r3) > - STL r7,7*SZL(r3) > - STL r8,8*SZL(r3) > - STL r9,9*SZL(r3) > - STL r10,10*SZL(r3) > - STL r11,11*SZL(r3) > - STL r12,12*SZL(r3) > - STL r13,13*SZL(r3) > - STL r14,14*SZL(r3) > - STL r15,15*SZL(r3) > - STL r16,16*SZL(r3) > - STL r17,17*SZL(r3) > - STL r18,18*SZL(r3) > - STL r19,19*SZL(r3) > - STL r20,20*SZL(r3) > - STL r21,21*SZL(r3) > - STL r22,22*SZL(r3) > - STL r23,23*SZL(r3) > - STL r24,24*SZL(r3) > - STL r25,25*SZL(r3) > - STL r26,26*SZL(r3) > - STL r27,27*SZL(r3) > - STL r28,28*SZL(r3) > - STL r29,29*SZL(r3) > - STL r30,30*SZL(r3) > - STL r31,31*SZL(r3) > + PPC_ST r0,0*SZL(r3) > + PPC_ST r2,2*SZL(r3) > + PPC_ST r3,3*SZL(r3) > + PPC_ST r4,4*SZL(r3) > + PPC_ST r5,5*SZL(r3) > + PPC_ST r6,6*SZL(r3) > + PPC_ST r7,7*SZL(r3) > + PPC_ST r8,8*SZL(r3) > + PPC_ST r9,9*SZL(r3) > + PPC_ST r10,10*SZL(r3) > + PPC_ST r11,11*SZL(r3) > + PPC_ST r12,12*SZL(r3) > + PPC_ST r13,13*SZL(r3) > + PPC_ST r14,14*SZL(r3) > + PPC_ST r15,15*SZL(r3) > + PPC_ST r16,16*SZL(r3) > + PPC_ST r17,17*SZL(r3) > + PPC_ST r18,18*SZL(r3) > + PPC_ST r19,19*SZL(r3) > + PPC_ST r20,20*SZL(r3) > + PPC_ST r21,21*SZL(r3) > + PPC_ST r22,22*SZL(r3) > + PPC_ST r23,23*SZL(r3) > + PPC_ST r24,24*SZL(r3) > + PPC_ST r25,25*SZL(r3) > + PPC_ST r26,26*SZL(r3) > + PPC_ST r27,27*SZL(r3) > + PPC_ST r28,28*SZL(r3) > + PPC_ST r29,29*SZL(r3) > + PPC_ST r30,30*SZL(r3) > + PPC_ST r31,31*SZL(r3) > /* go up one stack frame for SP */ > - LDL r4,0(r1) > - STL r4,1*SZL(r3) > + PPC_LD r4,0(r1) > + PPC_ST r4,1*SZL(r3) > /* get caller's LR */ > - LDL r0,LRSAVE(r4) > - STL r0,_NIP-STACK_FRAME_OVERHEAD(r3) > - STL r0,_LINK-STACK_FRAME_OVERHEAD(r3) > + PPC_LD r0,LRSAVE(r4) > + PPC_ST r0,_NIP-STACK_FRAME_OVERHEAD(r3) > + PPC_ST r0,_LINK-STACK_FRAME_OVERHEAD(r3) > mfmsr r0 > - STL r0,_MSR-STACK_FRAME_OVERHEAD(r3) > + PPC_ST r0,_MSR-STACK_FRAME_OVERHEAD(r3) > mfctr r0 > - STL r0,_CTR-STACK_FRAME_OVERHEAD(r3) > + PPC_ST r0,_CTR-STACK_FRAME_OVERHEAD(r3) > mfxer r0 > - STL r0,_XER-STACK_FRAME_OVERHEAD(r3) > + PPC_ST r0,_XER-STACK_FRAME_OVERHEAD(r3) > mfcr r0 > - STL r0,_CCR-STACK_FRAME_OVERHEAD(r3) > + PPC_ST r0,_CCR-STACK_FRAME_OVERHEAD(r3) > li r0,0 > - STL r0,_TRAP-STACK_FRAME_OVERHEAD(r3) > + PPC_ST r0,_TRAP-STACK_FRAME_OVERHEAD(r3) > blr > Index: working-2.6/include/asm-powerpc/system.h > =================================================================== > --- working-2.6.orig/include/asm-powerpc/system.h 2005-10-31 > 15:45:01.000000000 +1100 > +++ working-2.6/include/asm-powerpc/system.h 2005-11-04 > 14:04:05.000000000 +1100 > @@ -8,7 +8,6 @@ > #include > > #include > -#include > #include > > /* > Index: working-2.6/include/asm-powerpc/atomic.h > =================================================================== > --- working-2.6.orig/include/asm-powerpc/atomic.h 2005-10-31 > 15:20:22.000000000 +1100 > +++ working-2.6/include/asm-powerpc/atomic.h 2005-11-04 > 14:04:05.000000000 +1100 > @@ -9,21 +9,13 @@ > > #ifdef __KERNEL__ > #include > +#include > > #define ATOMIC_INIT(i) { (i) } > > #define atomic_read(v) ((v)->counter) > #define atomic_set(v,i) (((v)->counter) = (i)) > > -/* Erratum #77 on the 405 means we need a sync or dcbt before > every stwcx. > - * The old ATOMIC_SYNC_FIX covered some but not all of this. > - */ > -#ifdef CONFIG_IBM405_ERR77 > -#define PPC405_ERR77(ra,rb) "dcbt " #ra "," #rb ";" > -#else > -#define PPC405_ERR77(ra,rb) > -#endif > - > static __inline__ void atomic_add(int a, atomic_t *v) > { > int t; > Index: working-2.6/include/asm-powerpc/uaccess.h > =================================================================== > --- working-2.6.orig/include/asm-powerpc/uaccess.h 2005-11-03 > 16:26:58.000000000 +1100 > +++ working-2.6/include/asm-powerpc/uaccess.h 2005-11-04 > 14:04:05.000000000 +1100 > @@ -120,14 +120,6 @@ > > extern long __put_user_bad(void); > > -#ifdef __powerpc64__ > -#define __EX_TABLE_ALIGN "3" > -#define __EX_TABLE_TYPE "llong" > -#else > -#define __EX_TABLE_ALIGN "2" > -#define __EX_TABLE_TYPE "long" > -#endif > - > /* > * We don't tell gcc that we are accessing memory, but this is OK > * because we do not write to any memory gcc knows about, so there > @@ -142,11 +134,12 @@ > " b 2b\n" \ > ".previous\n" \ > ".section __ex_table,\"a\"\n" \ > - " .align " __EX_TABLE_ALIGN "\n" \ > - " ."__EX_TABLE_TYPE" 1b,3b\n" \ > + " .balign %5\n" \ > + PPC_LONG "1b,3b\n" \ > ".previous" \ > : "=r" (err) \ > - : "r" (x), "b" (addr), "i" (-EFAULT), "0" (err)) > + : "r" (x), "b" (addr), "i" (-EFAULT), "0" (err),\ > + "i"(sizeof(unsigned long))) > > #ifdef __powerpc64__ > #define __put_user_asm2(x, ptr, retval) \ > @@ -162,12 +155,13 @@ > " b 3b\n" \ > ".previous\n" \ > ".section __ex_table,\"a\"\n" \ > - " .align " __EX_TABLE_ALIGN "\n" \ > - " ." __EX_TABLE_TYPE " 1b,4b\n" \ > - " ." __EX_TABLE_TYPE " 2b,4b\n" \ > + " .balign %5\n" \ > + PPC_LONG "1b,4b\n" \ > + PPC_LONG "2b,4b\n" \ > ".previous" \ > : "=r" (err) \ > - : "r" (x), "b" (addr), "i" (-EFAULT), "0" (err)) > + : "r" (x), "b" (addr), "i" (-EFAULT), "0" (err),\ > + "i"(sizeof(unsigned long))) > #endif /* __powerpc64__ */ > > #define __put_user_size(x, ptr, size, retval) \ > @@ -213,11 +207,12 @@ > " b 2b\n" \ > ".previous\n" \ > ".section __ex_table,\"a\"\n" \ > - " .align "__EX_TABLE_ALIGN "\n" \ > - " ." __EX_TABLE_TYPE " 1b,3b\n" \ > + " .balign %5\n" \ > + PPC_LONG "1b,3b\n" \ > ".previous" \ > : "=r" (err), "=r" (x) \ > - : "b" (addr), "i" (-EFAULT), "0" (err)) > + : "b" (addr), "i" (-EFAULT), "0" (err), \ > + "i"(sizeof(unsigned long))) > > #ifdef __powerpc64__ > #define __get_user_asm2(x, addr, err) \ > @@ -235,12 +230,13 @@ > " b 3b\n" \ > ".previous\n" \ > ".section __ex_table,\"a\"\n" \ > - " .align " __EX_TABLE_ALIGN "\n" \ > - " ." __EX_TABLE_TYPE " 1b,4b\n" \ > - " ." __EX_TABLE_TYPE " 2b,4b\n" \ > + " .balign %5\n" \ > + PPC_LONG "1b,4b\n" \ > + PPC_LONG "2b,4b\n" \ > ".previous" \ > : "=r" (err), "=&r" (x) \ > - : "b" (addr), "i" (-EFAULT), "0" (err)) > + : "b" (addr), "i" (-EFAULT), "0" (err), \ > + "i"(sizeof(unsigned long))) > #endif /* __powerpc64__ */ > > #define __get_user_size(x, ptr, size, retval) \ > Index: working-2.6/arch/powerpc/platforms/iseries/misc.S > =================================================================== > --- working-2.6.orig/arch/powerpc/platforms/iseries/misc.S > 2005-10-31 15:20:20.000000000 +1100 > +++ working-2.6/arch/powerpc/platforms/iseries/misc.S 2005-11-04 > 14:04:05.000000000 +1100 > @@ -15,6 +15,7 @@ > > #include > #include > +#include > > .text > > Index: working-2.6/arch/ppc/boot/openfirmware/Makefile > =================================================================== > --- working-2.6.orig/arch/ppc/boot/openfirmware/Makefile 2005-10-25 > 11:59:53.000000000 +1000 > +++ working-2.6/arch/ppc/boot/openfirmware/Makefile 2005-11-04 > 14:04:05.000000000 +1100 > @@ -80,8 +80,7 @@ > $(call if_changed,mknote) > > > -$(obj)/coffcrt0.o: EXTRA_AFLAGS := -traditional -DXCOFF > -$(obj)/crt0.o: EXTRA_AFLAGS := -traditional > +$(obj)/coffcrt0.o: EXTRA_AFLAGS := -DXCOFF > targets += coffcrt0.o crt0.o > $(obj)/coffcrt0.o $(obj)/crt0.o: $(common)/crt0.S FORCE > $(call if_changed_dep,as_o_S) > > -- > David Gibson | I'll have my music baroque, and my code > david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ > _other_ > | _way_ _around_! > http://www.ozlabs.org/people/dgibson > _______________________________________________ > Linuxppc64-dev mailing list > Linuxppc64-dev at ozlabs.org > https://ozlabs.org/mailman/listinfo/linuxppc64-dev From benh at kernel.crashing.org Fri Nov 4 17:56:09 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 04 Nov 2005 17:56:09 +1100 Subject: pci_resource_end() changed problem with 2.6.14 In-Reply-To: <436ADBA7.7030706@am.sony.com> References: <436ADBA7.7030706@am.sony.com> Message-ID: <1131087370.4680.238.camel@gaston> On Thu, 2005-11-03 at 19:55 -0800, Geoff Levand wrote: > I found that the serial port probe code in drivers/serial/8250_pci.c > no longer works properly for ppc64 in 2.6.14. It seems the value > returned by pci_resource_len() on ppc64 changed from 8 to 16 since > 2.6.13. I tested on a PC and pci_resource_len() returns 8 as > expected. > > Any help on on where to look for the problem would be appreciated. > > Here's the code that hits the problem: > > if (pci_resource_flags(dev, i) & IORESOURCE_IO && > pci_resource_len(dev, i) == 8 && > > And here are some test results: Interesting... What does an lspci -vv shows for the BARs of the PCI card ? Also, what do you have in /proc/device-tree ? What is the machine precisely ? 2.6.14 now uses the OF device-tree to generate the linux PCI tree instead of going directly to PCI probing. It's possible that this is causing your problem if for some reason, the BAR sizing done by OF ends up being different than what the kernel does ... Ben. From benh at kernel.crashing.org Fri Nov 4 17:58:58 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 04 Nov 2005 17:58:58 +1100 Subject: [PATCH] HVC init race In-Reply-To: References: <20051104160211.c66d82f3.mikey@neuling.org> Message-ID: <1131087539.4680.242.camel@gaston> On Thu, 2005-11-03 at 23:10 -0600, Hollis Blanchard wrote: > On Nov 3, 2005, at 11:02 PM, Michael Neuling wrote: > > > I've been hitting a crash on boot where tty_open is being called > > before the > > hvc console driver setup is complete. Below patch fixes this problem. > > What is the race exactly? I guess nothing should be calling into > hvc_open before tty_register_driver()...? Well, in addition to the total lack of locking of the tty code you mean ? :) Seriously, nothing protects tty_register_driver vs; concurrent tty_open()... The race we are hitting here is a bit different though, it is that tty_open(), when asked for the default console driver, will first "ask" the kernel console if it has a matching tty driver. It does that by calling struct console -> device() callback. The hvc implementation of this callback returns the global hvc_driver. However, we fill that global before initializing it's content or registering the tty... kaboom. Ben. From mikey at neuling.org Fri Nov 4 18:02:31 2005 From: mikey at neuling.org (Michael Neuling) Date: Fri, 4 Nov 2005 18:02:31 +1100 Subject: [PATCH] HVC init race In-Reply-To: References: <20051104160211.c66d82f3.mikey@neuling.org> Message-ID: <20051104180231.ca288f01.mikey@neuling.org> > What is the race exactly? I guess nothing should be calling into > hvc_open before tty_register_driver()...? init_dev (from tty_io.c) seems to be racing with hvc_init (from hvc_console.c). Hence we can hit this section at the start if init_dev: if (driver->flags & TTY_DRIVER_DEVPTS_MEM) { tty = devpts_get_tty(idx); if (tty && driver->subtype == PTY_TYPE_MASTER) tty = tty->link; } else { tty = driver->ttys[idx]; /* Crashing here */ } and driver->flags is good but driver->ttys[idx] is not inited yet. ---- cpu 0x5: Vector: 300 (Data Access) at [c000000039f23690] pc: c0000000002a1dc8: .init_dev+0x158/0x760 lr: c0000000002a1db8: .init_dev+0x148/0x760 sp: c000000039f23910 msr: 9000000000009032 dar: 0 dsisr: 40000000 current = 0xc00000003a4147e0 paca = 0xc0000000005d7800 pid = 17150, comm = modprobe enter ? for help 5:mon> t [c000000039f239f0] c0000000002a2564 .tty_open+0x194/0x480 [c000000039f23ab0] c0000000000ce994 .chrdev_open+0x154/0x2a0 [c000000039f23b60] c0000000000bf280 .__dentry_open+0x140/0x3d0 [c000000039f23c10] c0000000000bf664 .filp_open+0x64/0x80 [c000000039f23d00] c0000000000bf868 .do_sys_open+0x68/0x120 [c000000039f23db0] c0000000000fd3e0 .compat_sys_open+0x10/0x30 [c000000039f23e30] c000000000008900 syscall_exit+0x0/0x18 --- Exception: c01 (System Call) at 000000000ff69810 SP (ff918950) is in userspace 5:mon> Mikey From arnd at arndb.de Sat Nov 5 02:31:16 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Fri, 4 Nov 2005 16:31:16 +0100 Subject: [PATCH] powerpc: mem_init crash for sparsemem Message-ID: <200511041631.17237.arnd@arndb.de> I have a Cell blade with some broken memory in the middle of the physical address space and this is correctly detected by the firmware, but not relocated. When I enable CONFIG_SPARSEMEM, the memsections for the nonexistant address space do not get struct page entries allocated, as expected. However, mem_init for the non-NUMA configuration tries to access these pages without first looking if they are there. I'm currently using the hack below to work around that, but I have the feeling that there should be a cleaner solution for this. Please comment. Signed-off-by: Arnd Bergmann --- linux-2.6.15-rc.orig/arch/powerpc/mm/mem.c +++ linux-2.6.15-rc/arch/powerpc/mm/mem.c @@ -348,6 +348,9 @@ void __init mem_init(void) #endif for_each_pgdat(pgdat) { for (i = 0; i < pgdat->node_spanned_pages; i++) { + if (!section_has_mem_map(__pfn_to_section + (pgdat->node_start_pfn + i))) + continue; page = pgdat_page_nr(pgdat, i); if (PageReserved(page)) reservedpages++; From johnrose at austin.ibm.com Sat Nov 5 03:20:55 2005 From: johnrose at austin.ibm.com (John Rose) Date: Fri, 04 Nov 2005 10:20:55 -0600 Subject: [PATCH 19/42]: ppc64: bugfix: crash on PHB add In-Reply-To: <20051104005117.GA26991@mail.gnucash.org> References: <20051103235918.GA25616@mail.gnucash.org> <20051104005117.GA26991@mail.gnucash.org> Message-ID: <1131121255.9574.11.camel@sinatra.austin.ibm.com> > This patch fixes a bug related to dlpar PHB add, after a PHB removal. This and patch 18 seem logically separate from the feature. This complicates review and adds to an already large patch set. Could we handle these separately? Thanks- John From linas at austin.ibm.com Sat Nov 5 03:35:57 2005 From: linas at austin.ibm.com (linas) Date: Fri, 4 Nov 2005 10:35:57 -0600 Subject: [PATCH 19/42]: ppc64: bugfix: crash on PHB add In-Reply-To: <1131121255.9574.11.camel@sinatra.austin.ibm.com> References: <20051103235918.GA25616@mail.gnucash.org> <20051104005117.GA26991@mail.gnucash.org> <1131121255.9574.11.camel@sinatra.austin.ibm.com> Message-ID: <20051104163557.GR19593@austin.ibm.com> On Fri, Nov 04, 2005 at 10:20:55AM -0600, John Rose was heard to remark: > > This patch fixes a bug related to dlpar PHB add, after a PHB removal. > > This and patch 18 seem logically separate from the feature. This > complicates review and adds to an already large patch set. Could we > handle these separately? I sent these in separetely, a month ago, as bug fixes for the dlpar crashes in the pre-2.6.14 kernels, but these were never applied. Since they're needed to get EEH to work, I just sent them in again with this set. Yes, I'm aware that the patch you sent yesterday fixes the same bug in almost the same way. What you really want to concentrate on are patches 20 through 23 which mess with the guts of the rpaphp code. But again, these are the same old patches, they have not changed since the submit last month. --linas From geoffrey.levand at am.sony.com Sat Nov 5 05:36:58 2005 From: geoffrey.levand at am.sony.com (Geoff Levand) Date: Fri, 04 Nov 2005 10:36:58 -0800 Subject: pci_resource_end() changed problem with 2.6.14 In-Reply-To: <1131087370.4680.238.camel@gaston> References: <436ADBA7.7030706@am.sony.com> <1131087370.4680.238.camel@gaston> Message-ID: <436BAA4A.5030007@am.sony.com> Benjamin Herrenschmidt wrote: > On Thu, 2005-11-03 at 19:55 -0800, Geoff Levand wrote: > >>I found that the serial port probe code in drivers/serial/8250_pci.c >>no longer works properly for ppc64 in 2.6.14. It seems the value >>returned by pci_resource_len() on ppc64 changed from 8 to 16 since >>2.6.13. I tested on a PC and pci_resource_len() returns 8 as >>expected. >> > Interesting... What does an lspci -vv shows for the BARs of the PCI > card ? Also, what do you have in /proc/device-tree ? What is the > machine precisely ? > > 2.6.14 now uses the OF device-tree to generate the linux PCI tree > instead of going directly to PCI probing. It's possible that this is > causing your problem if for some reason, the BAR sizing done by OF ends > up being different than what the kernel does ... > Sorry, I should have mentioned it, this is on my PowerMac G5 with a generic 8250 serial PCI card (StarTech PCI4S550N). Here's what lspci gives me: 0001:05:03.0 Serial controller: NetMos Technology PCI 9845 Multi-I/O Controller (rev 01) (prog-if 02 [16550]) Subsystem: LSI Logic / Symbios Logic 0P4S (4 port 16550A serial card) Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- SERR- References: <436ADBA7.7030706@am.sony.com> <1131087370.4680.238.camel@gaston> <436BAA4A.5030007@am.sony.com> Message-ID: <436BB1A3.5000404@am.sony.com> Geoff Levand wrote: > Benjamin Herrenschmidt wrote: > >>On Thu, 2005-11-03 at 19:55 -0800, Geoff Levand wrote: >> >> >>>I found that the serial port probe code in drivers/serial/8250_pci.c >>>no longer works properly for ppc64 in 2.6.14. It seems the value >>>returned by pci_resource_len() on ppc64 changed from 8 to 16 since >>>2.6.13. I tested on a PC and pci_resource_len() returns 8 as >>>expected. >>> >> >>Interesting... What does an lspci -vv shows for the BARs of the PCI >>card ? Also, what do you have in /proc/device-tree ? What is the >>machine precisely ? >> >>2.6.14 now uses the OF device-tree to generate the linux PCI tree >>instead of going directly to PCI probing. It's possible that this is >>causing your problem if for some reason, the BAR sizing done by OF ends >>up being different than what the kernel does ... >> > > > Sorry, I should have mentioned it, this is on my PowerMac G5 with a > generic 8250 serial PCI card (StarTech PCI4S550N). Here's what lspci > gives me: > > 0001:05:03.0 Serial controller: NetMos Technology PCI 9845 Multi-I/O Controller (rev 01) (prog-if 02 [16550]) > Subsystem: LSI Logic / Symbios Logic 0P4S (4 port 16550A serial card) > Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- > Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- SERR- Interrupt: pin A routed to IRQ 53 > Region 0: I/O ports at f4000050 [size=16] > Region 1: I/O ports at f4000040 [size=16] > Region 2: I/O ports at f4000030 [size=16] > Region 3: I/O ports at f4000020 [size=16] > Region 4: I/O ports at f4000010 [size=16] > Region 5: I/O ports at f4000000 [size=16] > > It could be the change to using the OF device-tree. What's an easy way to > see the size OF has used? > OK, from the OF prompt I can list the properties, and I see assigned-addresses has a length of 16. I think to change the the OF device tree parsing defeats the purpose of the using the OF device tree, so I'll look into making the serial port probe routine more clever. -Geoff From apw at shadowen.org Sat Nov 5 07:18:19 2005 From: apw at shadowen.org (Andy Whitcroft) Date: Fri, 04 Nov 2005 20:18:19 +0000 Subject: [PATCH] powerpc: mem_init crash for sparsemem In-Reply-To: <200511041631.17237.arnd@arndb.de> References: <200511041631.17237.arnd@arndb.de> Message-ID: <436BC20B.9070704@shadowen.org> Arnd Bergmann wrote: > I have a Cell blade with some broken memory in the middle of the > physical address space and this is correctly detected by the > firmware, but not relocated. When I enable CONFIG_SPARSEMEM, > the memsections for the nonexistant address space do not > get struct page entries allocated, as expected. > > However, mem_init for the non-NUMA configuration tries to > access these pages without first looking if they are there. > I'm currently using the hack below to work around that, but > I have the feeling that there should be a cleaner solution > for this. > > Please comment. > > Signed-off-by: Arnd Bergmann > > --- linux-2.6.15-rc.orig/arch/powerpc/mm/mem.c > +++ linux-2.6.15-rc/arch/powerpc/mm/mem.c > @@ -348,6 +348,9 @@ void __init mem_init(void) > #endif > for_each_pgdat(pgdat) { > for (i = 0; i < pgdat->node_spanned_pages; i++) { > + if (!section_has_mem_map(__pfn_to_section > + (pgdat->node_start_pfn + i))) > + continue; > page = pgdat_page_nr(pgdat, i); > if (PageReserved(page)) > reservedpages++; Would it not make sense to use pfn_valid(), as that is not sparsemem specific? Not looked at the code in question specifically, but if you can use section_has_mem_map() it should be equivalent: if (!pfn_valid(pgdat->node_start_pfn + i)) continue; Want to spin us a patch and I'll give it some general testing. -apw From kravetz at us.ibm.com Sat Nov 5 07:57:58 2005 From: kravetz at us.ibm.com (Mike Kravetz) Date: Fri, 4 Nov 2005 12:57:58 -0800 Subject: [PATCH] powerpc: mem_init crash for sparsemem In-Reply-To: <436BC20B.9070704@shadowen.org> References: <200511041631.17237.arnd@arndb.de> <436BC20B.9070704@shadowen.org> Message-ID: <20051104205758.GA5397@w-mikek2.ibm.com> On Fri, Nov 04, 2005 at 08:18:19PM +0000, Andy Whitcroft wrote: > Arnd Bergmann wrote: > > I have a Cell blade with some broken memory in the middle of the > > physical address space and this is correctly detected by the > > firmware, but not relocated. When I enable CONFIG_SPARSEMEM, > > the memsections for the nonexistant address space do not > > get struct page entries allocated, as expected. > > > > However, mem_init for the non-NUMA configuration tries to > > access these pages without first looking if they are there. This earlier statement in mem_init (or at least the comment), num_physpages = max_pfn; /* RAM is assumed contiguous */ may be a cause for concern. I'm pretty sure max_pfn has previously been set based on the value of lmb_end_of_DRAM(). My guess is that we are going to report the system as having more memory that it actually does (will not account for the hole(s)). That being said, the pfn_valid() check is still needed here. But, it looks like that code was originally written under the assumption that there were no holes. Can someone 'more in the know' of ppc architecture comment on the ram is contiguous assumption? Is this no longer the case? -- Mike From johnrose at austin.ibm.com Sat Nov 5 08:30:56 2005 From: johnrose at austin.ibm.com (John Rose) Date: Fri, 04 Nov 2005 15:30:56 -0600 Subject: [PATCH] dlpar enable for OF pci probe Message-ID: <1131139856.9574.49.camel@sinatra.austin.ibm.com> This patch contains the arch/ppc64 bits for enabling DLPAR and PCI Hotplug for the new OF-based PCI probe mechanism. This code path is currently broken. Please apply if appropriate. Thanks- John Signed-off-by: John Rose diff -puN arch/ppc64/kernel/rtas_pci.c~base_changes arch/ppc64/kernel/rtas_pci.c --- 2_6_linus/arch/ppc64/kernel/rtas_pci.c~base_changes 2005-11-04 13:54:50.000000000 -0600 +++ 2_6_linus-johnrose/arch/ppc64/kernel/rtas_pci.c 2005-11-04 13:54:50.000000000 -0600 @@ -440,7 +440,6 @@ struct pci_controller * __devinit init_p struct device_node *root = of_find_node_by_path("/"); unsigned int root_size_cells = 0; struct pci_controller *phb; - struct pci_bus *bus; int primary; root_size_cells = prom_n_size_cells(root); @@ -456,10 +455,7 @@ struct pci_controller * __devinit init_p of_node_put(root); pci_devs_phb_init_dynamic(phb); - phb->last_busno = 0xff; - bus = pci_scan_bus(phb->first_busno, phb->ops, phb->arch_data); - phb->bus = bus; - phb->last_busno = bus->subordinate; + scan_phb(phb); return phb; } diff -puN arch/ppc64/kernel/pci.c~base_changes arch/ppc64/kernel/pci.c --- 2_6_linus/arch/ppc64/kernel/pci.c~base_changes 2005-11-04 13:54:50.000000000 -0600 +++ 2_6_linus-johnrose/arch/ppc64/kernel/pci.c 2005-11-04 13:54:50.000000000 -0600 @@ -295,8 +295,8 @@ static void pci_parse_of_addrs(struct de } } -static struct pci_dev *of_create_pci_dev(struct device_node *node, - struct pci_bus *bus, int devfn) +struct pci_dev *of_create_pci_dev(struct device_node *node, + struct pci_bus *bus, int devfn) { struct pci_dev *dev; const char *type; @@ -354,10 +354,9 @@ static struct pci_dev *of_create_pci_dev return dev; } +EXPORT_SYMBOL(of_create_pci_dev); -static void of_scan_pci_bridge(struct device_node *node, struct pci_dev *dev); - -static void __devinit of_scan_bus(struct device_node *node, +void __devinit of_scan_bus(struct device_node *node, struct pci_bus *bus) { struct device_node *child = NULL; @@ -381,9 +380,10 @@ static void __devinit of_scan_bus(struct do_bus_setup(bus); } +EXPORT_SYMBOL(of_scan_bus); -static void __devinit of_scan_pci_bridge(struct device_node *node, - struct pci_dev *dev) +void __devinit of_scan_pci_bridge(struct device_node *node, + struct pci_dev *dev) { struct pci_bus *bus; u32 *busrange, *ranges; @@ -464,9 +464,10 @@ static void __devinit of_scan_pci_bridge else if (mode == PCI_PROBE_NORMAL) pci_scan_child_bus(bus); } +EXPORT_SYMBOL(of_scan_pci_bridge); #endif /* CONFIG_PPC_MULTIPLATFORM */ -static void __devinit scan_phb(struct pci_controller *hose) +void __devinit scan_phb(struct pci_controller *hose) { struct pci_bus *bus; struct device_node *node = hose->arch_data; diff -puN include/asm-ppc64/pci.h~base_changes include/asm-ppc64/pci.h --- 2_6_linus/include/asm-ppc64/pci.h~base_changes 2005-11-04 13:54:50.000000000 -0600 +++ 2_6_linus-johnrose/include/asm-ppc64/pci.h 2005-11-04 13:54:50.000000000 -0600 @@ -162,6 +162,14 @@ pcibios_fixup_device_resources(struct pc extern struct pci_controller *init_phb_dynamic(struct device_node *dn); +extern struct pci_dev *of_create_pci_dev(struct device_node *node, + struct pci_bus *bus, int devfn); + +extern void of_scan_pci_bridge(struct device_node *node, + struct pci_dev *dev); + +extern void of_scan_bus(struct device_node *node, struct pci_bus *bus); + extern int pci_read_irq_line(struct pci_dev *dev); extern void pcibios_add_platform_entries(struct pci_dev *dev); diff -puN include/asm-powerpc/ppc-pci.h~base_changes include/asm-powerpc/ppc-pci.h --- 2_6_linus/include/asm-powerpc/ppc-pci.h~base_changes 2005-11-04 13:54:50.000000000 -0600 +++ 2_6_linus-johnrose/include/asm-powerpc/ppc-pci.h 2005-11-04 13:54:50.000000000 -0600 @@ -34,6 +34,7 @@ void *traverse_pci_devices(struct device void pci_devs_phb_init(void); void pci_devs_phb_init_dynamic(struct pci_controller *phb); +void __devinit scan_phb(struct pci_controller *hose); /* PCI address cache management routines */ void pci_addr_cache_insert_device(struct pci_dev *dev); _ From arnd at arndb.de Sat Nov 5 08:43:48 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Fri, 4 Nov 2005 22:43:48 +0100 Subject: [PATCH] powerpc: mem_init crash for sparsemem In-Reply-To: <436BC20B.9070704@shadowen.org> References: <200511041631.17237.arnd@arndb.de> <436BC20B.9070704@shadowen.org> Message-ID: <200511042243.49661.arnd@arndb.de> On Freedag 04 November 2005 21:18, Andy Whitcroft wrote: > Would it not make sense to use pfn_valid(), as that is not sparsemem > specific? ?Not looked at the code in question specifically, but if you > can use section_has_mem_map() it should be equivalent: > > ????????if (!pfn_valid(pgdat->node_start_pfn + i)) > ????????????????continue; > > Want to spin us a patch and I'll give it some general testing. Yes, I guess pfn_valid() is the function I was looking for, thanks for pointing that out. Unfortunately, I don't have access to the machine over the weekend, so I won't be able to test that until Monday. Arnd <>< From benh at kernel.crashing.org Sat Nov 5 08:46:10 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sat, 05 Nov 2005 08:46:10 +1100 Subject: pci_resource_end() changed problem with 2.6.14 In-Reply-To: <436BAA4A.5030007@am.sony.com> References: <436ADBA7.7030706@am.sony.com> <1131087370.4680.238.camel@gaston> <436BAA4A.5030007@am.sony.com> Message-ID: <1131140771.29195.11.camel@gaston> On Fri, 2005-11-04 at 10:36 -0800, Geoff Levand wrote: > 0001:05:03.0 Serial controller: NetMos Technology PCI 9845 Multi-I/O Controller (rev 01) (prog-if 02 [16550]) > Subsystem: LSI Logic / Symbios Logic 0P4S (4 port 16550A serial card) > Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- > Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- SERR- Interrupt: pin A routed to IRQ 53 > Region 0: I/O ports at f4000050 [size=16] > Region 1: I/O ports at f4000040 [size=16] > Region 2: I/O ports at f4000030 [size=16] > Region 3: I/O ports at f4000020 [size=16] > Region 4: I/O ports at f4000010 [size=16] > Region 5: I/O ports at f4000000 [size=16] > > It could be the change to using the OF device-tree. What's an easy way to > see the size OF has used? Find the card in /proc/device-tree and look at "assigned-addresses" Ben. From johnrose at austin.ibm.com Sat Nov 5 08:54:26 2005 From: johnrose at austin.ibm.com (John Rose) Date: Fri, 04 Nov 2005 15:54:26 -0600 Subject: [PATCH 22/42]: PCI: remove duplicted pci hotplug code In-Reply-To: <20051104005201.GA27016@mail.gnucash.org> References: <20051103235918.GA25616@mail.gnucash.org> <20051104005201.GA27016@mail.gnucash.org> Message-ID: <1131141266.9574.60.camel@sinatra.austin.ibm.com> > +extern void pcibios_claim_one_bus(struct pci_bus *b); > + Might need to export this for module use by the kernel. Thanks- John From arnd at arndb.de Sat Nov 5 08:59:33 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Fri, 4 Nov 2005 22:59:33 +0100 Subject: [PATCH] powerpc: mem_init crash for sparsemem In-Reply-To: <20051104205758.GA5397@w-mikek2.ibm.com> References: <200511041631.17237.arnd@arndb.de> <436BC20B.9070704@shadowen.org> <20051104205758.GA5397@w-mikek2.ibm.com> Message-ID: <200511042259.33880.arnd@arndb.de> On Freedag 04 November 2005 21:57, Mike Kravetz wrote: > This earlier statement in mem_init (or at least the comment), > > num_physpages = max_pfn; /* RAM is assumed contiguous */ > > may be a cause for concern. I'm pretty sure max_pfn has previously > been set based on the value of lmb_end_of_DRAM(). My guess is that we > are going to report the system as having more memory that it actually > does (will not account for the hole(s)). Yes, that's likely to cause trouble later. Unfortunately, there are still multiple places that determine the memory size by different means and save the result in a global variable, so it's hard to get them all right. I'll probably move Cell to use NUMA mode for setups with multiple CPUs (each of which is already SMT), which means we can use the code that we already know handles this correctly, in addtition to the option of using NUMA aware memory allocation and scheduling for the SPUs. > That being said, the pfn_valid() check is still needed here. But, > it looks like that code was originally written under the assumption > that there were no holes. > > Can someone 'more in the know' of ppc architecture comment on the > ram is contiguous assumption? Is this no longer the case? For all I know, the firmware interface can legally declare noncontiguous memory, but that is not done on product level hardware except NUMA. The configuration for SPARSEMEM without NUMA is normally not possible on ppc64, I had to hack Kconfig to allow this in the first place. Without SPARSEMEM, the noncontiguous memory seems to be handled well except for the size detection. Arnd <>< From linas at austin.ibm.com Sat Nov 5 09:14:00 2005 From: linas at austin.ibm.com (linas) Date: Fri, 4 Nov 2005 16:14:00 -0600 Subject: [PATCH 41/42]: ppc64: Save device BARS much earlier in the boot sequence In-Reply-To: <20051104005519.GA27189@mail.gnucash.org> References: <20051103235918.GA25616@mail.gnucash.org> <20051104005519.GA27189@mail.gnucash.org> Message-ID: <20051104221400.GS19593@austin.ibm.com> Hi, On Thu, Nov 03, 2005 at 06:55:19PM -0600, Linas Vepstas was heard to remark: > Save the PCI device bars *before* any PCI probing is done. After a tiny bit of extra testing, I found one of those forehead slapping bugs in this patch. Here it is again, with the bug fixed. (So far, all the other tests look good; I've survived 24 hours of several thousand artifical pci errors injected onto ethernet and scsi intermingled with hundreds of pci slot adds/removes.) ------------ 241-eeh-save-bars-earlier.patch Save the PCI device bars *before* any PCI probing is done. Signed-off-by: Linas Vepstas -- Index: linux-2.6.14-git6/arch/ppc64/kernel/rtas_pci.c =================================================================== --- linux-2.6.14-git6.orig/arch/ppc64/kernel/rtas_pci.c 2005-11-03 14:46:40.000000000 -0600 +++ linux-2.6.14-git6/arch/ppc64/kernel/rtas_pci.c 2005-11-03 14:50:22.000000000 -0600 @@ -72,7 +72,7 @@ return 0; } -static int rtas_read_config(struct pci_dn *pdn, int where, int size, u32 *val) +int rtas_read_config(struct pci_dn *pdn, int where, int size, u32 *val) { int returnval = -1; unsigned long buid, addr; Index: linux-2.6.14-git6/include/asm-powerpc/ppc-pci.h =================================================================== --- linux-2.6.14-git6.orig/include/asm-powerpc/ppc-pci.h 2005-11-03 14:50:21.000000000 -0600 +++ linux-2.6.14-git6/include/asm-powerpc/ppc-pci.h 2005-11-03 14:50:22.000000000 -0600 @@ -59,8 +59,6 @@ void pci_addr_cache_build(void); struct pci_dev *pci_get_device_by_addr(unsigned long addr); -void eeh_save_bars(struct pci_dev * pdev, struct pci_dn *pdn); - /** * eeh_slot_error_detail -- record and EEH error condition to the log * @severity: 1 if temporary, 2 if permanent failure. @@ -104,6 +102,7 @@ void rtas_configure_bridge(struct pci_dn *); int rtas_write_config(struct pci_dn *, int where, int size, u32 val); +int rtas_read_config(struct pci_dn *, int where, int size, u32 *val); /** * mark and clear slots: find "partition endpoint" PE and set or Index: linux-2.6.14-git6/include/asm-ppc64/pci-bridge.h =================================================================== --- linux-2.6.14-git6.orig/include/asm-ppc64/pci-bridge.h 2005-11-03 14:50:15.000000000 -0600 +++ linux-2.6.14-git6/include/asm-ppc64/pci-bridge.h 2005-11-03 14:50:22.000000000 -0600 @@ -58,15 +58,15 @@ struct iommu_table; struct pci_dn { - int busno; /* for pci devices */ - int bussubno; /* for pci devices */ - int devfn; /* for pci devices */ + int busno; /* pci bus number */ + int bussubno; /* pci subordinate bus number */ + int devfn; /* pci device and function number */ + int class_code; /* pci device class */ int eeh_mode; /* See eeh.h for possible EEH_MODEs */ int eeh_config_addr; int eeh_pe_config_addr; /* new-style partition endpoint address */ int eeh_check_count; /* # times driver ignored error */ int eeh_freeze_count; /* # times this device froze up. */ - int eeh_is_bridge; /* device is pci-to-pci bridge */ int pci_ext_config_space; /* for pci devices */ struct pci_controller *phb; /* for pci devices */ Index: linux-2.6.14-git6/arch/powerpc/platforms/pseries/eeh.c =================================================================== --- linux-2.6.14-git6.orig/arch/powerpc/platforms/pseries/eeh.c 2005-11-03 14:50:21.000000000 -0600 +++ linux-2.6.14-git6/arch/powerpc/platforms/pseries/eeh.c 2005-11-04 16:07:29.596059751 -0600 @@ -106,6 +106,8 @@ static DEFINE_PER_CPU(unsigned long, ignored_failures); static DEFINE_PER_CPU(unsigned long, slot_resets); +#define IS_BRIDGE(class_code) (((class_code)<<16) == PCI_BASE_CLASS_BRIDGE) + /* --------------------------------------------------------------- */ /* Below lies the EEH event infrastructure */ @@ -620,7 +622,7 @@ if (!pdn) return; - if ((pdn->eeh_mode & EEH_MODE_SUPPORTED) && (!pdn->eeh_is_bridge)) + if ((pdn->eeh_mode & EEH_MODE_SUPPORTED) && !IS_BRIDGE(pdn->class_code)) __restore_bars (pdn); dn = pdn->node->child; @@ -638,18 +640,15 @@ * PCI devices are added individuallly; but, for the restore, * an entire slot is reset at a time. */ -void eeh_save_bars(struct pci_dev * pdev, struct pci_dn *pdn) +static void eeh_save_bars(struct pci_dn *pdn) { int i; - if (!pdev || !pdn ) + if (!pdn ) return; for (i = 0; i < 16; i++) - pci_read_config_dword(pdev, i * 4, &pdn->config_space[i]); - - if (pdev->hdr_type == PCI_HEADER_TYPE_BRIDGE) - pdn->eeh_is_bridge = 1; + rtas_read_config(pdn, i * 4, 4, &pdn->config_space[i]); } void @@ -703,6 +702,9 @@ pdn->eeh_check_count = 0; pdn->eeh_freeze_count = 0; + if (class_code) + pdn->class_code = *class_code; + if (status && strcmp(status, "ok") != 0) return NULL; /* ignore devices with bad status */ @@ -781,6 +783,7 @@ dn->full_name); } + eeh_save_bars(pdn); return NULL; } @@ -915,7 +918,6 @@ pdn->pcidev = dev; pci_addr_cache_insert_device (dev); - eeh_save_bars(dev, pdn); } EXPORT_SYMBOL_GPL(eeh_add_device_late); Index: linux-2.6.14-git6/arch/powerpc/platforms/pseries/eeh_cache.c =================================================================== --- linux-2.6.14-git6.orig/arch/powerpc/platforms/pseries/eeh_cache.c 2005-11-03 14:50:19.000000000 -0600 +++ linux-2.6.14-git6/arch/powerpc/platforms/pseries/eeh_cache.c 2005-11-04 10:22:51.000000000 -0600 @@ -304,10 +304,7 @@ pci_addr_cache_insert_device(dev); - /* Save the BAR's; firmware doesn't restore these after EEH reset */ dn = pci_device_to_OF_node(dev); - eeh_save_bars(dev, PCI_DN(dn)); - pci_dev_get (dev); /* matching put is in eeh_remove_device() */ PCI_DN(dn)->pcidev = dev; } From greg at kroah.com Sat Nov 5 09:14:37 2005 From: greg at kroah.com (Greg KH) Date: Fri, 4 Nov 2005 14:14:37 -0800 Subject: [PATCH 0/42] PCI Error Recovery for PPC64 and misc device drivers In-Reply-To: <20051103235918.GA25616@mail.gnucash.org> References: <20051103235918.GA25616@mail.gnucash.org> Message-ID: <20051104221437.GA20004@kroah.com> On Thu, Nov 03, 2005 at 05:59:18PM -0600, Linas Vepstas wrote: > What follows is a long sequence of mostly small patches to implement > PCI Error Recovery by adding notification callbacks to the PCI device > driver structure, implementing the recovery in 5 device drivers > (3 ethernet, two scsi drivers), and adding the actual error detection > and recovery code to the ppc64/powerpc arch tree. > > Highlights: > > -- Patches 1-14: Misc required ppc64/powerpc cleanup/bugfixes/restructuring > -- Patch 15: Overview documentation > -- Patch 16: changes to include/linux/pci.h > -- Patches 17-26: error detection and recovery for pSeries PCI bridge chips > -- Patchs 27-32: recovery patches for ethernet, scsi device drivers > -- Patches 33-42: More misc ppc64-specific changes Ok, so at first glance, I only need to pay attention to patches 15, 16, and 27-32? If so, please send the ppc64 specific patches through the ppc64 maintainers, and the rpaphp specific patches through that specific maintainer. Then care to resend the 8 remaining patches to me, so I can stage them in -mm for a while? thanks, greg k-h From holindho at cs.helsinki.fi Sat Nov 5 09:08:21 2005 From: holindho at cs.helsinki.fi (Heikki Lindholm) Date: Sat, 05 Nov 2005 00:08:21 +0200 Subject: [PATCH] Fix G5 UP build of 2.6.14 Message-ID: <436BDBD5.9040009@cs.helsinki.fi> 2.6.14 broke UP build for G5. This is against 2.6.14 release (some were already posted by Olof Johansson). Compiled with g5_defconfig minus CONFIG_SMP. -- Heikki Lindholm -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 2.6.14-ppc64-build-UP.patch Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20051105/7c4e04d9/attachment.txt From kravetz at us.ibm.com Sat Nov 5 10:15:52 2005 From: kravetz at us.ibm.com (Mike Kravetz) Date: Fri, 4 Nov 2005 15:15:52 -0800 Subject: [PATCH 0/4] Memory Add Fixes for ppc64 Message-ID: <20051104231552.GA25545@w-mikek2.ibm.com> When memory add was merged into mainline in 2.6.14, there were various bits and pieces missing that prevent it from working on ppc64. The following patches are against 2.6.14-git7 and address all but one of the know issues. 1) Create hptes for new sections 2) Clear page count before freeing new pages 3) Kludge to add new memory to node 0 4) Ensure probe file is created for memory add via sysfs -- Mike From kravetz at us.ibm.com Sat Nov 5 10:18:00 2005 From: kravetz at us.ibm.com (Mike Kravetz) Date: Fri, 4 Nov 2005 15:18:00 -0800 Subject: [PATCH 1/4] Memory Add Fixes for ppc64 In-Reply-To: <20051104231552.GA25545@w-mikek2.ibm.com> References: <20051104231552.GA25545@w-mikek2.ibm.com> Message-ID: <20051104231800.GB25545@w-mikek2.ibm.com> Add the create_section_mapping() routine to create hptes for memory sections dynamically added after system boot. Signed-off-by: Mike Kravetz diff -Naupr linux-2.6.14-git7/arch/powerpc/mm/hash_utils_64.c linux-2.6.14-git7.work/arch/powerpc/mm/hash_utils_64.c --- linux-2.6.14-git7/arch/powerpc/mm/hash_utils_64.c 2005-11-04 21:21:05.000000000 +0000 +++ linux-2.6.14-git7.work/arch/powerpc/mm/hash_utils_64.c 2005-11-04 22:05:06.000000000 +0000 @@ -176,6 +176,15 @@ static unsigned long get_hashtable_size( return pteg_count << 7; } +#ifdef CONFIG_MEMORY_HOTPLUG +void create_section_mapping(unsigned long start, unsigned long end) +{ + create_pte_mapping(start, end, + _PAGE_ACCESSED | _PAGE_COHERENT | PP_RWXX, + cur_cpu_spec->cpu_features & CPU_FTR_16M_PAGE ? 1 : 0); +} +#endif /* CONFIG_MEMORY_HOTPLUG */ + void __init htab_initialize(void) { unsigned long table, htab_size_bytes; diff -Naupr linux-2.6.14-git7/arch/powerpc/mm/mem.c linux-2.6.14-git7.work/arch/powerpc/mm/mem.c --- linux-2.6.14-git7/arch/powerpc/mm/mem.c 2005-11-04 21:21:05.000000000 +0000 +++ linux-2.6.14-git7.work/arch/powerpc/mm/mem.c 2005-11-04 22:05:06.000000000 +0000 @@ -124,6 +124,9 @@ int __devinit add_memory(u64 start, u64 unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; + start += KERNELBASE; + create_section_mapping(start, start + size); + /* this should work for most non-highmem platforms */ zone = pgdata->node_zones; diff -Naupr linux-2.6.14-git7/include/asm-ppc64/sparsemem.h linux-2.6.14-git7.work/include/asm-ppc64/sparsemem.h --- linux-2.6.14-git7/include/asm-ppc64/sparsemem.h 2005-10-28 00:02:08.000000000 +0000 +++ linux-2.6.14-git7.work/include/asm-ppc64/sparsemem.h 2005-11-04 22:05:06.000000000 +0000 @@ -11,6 +11,10 @@ #define MAX_PHYSADDR_BITS 38 #define MAX_PHYSMEM_BITS 36 +#ifdef CONFIG_MEMORY_HOTPLUG +extern void create_section_mapping(unsigned long start, unsigned long end); +#endif /* CONFIG_MEMORY_HOTPLUG */ + #endif /* CONFIG_SPARSEMEM */ #endif /* _ASM_PPC64_SPARSEMEM_H */ From kravetz at us.ibm.com Sat Nov 5 10:19:32 2005 From: kravetz at us.ibm.com (Mike Kravetz) Date: Fri, 4 Nov 2005 15:19:32 -0800 Subject: [PATCH 2/4] Memory Add Fixes for ppc64 In-Reply-To: <20051104231552.GA25545@w-mikek2.ibm.com> References: <20051104231552.GA25545@w-mikek2.ibm.com> Message-ID: <20051104231932.GC25545@w-mikek2.ibm.com> memmap_init_zone() sets page count to 1. Before 'freeing' the page, we need to clear the count. This is the same that is done on free_all_bootmem_core() for memory discovered at boot time. Signed-off-by: Mike Kravetz diff -Naupr linux-2.6.14-git7/arch/powerpc/mm/mem.c linux-2.6.14-git7.work/arch/powerpc/mm/mem.c --- linux-2.6.14-git7/arch/powerpc/mm/mem.c 2005-11-04 21:21:05.000000000 +0000 +++ linux-2.6.14-git7.work/arch/powerpc/mm/mem.c 2005-11-04 22:09:59.000000000 +0000 @@ -107,6 +107,7 @@ EXPORT_SYMBOL(phys_mem_access_prot); void online_page(struct page *page) { ClearPageReserved(page); + set_page_count(page, 0); free_cold_page(page); totalram_pages++; num_physpages++; From kravetz at us.ibm.com Sat Nov 5 10:20:24 2005 From: kravetz at us.ibm.com (Mike Kravetz) Date: Fri, 4 Nov 2005 15:20:24 -0800 Subject: [PATCH 3/4] Memory Add Fixes for ppc64 In-Reply-To: <20051104231552.GA25545@w-mikek2.ibm.com> References: <20051104231552.GA25545@w-mikek2.ibm.com> Message-ID: <20051104232024.GD25545@w-mikek2.ibm.com> This is a temporary kludge that supports adding all new memory to node 0. I will provide a more complete solution similar to that used for dynamically added CPUs in a few days. Signed-off-by: Mike Kravetz diff -Naupr linux-2.6.14-git7/include/asm-ppc64/mmzone.h linux-2.6.14-git7.work/include/asm-ppc64/mmzone.h --- linux-2.6.14-git7/include/asm-ppc64/mmzone.h 2005-11-04 21:21:09.000000000 +0000 +++ linux-2.6.14-git7.work/include/asm-ppc64/mmzone.h 2005-11-04 22:10:44.000000000 +0000 @@ -33,6 +33,9 @@ extern int numa_cpu_lookup_table[]; extern char *numa_memory_lookup_table; extern cpumask_t numa_cpumask_lookup_table[]; extern int nr_cpus_in_node[]; +#ifdef CONFIG_MEMORY_HOTPLUG +extern unsigned long max_pfn; +#endif /* 16MB regions */ #define MEMORY_INCREMENT_SHIFT 24 @@ -45,6 +48,11 @@ static inline int pa_to_nid(unsigned lon { int nid; +#ifdef CONFIG_MEMORY_HOTPLUG + /* kludge hot added sections default to node 0 */ + if (pa >= (max_pfn << PAGE_SHIFT)) + return 0; +#endif nid = numa_memory_lookup_table[pa >> MEMORY_INCREMENT_SHIFT]; #ifdef DEBUG_NUMA From kravetz at us.ibm.com Sat Nov 5 10:21:09 2005 From: kravetz at us.ibm.com (Mike Kravetz) Date: Fri, 4 Nov 2005 15:21:09 -0800 Subject: [PATCH 4/4] Memory Add Fixes for ppc64 In-Reply-To: <20051104231552.GA25545@w-mikek2.ibm.com> References: <20051104231552.GA25545@w-mikek2.ibm.com> Message-ID: <20051104232109.GE25545@w-mikek2.ibm.com> ppc64 needs a special sysfs probe file for adding new memory. Signed-off-by: Mike Kravetz diff -Naupr linux-2.6.14-git7/arch/ppc64/Kconfig linux-2.6.14-git7.work/arch/ppc64/Kconfig --- linux-2.6.14-git7/arch/ppc64/Kconfig 2005-11-04 21:21:06.000000000 +0000 +++ linux-2.6.14-git7.work/arch/ppc64/Kconfig 2005-11-04 22:11:16.000000000 +0000 @@ -277,6 +277,10 @@ config HAVE_ARCH_EARLY_PFN_TO_NID def_bool y depends on NEED_MULTIPLE_NODES +config ARCH_MEMORY_PROBE + def_bool y + depends on MEMORY_HOTPLUG + # Some NUMA nodes have memory ranges that span # other nodes. Even though a pfn is valid and # between a node's start and end pfns, it may not From benh at kernel.crashing.org Sat Nov 5 11:04:30 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sat, 05 Nov 2005 11:04:30 +1100 Subject: [PATCH 1/4] Memory Add Fixes for ppc64 In-Reply-To: <20051104231800.GB25545@w-mikek2.ibm.com> References: <20051104231552.GA25545@w-mikek2.ibm.com> <20051104231800.GB25545@w-mikek2.ibm.com> Message-ID: <1131149070.29195.41.camel@gaston> On Fri, 2005-11-04 at 15:18 -0800, Mike Kravetz wrote: > Add the create_section_mapping() routine to create hptes for memory > sections dynamically added after system boot. > > Signed-off-by: Mike Kravetz This patch will have to be slightly reworked on top of the 64k pages one. It should be trivial though. Ben. > diff -Naupr linux-2.6.14-git7/arch/powerpc/mm/hash_utils_64.c linux-2.6.14-git7.work/arch/powerpc/mm/hash_utils_64.c > --- linux-2.6.14-git7/arch/powerpc/mm/hash_utils_64.c 2005-11-04 21:21:05.000000000 +0000 > +++ linux-2.6.14-git7.work/arch/powerpc/mm/hash_utils_64.c 2005-11-04 22:05:06.000000000 +0000 > @@ -176,6 +176,15 @@ static unsigned long get_hashtable_size( > return pteg_count << 7; > } > > +#ifdef CONFIG_MEMORY_HOTPLUG > +void create_section_mapping(unsigned long start, unsigned long end) > +{ > + create_pte_mapping(start, end, > + _PAGE_ACCESSED | _PAGE_COHERENT | PP_RWXX, > + cur_cpu_spec->cpu_features & CPU_FTR_16M_PAGE ? 1 : 0); > +} > +#endif /* CONFIG_MEMORY_HOTPLUG */ > + > void __init htab_initialize(void) > { > unsigned long table, htab_size_bytes; > diff -Naupr linux-2.6.14-git7/arch/powerpc/mm/mem.c linux-2.6.14-git7.work/arch/powerpc/mm/mem.c > --- linux-2.6.14-git7/arch/powerpc/mm/mem.c 2005-11-04 21:21:05.000000000 +0000 > +++ linux-2.6.14-git7.work/arch/powerpc/mm/mem.c 2005-11-04 22:05:06.000000000 +0000 > @@ -124,6 +124,9 @@ int __devinit add_memory(u64 start, u64 > unsigned long start_pfn = start >> PAGE_SHIFT; > unsigned long nr_pages = size >> PAGE_SHIFT; > > + start += KERNELBASE; > + create_section_mapping(start, start + size); > + > /* this should work for most non-highmem platforms */ > zone = pgdata->node_zones; > > diff -Naupr linux-2.6.14-git7/include/asm-ppc64/sparsemem.h linux-2.6.14-git7.work/include/asm-ppc64/sparsemem.h > --- linux-2.6.14-git7/include/asm-ppc64/sparsemem.h 2005-10-28 00:02:08.000000000 +0000 > +++ linux-2.6.14-git7.work/include/asm-ppc64/sparsemem.h 2005-11-04 22:05:06.000000000 +0000 > @@ -11,6 +11,10 @@ > #define MAX_PHYSADDR_BITS 38 > #define MAX_PHYSMEM_BITS 36 > > +#ifdef CONFIG_MEMORY_HOTPLUG > +extern void create_section_mapping(unsigned long start, unsigned long end); > +#endif /* CONFIG_MEMORY_HOTPLUG */ > + > #endif /* CONFIG_SPARSEMEM */ > > #endif /* _ASM_PPC64_SPARSEMEM_H */ > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ From paulus at samba.org Sat Nov 5 11:08:17 2005 From: paulus at samba.org (Paul Mackerras) Date: Sat, 5 Nov 2005 11:08:17 +1100 Subject: [PATCH 0/42] PCI Error Recovery for PPC64 and misc device drivers In-Reply-To: <20051104221437.GA20004@kroah.com> References: <20051103235918.GA25616@mail.gnucash.org> <20051104221437.GA20004@kroah.com> Message-ID: <17259.63473.450876.276151@cargo.ozlabs.ibm.com> Greg KH writes: > Ok, so at first glance, I only need to pay attention to patches 15, 16, > and 27-32? If so, please send the ppc64 specific patches through the > ppc64 maintainers, and the rpaphp specific patches through that specific > maintainer. Then care to resend the 8 remaining patches to me, so I can > stage them in -mm for a while? I'm happy to take care of the ppc64-specific patches. I would *really* like to see 16 go to Linus as soon as possible, since everything else depends on it, and since it has very little chance of breaking any existing code. Would you be OK with sending 16 to Linus within the next week? Thanks, Paul. From greg at kroah.com Sat Nov 5 11:28:43 2005 From: greg at kroah.com (Greg KH) Date: Fri, 4 Nov 2005 16:28:43 -0800 Subject: [PATCH 0/42] PCI Error Recovery for PPC64 and misc device drivers In-Reply-To: <17259.63473.450876.276151@cargo.ozlabs.ibm.com> References: <20051103235918.GA25616@mail.gnucash.org> <20051104221437.GA20004@kroah.com> <17259.63473.450876.276151@cargo.ozlabs.ibm.com> Message-ID: <20051105002842.GB22574@kroah.com> On Sat, Nov 05, 2005 at 11:08:17AM +1100, Paul Mackerras wrote: > Greg KH writes: > > > Ok, so at first glance, I only need to pay attention to patches 15, 16, > > and 27-32? If so, please send the ppc64 specific patches through the > > ppc64 maintainers, and the rpaphp specific patches through that specific > > maintainer. Then care to resend the 8 remaining patches to me, so I can > > stage them in -mm for a while? > > I'm happy to take care of the ppc64-specific patches. > > I would *really* like to see 16 go to Linus as soon as possible, since > everything else depends on it, and since it has very little chance of > breaking any existing code. Would you be OK with sending 16 to Linus > within the next week? Can I take 15, 16, 27-32 now without the ppc64 patches dying without it? thanks, greg k-h From kravetz at us.ibm.com Sat Nov 5 11:35:57 2005 From: kravetz at us.ibm.com (Mike Kravetz) Date: Fri, 4 Nov 2005 16:35:57 -0800 Subject: [PATCH 1/4] Memory Add Fixes for ppc64 In-Reply-To: <1131149070.29195.41.camel@gaston> References: <20051104231552.GA25545@w-mikek2.ibm.com> <20051104231800.GB25545@w-mikek2.ibm.com> <1131149070.29195.41.camel@gaston> Message-ID: <20051105003557.GC5397@w-mikek2.ibm.com> On Sat, Nov 05, 2005 at 11:04:30AM +1100, Benjamin Herrenschmidt wrote: > On Fri, 2005-11-04 at 15:18 -0800, Mike Kravetz wrote: > > Add the create_section_mapping() routine to create hptes for memory > > sections dynamically added after system boot. > > > > Signed-off-by: Mike Kravetz > > This patch will have to be slightly reworked on top of the 64k pages > one. It should be trivial though. OK. I'll respin on top of your patch at: http://gate.crashing.org/~benh/ppc64-64k-pages.diff Let me know if there is a different version going upstream. -- Mike From hch at lst.de Sat Nov 5 11:38:19 2005 From: hch at lst.de (Christoph Hellwig) Date: Sat, 5 Nov 2005 01:38:19 +0100 Subject: [PATCH] ppc64: 64K pages support In-Reply-To: <1130916198.20136.17.camel@gaston> References: <1130915220.20136.14.camel@gaston> <1130916198.20136.17.camel@gaston> Message-ID: <20051105003819.GA11505@lst.de> So how does the 64k on 4k hardware emulation work? When Hugh did bigger softpagesize for x86 based on 2.4.x he had to fix drivers all over to deal with that. From benh at kernel.crashing.org Sat Nov 5 11:43:07 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sat, 05 Nov 2005 11:43:07 +1100 Subject: [PATCH 1/4] Memory Add Fixes for ppc64 In-Reply-To: <20051105003557.GC5397@w-mikek2.ibm.com> References: <20051104231552.GA25545@w-mikek2.ibm.com> <20051104231800.GB25545@w-mikek2.ibm.com> <1131149070.29195.41.camel@gaston> <20051105003557.GC5397@w-mikek2.ibm.com> Message-ID: <1131151387.29195.43.camel@gaston> On Fri, 2005-11-04 at 16:35 -0800, Mike Kravetz wrote: > On Sat, Nov 05, 2005 at 11:04:30AM +1100, Benjamin Herrenschmidt wrote: > > On Fri, 2005-11-04 at 15:18 -0800, Mike Kravetz wrote: > > > Add the create_section_mapping() routine to create hptes for memory > > > sections dynamically added after system boot. > > > > > > Signed-off-by: Mike Kravetz > > > > This patch will have to be slightly reworked on top of the 64k pages > > one. It should be trivial though. > > OK. I'll respin on top of your patch at: > > http://gate.crashing.org/~benh/ppc64-64k-pages.diff > > Let me know if there is a different version going upstream I'll check if it still applied after linus pulls the next round of ppc updates Ben. From benh at kernel.crashing.org Sat Nov 5 11:44:47 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sat, 05 Nov 2005 11:44:47 +1100 Subject: [PATCH] ppc64: 64K pages support In-Reply-To: <20051105003819.GA11505@lst.de> References: <1130915220.20136.14.camel@gaston> <1130916198.20136.17.camel@gaston> <20051105003819.GA11505@lst.de> Message-ID: <1131151488.29195.46.camel@gaston> On Sat, 2005-11-05 at 01:38 +0100, Christoph Hellwig wrote: > So how does the 64k on 4k hardware emulation work? When Hugh did > bigger softpagesize for x86 based on 2.4.x he had to fix drivers all > over to deal with that. What was the problem with drivers ? On ppc64, it's all hidden in the arch code. All the kernel sees is a 64k page size. I extended the PTE to contain tracking informations for the 16 sub pages (HPTE bits & hash slot index). Sub pages are faulted on demand and flushed all at once, but it's all transparent to the generic code. Ben. From paulus at samba.org Sat Nov 5 11:46:44 2005 From: paulus at samba.org (Paul Mackerras) Date: Sat, 5 Nov 2005 11:46:44 +1100 Subject: [PATCH 0/42] PCI Error Recovery for PPC64 and misc device drivers In-Reply-To: <20051105002842.GB22574@kroah.com> References: <20051103235918.GA25616@mail.gnucash.org> <20051104221437.GA20004@kroah.com> <17259.63473.450876.276151@cargo.ozlabs.ibm.com> <20051105002842.GB22574@kroah.com> Message-ID: <17260.244.435084.466507@cargo.ozlabs.ibm.com> Greg KH writes: > Can I take 15, 16, 27-32 now without the ppc64 patches dying without it? Sorry, I'm having trouble parsing that. If you mean, will it break ppc64 if you send those patches to Linus before the ppc64 bits get in, the answer is no it won't, please send them on. Thanks, Paul. From greg at kroah.com Sat Nov 5 12:28:06 2005 From: greg at kroah.com (Greg KH) Date: Fri, 4 Nov 2005 17:28:06 -0800 Subject: [PATCH 0/42] PCI Error Recovery for PPC64 and misc device drivers In-Reply-To: <17260.244.435084.466507@cargo.ozlabs.ibm.com> References: <20051103235918.GA25616@mail.gnucash.org> <20051104221437.GA20004@kroah.com> <17259.63473.450876.276151@cargo.ozlabs.ibm.com> <20051105002842.GB22574@kroah.com> <17260.244.435084.466507@cargo.ozlabs.ibm.com> Message-ID: <20051105012806.GA23675@kroah.com> On Sat, Nov 05, 2005 at 11:46:44AM +1100, Paul Mackerras wrote: > Greg KH writes: > > > Can I take 15, 16, 27-32 now without the ppc64 patches dying without it? > > Sorry, I'm having trouble parsing that. If you mean, will it break > ppc64 if you send those patches to Linus before the ppc64 bits get in, > the answer is no it won't, please send them on. Ok, I'll go look at them and see if they can be added to my tree for testing in -mm before I'll send them to Linus. thanks, greg k-h From greg at kroah.com Sat Nov 5 17:11:14 2005 From: greg at kroah.com (Greg KH) Date: Fri, 4 Nov 2005 22:11:14 -0800 Subject: [PATCH 16/42]: PCI: PCI Error reporting callbacks In-Reply-To: <20051104005035.GA26929@mail.gnucash.org> References: <20051103235918.GA25616@mail.gnucash.org> <20051104005035.GA26929@mail.gnucash.org> Message-ID: <20051105061114.GA27016@kroah.com> On Thu, Nov 03, 2005 at 06:50:35PM -0600, Linas Vepstas wrote: > +/* ---------------------------------------------------------------- */ > +/** PCI error recovery infrastructure. If a PCI device driver provides > + * a set fof callbacks in struct pci_error_handlers, then that device driver > + * will be notified of PCI bus errors, and will be driven to recovery > + * when an error occurs. > + */ > + > +enum pcierr_result { > + PCIERR_RESULT_NONE=0, /* no result/none/not supported in device driver */ > + PCIERR_RESULT_CAN_RECOVER=1, /* Device driver can recover without slot reset */ > + PCIERR_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */ > + PCIERR_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */ > + PCIERR_RESULT_RECOVERED, /* Device driver is fully recovered and operational */ > +}; No, do not create new types of error or return codes. Use the standard -EFOO values. You can document what they should each return, and mean, but do not create new codes. Also, you create an enum, but yet do not use it in your function callback definition, which means you really didn't want to create it in the first place... I'll add 15 and 16 to my tree for now, so they will show up in -mm, but I want to see updated versions before sending them off to Linus. thanks, greg k-h From olh at suse.de Sat Nov 5 23:13:52 2005 From: olh at suse.de (Olaf Hering) Date: Sat, 5 Nov 2005 13:13:52 +0100 Subject: [PATCH] ppc64: add MODALIAS= for vio bus In-Reply-To: <20051030213900.GA22510@suse.de> References: <20051030213900.GA22510@suse.de> Message-ID: <20051105121352.GA8814@suse.de> A non-broken udev would autoload also the drivers for devices on the pseries vio bus, like ibmveth, ibmvscsic and hvsc. This is similar to pci, usb and ieee1394: /lib/modules/`uname -r`/modules.alias alias vio:TvscsiSIBM,v-scsi* ibmvscsic alias vio:TnetworkSIBM,l-lan* ibmveth alias vio:Tserial-serverShvterm2* hvcs /events/debug.00004.pci.add.1394:MODALIAS='pci:v00001014d00000188sv00000000sd00000000bc06sc04i0f' /events/debug.00005.pci.add.1509:MODALIAS='pci:v00008086d00001229sv00001014sd000001FFbc02sc00i00' /events/debug.00026.vio.add.1519:MODALIAS='vio:TserialShvterm1' /events/debug.00027.vio.add.1446:MODALIAS='vio:TvscsiSIBM,v-scsi' /events/debug.00028.vio.add.1451:MODALIAS='vio:TnetworkSIBM,l-lan' modprobe -v vio:TnetworkSIBM,l-lan insmod /lib/modules/2.6.14-20051030_vio-ppc64/kernel/drivers/net/ibmveth.ko Signed-off-by: Olaf Hering arch/powerpc/kernel/vio.c | 27 +++++++++++++++++++++++++++ 1 files changed, 27 insertions(+) Index: linux-2.6.14-olh/arch/powerpc/kernel/vio.c =================================================================== --- linux-2.6.14-olh.orig/arch/powerpc/kernel/vio.c +++ linux-2.6.14-olh/arch/powerpc/kernel/vio.c @@ -21,6 +21,7 @@ #include #include #include +#include static const struct vio_device_id *vio_match_device( const struct vio_device_id *, const struct vio_dev *); @@ -265,7 +266,33 @@ static int vio_bus_match(struct device * return (ids != NULL) && (vio_match_device(ids, vio_dev) != NULL); } +static int vio_hotplug(struct device *dev, char **envp, int num_envp, + char *buffer, int buffer_size) +{ + const struct vio_dev *vio_dev = to_vio_dev(dev); + char *cp; + int length; + + if (!num_envp) + return -ENOMEM; + + if (!vio_dev->dev.platform_data) + return -ENODEV; + cp = (char *)get_property(vio_dev->dev.platform_data, "compatible", &length); + if (!cp) + return -ENODEV; + + envp[0] = buffer; + length = scnprintf(buffer, buffer_size, "MODALIAS=vio:T%sS%s", + vio_dev->type, cp); + if (buffer_size - length <= 0) + return -ENOMEM; + envp[1] = NULL; + return 0; +} + struct bus_type vio_bus_type = { .name = "vio", + .hotplug = vio_hotplug, .match = vio_bus_match, }; -- short story of a lazy sysadmin: alias appserv=wotan From airlied at gmail.com Sat Nov 5 17:37:24 2005 From: airlied at gmail.com (Dave Airlie) Date: Sat, 5 Nov 2005 17:37:24 +1100 Subject: [PATCH] ppc64: 64K pages support In-Reply-To: <1131151488.29195.46.camel@gaston> References: <1130915220.20136.14.camel@gaston> <1130916198.20136.17.camel@gaston> <20051105003819.GA11505@lst.de> <1131151488.29195.46.camel@gaston> Message-ID: <21d7e9970511042237p618d6306qb63272a4fa2263ea@mail.gmail.com> > What was the problem with drivers ? On ppc64, it's all hidden in the > arch code. All the kernel sees is a 64k page size. I extended the PTE to > contain tracking informations for the 16 sub pages (HPTE bits & hash > slot index). Sub pages are faulted on demand and flushed all at once, > but it's all transparent to the generic code. > We did that with the VAX port about 5 years ago :-), granted for different reasons.. The VAX has 512 byte hw pages, we had to make a 4K pagesize for the kernel by grouping 8 hw pages together and hiding it all in the arch dir.. granted I don't know if it broke any drivers, we didn't have any... Dave. From david at gibson.dropbear.id.au Sun Nov 6 13:55:58 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Sun, 6 Nov 2005 13:55:58 +1100 Subject: powerpc: Consolidate asm compatibility macros In-Reply-To: <7487F450-429B-4836-AF05-DD47B02D5BC1@kernel.crashing.org> References: <20051104031609.GA962@localhost.localdomain> <7487F450-429B-4836-AF05-DD47B02D5BC1@kernel.crashing.org> Message-ID: <20051106025558.GC17292@localhost.localdomain> On Thu, Nov 03, 2005 at 11:33:14PM -0600, Kumar Gala wrote: > David, > > I hate to be anal, but I think keep the 'L' is useful in the macro > names. > > PPC_LD -> PPC_LL > > I read 'PPC_LD' as either "PPC load" or "PPC load double" never of > which is useful. How about "PPC_LL", which I read as "PPC load long". > > I would propose the following names which at least follow some PPC > naming convention: > > PPC_LL > PPC_STL > PPC_LLARX > PPC_STLCX Hrm.. I actually deliberately removed the L, on the grounds that "long" doesn't necessarily have a consistent meaning between C and asm. The idea is that all these operations work on the "natural" size for the arch, which is to say the size of the GPRs, which is to say the size of a C long. Up to you, paulus. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson From anton at samba.org Mon Nov 7 03:25:11 2005 From: anton at samba.org (Anton Blanchard) Date: Mon, 7 Nov 2005 03:25:11 +1100 Subject: powerpc: Kill ppcdebug In-Reply-To: <200511041134.30106.michael@ellerman.id.au> References: <20051104001653.GC29025@localhost.localdomain> <200511041134.30106.michael@ellerman.id.au> Message-ID: <20051106162511.GI12353@krispykreme> Hi, > I agree it's pretty ugly, but I thought the concept was at least nice, ie. > runttime enablable debugging. The current scheme of having to #define DEBUG > in a gazillion different files is pretty painful. I found it hard to use because it ended up printing way too much stuff that only the writer of that particular bit of debug code could understand. The PCI code was a case in point, I always went in and wrote my own debug code even though a lot of ppcdebug stuff existed. I agree some runtime debug stuff is useful (numa=debug has been very useful in the past), but Im not sure we need common infrastructure for this. Anton From olof at lixom.net Mon Nov 7 09:04:39 2005 From: olof at lixom.net (Olof Johansson) Date: Sun, 6 Nov 2005 14:04:39 -0800 Subject: [PATCH] powerpc: Nicer printing of address at oops Message-ID: <20051106220439.GA7166@pb15.lixom.net> Hi, Please apply. --- Add nicer printing of faulting address on unresolvable kernel faults. Makes life a little easier for those who don't know how to decode our register contents at oops time. Signed-off-by: Olof Johansson Index: 2.6/arch/powerpc/mm/fault.c =================================================================== --- 2.6.orig/arch/powerpc/mm/fault.c 2005-11-06 12:55:22.000000000 -0800 +++ 2.6/arch/powerpc/mm/fault.c 2005-11-06 14:02:20.000000000 -0800 @@ -389,5 +389,22 @@ void bad_page_fault(struct pt_regs *regs } /* kernel has accessed a bad area */ + + printk(KERN_ALERT "Unable to handle kernel paging request for "); + switch (regs->trap) { + case 0x300: + case 0x380: + printk("data at address 0x%016lx\n", regs->dar); + break; + case 0x400: + case 0x480: + printk("instruction fetch\n"); + break; + default: + printk("unknown fault\n"); + } + printk(KERN_ALERT "Faulting instruction address: 0x%016lx\n", + regs->nip); + die("Kernel access of bad area", regs, sig); } From paulus at samba.org Mon Nov 7 09:44:50 2005 From: paulus at samba.org (Paul Mackerras) Date: Mon, 7 Nov 2005 09:44:50 +1100 Subject: [PATCH] powerpc: Nicer printing of address at oops In-Reply-To: <20051106220439.GA7166@pb15.lixom.net> References: <20051106220439.GA7166@pb15.lixom.net> Message-ID: <17262.34658.917350.594965@cargo.ozlabs.ibm.com> Olof Johansson writes: > + printk("data at address 0x%016lx\n", regs->dar); Nice idea, but 16 digits is a bit excessive for 32-bit... Paul. From olof at lixom.net Mon Nov 7 09:54:36 2005 From: olof at lixom.net (Olof Johansson) Date: Sun, 6 Nov 2005 14:54:36 -0800 Subject: [PATCH] powerpc: Nicer printing of address at oops In-Reply-To: <17262.34658.917350.594965@cargo.ozlabs.ibm.com> References: <20051106220439.GA7166@pb15.lixom.net> <17262.34658.917350.594965@cargo.ozlabs.ibm.com> Message-ID: <20051106225435.GB7166@pb15.lixom.net> On Mon, Nov 07, 2005 at 09:44:50AM +1100, Paul Mackerras wrote: > Olof Johansson writes: > > > + printk("data at address 0x%016lx\n", regs->dar); > > Nice idea, but 16 digits is a bit excessive for 32-bit... Ack, it's shared now, I forgot. Thanks. Here's a new patch. I don't like the thought of ifdeffing for it so we'll just have to live with 8 digit padding for small 64-bit pointers. :-) -Olof Add nicer printing of faulting address on unresolvable kernel faults. Makes life a little easier for those who don't know how to decode our register contents at oops time. Signed-off-by: Olof Johansson Index: 2.6/arch/powerpc/mm/fault.c =================================================================== --- 2.6.orig/arch/powerpc/mm/fault.c 2005-11-06 12:55:22.000000000 -0800 +++ 2.6/arch/powerpc/mm/fault.c 2005-11-06 14:52:23.000000000 -0800 @@ -389,5 +389,22 @@ void bad_page_fault(struct pt_regs *regs } /* kernel has accessed a bad area */ + + printk(KERN_ALERT "Unable to handle kernel paging request for "); + switch (regs->trap) { + case 0x300: + case 0x380: + printk("data at address 0x%08lx\n", regs->dar); + break; + case 0x400: + case 0x480: + printk("instruction fetch\n"); + break; + default: + printk("unknown fault\n"); + } + printk(KERN_ALERT "Faulting instruction address: 0x%08lx\n", + regs->nip); + die("Kernel access of bad area", regs, sig); } From david at gibson.dropbear.id.au Mon Nov 7 09:49:43 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Mon, 7 Nov 2005 09:49:43 +1100 Subject: powerpc: Kill ppcdebug (revised) In-Reply-To: <20051104001653.GC29025@localhost.localdomain> References: <20051104001653.GC29025@localhost.localdomain> Message-ID: <20051106224943.GA11833@localhost.localdomain> The previous version of the kill ppcdebug patch confliced with Michael Ellerman's plpar_wrappers.h patch. Here's a revised version which fixes the conflict. The ancient ppcdebug/PPCDBG mechanism is now only used in two places. First, in the hash setup code, one of the bits allows the size of the hash table to be reduced by a factor of 8 - which would be better accomplished with a command line option for that purpose. The other was a bunch of bus walking related messages in the iSeries code, which would seem to be insufficient reason to keep the mechanism. This patch removes the last traces of this mechanism. Built and booted on iSeries and pSeries POWER5 LPAR (ARCH=powerpc). Signed-off-by: David Gibson Index: working-2.6/arch/powerpc/kernel/signal_32.c =================================================================== --- working-2.6.orig/arch/powerpc/kernel/signal_32.c 2005-11-04 14:18:13.000000000 +1100 +++ working-2.6/arch/powerpc/kernel/signal_32.c 2005-11-07 09:41:24.000000000 +1100 @@ -44,7 +44,6 @@ #include #ifdef CONFIG_PPC64 #include "ppc32.h" -#include #include #include #else Index: working-2.6/arch/powerpc/mm/init_64.c =================================================================== --- working-2.6.orig/arch/powerpc/mm/init_64.c 2005-10-31 15:20:20.000000000 +1100 +++ working-2.6/arch/powerpc/mm/init_64.c 2005-11-07 09:41:24.000000000 +1100 @@ -57,7 +57,6 @@ #include #include #include -#include #include #include #include Index: working-2.6/arch/powerpc/mm/pgtable_64.c =================================================================== --- working-2.6.orig/arch/powerpc/mm/pgtable_64.c 2005-10-31 15:44:59.000000000 +1100 +++ working-2.6/arch/powerpc/mm/pgtable_64.c 2005-11-07 09:41:24.000000000 +1100 @@ -59,7 +59,6 @@ #include #include #include -#include #include #include #include Index: working-2.6/arch/powerpc/platforms/iseries/smp.c =================================================================== --- working-2.6.orig/arch/powerpc/platforms/iseries/smp.c 2005-11-04 14:18:13.000000000 +1100 +++ working-2.6/arch/powerpc/platforms/iseries/smp.c 2005-11-07 09:41:24.000000000 +1100 @@ -40,7 +40,6 @@ #include #include #include -#include #include #include #include Index: working-2.6/arch/powerpc/platforms/pseries/iommu.c =================================================================== --- working-2.6.orig/arch/powerpc/platforms/pseries/iommu.c 2005-11-07 09:38:01.000000000 +1100 +++ working-2.6/arch/powerpc/platforms/pseries/iommu.c 2005-11-07 09:41:24.000000000 +1100 @@ -37,7 +37,6 @@ #include #include #include -#include #include #include #include Index: working-2.6/arch/powerpc/platforms/pseries/lpar.c =================================================================== --- working-2.6.orig/arch/powerpc/platforms/pseries/lpar.c 2005-11-07 09:38:01.000000000 +1100 +++ working-2.6/arch/powerpc/platforms/pseries/lpar.c 2005-11-07 09:41:35.000000000 +1100 @@ -31,13 +31,13 @@ #include #include #include -#include #include #include #include #include #include #include +#include #include "plpar_wrappers.h" Index: working-2.6/arch/powerpc/platforms/pseries/ras.c =================================================================== --- working-2.6.orig/arch/powerpc/platforms/pseries/ras.c 2005-10-31 15:20:20.000000000 +1100 +++ working-2.6/arch/powerpc/platforms/pseries/ras.c 2005-11-07 09:41:24.000000000 +1100 @@ -48,7 +48,7 @@ #include #include #include -#include +#include static unsigned char ras_log_buf[RTAS_ERROR_LOG_MAX]; static DEFINE_SPINLOCK(ras_log_buf_lock); Index: working-2.6/arch/ppc64/kernel/prom.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/prom.c 2005-10-31 15:44:59.000000000 +1100 +++ working-2.6/arch/ppc64/kernel/prom.c 2005-11-07 09:41:24.000000000 +1100 @@ -46,7 +46,6 @@ #include #include #include -#include #include #include #include Index: working-2.6/arch/ppc64/kernel/prom_init.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/prom_init.c 2005-11-04 14:18:13.000000000 +1100 +++ working-2.6/arch/ppc64/kernel/prom_init.c 2005-11-07 09:41:24.000000000 +1100 @@ -44,7 +44,6 @@ #include #include #include -#include #include #include #include Index: working-2.6/arch/powerpc/sysdev/u3_iommu.c =================================================================== --- working-2.6.orig/arch/powerpc/sysdev/u3_iommu.c 2005-11-04 14:18:13.000000000 +1100 +++ working-2.6/arch/powerpc/sysdev/u3_iommu.c 2005-11-07 09:41:24.000000000 +1100 @@ -37,7 +37,6 @@ #include #include #include -#include #include #include #include Index: working-2.6/arch/powerpc/kernel/setup_64.c =================================================================== --- working-2.6.orig/arch/powerpc/kernel/setup_64.c 2005-11-07 09:38:01.000000000 +1100 +++ working-2.6/arch/powerpc/kernel/setup_64.c 2005-11-07 09:41:24.000000000 +1100 @@ -41,7 +41,6 @@ #include #include #include -#include #include #include #include @@ -60,6 +59,7 @@ #include #include #include +#include #ifdef DEBUG #define DBG(fmt...) udbg_printf(fmt) @@ -244,12 +244,6 @@ DBG(" -> early_setup()\n"); /* - * Fill the default DBG level (do we want to keep - * that old mecanism around forever ?) - */ - ppcdbg_initialize(); - - /* * Do early initializations using the flattened device * tree, like retreiving the physical memory map or * calculating/retreiving the hash table size @@ -516,7 +510,6 @@ printk("-----------------------------------------------------\n"); printk("ppc64_pft_size = 0x%lx\n", ppc64_pft_size); - printk("ppc64_debug_switch = 0x%lx\n", ppc64_debug_switch); printk("ppc64_interrupt_controller = 0x%ld\n", ppc64_interrupt_controller); printk("systemcfg = 0x%p\n", systemcfg); printk("systemcfg->platform = 0x%x\n", systemcfg->platform); Index: working-2.6/include/asm-ppc64/ppcdebug.h =================================================================== --- working-2.6.orig/include/asm-ppc64/ppcdebug.h 2005-10-25 11:59:59.000000000 +1000 +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 @@ -1,108 +0,0 @@ -#ifndef __PPCDEBUG_H -#define __PPCDEBUG_H -/******************************************************************** - * Author: Adam Litke, IBM Corp - * (c) 2001 - * - * This file contains definitions and macros for a runtime debugging - * system for ppc64 (This should also work on 32 bit with a few - * adjustments. - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * - ********************************************************************/ - -#include -#include -#include -#include - -#define PPCDBG_BITVAL(X) ((1UL)<<((unsigned long)(X))) - -/* Defined below are the bit positions of various debug flags in the - * ppc64_debug_switch variable. - * -- When adding new values, please enter them into trace names below -- - * - * Values 62 & 63 can be used to stress the hardware page table management - * code. They must be set statically, any attempt to change them dynamically - * would be a very bad idea. - */ -#define PPCDBG_MMINIT PPCDBG_BITVAL(0) -#define PPCDBG_MM PPCDBG_BITVAL(1) -#define PPCDBG_SYS32 PPCDBG_BITVAL(2) -#define PPCDBG_SYS32NI PPCDBG_BITVAL(3) -#define PPCDBG_SYS32X PPCDBG_BITVAL(4) -#define PPCDBG_SYS32M PPCDBG_BITVAL(5) -#define PPCDBG_SYS64 PPCDBG_BITVAL(6) -#define PPCDBG_SYS64NI PPCDBG_BITVAL(7) -#define PPCDBG_SYS64X PPCDBG_BITVAL(8) -#define PPCDBG_SIGNAL PPCDBG_BITVAL(9) -#define PPCDBG_SIGNALXMON PPCDBG_BITVAL(10) -#define PPCDBG_BINFMT32 PPCDBG_BITVAL(11) -#define PPCDBG_BINFMT64 PPCDBG_BITVAL(12) -#define PPCDBG_BINFMTXMON PPCDBG_BITVAL(13) -#define PPCDBG_BINFMT_32ADDR PPCDBG_BITVAL(14) -#define PPCDBG_ALIGNFIXUP PPCDBG_BITVAL(15) -#define PPCDBG_TCEINIT PPCDBG_BITVAL(16) -#define PPCDBG_TCE PPCDBG_BITVAL(17) -#define PPCDBG_PHBINIT PPCDBG_BITVAL(18) -#define PPCDBG_SMP PPCDBG_BITVAL(19) -#define PPCDBG_BOOT PPCDBG_BITVAL(20) -#define PPCDBG_BUSWALK PPCDBG_BITVAL(21) -#define PPCDBG_PROM PPCDBG_BITVAL(22) -#define PPCDBG_RTAS PPCDBG_BITVAL(23) -#define PPCDBG_HTABSTRESS PPCDBG_BITVAL(62) -#define PPCDBG_HTABSIZE PPCDBG_BITVAL(63) -#define PPCDBG_NONE (0UL) -#define PPCDBG_ALL (0xffffffffUL) - -/* The default initial value for the debug switch */ -#define PPC_DEBUG_DEFAULT 0 -/* #define PPC_DEBUG_DEFAULT PPCDBG_ALL */ - -#define PPCDBG_NUM_FLAGS 64 - -extern u64 ppc64_debug_switch; - -#ifdef WANT_PPCDBG_TAB -/* A table of debug switch names to allow name lookup in xmon - * (and whoever else wants it. - */ -char *trace_names[PPCDBG_NUM_FLAGS] = { - /* Known debug names */ - "mminit", "mm", - "syscall32", "syscall32_ni", "syscall32x", "syscall32m", - "syscall64", "syscall64_ni", "syscall64x", - "signal", "signal_xmon", - "binfmt32", "binfmt64", "binfmt_xmon", "binfmt_32addr", - "alignfixup", "tceinit", "tce", "phb_init", - "smp", "boot", "buswalk", "prom", - "rtas" -}; -#else -extern char *trace_names[64]; -#endif /* WANT_PPCDBG_TAB */ - -#ifdef CONFIG_PPCDBG -/* Macro to conditionally print debug based on debug_switch */ -#define PPCDBG(...) udbg_ppcdbg(__VA_ARGS__) - -/* Macro to conditionally call a debug routine based on debug_switch */ -#define PPCDBGCALL(FLAGS,FUNCTION) ifppcdebug(FLAGS) FUNCTION - -/* Macros to test for debug states */ -#define ifppcdebug(FLAGS) if (udbg_ifdebug(FLAGS)) -#define ppcdebugset(FLAGS) (udbg_ifdebug(FLAGS)) -#define PPCDBG_BINFMT (test_thread_flag(TIF_32BIT) ? PPCDBG_BINFMT32 : PPCDBG_BINFMT64) - -#else -#define PPCDBG(...) do {;} while (0) -#define PPCDBGCALL(FLAGS,FUNCTION) do {;} while (0) -#define ifppcdebug(...) if (0) -#define ppcdebugset(FLAGS) (0) -#endif /* CONFIG_PPCDBG */ - -#endif /*__PPCDEBUG_H */ Index: working-2.6/arch/ppc64/kernel/udbg.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/udbg.c 2005-10-25 11:59:53.000000000 +1000 +++ working-2.6/arch/ppc64/kernel/udbg.c 2005-11-07 09:41:24.000000000 +1100 @@ -10,12 +10,10 @@ */ #include -#define WANT_PPCDBG_TAB /* Only defined here */ #include #include #include #include -#include #include void (*udbg_putc)(unsigned char c); @@ -89,59 +87,6 @@ va_end(args); } -/* PPCDBG stuff */ - -u64 ppc64_debug_switch; - -/* Special print used by PPCDBG() macro */ -void udbg_ppcdbg(unsigned long debug_flags, const char *fmt, ...) -{ - unsigned long active_debugs = debug_flags & ppc64_debug_switch; - - if (active_debugs) { - va_list ap; - unsigned char buf[UDBG_BUFSIZE]; - unsigned long i, len = 0; - - for (i=0; i < PPCDBG_NUM_FLAGS; i++) { - if (((1U << i) & active_debugs) && - trace_names[i]) { - len += strlen(trace_names[i]); - udbg_puts(trace_names[i]); - break; - } - } - - snprintf(buf, UDBG_BUFSIZE, " [%s]: ", current->comm); - len += strlen(buf); - udbg_puts(buf); - - while (len < 18) { - udbg_puts(" "); - len++; - } - - va_start(ap, fmt); - vsnprintf(buf, UDBG_BUFSIZE, fmt, ap); - udbg_puts(buf); - va_end(ap); - } -} - -unsigned long udbg_ifdebug(unsigned long flags) -{ - return (flags & ppc64_debug_switch); -} - -/* - * Initialize the PPCDBG state. Called before relocation has been enabled. - */ -void __init ppcdbg_initialize(void) -{ - ppc64_debug_switch = PPC_DEBUG_DEFAULT; /* | PPCDBG_BUSWALK | */ - /* PPCDBG_PHBINIT | PPCDBG_MM | PPCDBG_MMINIT | PPCDBG_TCEINIT | PPCDBG_TCE */; -} - /* * Early boot console based on udbg */ Index: working-2.6/include/asm-ppc64/udbg.h =================================================================== --- working-2.6.orig/include/asm-ppc64/udbg.h 2005-10-31 15:20:22.000000000 +1100 +++ working-2.6/include/asm-ppc64/udbg.h 2005-11-07 09:41:24.000000000 +1100 @@ -23,9 +23,6 @@ extern void register_early_udbg_console(void); extern void udbg_printf(const char *fmt, ...); -extern void udbg_ppcdbg(unsigned long flags, const char *fmt, ...); -extern unsigned long udbg_ifdebug(unsigned long flags); -extern void __init ppcdbg_initialize(void); extern void udbg_init_uart(void __iomem *comport, unsigned int speed); Index: working-2.6/arch/powerpc/mm/hash_utils_64.c =================================================================== --- working-2.6.orig/arch/powerpc/mm/hash_utils_64.c 2005-10-31 15:20:20.000000000 +1100 +++ working-2.6/arch/powerpc/mm/hash_utils_64.c 2005-11-07 09:41:24.000000000 +1100 @@ -32,7 +32,6 @@ #include #include -#include #include #include #include @@ -194,12 +193,6 @@ htab_size_bytes = get_hashtable_size(); pteg_count = htab_size_bytes >> 7; - /* For debug, make the HTAB 1/8 as big as it normally would be. */ - ifppcdebug(PPCDBG_HTABSIZE) { - pteg_count >>= 3; - htab_size_bytes = pteg_count << 7; - } - htab_hash_mask = pteg_count - 1; if (systemcfg->platform & PLATFORM_LPAR) { Index: working-2.6/arch/powerpc/platforms/iseries/irq.c =================================================================== --- working-2.6.orig/arch/powerpc/platforms/iseries/irq.c 2005-11-04 14:18:13.000000000 +1100 +++ working-2.6/arch/powerpc/platforms/iseries/irq.c 2005-11-07 09:41:24.000000000 +1100 @@ -35,7 +35,6 @@ #include #include -#include #include #include #include @@ -227,8 +226,6 @@ /* Unmask secondary INTA */ mask = 0x80000000; HvCallPci_unmaskInterrupts(bus, subBus, deviceId, mask); - PPCDBG(PPCDBG_BUSWALK, "iSeries_enable_IRQ 0x%02X.%02X.%02X 0x%04X\n", - bus, subBus, deviceId, irq); } /* This is called by iSeries_activate_IRQs */ @@ -310,8 +307,6 @@ /* Mask secondary INTA */ mask = 0x80000000; HvCallPci_maskInterrupts(bus, subBus, deviceId, mask); - PPCDBG(PPCDBG_BUSWALK, "iSeries_disable_IRQ 0x%02X.%02X.%02X 0x%04X\n", - bus, subBus, deviceId, irq); } /* Index: working-2.6/arch/powerpc/platforms/iseries/pci.c =================================================================== --- working-2.6.orig/arch/powerpc/platforms/iseries/pci.c 2005-11-04 14:18:13.000000000 +1100 +++ working-2.6/arch/powerpc/platforms/iseries/pci.c 2005-11-07 09:41:24.000000000 +1100 @@ -32,7 +32,6 @@ #include #include #include -#include #include #include @@ -207,10 +206,6 @@ struct device_node *node; struct pci_dn *pdn; - PPCDBG(PPCDBG_BUSWALK, - "-build_device_node 0x%02X.%02X.%02X Function: %02X\n", - Bus, SubBus, AgentId, Function); - node = kmalloc(sizeof(struct device_node), GFP_KERNEL); if (node == NULL) return NULL; @@ -243,8 +238,6 @@ struct pci_controller *phb; HvBusNumber bus; - PPCDBG(PPCDBG_BUSWALK, "find_and_init_phbs Entry\n"); - /* Check all possible buses. */ for (bus = 0; bus < 256; bus++) { int ret = HvCallXm_testBus(bus); @@ -261,9 +254,6 @@ phb->last_busno = bus; phb->ops = &iSeries_pci_ops; - PPCDBG(PPCDBG_BUSWALK, "PCI:Create iSeries pci_controller(%p), Bus: %04X\n", - phb, bus); - /* Find and connect the devices. */ scan_PHB_slots(phb); } @@ -285,11 +275,9 @@ */ void iSeries_pcibios_init(void) { - PPCDBG(PPCDBG_BUSWALK, "iSeries_pcibios_init Entry.\n"); iomm_table_initialize(); find_and_init_phbs(); io_page_mask = -1; - PPCDBG(PPCDBG_BUSWALK, "iSeries_pcibios_init Exit.\n"); } /* @@ -301,8 +289,6 @@ struct device_node *node; int DeviceCount = 0; - PPCDBG(PPCDBG_BUSWALK, "iSeries_pcibios_fixup Entry.\n"); - /* Fix up at the device node and pci_dev relationship */ mf_display_src(0xC9000100); @@ -316,9 +302,6 @@ ++DeviceCount; pdev->sysdata = (void *)node; PCI_DN(node)->pcidev = pdev; - PPCDBG(PPCDBG_BUSWALK, - "pdev 0x%p <==> DevNode 0x%p\n", - pdev, node); allocate_device_bars(pdev); iSeries_Device_Information(pdev, DeviceCount); iommu_devnode_init_iSeries(node); @@ -333,13 +316,10 @@ void pcibios_fixup_bus(struct pci_bus *PciBus) { - PPCDBG(PPCDBG_BUSWALK, "iSeries_pcibios_fixup_bus(0x%04X) Entry.\n", - PciBus->number); } void pcibios_fixup_resources(struct pci_dev *pdev) { - PPCDBG(PPCDBG_BUSWALK, "fixup_resources pdev %p\n", pdev); } /* @@ -401,9 +381,6 @@ printk("found device at bus %d idsel %d func %d (AgentId %x)\n", bus, IdSel, Function, AgentId); /* Connect EADs: 0x18.00.12 = 0x00 */ - PPCDBG(PPCDBG_BUSWALK, - "PCI:Connect EADs: 0x%02X.%02X.%02X\n", - bus, SubBus, AgentId); HvRc = HvCallPci_getBusUnitInfo(bus, SubBus, AgentId, iseries_hv_addr(BridgeInfo), sizeof(struct HvCallPci_BridgeInfo)); @@ -414,14 +391,6 @@ BridgeInfo->maxAgents, BridgeInfo->maxSubBusNumber, BridgeInfo->logicalSlotNumber); - PPCDBG(PPCDBG_BUSWALK, - "PCI: BridgeInfo, Type:0x%02X, SubBus:0x%02X, MaxAgents:0x%02X, MaxSubBus: 0x%02X, LSlot: 0x%02X\n", - BridgeInfo->busUnitInfo.deviceType, - BridgeInfo->subBusNumber, - BridgeInfo->maxAgents, - BridgeInfo->maxSubBusNumber, - BridgeInfo->logicalSlotNumber); - if (BridgeInfo->busUnitInfo.deviceType == HvCallPci_BridgeDevice) { /* Scan_Bridge_Slot...: 0x18.00.12 */ @@ -454,9 +423,6 @@ /* iSeries_allocate_IRQ.: 0x18.00.12(0xA3) */ Irq = iSeries_allocate_IRQ(Bus, 0, EADsIdSel); - PPCDBG(PPCDBG_BUSWALK, - "PCI:- allocate and assign IRQ 0x%02X.%02X.%02X = 0x%02X\n", - Bus, 0, EADsIdSel, Irq); /* * Connect all functions of any device found. @@ -482,9 +448,6 @@ printk("read vendor ID: %x\n", VendorId); /* FoundDevice: 0x18.28.10 = 0x12AE */ - PPCDBG(PPCDBG_BUSWALK, - "PCI:- FoundDevice: 0x%02X.%02X.%02X = 0x%04X, irq %d\n", - Bus, SubBus, AgentId, VendorId, Irq); HvRc = HvCallPci_configStore8(Bus, SubBus, AgentId, PCI_INTERRUPT_LINE, Irq); if (HvRc != 0) Index: working-2.6/arch/powerpc/platforms/iseries/setup.c =================================================================== --- working-2.6.orig/arch/powerpc/platforms/iseries/setup.c 2005-11-04 14:18:13.000000000 +1100 +++ working-2.6/arch/powerpc/platforms/iseries/setup.c 2005-11-07 09:41:24.000000000 +1100 @@ -71,8 +71,6 @@ #endif /* Function Prototypes */ -extern void ppcdbg_initialize(void); - static void build_iSeries_Memory_Map(void); static void iseries_shared_idle(void); static void iseries_dedicated_idle(void); @@ -309,8 +307,6 @@ ppc64_firmware_features = FW_FEATURE_ISERIES; - ppcdbg_initialize(); - ppc64_interrupt_controller = IC_ISERIES; #if defined(CONFIG_BLK_DEV_INITRD) Index: working-2.6/arch/ppc64/Kconfig.debug =================================================================== --- working-2.6.orig/arch/ppc64/Kconfig.debug 2005-10-25 11:59:53.000000000 +1000 +++ working-2.6/arch/ppc64/Kconfig.debug 2005-11-07 09:41:24.000000000 +1100 @@ -55,10 +55,6 @@ xmon is normally disabled unless booted with 'xmon=on'. Use 'xmon=off' to disable xmon init during runtime. -config PPCDBG - bool "Include PPCDBG realtime debugging" - depends on DEBUG_KERNEL - config IRQSTACKS bool "Use separate kernel stacks when processing interrupts" help Index: working-2.6/arch/powerpc/kernel/signal_64.c =================================================================== --- working-2.6.orig/arch/powerpc/kernel/signal_64.c 2005-11-04 14:18:13.000000000 +1100 +++ working-2.6/arch/powerpc/kernel/signal_64.c 2005-11-07 09:41:24.000000000 +1100 @@ -33,7 +33,6 @@ #include #include #include -#include #include #include #include -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson From paulus at samba.org Mon Nov 7 10:25:39 2005 From: paulus at samba.org (Paul Mackerras) Date: Mon, 7 Nov 2005 10:25:39 +1100 Subject: [PATCH 16/42]: PCI: PCI Error reporting callbacks In-Reply-To: <20051105061114.GA27016@kroah.com> References: <20051103235918.GA25616@mail.gnucash.org> <20051104005035.GA26929@mail.gnucash.org> <20051105061114.GA27016@kroah.com> Message-ID: <17262.37107.857718.184055@cargo.ozlabs.ibm.com> Greg KH writes: > > +enum pcierr_result { > > + PCIERR_RESULT_NONE=0, /* no result/none/not supported in device driver */ > > + PCIERR_RESULT_CAN_RECOVER=1, /* Device driver can recover without slot reset */ > > + PCIERR_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */ > > + PCIERR_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */ > > + PCIERR_RESULT_RECOVERED, /* Device driver is fully recovered and operational */ > > +}; > > No, do not create new types of error or return codes. Use the standard > -EFOO values. You can document what they should each return, and mean, > but do not create new codes. Actually, these are not error or return codes, but rather requested actions (maybe somewhat misnamed). We can map them on to -EFOO values but it will be rather strained (-ECONNRESET for "please reset the slot", anyone? :). > Also, you create an enum, but yet do not use it in your function > callback definition, which means you really didn't want to create it in > the first place... Yes, they could be #defines. Paul. From paulus at samba.org Mon Nov 7 11:59:42 2005 From: paulus at samba.org (Paul Mackerras) Date: Mon, 7 Nov 2005 11:59:42 +1100 Subject: [PATCH 4/4] Memory Add Fixes for ppc64 In-Reply-To: <20051104232109.GE25545@w-mikek2.ibm.com> References: <20051104231552.GA25545@w-mikek2.ibm.com> <20051104232109.GE25545@w-mikek2.ibm.com> Message-ID: <17262.42750.810366.294231@cargo.ozlabs.ibm.com> Mike Kravetz writes: > ppc64 needs a special sysfs probe file for adding new memory. > > Signed-off-by: Mike Kravetz > > diff -Naupr linux-2.6.14-git7/arch/ppc64/Kconfig linux-2.6.14-git7.work/arch/ppc64/Kconfig > --- linux-2.6.14-git7/arch/ppc64/Kconfig 2005-11-04 21:21:06.000000000 +0000 > +++ linux-2.6.14-git7.work/arch/ppc64/Kconfig 2005-11-04 22:11:16.000000000 +0000 > @@ -277,6 +277,10 @@ config HAVE_ARCH_EARLY_PFN_TO_NID > def_bool y > depends on NEED_MULTIPLE_NODES > > +config ARCH_MEMORY_PROBE > + def_bool y > + depends on MEMORY_HOTPLUG > + Does arch/powerpc/Kconfig need a similar fix then? Paul. From paulus at samba.org Mon Nov 7 12:05:23 2005 From: paulus at samba.org (Paul Mackerras) Date: Mon, 7 Nov 2005 12:05:23 +1100 Subject: powerpc: Consolidate asm compatibility macros In-Reply-To: <20051106025558.GC17292@localhost.localdomain> References: <20051104031609.GA962@localhost.localdomain> <7487F450-429B-4836-AF05-DD47B02D5BC1@kernel.crashing.org> <20051106025558.GC17292@localhost.localdomain> Message-ID: <17262.43091.764598.779557@cargo.ozlabs.ibm.com> David Gibson writes: > Hrm.. I actually deliberately removed the L, on the grounds that > "long" doesn't necessarily have a consistent meaning between C and > asm. The idea is that all these operations work on the "natural" size > for the arch, which is to say the size of the GPRs, which is to say > the size of a C long. Yes, long is the type that (happens to) correspond to a register width, so I like Kumar's suggestion. Paul. From benh at kernel.crashing.org Mon Nov 7 14:22:18 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 07 Nov 2005 14:22:18 +1100 Subject: [PATCH] ppc64: Fix zImage boot Message-ID: <1131333739.5229.148.camel@gaston> The zImage wrapper has a bug where it doesn't claim() the memory for the kernel properly, it forgets to take into account the offset between the ELF header and the kernel itself. This results on some machines, like G5s, into a kernel that crashes at boot when clearing the BSS. Signed-off-by: Benjamin Herrenschmidt Index: linux-work/arch/ppc64/boot/main.c =================================================================== --- linux-work.orig/arch/ppc64/boot/main.c 2005-11-01 14:13:53.000000000 +1100 +++ linux-work/arch/ppc64/boot/main.c 2005-11-07 14:20:54.000000000 +1100 @@ -203,8 +203,15 @@ if (elf64ph->p_type == PT_LOAD && elf64ph->p_offset != 0) break; } - vmlinux.size = (unsigned long)elf64ph->p_filesz; - vmlinux.memsize = (unsigned long)elf64ph->p_memsz; + vmlinux.size = (unsigned long)elf64ph->p_filesz + + (unsigned long)elf64ph->p_offset; + /* We need to claim the memsize plus the file offset since gzip + * will expand the header (file offset), then the kernel, then + * possible rubbish we don't care about. But the kernel bss must + * be claimed (it will be zero'd by the kernel itself) + */ + vmlinux.memsize = (unsigned long)elf64ph->p_memsz + + (unsigned long)elf64ph->p_offset; printf("Allocating 0x%lx bytes for kernel ...\n\r", vmlinux.memsize); vmlinux.addr = try_claim(vmlinux.memsize); if (vmlinux.addr == 0) { From benh at kernel.crashing.org Mon Nov 7 14:27:33 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 07 Nov 2005 14:27:33 +1100 Subject: [PATCH] ppc64: SMU based macs cpufreq support Message-ID: <1131334053.5229.151.camel@gaston> CPU freq support using 970FX powertune facility for iMac G5 and SMU based single CPU desktop. Signed-off-by: Benjamin Herrenschmidt Index: linux-work/arch/ppc64/kernel/misc.S =================================================================== --- linux-work.orig/arch/ppc64/kernel/misc.S 2005-11-07 11:58:28.000000000 +1100 +++ linux-work/arch/ppc64/kernel/misc.S 2005-11-07 11:58:31.000000000 +1100 @@ -560,7 +560,7 @@ isync blr - /* +/* * Do an IO access in real mode */ _GLOBAL(real_writeb) @@ -593,6 +593,76 @@ #endif /* defined(CONFIG_PPC_PMAC) || defined(CONFIG_PPC_MAPLE) */ /* + * SCOM access functions for 970 (FX only for now) + * + * unsigned long scom970_read(unsigned int address); + * void scom970_write(unsigned int address, unsigned long value); + * + * The address passed in is the 24 bits register address. This code + * is 970 specific and will not check the status bits, so you should + * know what you are doing. + */ +_GLOBAL(scom970_read) + /* interrupts off */ + mfmsr r4 + ori r0,r4,MSR_EE + xori r0,r0,MSR_EE + mtmsrd r0,1 + + /* rotate 24 bits SCOM address 8 bits left and mask out it's low 8 bits + * (including parity). On current CPUs they must be 0'd, + * and finally or in RW bit + */ + rlwinm r3,r3,8,0,15 + ori r3,r3,0x8000 + + /* do the actual scom read */ + sync + mtspr SPRN_SCOMC,r3 + isync + mfspr r3,SPRN_SCOMD + isync + mfspr r0,SPRN_SCOMC + isync + + /* XXX: fixup result on some buggy 970's (ouch ! we lost a bit, bah + * that's the best we can do). Not implemented yet as we don't use + * the scom on any of the bogus CPUs yet, but may have to be done + * ultimately + */ + + /* restore interrupts */ + mtmsrd r4,1 + blr + + +_GLOBAL(scom970_write) + /* interrupts off */ + mfmsr r5 + ori r0,r5,MSR_EE + xori r0,r0,MSR_EE + mtmsrd r0,1 + + /* rotate 24 bits SCOM address 8 bits left and mask out it's low 8 bits + * (including parity). On current CPUs they must be 0'd. + */ + + rlwinm r3,r3,8,0,15 + + sync + mtspr SPRN_SCOMD,r4 /* write data */ + isync + mtspr SPRN_SCOMC,r3 /* write command */ + isync + mfspr 3,SPRN_SCOMC + isync + + /* restore interrupts */ + mtmsrd r5,1 + blr + + +/* * Create a kernel thread * kernel_thread(fn, arg, flags) */ Index: linux-work/arch/ppc64/Kconfig =================================================================== --- linux-work.orig/arch/ppc64/Kconfig 2005-11-07 11:58:28.000000000 +1100 +++ linux-work/arch/ppc64/Kconfig 2005-11-07 11:58:31.000000000 +1100 @@ -169,6 +169,16 @@ support. As of this writing the exact hardware interface is strongly in flux, so no good recommendation can be made. +source "drivers/cpufreq/Kconfig" + +config CPU_FREQ_PMAC64 + bool "Support for some Apple G5s" + depends on CPU_FREQ && PMAC_SMU && PPC64 + select CPU_FREQ_TABLE + help + This adds support for frequency switching on Apple iMac G5, + and some of the more recent desktop G5 machines as well. + config IBMVIO depends on PPC_PSERIES || PPC_ISERIES bool Index: linux-work/drivers/macintosh/smu.c =================================================================== --- linux-work.orig/drivers/macintosh/smu.c 2005-11-07 11:58:28.000000000 +1100 +++ linux-work/drivers/macintosh/smu.c 2005-11-07 11:58:31.000000000 +1100 @@ -845,6 +845,18 @@ return 0; } +struct smu_sdbp_header *smu_get_sdb_partition(int id, unsigned int *size) +{ + char pname[32]; + + if (!smu) + return NULL; + + sprintf(pname, "sdb-partition-%02x", id); + return (struct smu_sdbp_header *)get_property(smu->of_node, + pname, size); +} +EXPORT_SYMBOL(smu_get_sdb_partition); /* Index: linux-work/arch/powerpc/platforms/powermac/cpufreq.c =================================================================== --- linux-work.orig/arch/powerpc/platforms/powermac/cpufreq.c 2005-11-07 11:58:28.000000000 +1100 +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 @@ -1,726 +0,0 @@ -/* - * arch/ppc/platforms/pmac_cpufreq.c - * - * Copyright (C) 2002 - 2005 Benjamin Herrenschmidt - * Copyright (C) 2004 John Steele Scott - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - * - * TODO: Need a big cleanup here. Basically, we need to have different - * cpufreq_driver structures for the different type of HW instead of the - * current mess. We also need to better deal with the detection of the - * type of machine. - * - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -/* WARNING !!! This will cause calibrate_delay() to be called, - * but this is an __init function ! So you MUST go edit - * init/main.c to make it non-init before enabling DEBUG_FREQ - */ -#undef DEBUG_FREQ - -/* - * There is a problem with the core cpufreq code on SMP kernels, - * it won't recalculate the Bogomips properly - */ -#ifdef CONFIG_SMP -#warning "WARNING, CPUFREQ not recommended on SMP kernels" -#endif - -extern void low_choose_7447a_dfs(int dfs); -extern void low_choose_750fx_pll(int pll); -extern void low_sleep_handler(void); - -/* - * Currently, PowerMac cpufreq supports only high & low frequencies - * that are set by the firmware - */ -static unsigned int low_freq; -static unsigned int hi_freq; -static unsigned int cur_freq; -static unsigned int sleep_freq; - -/* - * Different models uses different mecanisms to switch the frequency - */ -static int (*set_speed_proc)(int low_speed); -static unsigned int (*get_speed_proc)(void); - -/* - * Some definitions used by the various speedprocs - */ -static u32 voltage_gpio; -static u32 frequency_gpio; -static u32 slew_done_gpio; -static int no_schedule; -static int has_cpu_l2lve; -static int is_pmu_based; - -/* There are only two frequency states for each processor. Values - * are in kHz for the time being. - */ -#define CPUFREQ_HIGH 0 -#define CPUFREQ_LOW 1 - -static struct cpufreq_frequency_table pmac_cpu_freqs[] = { - {CPUFREQ_HIGH, 0}, - {CPUFREQ_LOW, 0}, - {0, CPUFREQ_TABLE_END}, -}; - -static struct freq_attr* pmac_cpu_freqs_attr[] = { - &cpufreq_freq_attr_scaling_available_freqs, - NULL, -}; - -static inline void local_delay(unsigned long ms) -{ - if (no_schedule) - mdelay(ms); - else - msleep(ms); -} - -#ifdef DEBUG_FREQ -static inline void debug_calc_bogomips(void) -{ - /* This will cause a recalc of bogomips and display the - * result. We backup/restore the value to avoid affecting the - * core cpufreq framework's own calculation. - */ - extern void calibrate_delay(void); - - unsigned long save_lpj = loops_per_jiffy; - calibrate_delay(); - loops_per_jiffy = save_lpj; -} -#endif /* DEBUG_FREQ */ - -/* Switch CPU speed under 750FX CPU control - */ -static int cpu_750fx_cpu_speed(int low_speed) -{ - u32 hid2; - - if (low_speed == 0) { - /* ramping up, set voltage first */ - pmac_call_feature(PMAC_FTR_WRITE_GPIO, NULL, voltage_gpio, 0x05); - /* Make sure we sleep for at least 1ms */ - local_delay(10); - - /* tweak L2 for high voltage */ - if (has_cpu_l2lve) { - hid2 = mfspr(SPRN_HID2); - hid2 &= ~0x2000; - mtspr(SPRN_HID2, hid2); - } - } -#ifdef CONFIG_6xx - low_choose_750fx_pll(low_speed); -#endif - if (low_speed == 1) { - /* tweak L2 for low voltage */ - if (has_cpu_l2lve) { - hid2 = mfspr(SPRN_HID2); - hid2 |= 0x2000; - mtspr(SPRN_HID2, hid2); - } - - /* ramping down, set voltage last */ - pmac_call_feature(PMAC_FTR_WRITE_GPIO, NULL, voltage_gpio, 0x04); - local_delay(10); - } - - return 0; -} - -static unsigned int cpu_750fx_get_cpu_speed(void) -{ - if (mfspr(SPRN_HID1) & HID1_PS) - return low_freq; - else - return hi_freq; -} - -/* Switch CPU speed using DFS */ -static int dfs_set_cpu_speed(int low_speed) -{ - if (low_speed == 0) { - /* ramping up, set voltage first */ - pmac_call_feature(PMAC_FTR_WRITE_GPIO, NULL, voltage_gpio, 0x05); - /* Make sure we sleep for at least 1ms */ - local_delay(1); - } - - /* set frequency */ -#ifdef CONFIG_6xx - low_choose_7447a_dfs(low_speed); -#endif - udelay(100); - - if (low_speed == 1) { - /* ramping down, set voltage last */ - pmac_call_feature(PMAC_FTR_WRITE_GPIO, NULL, voltage_gpio, 0x04); - local_delay(1); - } - - return 0; -} - -static unsigned int dfs_get_cpu_speed(void) -{ - if (mfspr(SPRN_HID1) & HID1_DFS) - return low_freq; - else - return hi_freq; -} - - -/* Switch CPU speed using slewing GPIOs - */ -static int gpios_set_cpu_speed(int low_speed) -{ - int gpio, timeout = 0; - - /* If ramping up, set voltage first */ - if (low_speed == 0) { - pmac_call_feature(PMAC_FTR_WRITE_GPIO, NULL, voltage_gpio, 0x05); - /* Delay is way too big but it's ok, we schedule */ - local_delay(10); - } - - /* Set frequency */ - gpio = pmac_call_feature(PMAC_FTR_READ_GPIO, NULL, frequency_gpio, 0); - if (low_speed == ((gpio & 0x01) == 0)) - goto skip; - - pmac_call_feature(PMAC_FTR_WRITE_GPIO, NULL, frequency_gpio, - low_speed ? 0x04 : 0x05); - udelay(200); - do { - if (++timeout > 100) - break; - local_delay(1); - gpio = pmac_call_feature(PMAC_FTR_READ_GPIO, NULL, slew_done_gpio, 0); - } while((gpio & 0x02) == 0); - skip: - /* If ramping down, set voltage last */ - if (low_speed == 1) { - pmac_call_feature(PMAC_FTR_WRITE_GPIO, NULL, voltage_gpio, 0x04); - /* Delay is way too big but it's ok, we schedule */ - local_delay(10); - } - -#ifdef DEBUG_FREQ - debug_calc_bogomips(); -#endif - - return 0; -} - -/* Switch CPU speed under PMU control - */ -static int pmu_set_cpu_speed(int low_speed) -{ - struct adb_request req; - unsigned long save_l2cr; - unsigned long save_l3cr; - unsigned int pic_prio; - unsigned long flags; - - preempt_disable(); - -#ifdef DEBUG_FREQ - printk(KERN_DEBUG "HID1, before: %x\n", mfspr(SPRN_HID1)); -#endif - pmu_suspend(); - - /* Disable all interrupt sources on openpic */ - pic_prio = mpic_cpu_get_priority(); - mpic_cpu_set_priority(0xf); - - /* Make sure the decrementer won't interrupt us */ - asm volatile("mtdec %0" : : "r" (0x7fffffff)); - /* Make sure any pending DEC interrupt occuring while we did - * the above didn't re-enable the DEC */ - mb(); - asm volatile("mtdec %0" : : "r" (0x7fffffff)); - - /* We can now disable MSR_EE */ - local_irq_save(flags); - - /* Giveup the FPU & vec */ - enable_kernel_fp(); - -#ifdef CONFIG_ALTIVEC - if (cpu_has_feature(CPU_FTR_ALTIVEC)) - enable_kernel_altivec(); -#endif /* CONFIG_ALTIVEC */ - - /* Save & disable L2 and L3 caches */ - save_l3cr = _get_L3CR(); /* (returns -1 if not available) */ - save_l2cr = _get_L2CR(); /* (returns -1 if not available) */ - - /* Send the new speed command. My assumption is that this command - * will cause PLL_CFG[0..3] to be changed next time CPU goes to sleep - */ - pmu_request(&req, NULL, 6, PMU_CPU_SPEED, 'W', 'O', 'O', 'F', low_speed); - while (!req.complete) - pmu_poll(); - - /* Prepare the northbridge for the speed transition */ - pmac_call_feature(PMAC_FTR_SLEEP_STATE,NULL,1,1); - - /* Call low level code to backup CPU state and recover from - * hardware reset - */ - low_sleep_handler(); - - /* Restore the northbridge */ - pmac_call_feature(PMAC_FTR_SLEEP_STATE,NULL,1,0); - - /* Restore L2 cache */ - if (save_l2cr != 0xffffffff && (save_l2cr & L2CR_L2E) != 0) - _set_L2CR(save_l2cr); - /* Restore L3 cache */ - if (save_l3cr != 0xffffffff && (save_l3cr & L3CR_L3E) != 0) - _set_L3CR(save_l3cr); - - /* Restore userland MMU context */ - set_context(current->active_mm->context, current->active_mm->pgd); - -#ifdef DEBUG_FREQ - printk(KERN_DEBUG "HID1, after: %x\n", mfspr(SPRN_HID1)); -#endif - - /* Restore low level PMU operations */ - pmu_unlock(); - - /* Restore decrementer */ - wakeup_decrementer(); - - /* Restore interrupts */ - mpic_cpu_set_priority(pic_prio); - - /* Let interrupts flow again ... */ - local_irq_restore(flags); - -#ifdef DEBUG_FREQ - debug_calc_bogomips(); -#endif - - pmu_resume(); - - preempt_enable(); - - return 0; -} - -static int do_set_cpu_speed(int speed_mode, int notify) -{ - struct cpufreq_freqs freqs; - unsigned long l3cr; - static unsigned long prev_l3cr; - - freqs.old = cur_freq; - freqs.new = (speed_mode == CPUFREQ_HIGH) ? hi_freq : low_freq; - freqs.cpu = smp_processor_id(); - - if (freqs.old == freqs.new) - return 0; - - if (notify) - cpufreq_notify_transition(&freqs, CPUFREQ_PRECHANGE); - if (speed_mode == CPUFREQ_LOW && - cpu_has_feature(CPU_FTR_L3CR)) { - l3cr = _get_L3CR(); - if (l3cr & L3CR_L3E) { - prev_l3cr = l3cr; - _set_L3CR(0); - } - } - set_speed_proc(speed_mode == CPUFREQ_LOW); - if (speed_mode == CPUFREQ_HIGH && - cpu_has_feature(CPU_FTR_L3CR)) { - l3cr = _get_L3CR(); - if ((prev_l3cr & L3CR_L3E) && l3cr != prev_l3cr) - _set_L3CR(prev_l3cr); - } - if (notify) - cpufreq_notify_transition(&freqs, CPUFREQ_POSTCHANGE); - cur_freq = (speed_mode == CPUFREQ_HIGH) ? hi_freq : low_freq; - - return 0; -} - -static unsigned int pmac_cpufreq_get_speed(unsigned int cpu) -{ - return cur_freq; -} - -static int pmac_cpufreq_verify(struct cpufreq_policy *policy) -{ - return cpufreq_frequency_table_verify(policy, pmac_cpu_freqs); -} - -static int pmac_cpufreq_target( struct cpufreq_policy *policy, - unsigned int target_freq, - unsigned int relation) -{ - unsigned int newstate = 0; - - if (cpufreq_frequency_table_target(policy, pmac_cpu_freqs, - target_freq, relation, &newstate)) - return -EINVAL; - - return do_set_cpu_speed(newstate, 1); -} - -unsigned int pmac_get_one_cpufreq(int i) -{ - /* Supports only one CPU for now */ - return (i == 0) ? cur_freq : 0; -} - -static int pmac_cpufreq_cpu_init(struct cpufreq_policy *policy) -{ - if (policy->cpu != 0) - return -ENODEV; - - policy->governor = CPUFREQ_DEFAULT_GOVERNOR; - policy->cpuinfo.transition_latency = CPUFREQ_ETERNAL; - policy->cur = cur_freq; - - cpufreq_frequency_table_get_attr(pmac_cpu_freqs, policy->cpu); - return cpufreq_frequency_table_cpuinfo(policy, pmac_cpu_freqs); -} - -static u32 read_gpio(struct device_node *np) -{ - u32 *reg = (u32 *)get_property(np, "reg", NULL); - u32 offset; - - if (reg == NULL) - return 0; - /* That works for all keylargos but shall be fixed properly - * some day... The problem is that it seems we can't rely - * on the "reg" property of the GPIO nodes, they are either - * relative to the base of KeyLargo or to the base of the - * GPIO space, and the device-tree doesn't help. - */ - offset = *reg; - if (offset < KEYLARGO_GPIO_LEVELS0) - offset += KEYLARGO_GPIO_LEVELS0; - return offset; -} - -static int pmac_cpufreq_suspend(struct cpufreq_policy *policy, pm_message_t pmsg) -{ - /* Ok, this could be made a bit smarter, but let's be robust for now. We - * always force a speed change to high speed before sleep, to make sure - * we have appropriate voltage and/or bus speed for the wakeup process, - * and to make sure our loops_per_jiffies are "good enough", that is will - * not cause too short delays if we sleep in low speed and wake in high - * speed.. - */ - no_schedule = 1; - sleep_freq = cur_freq; - if (cur_freq == low_freq && !is_pmu_based) - do_set_cpu_speed(CPUFREQ_HIGH, 0); - return 0; -} - -static int pmac_cpufreq_resume(struct cpufreq_policy *policy) -{ - /* If we resume, first check if we have a get() function */ - if (get_speed_proc) - cur_freq = get_speed_proc(); - else - cur_freq = 0; - - /* We don't, hrm... we don't really know our speed here, best - * is that we force a switch to whatever it was, which is - * probably high speed due to our suspend() routine - */ - do_set_cpu_speed(sleep_freq == low_freq ? - CPUFREQ_LOW : CPUFREQ_HIGH, 0); - - no_schedule = 0; - return 0; -} - -static struct cpufreq_driver pmac_cpufreq_driver = { - .verify = pmac_cpufreq_verify, - .target = pmac_cpufreq_target, - .get = pmac_cpufreq_get_speed, - .init = pmac_cpufreq_cpu_init, - .suspend = pmac_cpufreq_suspend, - .resume = pmac_cpufreq_resume, - .flags = CPUFREQ_PM_NO_WARN, - .attr = pmac_cpu_freqs_attr, - .name = "powermac", - .owner = THIS_MODULE, -}; - - -static int pmac_cpufreq_init_MacRISC3(struct device_node *cpunode) -{ - struct device_node *volt_gpio_np = of_find_node_by_name(NULL, - "voltage-gpio"); - struct device_node *freq_gpio_np = of_find_node_by_name(NULL, - "frequency-gpio"); - struct device_node *slew_done_gpio_np = of_find_node_by_name(NULL, - "slewing-done"); - u32 *value; - - /* - * Check to see if it's GPIO driven or PMU only - * - * The way we extract the GPIO address is slightly hackish, but it - * works well enough for now. We need to abstract the whole GPIO - * stuff sooner or later anyway - */ - - if (volt_gpio_np) - voltage_gpio = read_gpio(volt_gpio_np); - if (freq_gpio_np) - frequency_gpio = read_gpio(freq_gpio_np); - if (slew_done_gpio_np) - slew_done_gpio = read_gpio(slew_done_gpio_np); - - /* If we use the frequency GPIOs, calculate the min/max speeds based - * on the bus frequencies - */ - if (frequency_gpio && slew_done_gpio) { - int lenp, rc; - u32 *freqs, *ratio; - - freqs = (u32 *)get_property(cpunode, "bus-frequencies", &lenp); - lenp /= sizeof(u32); - if (freqs == NULL || lenp != 2) { - printk(KERN_ERR "cpufreq: bus-frequencies incorrect or missing\n"); - return 1; - } - ratio = (u32 *)get_property(cpunode, "processor-to-bus-ratio*2", NULL); - if (ratio == NULL) { - printk(KERN_ERR "cpufreq: processor-to-bus-ratio*2 missing\n"); - return 1; - } - - /* Get the min/max bus frequencies */ - low_freq = min(freqs[0], freqs[1]); - hi_freq = max(freqs[0], freqs[1]); - - /* Grrrr.. It _seems_ that the device-tree is lying on the low bus - * frequency, it claims it to be around 84Mhz on some models while - * it appears to be approx. 101Mhz on all. Let's hack around here... - * fortunately, we don't need to be too precise - */ - if (low_freq < 98000000) - low_freq = 101000000; - - /* Convert those to CPU core clocks */ - low_freq = (low_freq * (*ratio)) / 2000; - hi_freq = (hi_freq * (*ratio)) / 2000; - - /* Now we get the frequencies, we read the GPIO to see what is out current - * speed - */ - rc = pmac_call_feature(PMAC_FTR_READ_GPIO, NULL, frequency_gpio, 0); - cur_freq = (rc & 0x01) ? hi_freq : low_freq; - - set_speed_proc = gpios_set_cpu_speed; - return 1; - } - - /* If we use the PMU, look for the min & max frequencies in the - * device-tree - */ - value = (u32 *)get_property(cpunode, "min-clock-frequency", NULL); - if (!value) - return 1; - low_freq = (*value) / 1000; - /* The PowerBook G4 12" (PowerBook6,1) has an error in the device-tree - * here */ - if (low_freq < 100000) - low_freq *= 10; - - value = (u32 *)get_property(cpunode, "max-clock-frequency", NULL); - if (!value) - return 1; - hi_freq = (*value) / 1000; - set_speed_proc = pmu_set_cpu_speed; - is_pmu_based = 1; - - return 0; -} - -static int pmac_cpufreq_init_7447A(struct device_node *cpunode) -{ - struct device_node *volt_gpio_np; - - if (get_property(cpunode, "dynamic-power-step", NULL) == NULL) - return 1; - - volt_gpio_np = of_find_node_by_name(NULL, "cpu-vcore-select"); - if (volt_gpio_np) - voltage_gpio = read_gpio(volt_gpio_np); - if (!voltage_gpio){ - printk(KERN_ERR "cpufreq: missing cpu-vcore-select gpio\n"); - return 1; - } - - /* OF only reports the high frequency */ - hi_freq = cur_freq; - low_freq = cur_freq/2; - - /* Read actual frequency from CPU */ - cur_freq = dfs_get_cpu_speed(); - set_speed_proc = dfs_set_cpu_speed; - get_speed_proc = dfs_get_cpu_speed; - - return 0; -} - -static int pmac_cpufreq_init_750FX(struct device_node *cpunode) -{ - struct device_node *volt_gpio_np; - u32 pvr, *value; - - if (get_property(cpunode, "dynamic-power-step", NULL) == NULL) - return 1; - - hi_freq = cur_freq; - value = (u32 *)get_property(cpunode, "reduced-clock-frequency", NULL); - if (!value) - return 1; - low_freq = (*value) / 1000; - - volt_gpio_np = of_find_node_by_name(NULL, "cpu-vcore-select"); - if (volt_gpio_np) - voltage_gpio = read_gpio(volt_gpio_np); - - pvr = mfspr(SPRN_PVR); - has_cpu_l2lve = !((pvr & 0xf00) == 0x100); - - set_speed_proc = cpu_750fx_cpu_speed; - get_speed_proc = cpu_750fx_get_cpu_speed; - cur_freq = cpu_750fx_get_cpu_speed(); - - return 0; -} - -/* Currently, we support the following machines: - * - * - Titanium PowerBook 1Ghz (PMU based, 667Mhz & 1Ghz) - * - Titanium PowerBook 800 (PMU based, 667Mhz & 800Mhz) - * - Titanium PowerBook 400 (PMU based, 300Mhz & 400Mhz) - * - Titanium PowerBook 500 (PMU based, 300Mhz & 500Mhz) - * - iBook2 500/600 (PMU based, 400Mhz & 500/600Mhz) - * - iBook2 700 (CPU based, 400Mhz & 700Mhz, support low voltage) - * - Recent MacRISC3 laptops - * - All new machines with 7447A CPUs - */ -static int __init pmac_cpufreq_setup(void) -{ - struct device_node *cpunode; - u32 *value; - - if (strstr(cmd_line, "nocpufreq")) - return 0; - - /* Assume only one CPU */ - cpunode = find_type_devices("cpu"); - if (!cpunode) - goto out; - - /* Get current cpu clock freq */ - value = (u32 *)get_property(cpunode, "clock-frequency", NULL); - if (!value) - goto out; - cur_freq = (*value) / 1000; - - /* Check for 7447A based MacRISC3 */ - if (machine_is_compatible("MacRISC3") && - get_property(cpunode, "dynamic-power-step", NULL) && - PVR_VER(mfspr(SPRN_PVR)) == 0x8003) { - pmac_cpufreq_init_7447A(cpunode); - /* Check for other MacRISC3 machines */ - } else if (machine_is_compatible("PowerBook3,4") || - machine_is_compatible("PowerBook3,5") || - machine_is_compatible("MacRISC3")) { - pmac_cpufreq_init_MacRISC3(cpunode); - /* Else check for iBook2 500/600 */ - } else if (machine_is_compatible("PowerBook4,1")) { - hi_freq = cur_freq; - low_freq = 400000; - set_speed_proc = pmu_set_cpu_speed; - is_pmu_based = 1; - } - /* Else check for TiPb 550 */ - else if (machine_is_compatible("PowerBook3,3") && cur_freq == 550000) { - hi_freq = cur_freq; - low_freq = 500000; - set_speed_proc = pmu_set_cpu_speed; - is_pmu_based = 1; - } - /* Else check for TiPb 400 & 500 */ - else if (machine_is_compatible("PowerBook3,2")) { - /* We only know about the 400 MHz and the 500Mhz model - * they both have 300 MHz as low frequency - */ - if (cur_freq < 350000 || cur_freq > 550000) - goto out; - hi_freq = cur_freq; - low_freq = 300000; - set_speed_proc = pmu_set_cpu_speed; - is_pmu_based = 1; - } - /* Else check for 750FX */ - else if (PVR_VER(mfspr(SPRN_PVR)) == 0x7000) - pmac_cpufreq_init_750FX(cpunode); -out: - if (set_speed_proc == NULL) - return -ENODEV; - - pmac_cpu_freqs[CPUFREQ_LOW].frequency = low_freq; - pmac_cpu_freqs[CPUFREQ_HIGH].frequency = hi_freq; - - printk(KERN_INFO "Registering PowerMac CPU frequency driver\n"); - printk(KERN_INFO "Low: %d Mhz, High: %d Mhz, Boot: %d Mhz\n", - low_freq/1000, hi_freq/1000, cur_freq/1000); - - return cpufreq_register_driver(&pmac_cpufreq_driver); -} - -module_init(pmac_cpufreq_setup); - Index: linux-work/arch/powerpc/platforms/powermac/cpufreq_32.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/arch/powerpc/platforms/powermac/cpufreq_32.c 2005-11-07 11:58:31.000000000 +1100 @@ -0,0 +1,727 @@ +/* + * arch/ppc/platforms/pmac_cpufreq.c + * + * Copyright (C) 2002 - 2005 Benjamin Herrenschmidt + * Copyright (C) 2004 John Steele Scott + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * TODO: Need a big cleanup here. Basically, we need to have different + * cpufreq_driver structures for the different type of HW instead of the + * current mess. We also need to better deal with the detection of the + * type of machine. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* WARNING !!! This will cause calibrate_delay() to be called, + * but this is an __init function ! So you MUST go edit + * init/main.c to make it non-init before enabling DEBUG_FREQ + */ +#undef DEBUG_FREQ + +/* + * There is a problem with the core cpufreq code on SMP kernels, + * it won't recalculate the Bogomips properly + */ +#ifdef CONFIG_SMP +#warning "WARNING, CPUFREQ not recommended on SMP kernels" +#endif + +extern void low_choose_7447a_dfs(int dfs); +extern void low_choose_750fx_pll(int pll); +extern void low_sleep_handler(void); + +/* + * Currently, PowerMac cpufreq supports only high & low frequencies + * that are set by the firmware + */ +static unsigned int low_freq; +static unsigned int hi_freq; +static unsigned int cur_freq; +static unsigned int sleep_freq; + +/* + * Different models uses different mecanisms to switch the frequency + */ +static int (*set_speed_proc)(int low_speed); +static unsigned int (*get_speed_proc)(void); + +/* + * Some definitions used by the various speedprocs + */ +static u32 voltage_gpio; +static u32 frequency_gpio; +static u32 slew_done_gpio; +static int no_schedule; +static int has_cpu_l2lve; +static int is_pmu_based; + +/* There are only two frequency states for each processor. Values + * are in kHz for the time being. + */ +#define CPUFREQ_HIGH 0 +#define CPUFREQ_LOW 1 + +static struct cpufreq_frequency_table pmac_cpu_freqs[] = { + {CPUFREQ_HIGH, 0}, + {CPUFREQ_LOW, 0}, + {0, CPUFREQ_TABLE_END}, +}; + +static struct freq_attr* pmac_cpu_freqs_attr[] = { + &cpufreq_freq_attr_scaling_available_freqs, + NULL, +}; + +static inline void local_delay(unsigned long ms) +{ + if (no_schedule) + mdelay(ms); + else + msleep(ms); +} + +#ifdef DEBUG_FREQ +static inline void debug_calc_bogomips(void) +{ + /* This will cause a recalc of bogomips and display the + * result. We backup/restore the value to avoid affecting the + * core cpufreq framework's own calculation. + */ + extern void calibrate_delay(void); + + unsigned long save_lpj = loops_per_jiffy; + calibrate_delay(); + loops_per_jiffy = save_lpj; +} +#endif /* DEBUG_FREQ */ + +/* Switch CPU speed under 750FX CPU control + */ +static int cpu_750fx_cpu_speed(int low_speed) +{ + u32 hid2; + + if (low_speed == 0) { + /* ramping up, set voltage first */ + pmac_call_feature(PMAC_FTR_WRITE_GPIO, NULL, voltage_gpio, 0x05); + /* Make sure we sleep for at least 1ms */ + local_delay(10); + + /* tweak L2 for high voltage */ + if (has_cpu_l2lve) { + hid2 = mfspr(SPRN_HID2); + hid2 &= ~0x2000; + mtspr(SPRN_HID2, hid2); + } + } +#ifdef CONFIG_6xx + low_choose_750fx_pll(low_speed); +#endif + if (low_speed == 1) { + /* tweak L2 for low voltage */ + if (has_cpu_l2lve) { + hid2 = mfspr(SPRN_HID2); + hid2 |= 0x2000; + mtspr(SPRN_HID2, hid2); + } + + /* ramping down, set voltage last */ + pmac_call_feature(PMAC_FTR_WRITE_GPIO, NULL, voltage_gpio, 0x04); + local_delay(10); + } + + return 0; +} + +static unsigned int cpu_750fx_get_cpu_speed(void) +{ + if (mfspr(SPRN_HID1) & HID1_PS) + return low_freq; + else + return hi_freq; +} + +/* Switch CPU speed using DFS */ +static int dfs_set_cpu_speed(int low_speed) +{ + if (low_speed == 0) { + /* ramping up, set voltage first */ + pmac_call_feature(PMAC_FTR_WRITE_GPIO, NULL, voltage_gpio, 0x05); + /* Make sure we sleep for at least 1ms */ + local_delay(1); + } + + /* set frequency */ +#ifdef CONFIG_6xx + low_choose_7447a_dfs(low_speed); +#endif + udelay(100); + + if (low_speed == 1) { + /* ramping down, set voltage last */ + pmac_call_feature(PMAC_FTR_WRITE_GPIO, NULL, voltage_gpio, 0x04); + local_delay(1); + } + + return 0; +} + +static unsigned int dfs_get_cpu_speed(void) +{ + if (mfspr(SPRN_HID1) & HID1_DFS) + return low_freq; + else + return hi_freq; +} + + +/* Switch CPU speed using slewing GPIOs + */ +static int gpios_set_cpu_speed(int low_speed) +{ + int gpio, timeout = 0; + + /* If ramping up, set voltage first */ + if (low_speed == 0) { + pmac_call_feature(PMAC_FTR_WRITE_GPIO, NULL, voltage_gpio, 0x05); + /* Delay is way too big but it's ok, we schedule */ + local_delay(10); + } + + /* Set frequency */ + gpio = pmac_call_feature(PMAC_FTR_READ_GPIO, NULL, frequency_gpio, 0); + if (low_speed == ((gpio & 0x01) == 0)) + goto skip; + + pmac_call_feature(PMAC_FTR_WRITE_GPIO, NULL, frequency_gpio, + low_speed ? 0x04 : 0x05); + udelay(200); + do { + if (++timeout > 100) + break; + local_delay(1); + gpio = pmac_call_feature(PMAC_FTR_READ_GPIO, NULL, slew_done_gpio, 0); + } while((gpio & 0x02) == 0); + skip: + /* If ramping down, set voltage last */ + if (low_speed == 1) { + pmac_call_feature(PMAC_FTR_WRITE_GPIO, NULL, voltage_gpio, 0x04); + /* Delay is way too big but it's ok, we schedule */ + local_delay(10); + } + +#ifdef DEBUG_FREQ + debug_calc_bogomips(); +#endif + + return 0; +} + +/* Switch CPU speed under PMU control + */ +static int pmu_set_cpu_speed(int low_speed) +{ + struct adb_request req; + unsigned long save_l2cr; + unsigned long save_l3cr; + unsigned int pic_prio; + unsigned long flags; + + preempt_disable(); + +#ifdef DEBUG_FREQ + printk(KERN_DEBUG "HID1, before: %x\n", mfspr(SPRN_HID1)); +#endif + pmu_suspend(); + + /* Disable all interrupt sources on openpic */ + pic_prio = mpic_cpu_get_priority(); + mpic_cpu_set_priority(0xf); + + /* Make sure the decrementer won't interrupt us */ + asm volatile("mtdec %0" : : "r" (0x7fffffff)); + /* Make sure any pending DEC interrupt occuring while we did + * the above didn't re-enable the DEC */ + mb(); + asm volatile("mtdec %0" : : "r" (0x7fffffff)); + + /* We can now disable MSR_EE */ + local_irq_save(flags); + + /* Giveup the FPU & vec */ + enable_kernel_fp(); + +#ifdef CONFIG_ALTIVEC + if (cpu_has_feature(CPU_FTR_ALTIVEC)) + enable_kernel_altivec(); +#endif /* CONFIG_ALTIVEC */ + + /* Save & disable L2 and L3 caches */ + save_l3cr = _get_L3CR(); /* (returns -1 if not available) */ + save_l2cr = _get_L2CR(); /* (returns -1 if not available) */ + + /* Send the new speed command. My assumption is that this command + * will cause PLL_CFG[0..3] to be changed next time CPU goes to sleep + */ + pmu_request(&req, NULL, 6, PMU_CPU_SPEED, 'W', 'O', 'O', 'F', low_speed); + while (!req.complete) + pmu_poll(); + + /* Prepare the northbridge for the speed transition */ + pmac_call_feature(PMAC_FTR_SLEEP_STATE,NULL,1,1); + + /* Call low level code to backup CPU state and recover from + * hardware reset + */ + low_sleep_handler(); + + /* Restore the northbridge */ + pmac_call_feature(PMAC_FTR_SLEEP_STATE,NULL,1,0); + + /* Restore L2 cache */ + if (save_l2cr != 0xffffffff && (save_l2cr & L2CR_L2E) != 0) + _set_L2CR(save_l2cr); + /* Restore L3 cache */ + if (save_l3cr != 0xffffffff && (save_l3cr & L3CR_L3E) != 0) + _set_L3CR(save_l3cr); + + /* Restore userland MMU context */ + set_context(current->active_mm->context, current->active_mm->pgd); + +#ifdef DEBUG_FREQ + printk(KERN_DEBUG "HID1, after: %x\n", mfspr(SPRN_HID1)); +#endif + + /* Restore low level PMU operations */ + pmu_unlock(); + + /* Restore decrementer */ + wakeup_decrementer(); + + /* Restore interrupts */ + mpic_cpu_set_priority(pic_prio); + + /* Let interrupts flow again ... */ + local_irq_restore(flags); + +#ifdef DEBUG_FREQ + debug_calc_bogomips(); +#endif + + pmu_resume(); + + preempt_enable(); + + return 0; +} + +static int do_set_cpu_speed(int speed_mode, int notify) +{ + struct cpufreq_freqs freqs; + unsigned long l3cr; + static unsigned long prev_l3cr; + + freqs.old = cur_freq; + freqs.new = (speed_mode == CPUFREQ_HIGH) ? hi_freq : low_freq; + freqs.cpu = smp_processor_id(); + + if (freqs.old == freqs.new) + return 0; + + if (notify) + cpufreq_notify_transition(&freqs, CPUFREQ_PRECHANGE); + if (speed_mode == CPUFREQ_LOW && + cpu_has_feature(CPU_FTR_L3CR)) { + l3cr = _get_L3CR(); + if (l3cr & L3CR_L3E) { + prev_l3cr = l3cr; + _set_L3CR(0); + } + } + set_speed_proc(speed_mode == CPUFREQ_LOW); + if (speed_mode == CPUFREQ_HIGH && + cpu_has_feature(CPU_FTR_L3CR)) { + l3cr = _get_L3CR(); + if ((prev_l3cr & L3CR_L3E) && l3cr != prev_l3cr) + _set_L3CR(prev_l3cr); + } + if (notify) + cpufreq_notify_transition(&freqs, CPUFREQ_POSTCHANGE); + cur_freq = (speed_mode == CPUFREQ_HIGH) ? hi_freq : low_freq; + + return 0; +} + +static unsigned int pmac_cpufreq_get_speed(unsigned int cpu) +{ + return cur_freq; +} + +static int pmac_cpufreq_verify(struct cpufreq_policy *policy) +{ + return cpufreq_frequency_table_verify(policy, pmac_cpu_freqs); +} + +static int pmac_cpufreq_target( struct cpufreq_policy *policy, + unsigned int target_freq, + unsigned int relation) +{ + unsigned int newstate = 0; + int rc; + + if (cpufreq_frequency_table_target(policy, pmac_cpu_freqs, + target_freq, relation, &newstate)) + return -EINVAL; + + rc = do_set_cpu_speed(newstate, 1); + + ppc_proc_freq = cur_freq * 1000ul; + return rc; +} + +static int pmac_cpufreq_cpu_init(struct cpufreq_policy *policy) +{ + if (policy->cpu != 0) + return -ENODEV; + + policy->governor = CPUFREQ_DEFAULT_GOVERNOR; + policy->cpuinfo.transition_latency = CPUFREQ_ETERNAL; + policy->cur = cur_freq; + + cpufreq_frequency_table_get_attr(pmac_cpu_freqs, policy->cpu); + return cpufreq_frequency_table_cpuinfo(policy, pmac_cpu_freqs); +} + +static u32 read_gpio(struct device_node *np) +{ + u32 *reg = (u32 *)get_property(np, "reg", NULL); + u32 offset; + + if (reg == NULL) + return 0; + /* That works for all keylargos but shall be fixed properly + * some day... The problem is that it seems we can't rely + * on the "reg" property of the GPIO nodes, they are either + * relative to the base of KeyLargo or to the base of the + * GPIO space, and the device-tree doesn't help. + */ + offset = *reg; + if (offset < KEYLARGO_GPIO_LEVELS0) + offset += KEYLARGO_GPIO_LEVELS0; + return offset; +} + +static int pmac_cpufreq_suspend(struct cpufreq_policy *policy, pm_message_t pmsg) +{ + /* Ok, this could be made a bit smarter, but let's be robust for now. We + * always force a speed change to high speed before sleep, to make sure + * we have appropriate voltage and/or bus speed for the wakeup process, + * and to make sure our loops_per_jiffies are "good enough", that is will + * not cause too short delays if we sleep in low speed and wake in high + * speed.. + */ + no_schedule = 1; + sleep_freq = cur_freq; + if (cur_freq == low_freq && !is_pmu_based) + do_set_cpu_speed(CPUFREQ_HIGH, 0); + return 0; +} + +static int pmac_cpufreq_resume(struct cpufreq_policy *policy) +{ + /* If we resume, first check if we have a get() function */ + if (get_speed_proc) + cur_freq = get_speed_proc(); + else) + cur_freq = 0; + + /* We don't, hrm... we don't really know our speed here, best + * is that we force a switch to whatever it was, which is + * probably high speed due to our suspend() routine + */ + do_set_cpu_speed(sleep_freq == low_freq ? + CPUFREQ_LOW : CPUFREQ_HIGH, 0); + + ppc_proc_freq = cur_freq * 1000ul; + + no_schedule = 0; + return 0; +} + +static struct cpufreq_driver pmac_cpufreq_driver = { + .verify = pmac_cpufreq_verify, + .target = pmac_cpufreq_target, + .get = pmac_cpufreq_get_speed, + .init = pmac_cpufreq_cpu_init, + .suspend = pmac_cpufreq_suspend, + .resume = pmac_cpufreq_resume, + .flags = CPUFREQ_PM_NO_WARN, + .attr = pmac_cpu_freqs_attr, + .name = "powermac", + .owner = THIS_MODULE, +}; + + +static int pmac_cpufreq_init_MacRISC3(struct device_node *cpunode) +{ + struct device_node *volt_gpio_np = of_find_node_by_name(NULL, + "voltage-gpio"); + struct device_node *freq_gpio_np = of_find_node_by_name(NULL, + "frequency-gpio"); + struct device_node *slew_done_gpio_np = of_find_node_by_name(NULL, + "slewing-done"); + u32 *value; + + /* + * Check to see if it's GPIO driven or PMU only + * + * The way we extract the GPIO address is slightly hackish, but it + * works well enough for now. We need to abstract the whole GPIO + * stuff sooner or later anyway + */ + + if (volt_gpio_np) + voltage_gpio = read_gpio(volt_gpio_np); + if (freq_gpio_np) + frequency_gpio = read_gpio(freq_gpio_np); + if (slew_done_gpio_np) + slew_done_gpio = read_gpio(slew_done_gpio_np); + + /* If we use the frequency GPIOs, calculate the min/max speeds based + * on the bus frequencies + */ + if (frequency_gpio && slew_done_gpio) { + int lenp, rc; + u32 *freqs, *ratio; + + freqs = (u32 *)get_property(cpunode, "bus-frequencies", &lenp); + lenp /= sizeof(u32); + if (freqs == NULL || lenp != 2) { + printk(KERN_ERR "cpufreq: bus-frequencies incorrect or missing\n"); + return 1; + } + ratio = (u32 *)get_property(cpunode, "processor-to-bus-ratio*2", NULL); + if (ratio == NULL) { + printk(KERN_ERR "cpufreq: processor-to-bus-ratio*2 missing\n"); + return 1; + } + + /* Get the min/max bus frequencies */ + low_freq = min(freqs[0], freqs[1]); + hi_freq = max(freqs[0], freqs[1]); + + /* Grrrr.. It _seems_ that the device-tree is lying on the low bus + * frequency, it claims it to be around 84Mhz on some models while + * it appears to be approx. 101Mhz on all. Let's hack around here... + * fortunately, we don't need to be too precise + */ + if (low_freq < 98000000) + low_freq = 101000000; + + /* Convert those to CPU core clocks */ + low_freq = (low_freq * (*ratio)) / 2000; + hi_freq = (hi_freq * (*ratio)) / 2000; + + /* Now we get the frequencies, we read the GPIO to see what is out current + * speed + */ + rc = pmac_call_feature(PMAC_FTR_READ_GPIO, NULL, frequency_gpio, 0); + cur_freq = (rc & 0x01) ? hi_freq : low_freq; + + set_speed_proc = gpios_set_cpu_speed; + return 1; + } + + /* If we use the PMU, look for the min & max frequencies in the + * device-tree + */ + value = (u32 *)get_property(cpunode, "min-clock-frequency", NULL); + if (!value) + return 1; + low_freq = (*value) / 1000; + /* The PowerBook G4 12" (PowerBook6,1) has an error in the device-tree + * here */ + if (low_freq < 100000) + low_freq *= 10; + + value = (u32 *)get_property(cpunode, "max-clock-frequency", NULL); + if (!value) + return 1; + hi_freq = (*value) / 1000; + set_speed_proc = pmu_set_cpu_speed; + is_pmu_based = 1; + + return 0; +} + +static int pmac_cpufreq_init_7447A(struct device_node *cpunode) +{ + struct device_node *volt_gpio_np; + + if (get_property(cpunode, "dynamic-power-step", NULL) == NULL) + return 1; + + volt_gpio_np = of_find_node_by_name(NULL, "cpu-vcore-select"); + if (volt_gpio_np) + voltage_gpio = read_gpio(volt_gpio_np); + if (!voltage_gpio){ + printk(KERN_ERR "cpufreq: missing cpu-vcore-select gpio\n"); + return 1; + } + + /* OF only reports the high frequency */ + hi_freq = cur_freq; + low_freq = cur_freq/2; + + /* Read actual frequency from CPU */ + cur_freq = dfs_get_cpu_speed(); + set_speed_proc = dfs_set_cpu_speed; + get_speed_proc = dfs_get_cpu_speed; + + return 0; +} + +static int pmac_cpufreq_init_750FX(struct device_node *cpunode) +{ + struct device_node *volt_gpio_np; + u32 pvr, *value; + + if (get_property(cpunode, "dynamic-power-step", NULL) == NULL) + return 1; + + hi_freq = cur_freq; + value = (u32 *)get_property(cpunode, "reduced-clock-frequency", NULL); + if (!value) + return 1; + low_freq = (*value) / 1000; + + volt_gpio_np = of_find_node_by_name(NULL, "cpu-vcore-select"); + if (volt_gpio_np) + voltage_gpio = read_gpio(volt_gpio_np); + + pvr = mfspr(SPRN_PVR); + has_cpu_l2lve = !((pvr & 0xf00) == 0x100); + + set_speed_proc = cpu_750fx_cpu_speed; + get_speed_proc = cpu_750fx_get_cpu_speed; + cur_freq = cpu_750fx_get_cpu_speed(); + + return 0; +} + +/* Currently, we support the following machines: + * + * - Titanium PowerBook 1Ghz (PMU based, 667Mhz & 1Ghz) + * - Titanium PowerBook 800 (PMU based, 667Mhz & 800Mhz) + * - Titanium PowerBook 400 (PMU based, 300Mhz & 400Mhz) + * - Titanium PowerBook 500 (PMU based, 300Mhz & 500Mhz) + * - iBook2 500/600 (PMU based, 400Mhz & 500/600Mhz) + * - iBook2 700 (CPU based, 400Mhz & 700Mhz, support low voltage) + * - Recent MacRISC3 laptops + * - All new machines with 7447A CPUs + */ +static int __init pmac_cpufreq_setup(void) +{ + struct device_node *cpunode; + u32 *value; + + if (strstr(cmd_line, "nocpufreq")) + return 0; + + /* Assume only one CPU */ + cpunode = find_type_devices("cpu"); + if (!cpunode) + goto out; + + /* Get current cpu clock freq */ + value = (u32 *)get_property(cpunode, "clock-frequency", NULL); + if (!value) + goto out; + cur_freq = (*value) / 1000; + + /* Check for 7447A based MacRISC3 */ + if (machine_is_compatible("MacRISC3") && + get_property(cpunode, "dynamic-power-step", NULL) && + PVR_VER(mfspr(SPRN_PVR)) == 0x8003) { + pmac_cpufreq_init_7447A(cpunode); + /* Check for other MacRISC3 machines */ + } else if (machine_is_compatible("PowerBook3,4") || + machine_is_compatible("PowerBook3,5") || + machine_is_compatible("MacRISC3")) { + pmac_cpufreq_init_MacRISC3(cpunode); + /* Else check for iBook2 500/600 */ + } else if (machine_is_compatible("PowerBook4,1")) { + hi_freq = cur_freq; + low_freq = 400000; + set_speed_proc = pmu_set_cpu_speed; + is_pmu_based = 1; + } + /* Else check for TiPb 550 */ + else if (machine_is_compatible("PowerBook3,3") && cur_freq == 550000) { + hi_freq = cur_freq; + low_freq = 500000; + set_speed_proc = pmu_set_cpu_speed; + is_pmu_based = 1; + } + /* Else check for TiPb 400 & 500 */ + else if (machine_is_compatible("PowerBook3,2")) { + /* We only know about the 400 MHz and the 500Mhz model + * they both have 300 MHz as low frequency + */ + if (cur_freq < 350000 || cur_freq > 550000) + goto out; + hi_freq = cur_freq; + low_freq = 300000; + set_speed_proc = pmu_set_cpu_speed; + is_pmu_based = 1; + } + /* Else check for 750FX */ + else if (PVR_VER(mfspr(SPRN_PVR)) == 0x7000) + pmac_cpufreq_init_750FX(cpunode); +out: + if (set_speed_proc == NULL) + return -ENODEV; + + pmac_cpu_freqs[CPUFREQ_LOW].frequency = low_freq; + pmac_cpu_freqs[CPUFREQ_HIGH].frequency = hi_freq; + ppc_proc_freq = cur_freq * 1000ul; + + printk(KERN_INFO "Registering PowerMac CPU frequency driver\n"); + printk(KERN_INFO "Low: %d Mhz, High: %d Mhz, Boot: %d Mhz\n", + low_freq/1000, hi_freq/1000, cur_freq/1000); + + return cpufreq_register_driver(&pmac_cpufreq_driver); +} + +module_init(pmac_cpufreq_setup); + Index: linux-work/arch/powerpc/platforms/powermac/cpufreq_64.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/arch/powerpc/platforms/powermac/cpufreq_64.c 2005-11-07 12:01:00.000000000 +1100 @@ -0,0 +1,323 @@ +/* + * Copyright (C) 2002 - 2005 Benjamin Herrenschmidt + * and Markus Demleitner + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This driver adds basic cpufreq support for SMU & 970FX based G5 Macs, + * that is iMac G5 and latest single CPU desktop. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#undef DEBUG + +#ifdef DEBUG +#define DBG(fmt...) printk(fmt) +#else +#define DBG(fmt...) +#endif + +/* see 970FX user manual */ + +#define SCOM_PCR 0x0aa001 /* PCR scom addr */ + +#define PCR_HILO_SELECT 0x80000000U /* 1 = PCR, 0 = PCRH */ +#define PCR_SPEED_FULL 0x00000000U /* 1:1 speed value */ +#define PCR_SPEED_HALF 0x00020000U /* 1:2 speed value */ +#define PCR_SPEED_QUARTER 0x00040000U /* 1:4 speed value */ +#define PCR_SPEED_MASK 0x000e0000U /* speed mask */ +#define PCR_SPEED_SHIFT 17 +#define PCR_FREQ_REQ_VALID 0x00010000U /* freq request valid */ +#define PCR_VOLT_REQ_VALID 0x00008000U /* volt request valid */ +#define PCR_TARGET_TIME_MASK 0x00006000U /* target time */ +#define PCR_STATLAT_MASK 0x00001f00U /* STATLAT value */ +#define PCR_SNOOPLAT_MASK 0x000000f0U /* SNOOPLAT value */ +#define PCR_SNOOPACC_MASK 0x0000000fU /* SNOOPACC value */ + +#define SCOM_PSR 0x408001 /* PSR scom addr */ +/* warning: PSR is a 64 bits register */ +#define PSR_CMD_RECEIVED 0x2000000000000000U /* command received */ +#define PSR_CMD_COMPLETED 0x1000000000000000U /* command completed */ +#define PSR_CUR_SPEED_MASK 0x0300000000000000U /* current speed */ +#define PSR_CUR_SPEED_SHIFT (56) + +/* + * The G5 only supports two frequencies (Quarter speed is not supported) + */ +#define CPUFREQ_HIGH 0 +#define CPUFREQ_LOW 1 + +static struct cpufreq_frequency_table g5_cpu_freqs[] = { + {CPUFREQ_HIGH, 0}, + {CPUFREQ_LOW, 0}, + {0, CPUFREQ_TABLE_END}, +}; + +static struct freq_attr* g5_cpu_freqs_attr[] = { + &cpufreq_freq_attr_scaling_available_freqs, + NULL, +}; + +/* Power mode data is an array of the 32 bits PCR values to use for + * the various frequencies, retreived from the device-tree + */ +static u32 *g5_pmode_data; +static int g5_pmode_max; +static int g5_pmode_cur; + +static DECLARE_MUTEX(g5_switch_mutex); + + +static struct smu_sdbp_fvt *g5_fvt_table; /* table of op. points */ +static int g5_fvt_count; /* number of op. points */ +static int g5_fvt_cur; /* current op. point */ + +/* ----------------- real hardware interface */ + +static void g5_switch_volt(int speed_mode) +{ + struct smu_simple_cmd cmd; + + DECLARE_COMPLETION(comp); + smu_queue_simple(&cmd, SMU_CMD_POWER_COMMAND, 8, smu_done_complete, + &comp, 'V', 'S', 'L', 'E', 'W', + 0xff, g5_fvt_cur+1, speed_mode); + wait_for_completion(&comp); +} + +static int g5_switch_freq(int speed_mode) +{ + struct cpufreq_freqs freqs; + int to; + + if (g5_pmode_cur == speed_mode) + return 0; + + down(&g5_switch_mutex); + + freqs.old = g5_cpu_freqs[g5_pmode_cur].frequency; + freqs.new = g5_cpu_freqs[speed_mode].frequency; + freqs.cpu = 0; + + cpufreq_notify_transition(&freqs, CPUFREQ_PRECHANGE); + + /* If frequency is going up, first ramp up the voltage */ + if (speed_mode < g5_pmode_cur) + g5_switch_volt(speed_mode); + + /* Clear PCR high */ + scom970_write(SCOM_PCR, 0); + /* Clear PCR low */ + scom970_write(SCOM_PCR, PCR_HILO_SELECT | 0); + /* Set PCR low */ + scom970_write(SCOM_PCR, PCR_HILO_SELECT | + g5_pmode_data[speed_mode]); + + /* Wait for completion */ + for (to = 0; to < 10; to++) { + unsigned long psr = scom970_read(SCOM_PSR); + + if ((psr & PSR_CMD_RECEIVED) == 0 && + (((psr >> PSR_CUR_SPEED_SHIFT) ^ + (g5_pmode_data[speed_mode] >> PCR_SPEED_SHIFT)) & 0x3) + == 0) + break; + if (psr & PSR_CMD_COMPLETED) + break; + udelay(100); + } + + /* If frequency is going down, last ramp the voltage */ + if (speed_mode > g5_pmode_cur) + g5_switch_volt(speed_mode); + + g5_pmode_cur = speed_mode; + ppc_proc_freq = g5_cpu_freqs[speed_mode].frequency * 1000ul; + + cpufreq_notify_transition(&freqs, CPUFREQ_POSTCHANGE); + + up(&g5_switch_mutex); + + return 0; +} + +static int g5_query_freq(void) +{ + unsigned long psr = scom970_read(SCOM_PSR); + int i; + + for (i = 0; i <= g5_pmode_max; i++) + if ((((psr >> PSR_CUR_SPEED_SHIFT) ^ + (g5_pmode_data[i] >> PCR_SPEED_SHIFT)) & 0x3) == 0) + break; + return i; +} + +/* ----------------- cpufreq bookkeeping */ + +static int g5_cpufreq_verify(struct cpufreq_policy *policy) +{ + return cpufreq_frequency_table_verify(policy, g5_cpu_freqs); +} + +static int g5_cpufreq_target(struct cpufreq_policy *policy, + unsigned int target_freq, unsigned int relation) +{ + unsigned int newstate = 0; + + if (cpufreq_frequency_table_target(policy, g5_cpu_freqs, + target_freq, relation, &newstate)) + return -EINVAL; + + return g5_switch_freq(newstate); +} + +static unsigned int g5_cpufreq_get_speed(unsigned int cpu) +{ + return g5_cpu_freqs[g5_pmode_cur].frequency; +} + +static int g5_cpufreq_cpu_init(struct cpufreq_policy *policy) +{ + if (policy->cpu != 0) + return -ENODEV; + + policy->governor = CPUFREQ_DEFAULT_GOVERNOR; + policy->cpuinfo.transition_latency = CPUFREQ_ETERNAL; + policy->cur = g5_cpu_freqs[g5_query_freq()].frequency; + cpufreq_frequency_table_get_attr(g5_cpu_freqs, policy->cpu); + + return cpufreq_frequency_table_cpuinfo(policy, + g5_cpu_freqs); +} + + +static struct cpufreq_driver g5_cpufreq_driver = { + .name = "powermac", + .owner = THIS_MODULE, + .flags = CPUFREQ_CONST_LOOPS, + .init = g5_cpufreq_cpu_init, + .verify = g5_cpufreq_verify, + .target = g5_cpufreq_target, + .get = g5_cpufreq_get_speed, + .attr = g5_cpu_freqs_attr, +}; + + +static int __init g5_cpufreq_init(void) +{ + struct device_node *cpunode; + unsigned int psize, ssize; + struct smu_sdbp_header *shdr; + unsigned long max_freq; + u32 *valp; + int rc = -ENODEV; + + /* Look for CPU and SMU nodes */ + cpunode = of_find_node_by_type(NULL, "cpu"); + if (!cpunode) { + DBG("No CPU node !\n"); + return -ENODEV; + } + + /* Check 970FX for now */ + valp = (u32 *)get_property(cpunode, "cpu-version", NULL); + if (!valp) { + DBG("No cpu-version property !\n"); + goto bail_noprops; + } + if (((*valp) >> 16) != 0x3c) { + DBG("Wrong CPU version: %08x\n", *valp); + goto bail_noprops; + } + + /* Look for the powertune data in the device-tree */ + g5_pmode_data = (u32 *)get_property(cpunode, "power-mode-data",&psize); + if (!g5_pmode_data) { + DBG("No power-mode-data !\n"); + goto bail_noprops; + } + g5_pmode_max = psize / sizeof(u32) - 1; + + /* Look for the FVT table */ + shdr = smu_get_sdb_partition(SMU_SDB_FVT_ID, NULL); + if (!shdr) + goto bail_noprops; + g5_fvt_table = (struct smu_sdbp_fvt *)&shdr[1]; + ssize = (shdr->len * sizeof(u32)) - sizeof(struct smu_sdbp_header); + g5_fvt_count = ssize / sizeof(struct smu_sdbp_fvt); + g5_fvt_cur = 0; + + /* Sanity checking */ + if (g5_fvt_count < 1 || g5_pmode_max < 1) + goto bail_noprops; + + /* + * From what I see, clock-frequency is always the maximal frequency. + * The current driver can not slew sysclk yet, so we really only deal + * with powertune steps for now. We also only implement full freq and + * half freq in this version. So far, I haven't yet seen a machine + * supporting anything else. + */ + valp = (u32 *)get_property(cpunode, "clock-frequency", NULL); + if (!valp) + return -ENODEV; + max_freq = (*valp)/1000; + g5_cpu_freqs[0].frequency = max_freq; + g5_cpu_freqs[1].frequency = max_freq/2; + + /* Check current frequency */ + g5_pmode_cur = g5_query_freq(); + if (g5_pmode_cur > 1) + /* We don't support anything but 1:1 and 1:2, fixup ... */ + g5_pmode_cur = 1; + + /* Force apply current frequency to make sure everything is in + * sync (voltage is right for example). Firmware may leave us with + * a strange setting ... + */ + g5_switch_freq(g5_pmode_cur); + + printk(KERN_INFO "Registering G5 CPU frequency driver\n"); + printk(KERN_INFO "Low: %d Mhz, High: %d Mhz, Cur: %d MHz\n", + g5_cpu_freqs[1].frequency/1000, + g5_cpu_freqs[0].frequency/1000, + g5_cpu_freqs[g5_pmode_cur].frequency/1000); + + rc = cpufreq_register_driver(&g5_cpufreq_driver); + + /* We keep the CPU node on hold... hopefully, Apple G5 don't have + * hotplug CPU with a dynamic device-tree ... + */ + return rc; + + bail_noprops: + of_node_put(cpunode); + + return rc; +} + +module_init(g5_cpufreq_init); + + +MODULE_LICENSE("GPL"); Index: linux-work/include/asm-powerpc/reg.h =================================================================== --- linux-work.orig/include/asm-powerpc/reg.h 2005-11-07 11:58:28.000000000 +1100 +++ linux-work/include/asm-powerpc/reg.h 2005-11-07 11:58:31.000000000 +1100 @@ -396,6 +396,9 @@ #define SPRN_VRSAVE 0x100 /* Vector Register Save Register */ #define SPRN_XER 0x001 /* Fixed Point Exception Register */ +#define SPRN_SCOMC 0x114 /* SCOM Access Control */ +#define SPRN_SCOMD 0x115 /* SCOM Access DATA */ + /* Performance monitor SPRs */ #ifdef CONFIG_PPC64 #define SPRN_MMCR0 795 @@ -594,7 +597,11 @@ mtspr(SPRN_CTRLT, ctrl); } } -#endif + +extern unsigned long scom970_read(unsigned int address); +extern void scom970_write(unsigned int address, unsigned long value); + +#endif /* CONFIG_PPC64 */ #define __get_SP() ({unsigned long sp; \ asm volatile("mr %0,1": "=r" (sp)); sp;}) Index: linux-work/include/asm-powerpc/smu.h =================================================================== --- linux-work.orig/include/asm-powerpc/smu.h 2005-11-07 11:58:28.000000000 +1100 +++ linux-work/include/asm-powerpc/smu.h 2005-11-07 11:58:31.000000000 +1100 @@ -144,7 +144,11 @@ * - lenght 8 ("VSLEWxyz") has 3 additional bytes appended, and is * used to set the voltage slewing point. The SMU replies with "DONE" * I yet have to figure out their exact meaning of those 3 bytes in - * both cases. + * both cases. They seem to be: + * x = processor mask + * y = op. point index + * z = processor freq. step index + * I haven't yet decyphered result codes * */ #define SMU_CMD_POWER_COMMAND 0xaa @@ -333,6 +337,60 @@ #endif /* __KERNEL__ */ + +/* + * - SMU "sdb" partitions informations - + */ + + +/* + * Partition header format + */ +struct smu_sdbp_header { + __u8 id; + __u8 len; + __u8 version; + __u8 flags; +}; + +/* + * 32 bits integers are usually encoded with 2x16 bits swapped, + * this demangles them + */ +#define SMU_U32_MIX(x) ((((x) << 16) & 0xffff0000u) | (((x) >> 16) & 0xffffu)) + +/* This is the definition of the SMU sdb-partition-0x12 table (called + * CPU F/V/T operating points in Darwin). The definition for all those + * SMU tables should be moved to some separate file + */ +#define SMU_SDB_FVT_ID 0x12 + +struct smu_sdbp_fvt { + __u32 sysclk; /* Base SysClk frequency in Hz for + * this operating point + */ + __u8 pad; + __u8 maxtemp; /* Max temp. supported by this + * operating point + */ + + __u16 volts[3]; /* CPU core voltage for the 3 + * PowerTune modes, a mode with + * 0V = not supported. + */ +}; + +#ifdef __KERNEL__ +/* + * This returns the pointer to an SMU "sdb" partition data or NULL + * if not found. The data format is described below + */ +extern struct smu_sdbp_header *smu_get_sdb_partition(int id, + unsigned int *size); + +#endif /* __KERNEL__ */ + + /* * - Userland interface - */ Index: linux-work/arch/powerpc/Kconfig =================================================================== --- linux-work.orig/arch/powerpc/Kconfig 2005-11-07 11:58:28.000000000 +1100 +++ linux-work/arch/powerpc/Kconfig 2005-11-07 11:58:31.000000000 +1100 @@ -404,6 +404,14 @@ this currently includes some models of iBook & Titanium PowerBook. +config CPU_FREQ_PMAC64 + bool "Support for some Apple G5s" + depends on CPU_FREQ && PMAC_SMU && PPC64 + select CPU_FREQ_TABLE + help + This adds support for frequency switching on Apple iMac G5, + and some of the more recent desktop G5 machines as well. + config PPC601_SYNC_FIX bool "Workarounds for PPC601 bugs" depends on 6xx && (PPC_PREP || PPC_PMAC) Index: linux-work/arch/powerpc/kernel/misc_64.S =================================================================== --- linux-work.orig/arch/powerpc/kernel/misc_64.S 2005-11-07 11:58:28.000000000 +1100 +++ linux-work/arch/powerpc/kernel/misc_64.S 2005-11-07 11:58:31.000000000 +1100 @@ -604,6 +604,76 @@ #endif /* defined(CONFIG_PPC_PMAC) || defined(CONFIG_PPC_MAPLE) */ /* + * SCOM access functions for 970 (FX only for now) + * + * unsigned long scom970_read(unsigned int address); + * void scom970_write(unsigned int address, unsigned long value); + * + * The address passed in is the 24 bits register address. This code + * is 970 specific and will not check the status bits, so you should + * know what you are doing. + */ +_GLOBAL(scom970_read) + /* interrupts off */ + mfmsr r4 + ori r0,r4,MSR_EE + xori r0,r0,MSR_EE + mtmsrd r0,1 + + /* rotate 24 bits SCOM address 8 bits left and mask out it's low 8 bits + * (including parity). On current CPUs they must be 0'd, + * and finally or in RW bit + */ + rlwinm r3,r3,8,0,15 + ori r3,r3,0x8000 + + /* do the actual scom read */ + sync + mtspr SPRN_SCOMC,r3 + isync + mfspr r3,SPRN_SCOMD + isync + mfspr r0,SPRN_SCOMC + isync + + /* XXX: fixup result on some buggy 970's (ouch ! we lost a bit, bah + * that's the best we can do). Not implemented yet as we don't use + * the scom on any of the bogus CPUs yet, but may have to be done + * ultimately + */ + + /* restore interrupts */ + mtmsrd r4,1 + blr + + +_GLOBAL(scom970_write) + /* interrupts off */ + mfmsr r5 + ori r0,r5,MSR_EE + xori r0,r0,MSR_EE + mtmsrd r0,1 + + /* rotate 24 bits SCOM address 8 bits left and mask out it's low 8 bits + * (including parity). On current CPUs they must be 0'd. + */ + + rlwinm r3,r3,8,0,15 + + sync + mtspr SPRN_SCOMD,r4 /* write data */ + isync + mtspr SPRN_SCOMC,r3 /* write command */ + isync + mfspr 3,SPRN_SCOMC + isync + + /* restore interrupts */ + mtmsrd r5,1 + blr + + +/* * Create a kernel thread * kernel_thread(fn, arg, flags) */ Index: linux-work/arch/powerpc/platforms/powermac/Makefile =================================================================== --- linux-work.orig/arch/powerpc/platforms/powermac/Makefile 2005-11-07 11:58:28.000000000 +1100 +++ linux-work/arch/powerpc/platforms/powermac/Makefile 2005-11-07 11:58:31.000000000 +1100 @@ -1,7 +1,8 @@ obj-y += pic.o setup.o time.o feature.o pci.o \ sleep.o low_i2c.o cache.o obj-$(CONFIG_PMAC_BACKLIGHT) += backlight.o -obj-$(CONFIG_CPU_FREQ_PMAC) += cpufreq.o +obj-$(CONFIG_CPU_FREQ_PMAC) += cpufreq_32.o +obj-$(CONFIG_CPU_FREQ_PMAC64) += cpufreq_64.o obj-$(CONFIG_NVRAM) += nvram.o # ppc64 pmac doesn't define CONFIG_NVRAM but needs nvram stuff obj-$(CONFIG_PPC64) += nvram.o Index: linux-work/arch/powerpc/platforms/powermac/setup.c =================================================================== --- linux-work.orig/arch/powerpc/platforms/powermac/setup.c 2005-11-07 11:58:28.000000000 +1100 +++ linux-work/arch/powerpc/platforms/powermac/setup.c 2005-11-07 11:58:31.000000000 +1100 @@ -193,18 +193,6 @@ pmac_newworld ? "NewWorld" : "OldWorld"); } -static void pmac_show_percpuinfo(struct seq_file *m, int i) -{ -#ifdef CONFIG_CPU_FREQ_PMAC - extern unsigned int pmac_get_one_cpufreq(int i); - unsigned int freq = pmac_get_one_cpufreq(i); - if (freq != 0) { - seq_printf(m, "clock\t\t: %dMHz\n", freq/1000); - return; - } -#endif /* CONFIG_CPU_FREQ_PMAC */ -} - #ifndef CONFIG_ADB_CUDA int find_via_cuda(void) { @@ -767,7 +755,6 @@ .setup_arch = pmac_setup_arch, .init_early = pmac_init_early, .show_cpuinfo = pmac_show_cpuinfo, - .show_percpuinfo = pmac_show_percpuinfo, .init_IRQ = pmac_pic_init, .get_irq = mpic_get_irq, /* changed later */ .pcibios_fixup = pmac_pcibios_fixup, From benh at kernel.crashing.org Mon Nov 7 14:29:02 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 07 Nov 2005 14:29:02 +1100 Subject: [PATCH] ppc64: SMU partition recovery Message-ID: <1131334142.5229.154.camel@gaston> This patch adds the ability to the SMU driver to recover missing calibration partitions from the SMU chip itself. It also adds some dynamic mecanism to /proc/device-tree so that new properties are visible to userland. Signed-off-by: Benjamin Herrenschmidt Index: linux-work/drivers/macintosh/smu.c =================================================================== --- linux-work.orig/drivers/macintosh/smu.c 2005-11-07 12:45:42.000000000 +1100 +++ linux-work/drivers/macintosh/smu.c 2005-11-07 12:45:44.000000000 +1100 @@ -47,13 +47,13 @@ #include #include -#define VERSION "0.6" +#define VERSION "0.7" #define AUTHOR "(c) 2005 Benjamin Herrenschmidt, IBM Corp." #undef DEBUG_SMU #ifdef DEBUG_SMU -#define DPRINTK(fmt, args...) do { printk(KERN_DEBUG fmt , ##args); } while (0) +#define DPRINTK(fmt, args...) do { udbg_printf(KERN_DEBUG fmt , ##args); } while (0) #else #define DPRINTK(fmt, args...) do { } while (0) #endif @@ -92,7 +92,7 @@ * for now, just hard code that */ static struct smu_device *smu; - +static DECLARE_MUTEX(smu_part_access); /* * SMU driver low level stuff @@ -113,9 +113,11 @@ DPRINTK("SMU: starting cmd %x, %d bytes data\n", cmd->cmd, cmd->data_len); - DPRINTK("SMU: data buffer: %02x %02x %02x %02x ...\n", + DPRINTK("SMU: data buffer: %02x %02x %02x %02x %02x %02x %02x %02x\n", ((u8 *)cmd->data_buf)[0], ((u8 *)cmd->data_buf)[1], - ((u8 *)cmd->data_buf)[2], ((u8 *)cmd->data_buf)[3]); + ((u8 *)cmd->data_buf)[2], ((u8 *)cmd->data_buf)[3], + ((u8 *)cmd->data_buf)[4], ((u8 *)cmd->data_buf)[5], + ((u8 *)cmd->data_buf)[6], ((u8 *)cmd->data_buf)[7]); /* Fill the SMU command buffer */ smu->cmd_buf->cmd = cmd->cmd; @@ -440,7 +442,7 @@ EXPORT_SYMBOL(smu_present); -int smu_init (void) +int __init smu_init (void) { struct device_node *np; u32 *data; @@ -845,16 +847,154 @@ return 0; } -struct smu_sdbp_header *smu_get_sdb_partition(int id, unsigned int *size) +/* + * Handling of "partitions" + */ + +static int smu_read_datablock(u8 *dest, unsigned int addr, unsigned int len) +{ + DECLARE_COMPLETION(comp); + unsigned int chunk; + struct smu_cmd cmd; + int rc; + u8 params[8]; + + /* We currently use a chunk size of 0xe. We could check the + * SMU firmware version and use bigger sizes though + */ + chunk = 0xe; + + while (len) { + unsigned int clen = min(len, chunk); + + cmd.cmd = SMU_CMD_MISC_ee_COMMAND; + cmd.data_len = 7; + cmd.data_buf = params; + cmd.reply_len = chunk; + cmd.reply_buf = dest; + cmd.done = smu_done_complete; + cmd.misc = ∁ + params[0] = SMU_CMD_MISC_ee_GET_DATABLOCK_REC; + params[1] = 0x4; + *((u32 *)¶ms[2]) = addr; + params[6] = clen; + + rc = smu_queue_cmd(&cmd); + if (rc) + return rc; + wait_for_completion(&comp); + if (cmd.status != 0) + return rc; + if (cmd.reply_len != clen) { + printk(KERN_DEBUG "SMU: short read in " + "smu_read_datablock, got: %d, want: %d\n", + cmd.reply_len, clen); + return -EIO; + } + len -= clen; + addr += clen; + dest += clen; + } + return 0; +} + +static struct smu_sdbp_header *smu_create_sdb_partition(int id) +{ + DECLARE_COMPLETION(comp); + struct smu_simple_cmd cmd; + unsigned int addr, len, tlen; + struct smu_sdbp_header *hdr; + struct property *prop; + + /* First query the partition info */ + smu_queue_simple(&cmd, SMU_CMD_PARTITION_COMMAND, 2, + smu_done_complete, &comp, + SMU_CMD_PARTITION_LATEST, id); + wait_for_completion(&comp); + + /* Partition doesn't exist (or other error) */ + if (cmd.cmd.status != 0 || cmd.cmd.reply_len != 6) + return NULL; + + /* Fetch address and length from reply */ + addr = *((u16 *)cmd.buffer); + len = cmd.buffer[3] << 2; + /* Calucluate total length to allocate, including the 17 bytes + * for "sdb-partition-XX" that we append at the end of the buffer + */ + tlen = sizeof(struct property) + len + 18; + + prop = kcalloc(tlen, 1, GFP_KERNEL); + if (prop == NULL) + return NULL; + hdr = (struct smu_sdbp_header *)(prop + 1); + prop->name = ((char *)prop) + tlen - 18; + sprintf(prop->name, "sdb-partition-%02x", id); + prop->length = len; + prop->value = (unsigned char *)hdr; + prop->next = NULL; + + /* Read the datablock */ + if (smu_read_datablock((u8 *)hdr, addr, len)) { + printk(KERN_DEBUG "SMU: datablock read failed while reading " + "partition %02x !\n", id); + goto failure; + } + + /* Got it, check a few things and create the property */ + if (hdr->id != id) { + printk(KERN_DEBUG "SMU: Reading partition %02x and got " + "%02x !\n", id, hdr->id); + goto failure; + } + if (prom_add_property(smu->of_node, prop)) { + printk(KERN_DEBUG "SMU: Failed creating sdb-partition-%02x " + "property !\n", id); + goto failure; + } + + return hdr; + failure: + kfree(prop); + return NULL; +} + +/* Note: Only allowed to return error code in pointers (using ERR_PTR) + * when interruptible is 1 + */ +struct smu_sdbp_header *__smu_get_sdb_partition(int id, unsigned int *size, + int interruptible) { char pname[32]; + struct smu_sdbp_header *part; if (!smu) return NULL; sprintf(pname, "sdb-partition-%02x", id); - return (struct smu_sdbp_header *)get_property(smu->of_node, + + if (interruptible) { + int rc; + rc = down_interruptible(&smu_part_access); + if (rc) + return ERR_PTR(rc); + } else + down(&smu_part_access); + + part = (struct smu_sdbp_header *)get_property(smu->of_node, pname, size); + if (part == NULL) { + part = smu_create_sdb_partition(id); + if (part != NULL && size) + *size = part->len << 2; + } + up(&smu_part_access); + return part; +} + +struct smu_sdbp_header *smu_get_sdb_partition(int id, unsigned int *size) +{ + return __smu_get_sdb_partition(id, size, 0); } EXPORT_SYMBOL(smu_get_sdb_partition); @@ -930,6 +1070,14 @@ else if (hdr.cmdtype == SMU_CMDTYPE_WANTS_EVENTS) { pp->mode = smu_file_events; return 0; + } else if (hdr.cmdtype == SMU_CMDTYPE_GET_PARTITION) { + struct smu_sdbp_header *part; + part = __smu_get_sdb_partition(hdr.cmd, NULL, 1); + if (part == NULL) + return -EINVAL; + else if (IS_ERR(part)) + return PTR_ERR(part); + return 0; } else if (hdr.cmdtype != SMU_CMDTYPE_SMU) return -EINVAL; else if (pp->mode != smu_file_commands) Index: linux-work/arch/ppc64/kernel/prom.c =================================================================== --- linux-work.orig/arch/ppc64/kernel/prom.c 2005-11-07 12:45:42.000000000 +1100 +++ linux-work/arch/ppc64/kernel/prom.c 2005-11-07 12:45:44.000000000 +1100 @@ -31,6 +31,7 @@ #include #include #include +#include #include #include @@ -1894,17 +1895,32 @@ EXPORT_SYMBOL(get_property); /* - * Add a property to a node + * Add a property to a node. */ -void +int prom_add_property(struct device_node* np, struct property* prop) { - struct property **next = &np->properties; + struct property **next; prop->next = NULL; - while (*next) + write_lock(&devtree_lock); + next = &np->properties; + while (*next) { + if (strcmp(prop->name, (*next)->name) == 0) { + /* duplicate ! don't insert it */ + write_unlock(&devtree_lock); + return -1; + } next = &(*next)->next; + } *next = prop; + write_unlock(&devtree_lock); + + /* try to add to proc as well if it was initialized */ + if (np->pde) + proc_device_tree_add_prop(np->pde, prop); + + return 0; } #if 0 Index: linux-work/fs/proc/proc_devtree.c =================================================================== --- linux-work.orig/fs/proc/proc_devtree.c 2005-11-07 12:45:42.000000000 +1100 +++ linux-work/fs/proc/proc_devtree.c 2005-11-07 12:45:44.000000000 +1100 @@ -49,6 +49,39 @@ */ /* + * Add a property to a node + */ +static struct proc_dir_entry * +__proc_device_tree_add_prop(struct proc_dir_entry *de, struct property *pp) +{ + struct proc_dir_entry *ent; + + /* + * Unfortunately proc_register puts each new entry + * at the beginning of the list. So we rearrange them. + */ + ent = create_proc_read_entry(pp->name, + strncmp(pp->name, "security-", 9) + ? S_IRUGO : S_IRUSR, de, + property_read_proc, pp); + if (ent == NULL) + return NULL; + + if (!strncmp(pp->name, "security-", 9)) + ent->size = 0; /* don't leak number of password chars */ + else + ent->size = pp->length; + + return ent; +} + + +void proc_device_tree_add_prop(struct proc_dir_entry *pde, struct property *prop) +{ + __proc_device_tree_add_prop(pde, prop); +} + +/* * Process a node, adding entries for its children and its properties. */ void proc_device_tree_add_node(struct device_node *np, @@ -57,11 +90,9 @@ struct property *pp; struct proc_dir_entry *ent; struct device_node *child; - struct proc_dir_entry *list = NULL, **lastp; const char *p; set_node_proc_entry(np, de); - lastp = &list; for (child = NULL; (child = of_get_next_child(np, child));) { p = strrchr(child->full_name, '/'); if (!p) @@ -71,9 +102,6 @@ ent = proc_mkdir(p, de); if (ent == 0) break; - *lastp = ent; - ent->next = NULL; - lastp = &ent->next; proc_device_tree_add_node(child, ent); } of_node_put(child); @@ -84,7 +112,7 @@ * properties are quite unimportant for us though, thus we * simply "skip" them here, but we do have to check. */ - for (ent = list; ent != NULL; ent = ent->next) + for (ent = de->subdir; ent != NULL; ent = ent->next) if (!strcmp(ent->name, pp->name)) break; if (ent != NULL) { @@ -94,25 +122,10 @@ continue; } - /* - * Unfortunately proc_register puts each new entry - * at the beginning of the list. So we rearrange them. - */ - ent = create_proc_read_entry(pp->name, - strncmp(pp->name, "security-", 9) - ? S_IRUGO : S_IRUSR, de, - property_read_proc, pp); + ent = __proc_device_tree_add_prop(de, pp); if (ent == 0) break; - if (!strncmp(pp->name, "security-", 9)) - ent->size = 0; /* don't leak number of password chars */ - else - ent->size = pp->length; - ent->next = NULL; - *lastp = ent; - lastp = &ent->next; } - de->subdir = list; } /* Index: linux-work/include/asm-ppc/prom.h =================================================================== --- linux-work.orig/include/asm-ppc/prom.h 2005-11-07 12:45:42.000000000 +1100 +++ linux-work/include/asm-ppc/prom.h 2005-11-07 12:45:44.000000000 +1100 @@ -93,7 +93,7 @@ extern int machine_is_compatible(const char *compat); extern unsigned char *get_property(struct device_node *node, const char *name, int *lenp); -extern void prom_add_property(struct device_node* np, struct property* prop); +extern int prom_add_property(struct device_node* np, struct property* prop); extern void prom_get_irq_senses(unsigned char *, int, int); extern int prom_n_addr_cells(struct device_node* np); extern int prom_n_size_cells(struct device_node* np); Index: linux-work/include/asm-ppc64/prom.h =================================================================== --- linux-work.orig/include/asm-ppc64/prom.h 2005-11-07 12:45:42.000000000 +1100 +++ linux-work/include/asm-ppc64/prom.h 2005-11-07 12:45:44.000000000 +1100 @@ -205,6 +205,6 @@ extern int prom_n_size_cells(struct device_node* np); extern int prom_n_intr_cells(struct device_node* np); extern void prom_get_irq_senses(unsigned char *senses, int off, int max); -extern void prom_add_property(struct device_node* np, struct property* prop); +extern int prom_add_property(struct device_node* np, struct property* prop); #endif /* _PPC64_PROM_H */ Index: linux-work/include/linux/proc_fs.h =================================================================== --- linux-work.orig/include/linux/proc_fs.h 2005-11-07 12:45:42.000000000 +1100 +++ linux-work/include/linux/proc_fs.h 2005-11-07 12:45:44.000000000 +1100 @@ -139,15 +139,12 @@ /* * proc_devtree.c */ +#ifdef CONFIG_PROC_DEVICETREE struct device_node; +struct property; extern void proc_device_tree_init(void); -#ifdef CONFIG_PROC_DEVICETREE extern void proc_device_tree_add_node(struct device_node *, struct proc_dir_entry *); -#else /* !CONFIG_PROC_DEVICETREE */ -static inline void proc_device_tree_add_node(struct device_node *np, struct proc_dir_entry *pde) -{ - return; -} +extern void proc_device_tree_add_prop(struct proc_dir_entry *pde, struct property *prop); #endif /* CONFIG_PROC_DEVICETREE */ extern struct proc_dir_entry *proc_symlink(const char *, Index: linux-work/arch/ppc/syslib/prom.c =================================================================== --- linux-work.orig/arch/ppc/syslib/prom.c 2005-11-07 12:45:42.000000000 +1100 +++ linux-work/arch/ppc/syslib/prom.c 2005-11-07 12:45:44.000000000 +1100 @@ -1165,7 +1165,7 @@ /* * Add a property to a node */ -void +int prom_add_property(struct device_node* np, struct property* prop) { struct property **next = &np->properties; @@ -1174,6 +1174,8 @@ while (*next) next = &(*next)->next; *next = prop; + + return 0; } /* I quickly hacked that one, check against spec ! */ Index: linux-work/include/asm-powerpc/smu.h =================================================================== --- linux-work.orig/include/asm-powerpc/smu.h 2005-11-07 12:45:42.000000000 +1100 +++ linux-work/include/asm-powerpc/smu.h 2005-11-07 12:45:44.000000000 +1100 @@ -20,16 +20,52 @@ /* * Partition info commands * - * I do not know what those are for at this point + * These commands are used to retreive the sdb-partition-XX datas from + * the SMU. The lenght is always 2. First byte is the subcommand code + * and second byte is the partition ID. + * + * The reply is 6 bytes: + * + * - 0..1 : partition address + * - 2 : a byte containing the partition ID + * - 3 : length (maybe other bits are rest of header ?) + * + * The data must then be obtained with calls to another command: + * SMU_CMD_MISC_ee_GET_DATABLOCK_REC (described below). */ #define SMU_CMD_PARTITION_COMMAND 0x3e +#define SMU_CMD_PARTITION_LATEST 0x01 +#define SMU_CMD_PARTITION_BASE 0x02 +#define SMU_CMD_PARTITION_UPDATE 0x03 /* * Fan control * - * This is a "mux" for fan control commands, first byte is the - * "sub" command. + * This is a "mux" for fan control commands. The command seem to + * act differently based on the number of arguments. With 1 byte + * of argument, this seem to be queries for fans status, setpoint, + * etc..., while with 0xe arguments, we will set the fans speeds. + * + * Queries (1 byte arg): + * --------------------- + * + * arg=0x01: read RPM fans status + * arg=0x02: read RPM fans setpoint + * arg=0x11: read PWM fans status + * arg=0x12: read PWM fans setpoint + * + * the "status" queries return the current speed while the "setpoint" ones + * return the programmed/target speed. It _seems_ that the result is a bit + * mask in the first byte of active/available fans, followed by 6 words (16 + * bits) containing the requested speed. + * + * Setpoint (14 bytes arg): + * ------------------------ + * + * first arg byte is 0 for RPM fans and 0x10 for PWM. Second arg byte is the + * mask of fans affected by the command. Followed by 6 words containing the + * setpoint value for selected fans in the mask (or 0 if mask value is 0) */ #define SMU_CMD_FAN_COMMAND 0x4a @@ -156,6 +192,14 @@ #define SMU_CMD_POWER_SHUTDOWN "SHUTDOWN" #define SMU_CMD_POWER_VOLTAGE_SLEW "VSLEW" +/* + * Read ADC sensors + * + * This command takes one byte of parameter: the sensor ID (or "reg" + * value in the device-tree) and returns a 16 bits value + */ +#define SMU_CMD_READ_ADC 0xd8 + /* Misc commands * * This command seem to be a grab bag of various things @@ -176,6 +220,25 @@ * Misc commands * * This command seem to be a grab bag of various things + * + * SMU_CMD_MISC_ee_GET_DATABLOCK_REC is used, among others, to + * transfer blocks of data from the SMU. So far, I've decrypted it's + * usage to retreive partition data. In order to do that, you have to + * break your transfer in "chunks" since that command cannot transfer + * more than a chunk at a time. The chunk size used by OF is 0xe bytes, + * but it seems that the darwin driver will let you do 0x1e bytes if + * your "PMU" version is >= 0x30. You can get the "PMU" version apparently + * either in the last 16 bits of property "smu-version-pmu" or as the 16 + * bytes at offset 1 of "smu-version-info" + * + * For each chunk, the command takes 7 bytes of arguments: + * byte 0: subcommand code (0x02) + * byte 1: 0x04 (always, I don't know what it means, maybe the address + * space to use or some other nicety. It's hard coded in OF) + * byte 2..5: SMU address of the chunk (big endian 32 bits) + * byte 6: size to transfer (up to max chunk size) + * + * The data is returned directly */ #define SMU_CMD_MISC_ee_COMMAND 0xee #define SMU_CMD_MISC_ee_GET_DATABLOCK_REC 0x02 @@ -353,21 +416,26 @@ __u8 flags; }; -/* - * 32 bits integers are usually encoded with 2x16 bits swapped, - * this demangles them + + /* + * demangle 16 and 32 bits integer in some SMU partitions + * (currently, afaik, this concerns only the FVT partition + * (0x12) */ -#define SMU_U32_MIX(x) ((((x) << 16) & 0xffff0000u) | (((x) >> 16) & 0xffffu)) +#define SMU_U16_MIX(x) le16_to_cpu(x); +#define SMU_U32_MIX(x) ((((x) & 0xff00ff00u) >> 8)|(((x) & 0x00ff00ffu) << 8)) + /* This is the definition of the SMU sdb-partition-0x12 table (called * CPU F/V/T operating points in Darwin). The definition for all those * SMU tables should be moved to some separate file */ -#define SMU_SDB_FVT_ID 0x12 +#define SMU_SDB_FVT_ID 0x12 struct smu_sdbp_fvt { __u32 sysclk; /* Base SysClk frequency in Hz for - * this operating point + * this operating point. Value need to + * be unmixed with SMU_U32_MIX() */ __u8 pad; __u8 maxtemp; /* Max temp. supported by this @@ -376,10 +444,73 @@ __u16 volts[3]; /* CPU core voltage for the 3 * PowerTune modes, a mode with - * 0V = not supported. + * 0V = not supported. Value need + * to be unmixed with SMU_U16_MIX() */ }; +/* This partition contains voltage & current sensor calibration + * informations + */ +#define SMU_SDB_CPUVCP_ID 0x21 + +struct smu_sdbp_cpuvcp { + __u16 volt_scale; /* u4.12 fixed point */ + __s16 volt_offset; /* s4.12 fixed point */ + __u16 curr_scale; /* u4.12 fixed point */ + __s16 curr_offset; /* s4.12 fixed point */ + __s32 power_quads[3]; /* s4.28 fixed point */ +}; + +/* This partition contains CPU thermal diode calibration + */ +#define SMU_SDB_CPUDIODE_ID 0x18 + +struct smu_sdbp_cpudiode { + __u16 m_value; /* u1.15 fixed point */ + __s16 b_value; /* s10.6 fixed point */ + +}; + +/* This partition contains Slots power calibration + */ +#define SMU_SDB_SLOTSPOW_ID 0x78 + +struct smu_sdbp_slotspow { + __u16 pow_scale; /* u4.12 fixed point */ + __s16 pow_offset; /* s4.12 fixed point */ +}; + +/* This partition contains machine specific version information about + * the sensor/control layout + */ +#define SMU_SDB_SENSORTREE_ID 0x25 + +struct smu_sdbp_sensortree { + u8 model_id; + u8 unknown[3]; +}; + +/* This partition contains CPU thermal control PID informations. So far + * only single CPU machines have been seen with an SMU, so we assume this + * carries only informations for those + */ +#define SMU_SDB_CPUPIDDATA_ID 0x17 + +struct smu_sdbp_cpupiddata { + u8 unknown1; + u8 target_temp_delta; + u8 unknown2; + u8 history_len; + s16 power_adj; + u16 max_power; + s32 gp,gr,gd; +}; + + +/* Other partitions without known structures */ +#define SMU_SDB_DEBUG_SWITCHES_ID 0x05 + #ifdef __KERNEL__ /* * This returns the pointer to an SMU "sdb" partition data or NULL @@ -423,8 +554,10 @@ __u32 cmdtype; #define SMU_CMDTYPE_SMU 0 /* SMU command */ #define SMU_CMDTYPE_WANTS_EVENTS 1 /* switch fd to events mode */ +#define SMU_CMDTYPE_GET_PARTITION 2 /* retreive an sdb partition */ __u8 cmd; /* SMU command byte */ + __u8 pad[3]; /* padding */ __u32 data_len; /* Lenght of data following */ }; Index: linux-work/arch/powerpc/kernel/prom.c =================================================================== --- linux-work.orig/arch/powerpc/kernel/prom.c 2005-11-07 12:45:42.000000000 +1100 +++ linux-work/arch/powerpc/kernel/prom.c 2005-11-07 12:46:45.000000000 +1100 @@ -1986,14 +1986,29 @@ /* * Add a property to a node */ -void prom_add_property(struct device_node* np, struct property* prop) +int prom_add_property(struct device_node* np, struct property* prop) { - struct property **next = &np->properties; + struct property **next; prop->next = NULL; - while (*next) + write_lock(&devtree_lock); + next = &np->properties; + while (*next) { + if (strcmp(prop->name, (*next)->name) == 0) { + /* duplicate ! don't insert it */ + write_unlock(&devtree_lock); + return -1; + } next = &(*next)->next; + } *next = prop; + write_unlock(&devtree_lock); + + /* try to add to proc as well if it was initialized */ + if (np->pde) + proc_device_tree_add_prop(np->pde, prop); + + return 0; } /* I quickly hacked that one, check against spec ! */ Index: linux-work/include/asm-powerpc/prom.h =================================================================== --- linux-work.orig/include/asm-powerpc/prom.h 2005-11-07 12:45:42.000000000 +1100 +++ linux-work/include/asm-powerpc/prom.h 2005-11-07 12:45:44.000000000 +1100 @@ -195,7 +195,7 @@ extern int prom_n_size_cells(struct device_node* np); extern int prom_n_intr_cells(struct device_node* np); extern void prom_get_irq_senses(unsigned char *senses, int off, int max); -extern void prom_add_property(struct device_node* np, struct property* prop); +extern int prom_add_property(struct device_node* np, struct property* prop); #ifdef CONFIG_PPC32 /* From benh at kernel.crashing.org Mon Nov 7 14:30:28 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 07 Nov 2005 14:30:28 +1100 Subject: [PATCH] ppc64: Thermal control for SMU based machiens Message-ID: <1131334230.5229.157.camel@gaston> This adds a new thermal control framework for PowerMac, along with the implementation for PowerMac8,1, PowerMac8,2 (iMac G5 rev 1 and 2), and PowerMac9,1 (latest single CPU desktop). In the future, I expect to move the older G5 thermal control to the new framework as well. Signed-off-by: Benjamin Herrenschmidt Index: linux-work/drivers/macintosh/smu.c =================================================================== --- linux-work.orig/drivers/macintosh/smu.c 2005-11-07 13:30:45.000000000 +1100 +++ linux-work/drivers/macintosh/smu.c 2005-11-07 13:30:46.000000000 +1100 @@ -590,6 +590,8 @@ sprintf(name, "smu-i2c-%02x", *reg); of_platform_device_create(np, name, &smu->of_dev->dev); } + if (device_is_compatible(np, "smu-sensors")) + of_platform_device_create(np, "smu-sensors", &smu->of_dev->dev); } } Index: linux-work/drivers/macintosh/Kconfig =================================================================== --- linux-work.orig/drivers/macintosh/Kconfig 2005-11-07 13:29:50.000000000 +1100 +++ linux-work/drivers/macintosh/Kconfig 2005-11-07 13:30:46.000000000 +1100 @@ -169,6 +169,25 @@ This driver provides thermostat and fan control for the desktop G5 machines. +config WINDFARM + tristate "New PowerMac thermal control infrastructure" + +config WINDFARM_PM81 + tristate "Support for thermal management on iMac G5" + depends on WINDFARM && I2C && CPU_FREQ_PMAC64 && PMAC_SMU + select I2C_PMAC_SMU + help + This driver provides thermal control for the iMacG5 + +config WINDFARM_PM91 + tristate "Support for thermal management on PowerMac9,1" + depends on WINDFARM && I2C && CPU_FREQ_PMAC64 && PMAC_SMU + select I2C_PMAC_SMU + help + This driver provides thermal control for the PowerMac9,1 + which is the recent (SMU based) single CPU desktop G5 + + config ANSLCD tristate "Support for ANS LCD display" depends on ADB_CUDA && PPC_PMAC Index: linux-work/drivers/macintosh/Makefile =================================================================== --- linux-work.orig/drivers/macintosh/Makefile 2005-11-07 13:29:50.000000000 +1100 +++ linux-work/drivers/macintosh/Makefile 2005-11-07 13:30:46.000000000 +1100 @@ -26,3 +26,12 @@ obj-$(CONFIG_THERM_PM72) += therm_pm72.o obj-$(CONFIG_THERM_WINDTUNNEL) += therm_windtunnel.o obj-$(CONFIG_THERM_ADT746X) += therm_adt746x.o +obj-$(CONFIG_WINDFARM) += windfarm_core.o +obj-$(CONFIG_WINDFARM_PM81) += windfarm_smu_controls.o \ + windfarm_smu_sensors.o \ + windfarm_lm75_sensor.o windfarm_pid.o \ + windfarm_cpufreq_clamp.o windfarm_pm81.o +obj-$(CONFIG_WINDFARM_PM91) += windfarm_smu_controls.o \ + windfarm_smu_sensors.o \ + windfarm_lm75_sensor.o windfarm_pid.o \ + windfarm_cpufreq_clamp.o windfarm_pm91.o Index: linux-work/drivers/macintosh/windfarm.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/drivers/macintosh/windfarm.h 2005-11-07 13:30:46.000000000 +1100 @@ -0,0 +1,122 @@ +#ifndef __WINDFARM_H__ +#define __WINDFARM_H__ + +#include +#include +#include +#include + +/* Display a 16.16 fixed point value */ +#define FIX32TOPRINT(f) ((f) >> 16),((((f) & 0xffff) * 1000) >> 16) + +/* + * Control objects + */ + +struct wf_control; + +struct wf_control_ops { + int (*set_value)(struct wf_control *ct, s32 val); + int (*get_value)(struct wf_control *ct, s32 *val); + s32 (*get_min)(struct wf_control *ct); + s32 (*get_max)(struct wf_control *ct); + void (*release)(struct wf_control *ct); + struct module *owner; +}; + +struct wf_control { + struct list_head link; + struct wf_control_ops *ops; + char *name; + int type; + struct kref ref; +}; + +#define WF_CONTROL_TYPE_GENERIC 0 +#define WF_CONTROL_RPM_FAN 1 +#define WF_CONTROL_PWM_FAN 2 + + +/* Note about lifetime rules: wf_register_control() will initialize + * the kref and wf_unregister_control will decrement it, thus the + * object creating/disposing a given control shouldn't assume it + * still exists after wf_unregister_control has been called. + * wf_find_control will inc the refcount for you + */ +extern int wf_register_control(struct wf_control *ct); +extern void wf_unregister_control(struct wf_control *ct); +extern struct wf_control * wf_find_control(const char *name); +extern int wf_get_control(struct wf_control *ct); +extern void wf_put_control(struct wf_control *ct); + +static inline int wf_control_set_max(struct wf_control *ct) +{ + s32 vmax = ct->ops->get_max(ct); + return ct->ops->set_value(ct, vmax); +} + +static inline int wf_control_set_min(struct wf_control *ct) +{ + s32 vmin = ct->ops->get_min(ct); + return ct->ops->set_value(ct, vmin); +} + +/* + * Sensor objects + */ + +struct wf_sensor; + +struct wf_sensor_ops { + int (*get_value)(struct wf_sensor *sr, s32 *val); + void (*release)(struct wf_sensor *sr); + struct module *owner; +}; + +struct wf_sensor { + struct list_head link; + struct wf_sensor_ops *ops; + char *name; + struct kref ref; +}; + +/* Same lifetime rules as controls */ +extern int wf_register_sensor(struct wf_sensor *sr); +extern void wf_unregister_sensor(struct wf_sensor *sr); +extern struct wf_sensor * wf_find_sensor(const char *name); +extern int wf_get_sensor(struct wf_sensor *sr); +extern void wf_put_sensor(struct wf_sensor *sr); + +/* For use by clients. Note that we are a bit racy here since + * notifier_block doesn't have a module owner field. I may fix + * it one day ... + * + * LOCKING NOTE ! + * + * All "events" except WF_EVENT_TICK are called with an internal mutex + * held which will deadlock if you call basically any core routine. + * So don't ! Just take note of the event and do your actual operations + * from the ticker. + * + */ +extern int wf_register_client(struct notifier_block *nb); +extern int wf_unregister_client(struct notifier_block *nb); + +/* Overtemp conditions. Those are refcounted */ +extern void wf_set_overtemp(void); +extern void wf_clear_overtemp(void); +extern int wf_is_overtemp(void); + +#define WF_EVENT_NEW_CONTROL 0 /* param is wf_control * */ +#define WF_EVENT_NEW_SENSOR 1 /* param is wf_sensor * */ +#define WF_EVENT_OVERTEMP 2 /* no param */ +#define WF_EVENT_NORMALTEMP 3 /* overtemp condition cleared */ +#define WF_EVENT_TICK 4 /* 1 second tick */ + +/* Note: If that driver gets more broad use, we could replace the + * simplistic overtemp bits with "environmental conditions". That + * could then be used to also notify of things like fan failure, + * case open, battery conditions, ... + */ + +#endif /* __WINDFARM_H__ */ Index: linux-work/drivers/macintosh/windfarm_core.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/drivers/macintosh/windfarm_core.c 2005-11-07 13:30:46.000000000 +1100 @@ -0,0 +1,427 @@ +/* + * Windfarm PowerMac thermal control. Core + * + * (c) Copyright 2005 Benjamin Herrenschmidt, IBM Corp. + * + * + * Released under the term of the GNU GPL v2. + * + * This core code tracks the list of sensors & controls, register + * clients, and holds the kernel thread used for control. + * + * TODO: + * + * Add some information about sensor/control type and data format to + * sensors/controls, and have the sysfs attribute stuff be moved + * generically here instead of hard coded in the platform specific + * driver as it us currently + * + * This however requires solving some annoying lifetime issues with + * sysfs which doesn't seem to have lifetime rules for struct attribute, + * I may have to create full features kobjects for every sensor/control + * instead which is a bit of an overkill imho + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "windfarm.h" + +#define VERSION "0.2" + +#undef DEBUG + +#ifdef DEBUG +#define DBG(args...) printk(args) +#else +#define DBG(args...) do { } while(0) +#endif + +static LIST_HEAD(wf_controls); +static LIST_HEAD(wf_sensors); +static DECLARE_MUTEX(wf_lock); +static struct notifier_block *wf_client_list; +static int wf_client_count; +static unsigned int wf_overtemp; +static unsigned int wf_overtemp_counter; +struct task_struct *wf_thread; + +/* + * Utilities & tick thread + */ + +static inline void wf_notify(int event, void *param) +{ + notifier_call_chain(&wf_client_list, event, param); +} + +int wf_critical_overtemp(void) +{ + static char * critical_overtemp_path = "/sbin/critical_overtemp"; + char *argv[] = { critical_overtemp_path, NULL }; + static char *envp[] = { "HOME=/", + "TERM=linux", + "PATH=/sbin:/usr/sbin:/bin:/usr/bin", + NULL }; + + return call_usermodehelper(critical_overtemp_path, argv, envp, 0); +} +EXPORT_SYMBOL_GPL(wf_critical_overtemp); + +static int wf_thread_func(void *data) +{ + unsigned long next, delay; + + next = jiffies; + + DBG("wf: thread started\n"); + + while(!kthread_should_stop()) { + try_to_freeze(); + + if (time_after_eq(jiffies, next)) { + wf_notify(WF_EVENT_TICK, NULL); + if (wf_overtemp) { + wf_overtemp_counter++; + /* 10 seconds overtemp, notify userland */ + if (wf_overtemp_counter > 10) + wf_critical_overtemp(); + /* 30 seconds, shutdown */ + if (wf_overtemp_counter > 30) { + printk(KERN_ERR "windfarm: Overtemp " + "for more than 30" + " seconds, shutting down\n"); + machine_power_off(); + } + } + next += HZ; + } + + delay = next - jiffies; + if (delay <= HZ) + schedule_timeout_interruptible(delay); + + /* there should be no signal, but oh well */ + if (signal_pending(current)) { + printk(KERN_WARNING "windfarm: thread got sigl !\n"); + break; + } + } + + DBG("wf: thread stopped\n"); + + return 0; +} + +static void wf_start_thread(void) +{ + wf_thread = kthread_run(wf_thread_func, NULL, "kwindfarm"); + if (IS_ERR(wf_thread)) { + printk(KERN_ERR "windfarm: failed to create thread,err %ld\n", + PTR_ERR(wf_thread)); + wf_thread = NULL; + } +} + + +static void wf_stop_thread(void) +{ + if (wf_thread) + kthread_stop(wf_thread); + wf_thread = NULL; +} + +/* + * Controls + */ + +static void wf_control_release(struct kref *kref) +{ + struct wf_control *ct = container_of(kref, struct wf_control, ref); + + DBG("wf: Deleting control %s\n", ct->name); + + if (ct->ops && ct->ops->release) + ct->ops->release(ct); + else + kfree(ct); +} + +int wf_register_control(struct wf_control *new_ct) +{ + struct wf_control *ct; + + down(&wf_lock); + list_for_each_entry(ct, &wf_controls, link) { + if (!strcmp(ct->name, new_ct->name)) { + printk(KERN_WARNING "windfarm: trying to register" + " duplicate control %s\n", ct->name); + up(&wf_lock); + return -EEXIST; + } + } + kref_init(&new_ct->ref); + list_add(&new_ct->link, &wf_controls); + + DBG("wf: Registered control %s\n", new_ct->name); + + wf_notify(WF_EVENT_NEW_CONTROL, new_ct); + up(&wf_lock); + + return 0; +} +EXPORT_SYMBOL_GPL(wf_register_control); + +void wf_unregister_control(struct wf_control *ct) +{ + down(&wf_lock); + list_del(&ct->link); + up(&wf_lock); + + DBG("wf: Unregistered control %s\n", ct->name); + + kref_put(&ct->ref, wf_control_release); +} +EXPORT_SYMBOL_GPL(wf_unregister_control); + +struct wf_control * wf_find_control(const char *name) +{ + struct wf_control *ct; + + down(&wf_lock); + list_for_each_entry(ct, &wf_controls, link) { + if (!strcmp(ct->name, name)) { + if (wf_get_control(ct)) + ct = NULL; + up(&wf_lock); + return ct; + } + } + up(&wf_lock); + return NULL; +} +EXPORT_SYMBOL_GPL(wf_find_control); + +int wf_get_control(struct wf_control *ct) +{ + if (!try_module_get(ct->ops->owner)) + return -ENODEV; + kref_get(&ct->ref); + return 0; +} +EXPORT_SYMBOL_GPL(wf_get_control); + +void wf_put_control(struct wf_control *ct) +{ + struct module *mod = ct->ops->owner; + kref_put(&ct->ref, wf_control_release); + module_put(mod); +} +EXPORT_SYMBOL_GPL(wf_put_control); + + +/* + * Sensors + */ + + +static void wf_sensor_release(struct kref *kref) +{ + struct wf_sensor *sr = container_of(kref, struct wf_sensor, ref); + + DBG("wf: Deleting sensor %s\n", sr->name); + + if (sr->ops && sr->ops->release) + sr->ops->release(sr); + else + kfree(sr); +} + +int wf_register_sensor(struct wf_sensor *new_sr) +{ + struct wf_sensor *sr; + + down(&wf_lock); + list_for_each_entry(sr, &wf_sensors, link) { + if (!strcmp(sr->name, new_sr->name)) { + printk(KERN_WARNING "windfarm: trying to register" + " duplicate sensor %s\n", sr->name); + up(&wf_lock); + return -EEXIST; + } + } + kref_init(&new_sr->ref); + list_add(&new_sr->link, &wf_sensors); + + DBG("wf: Registered sensor %s\n", new_sr->name); + + wf_notify(WF_EVENT_NEW_SENSOR, new_sr); + up(&wf_lock); + + return 0; +} +EXPORT_SYMBOL_GPL(wf_register_sensor); + +void wf_unregister_sensor(struct wf_sensor *sr) +{ + down(&wf_lock); + list_del(&sr->link); + up(&wf_lock); + + DBG("wf: Unregistered sensor %s\n", sr->name); + + wf_put_sensor(sr); +} +EXPORT_SYMBOL_GPL(wf_unregister_sensor); + +struct wf_sensor * wf_find_sensor(const char *name) +{ + struct wf_sensor *sr; + + down(&wf_lock); + list_for_each_entry(sr, &wf_sensors, link) { + if (!strcmp(sr->name, name)) { + if (wf_get_sensor(sr)) + sr = NULL; + up(&wf_lock); + return sr; + } + } + up(&wf_lock); + return NULL; +} +EXPORT_SYMBOL_GPL(wf_find_sensor); + +int wf_get_sensor(struct wf_sensor *sr) +{ + if (!try_module_get(sr->ops->owner)) + return -ENODEV; + kref_get(&sr->ref); + return 0; +} +EXPORT_SYMBOL_GPL(wf_get_sensor); + +void wf_put_sensor(struct wf_sensor *sr) +{ + struct module *mod = sr->ops->owner; + kref_put(&sr->ref, wf_sensor_release); + module_put(mod); +} +EXPORT_SYMBOL_GPL(wf_put_sensor); + + +/* + * Client & notification + */ + +int wf_register_client(struct notifier_block *nb) +{ + int rc; + struct wf_control *ct; + struct wf_sensor *sr; + + down(&wf_lock); + rc = notifier_chain_register(&wf_client_list, nb); + if (rc != 0) + goto bail; + wf_client_count++; + list_for_each_entry(ct, &wf_controls, link) + wf_notify(WF_EVENT_NEW_CONTROL, ct); + list_for_each_entry(sr, &wf_sensors, link) + wf_notify(WF_EVENT_NEW_SENSOR, sr); + if (wf_client_count == 1) + wf_start_thread(); + bail: + up(&wf_lock); + return rc; +} +EXPORT_SYMBOL_GPL(wf_register_client); + +int wf_unregister_client(struct notifier_block *nb) +{ + down(&wf_lock); + notifier_chain_unregister(&wf_client_list, nb); + wf_client_count++; + if (wf_client_count == 0) + wf_stop_thread(); + up(&wf_lock); + + return 0; +} +EXPORT_SYMBOL_GPL(wf_unregister_client); + +void wf_set_overtemp(void) +{ + down(&wf_lock); + wf_overtemp++; + if (wf_overtemp == 1) { + printk(KERN_WARNING "windfarm: Overtemp condition detected !\n"); + wf_overtemp_counter = 0; + wf_notify(WF_EVENT_OVERTEMP, NULL); + } + up(&wf_lock); +} +EXPORT_SYMBOL_GPL(wf_set_overtemp); + +void wf_clear_overtemp(void) +{ + down(&wf_lock); + WARN_ON(wf_overtemp == 0); + if (wf_overtemp == 0) { + up(&wf_lock); + return; + } + wf_overtemp--; + if (wf_overtemp == 0) { + printk(KERN_WARNING "windfarm: Overtemp condition cleared !\n"); + wf_notify(WF_EVENT_NORMALTEMP, NULL); + } + up(&wf_lock); +} +EXPORT_SYMBOL_GPL(wf_clear_overtemp); + +int wf_is_overtemp(void) +{ + return (wf_overtemp != 0); +} +EXPORT_SYMBOL_GPL(wf_is_overtemp); + +static struct platform_device wf_platform_device = { + .name = "windfarm", +}; + +static int __init windfarm_core_init(void) +{ + DBG("wf: core loaded\n"); + + platform_device_register(&wf_platform_device); + return 0; +} + +static void __exit windfarm_core_exit(void) +{ + BUG_ON(wf_client_count != 0); + + DBG("wf: core unloaded\n"); + + platform_device_unregister(&wf_platform_device); +} + + +module_init(windfarm_core_init); +module_exit(windfarm_core_exit); + +MODULE_AUTHOR("Benjamin Herrenschmidt "); +MODULE_DESCRIPTION("Core component of PowerMac thermal control"); +MODULE_LICENSE("GPL"); + Index: linux-work/drivers/macintosh/windfarm_smu_controls.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/drivers/macintosh/windfarm_smu_controls.c 2005-11-07 13:30:46.000000000 +1100 @@ -0,0 +1,274 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "windfarm.h" + +#define VERSION "0.3" + +#undef DEBUG + +#ifdef DEBUG +#define DBG(args...) printk(args) +#else +#define DBG(args...) do { } while(0) +#endif + +/* + * SMU fans control object + */ + +static LIST_HEAD(smu_fans); + +struct smu_fan_control { + struct list_head link; + int fan_type; /* 0 = rpm, 1 = pwm */ + u32 reg; /* index in SMU */ + s32 value; /* current value */ + s32 min, max; /* min/max values */ + struct wf_control ctrl; +}; +#define to_smu_fan(c) container_of(c, struct smu_fan_control, ctrl) + +static int smu_set_fan(int pwm, u8 id, u16 value) +{ + struct smu_cmd cmd; + u8 buffer[16]; + DECLARE_COMPLETION(comp); + int rc; + + /* Fill SMU command structure */ + cmd.cmd = SMU_CMD_FAN_COMMAND; + cmd.data_len = 14; + cmd.reply_len = 16; + cmd.data_buf = cmd.reply_buf = buffer; + cmd.status = 0; + cmd.done = smu_done_complete; + cmd.misc = ∁ + + /* Fill argument buffer */ + memset(buffer, 0, 16); + buffer[0] = pwm ? 0x10 : 0x00; + buffer[1] = 0x01 << id; + *((u16 *)&buffer[2 + id * 2]) = value; + + rc = smu_queue_cmd(&cmd); + if (rc) + return rc; + wait_for_completion(&comp); + return cmd.status; +} + +static void smu_fan_release(struct wf_control *ct) +{ + struct smu_fan_control *fct = to_smu_fan(ct); + + kfree(fct); +} + +static int smu_fan_set(struct wf_control *ct, s32 value) +{ + struct smu_fan_control *fct = to_smu_fan(ct); + + if (value < fct->min) + value = fct->min; + if (value > fct->max) + value = fct->max; + fct->value = value; + + return smu_set_fan(fct->fan_type, fct->reg, value); +} + +static int smu_fan_get(struct wf_control *ct, s32 *value) +{ + struct smu_fan_control *fct = to_smu_fan(ct); + *value = fct->value; /* todo: read from SMU */ + return 0; +} + +static s32 smu_fan_min(struct wf_control *ct) +{ + struct smu_fan_control *fct = to_smu_fan(ct); + return fct->min; +} + +static s32 smu_fan_max(struct wf_control *ct) +{ + struct smu_fan_control *fct = to_smu_fan(ct); + return fct->max; +} + +static struct wf_control_ops smu_fan_ops = { + .set_value = smu_fan_set, + .get_value = smu_fan_get, + .get_min = smu_fan_min, + .get_max = smu_fan_max, + .release = smu_fan_release, + .owner = THIS_MODULE, +}; + +static struct smu_fan_control *smu_fan_create(struct device_node *node, + int pwm_fan) +{ + struct smu_fan_control *fct; + s32 *v; u32 *reg; + char *l; + + fct = kmalloc(sizeof(struct smu_fan_control), GFP_KERNEL); + if (fct == NULL) + return NULL; + fct->ctrl.ops = &smu_fan_ops; + l = (char *)get_property(node, "location", NULL); + if (l == NULL) + goto fail; + + fct->fan_type = pwm_fan; + fct->ctrl.type = pwm_fan ? WF_CONTROL_PWM_FAN : WF_CONTROL_RPM_FAN; + + /* We use the name & location here the same way we do for SMU sensors, + * see the comment in windfarm_smu_sensors.c. The locations are a bit + * less consistent here between the iMac and the desktop models, but + * that is good enough for our needs for now at least. + * + * One problem though is that Apple seem to be inconsistent with case + * and the kernel doesn't have strcasecmp =P + */ + + fct->ctrl.name = NULL; + + /* Names used on desktop models */ + if (!strcmp(l, "Rear Fan 0") || !strcmp(l, "Rear Fan") || + !strcmp(l, "Rear fan 0") || !strcmp(l, "Rear fan")) + fct->ctrl.name = "cpu-rear-fan-0"; + else if (!strcmp(l, "Rear Fan 1") || !strcmp(l, "Rear fan 1")) + fct->ctrl.name = "cpu-rear-fan-1"; + else if (!strcmp(l, "Front Fan 0") || !strcmp(l, "Front Fan") || + !strcmp(l, "Front fan 0") || !strcmp(l, "Front fan")) + fct->ctrl.name = "cpu-front-fan-0"; + else if (!strcmp(l, "Front Fan 1") || !strcmp(l, "Front fan 1")) + fct->ctrl.name = "cpu-front-fan-1"; + else if (!strcmp(l, "Slots Fan") || !strcmp(l, "Slots fan")) + fct->ctrl.name = "slots-fan"; + else if (!strcmp(l, "Drive Bay") || !strcmp(l, "Drive bay")) + fct->ctrl.name = "drive-bay-fan"; + + /* Names used on iMac models */ + if (!strcmp(l, "System Fan") || !strcmp(l, "System fan")) + fct->ctrl.name = "system-fan"; + else if (!strcmp(l, "CPU Fan") || !strcmp(l, "CPU fan")) + fct->ctrl.name = "cpu-fan"; + else if (!strcmp(l, "Hard Drive") || !strcmp(l, "Hard drive")) + fct->ctrl.name = "drive-bay-fan"; + + /* Unrecognized fan, bail out */ + if (fct->ctrl.name == NULL) + goto fail; + + /* Get min & max values*/ + v = (s32 *)get_property(node, "min-value", NULL); + if (v == NULL) + goto fail; + fct->min = *v; + v = (s32 *)get_property(node, "max-value", NULL); + if (v == NULL) + goto fail; + fct->max = *v; + + /* Get "reg" value */ + reg = (u32 *)get_property(node, "reg", NULL); + if (reg == NULL) + goto fail; + fct->reg = *reg; + + if (wf_register_control(&fct->ctrl)) + goto fail; + + return fct; + fail: + kfree(fct); + return NULL; +} + + +static int __init smu_controls_init(void) +{ + struct device_node *smu, *fans, *fan; + + if (!smu_present()) + return -ENODEV; + + smu = of_find_node_by_type(NULL, "smu"); + if (smu == NULL) + return -ENODEV; + + /* Look for RPM fans */ + for (fans = NULL; (fans = of_get_next_child(smu, fans)) != NULL;) + if (!strcmp(fans->name, "rpm-fans")) + break; + for (fan = NULL; + fans && (fan = of_get_next_child(fans, fan)) != NULL;) { + struct smu_fan_control *fct; + + fct = smu_fan_create(fan, 0); + if (fct == NULL) { + printk(KERN_WARNING "windfarm: Failed to create SMU " + "RPM fan %s\n", fan->name); + continue; + } + list_add(&fct->link, &smu_fans); + } + of_node_put(fans); + + + /* Look for PWM fans */ + for (fans = NULL; (fans = of_get_next_child(smu, fans)) != NULL;) + if (!strcmp(fans->name, "pwm-fans")) + break; + for (fan = NULL; + fans && (fan = of_get_next_child(fans, fan)) != NULL;) { + struct smu_fan_control *fct; + + fct = smu_fan_create(fan, 1); + if (fct == NULL) { + printk(KERN_WARNING "windfarm: Failed to create SMU " + "PWM fan %s\n", fan->name); + continue; + } + list_add(&fct->link, &smu_fans); + } + of_node_put(fans); + of_node_put(smu); + + return 0; +} + +static void __exit smu_controls_exit(void) +{ + struct smu_fan_control *fct; + + while (!list_empty(&smu_fans)) { + fct = list_entry(smu_fans.next, struct smu_fan_control, link); + list_del(&fct->link); + wf_unregister_control(&fct->ctrl); + } +} + + +module_init(smu_controls_init); +module_exit(smu_controls_exit); + +MODULE_AUTHOR("Benjamin Herrenschmidt "); +MODULE_DESCRIPTION("SMU control objects for PowerMacs thermal control"); +MODULE_LICENSE("GPL"); + Index: linux-work/drivers/macintosh/windfarm_smu_sensors.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/drivers/macintosh/windfarm_smu_sensors.c 2005-11-07 13:30:46.000000000 +1100 @@ -0,0 +1,471 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "windfarm.h" + +#define VERSION "0.2" + +#undef DEBUG + +#ifdef DEBUG +#define DBG(args...) printk(args) +#else +#define DBG(args...) do { } while(0) +#endif + +/* + * Various SMU "partitions" calibration objects for which we + * keep pointers here for use by bits & pieces of the driver + */ +static struct smu_sdbp_cpuvcp *cpuvcp; +static int cpuvcp_version; +static struct smu_sdbp_cpudiode *cpudiode; +static struct smu_sdbp_slotspow *slotspow; +static u8 *debugswitches; + +/* + * SMU basic sensors objects + */ + +static LIST_HEAD(smu_ads); + +struct smu_ad_sensor { + struct list_head link; + u32 reg; /* index in SMU */ + struct wf_sensor sens; +}; +#define to_smu_ads(c) container_of(c, struct smu_ad_sensor, sens) + +static void smu_ads_release(struct wf_sensor *sr) +{ + struct smu_ad_sensor *ads = to_smu_ads(sr); + + kfree(ads); +} + +static int smu_read_adc(u8 id, s32 *value) +{ + struct smu_simple_cmd cmd; + DECLARE_COMPLETION(comp); + int rc; + + rc = smu_queue_simple(&cmd, SMU_CMD_READ_ADC, 1, + smu_done_complete, &comp, id); + if (rc) + return rc; + wait_for_completion(&comp); + if (cmd.cmd.status != 0) + return cmd.cmd.status; + if (cmd.cmd.reply_len != 2) { + printk(KERN_ERR "winfarm: read ADC 0x%x returned %d bytes !\n", + id, cmd.cmd.reply_len); + return -EIO; + } + *value = *((u16 *)cmd.buffer); + return 0; +} + +static int smu_cputemp_get(struct wf_sensor *sr, s32 *value) +{ + struct smu_ad_sensor *ads = to_smu_ads(sr); + int rc; + s32 val; + s64 scaled; + + rc = smu_read_adc(ads->reg, &val); + if (rc) { + printk(KERN_ERR "windfarm: read CPU temp failed, err %d\n", + rc); + return rc; + } + + /* Ok, we have to scale & adjust, taking units into account */ + scaled = (s64)(((u64)val) * (u64)cpudiode->m_value); + scaled >>= 3; + scaled += ((s64)cpudiode->b_value) << 9; + *value = (s32)(scaled << 1); + + return 0; +} + +static int smu_cpuamp_get(struct wf_sensor *sr, s32 *value) +{ + struct smu_ad_sensor *ads = to_smu_ads(sr); + s32 val, scaled; + int rc; + + rc = smu_read_adc(ads->reg, &val); + if (rc) { + printk(KERN_ERR "windfarm: read CPU current failed, err %d\n", + rc); + return rc; + } + + /* Ok, we have to scale & adjust, taking units into account */ + scaled = (s32)(val * (u32)cpuvcp->curr_scale); + scaled += (s32)cpuvcp->curr_offset; + *value = scaled << 4; + + return 0; +} + +static int smu_cpuvolt_get(struct wf_sensor *sr, s32 *value) +{ + struct smu_ad_sensor *ads = to_smu_ads(sr); + s32 val, scaled; + int rc; + + rc = smu_read_adc(ads->reg, &val); + if (rc) { + printk(KERN_ERR "windfarm: read CPU voltage failed, err %d\n", + rc); + return rc; + } + + /* Ok, we have to scale & adjust, taking units into account */ + scaled = (s32)(val * (u32)cpuvcp->volt_scale); + scaled += (s32)cpuvcp->volt_offset; + *value = scaled << 4; + + return 0; +} + +static int smu_slotspow_get(struct wf_sensor *sr, s32 *value) +{ + struct smu_ad_sensor *ads = to_smu_ads(sr); + s32 val, scaled; + int rc; + + rc = smu_read_adc(ads->reg, &val); + if (rc) { + printk(KERN_ERR "windfarm: read slots power failed, err %d\n", + rc); + return rc; + } + + /* Ok, we have to scale & adjust, taking units into account */ + scaled = (s32)(val * (u32)slotspow->pow_scale); + scaled += (s32)slotspow->pow_offset; + *value = scaled << 4; + + return 0; +} + + +static struct wf_sensor_ops smu_cputemp_ops = { + .get_value = smu_cputemp_get, + .release = smu_ads_release, + .owner = THIS_MODULE, +}; +static struct wf_sensor_ops smu_cpuamp_ops = { + .get_value = smu_cpuamp_get, + .release = smu_ads_release, + .owner = THIS_MODULE, +}; +static struct wf_sensor_ops smu_cpuvolt_ops = { + .get_value = smu_cpuvolt_get, + .release = smu_ads_release, + .owner = THIS_MODULE, +}; +static struct wf_sensor_ops smu_slotspow_ops = { + .get_value = smu_slotspow_get, + .release = smu_ads_release, + .owner = THIS_MODULE, +}; + + +static struct smu_ad_sensor *smu_ads_create(struct device_node *node) +{ + struct smu_ad_sensor *ads; + char *c, *l; + u32 *v; + + ads = kmalloc(sizeof(struct smu_ad_sensor), GFP_KERNEL); + if (ads == NULL) + return NULL; + c = (char *)get_property(node, "device_type", NULL); + l = (char *)get_property(node, "location", NULL); + if (c == NULL || l == NULL) + goto fail; + + /* We currently pick the sensors based on the OF name and location + * properties, while Darwin uses the sensor-id's. + * The problem with the IDs is that they are model specific while it + * looks like apple has been doing a reasonably good job at keeping + * the names and locations consistents so I'll stick with the names + * and locations for now. + */ + if (!strcmp(c, "temp-sensor") && + !strcmp(l, "CPU T-Diode")) { + ads->sens.ops = &smu_cputemp_ops; + ads->sens.name = "cpu-temp"; + } else if (!strcmp(c, "current-sensor") && + !strcmp(l, "CPU Current")) { + ads->sens.ops = &smu_cpuamp_ops; + ads->sens.name = "cpu-current"; + } else if (!strcmp(c, "voltage-sensor") && + !strcmp(l, "CPU Voltage")) { + ads->sens.ops = &smu_cpuvolt_ops; + ads->sens.name = "cpu-voltage"; + } else if (!strcmp(c, "power-sensor") && + !strcmp(l, "Slots Power")) { + ads->sens.ops = &smu_slotspow_ops; + ads->sens.name = "slots-power"; + if (slotspow == NULL) { + DBG("wf: slotspow partition (%02x) not found\n", + SMU_SDB_SLOTSPOW_ID); + goto fail; + } + } else + goto fail; + + v = (u32 *)get_property(node, "reg", NULL); + if (v == NULL) + goto fail; + ads->reg = *v; + + if (wf_register_sensor(&ads->sens)) + goto fail; + return ads; + fail: + kfree(ads); + return NULL; +} + +/* + * SMU Power combo sensor object + */ + +struct smu_cpu_power_sensor { + struct list_head link; + struct wf_sensor *volts; + struct wf_sensor *amps; + int fake_volts : 1; + int quadratic : 1; + struct wf_sensor sens; +}; +#define to_smu_cpu_power(c) container_of(c, struct smu_cpu_power_sensor, sens) + +static struct smu_cpu_power_sensor *smu_cpu_power; + +static void smu_cpu_power_release(struct wf_sensor *sr) +{ + struct smu_cpu_power_sensor *pow = to_smu_cpu_power(sr); + + if (pow->volts) + wf_put_sensor(pow->volts); + if (pow->amps) + wf_put_sensor(pow->amps); + kfree(pow); +} + +static int smu_cpu_power_get(struct wf_sensor *sr, s32 *value) +{ + struct smu_cpu_power_sensor *pow = to_smu_cpu_power(sr); + s32 volts, amps, power; + u64 tmps, tmpa, tmpb; + int rc; + + rc = pow->amps->ops->get_value(pow->amps, &s); + if (rc) + return rc; + + if (pow->fake_volts) { + *value = amps * 12 - 0x30000; + return 0; + } + + rc = pow->volts->ops->get_value(pow->volts, &volts); + if (rc) + return rc; + + power = (s32)((((u64)volts) * ((u64)amps)) >> 16); + if (!pow->quadratic) { + *value = power; + return 0; + } + tmps = (((u64)power) * ((u64)power)) >> 16; + tmpa = ((u64)cpuvcp->power_quads[0]) * tmps; + tmpb = ((u64)cpuvcp->power_quads[1]) * ((u64)power); + *value = (tmpa >> 28) + (tmpb >> 28) + (cpuvcp->power_quads[2] >> 12); + + return 0; +} + +static struct wf_sensor_ops smu_cpu_power_ops = { + .get_value = smu_cpu_power_get, + .release = smu_cpu_power_release, + .owner = THIS_MODULE, +}; + + +static struct smu_cpu_power_sensor * +smu_cpu_power_create(struct wf_sensor *volts, struct wf_sensor *amps) +{ + struct smu_cpu_power_sensor *pow; + + pow = kmalloc(sizeof(struct smu_cpu_power_sensor), GFP_KERNEL); + if (pow == NULL) + return NULL; + pow->sens.ops = &smu_cpu_power_ops; + pow->sens.name = "cpu-power"; + + wf_get_sensor(volts); + pow->volts = volts; + wf_get_sensor(amps); + pow->amps = amps; + + /* Some early machines need a faked voltage */ + if (debugswitches && ((*debugswitches) & 0x80)) { + printk(KERN_INFO "windfarm: CPU Power sensor using faked" + " voltage !\n"); + pow->fake_volts = 1; + } else + pow->fake_volts = 0; + + /* Try to use quadratic transforms on PowerMac8,1 and 9,1 for now, + * I yet have to figure out what's up with 8,2 and will have to + * adjust for later, unless we can 100% trust the SDB partition... + */ + if ((machine_is_compatible("PowerMac8,1") || + machine_is_compatible("PowerMac8,2") || + machine_is_compatible("PowerMac9,1")) && + cpuvcp_version >= 2) { + pow->quadratic = 1; + DBG("windfarm: CPU Power using quadratic transform\n"); + } else + pow->quadratic = 0; + + if (wf_register_sensor(&pow->sens)) + goto fail; + return pow; + fail: + kfree(pow); + return NULL; +} + +static int smu_fetch_param_partitions(void) +{ + struct smu_sdbp_header *hdr; + + /* Get CPU voltage/current/power calibration data */ + hdr = smu_get_sdb_partition(SMU_SDB_CPUVCP_ID, NULL); + if (hdr == NULL) { + DBG("wf: cpuvcp partition (%02x) not found\n", + SMU_SDB_CPUVCP_ID); + return -ENODEV; + } + cpuvcp = (struct smu_sdbp_cpuvcp *)&hdr[1]; + /* Keep version around */ + cpuvcp_version = hdr->version; + + /* Get CPU diode calibration data */ + hdr = smu_get_sdb_partition(SMU_SDB_CPUDIODE_ID, NULL); + if (hdr == NULL) { + DBG("wf: cpudiode partition (%02x) not found\n", + SMU_SDB_CPUDIODE_ID); + return -ENODEV; + } + cpudiode = (struct smu_sdbp_cpudiode *)&hdr[1]; + + /* Get slots power calibration data if any */ + hdr = smu_get_sdb_partition(SMU_SDB_SLOTSPOW_ID, NULL); + if (hdr != NULL) + slotspow = (struct smu_sdbp_slotspow *)&hdr[1]; + + /* Get debug switches if any */ + hdr = smu_get_sdb_partition(SMU_SDB_DEBUG_SWITCHES_ID, NULL); + if (hdr != NULL) + debugswitches = (u8 *)&hdr[1]; + + return 0; +} + +static int __init smu_sensors_init(void) +{ + struct device_node *smu, *sensors, *s; + struct smu_ad_sensor *volt_sensor = NULL, *curr_sensor = NULL; + int rc; + + if (!smu_present()) + return -ENODEV; + + /* Get parameters partitions */ + rc = smu_fetch_param_partitions(); + if (rc) + return rc; + + smu = of_find_node_by_type(NULL, "smu"); + if (smu == NULL) + return -ENODEV; + + /* Look for sensors subdir */ + for (sensors = NULL; + (sensors = of_get_next_child(smu, sensors)) != NULL;) + if (!strcmp(sensors->name, "sensors")) + break; + + of_node_put(smu); + + /* Create basic sensors */ + for (s = NULL; + sensors && (s = of_get_next_child(sensors, s)) != NULL;) { + struct smu_ad_sensor *ads; + + ads = smu_ads_create(s); + if (ads == NULL) + continue; + list_add(&ads->link, &smu_ads); + /* keep track of cpu voltage & current */ + if (!strcmp(ads->sens.name, "cpu-voltage")) + volt_sensor = ads; + else if (!strcmp(ads->sens.name, "cpu-current")) + curr_sensor = ads; + } + + of_node_put(sensors); + + /* Create CPU power sensor if possible */ + if (volt_sensor && curr_sensor) + smu_cpu_power = smu_cpu_power_create(&volt_sensor->sens, + &curr_sensor->sens); + + return 0; +} + +static void __exit smu_sensors_exit(void) +{ + struct smu_ad_sensor *ads; + + /* dispose of power sensor */ + if (smu_cpu_power) + wf_unregister_sensor(&smu_cpu_power->sens); + + /* dispose of basic sensors */ + while (!list_empty(&smu_ads)) { + ads = list_entry(smu_ads.next, struct smu_ad_sensor, link); + list_del(&ads->link); + wf_unregister_sensor(&ads->sens); + } +} + + +module_init(smu_sensors_init); +module_exit(smu_sensors_exit); + +MODULE_AUTHOR("Benjamin Herrenschmidt "); +MODULE_DESCRIPTION("SMU sensor objects for PowerMacs thermal control"); +MODULE_LICENSE("GPL"); + Index: linux-work/drivers/macintosh/windfarm_lm75_sensor.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/drivers/macintosh/windfarm_lm75_sensor.c 2005-11-07 13:30:46.000000000 +1100 @@ -0,0 +1,255 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "windfarm.h" + +#define VERSION "0.1" + +#undef DEBUG + +#ifdef DEBUG +#define DBG(args...) printk(args) +#else +#define DBG(args...) do { } while(0) +#endif + +struct wf_lm75_sensor { + int ds1775 : 1; + int inited : 1; + struct i2c_client i2c; + struct wf_sensor sens; +}; +#define wf_to_lm75(c) container_of(c, struct wf_lm75_sensor, sens) +#define i2c_to_lm75(c) container_of(c, struct wf_lm75_sensor, i2c) + +static int wf_lm75_attach(struct i2c_adapter *adapter); +static int wf_lm75_detach(struct i2c_client *client); + +static struct i2c_driver wf_lm75_driver = { + .owner = THIS_MODULE, + .name = "wf_lm75", + .flags = I2C_DF_NOTIFY, + .attach_adapter = wf_lm75_attach, + .detach_client = wf_lm75_detach, +}; + +static int wf_lm75_get(struct wf_sensor *sr, s32 *value) +{ + struct wf_lm75_sensor *lm = wf_to_lm75(sr); + s32 data; + + if (lm->i2c.adapter == NULL) + return -ENODEV; + + /* Init chip if necessary */ + if (!lm->inited) { + u8 cfg_new, cfg = (u8)i2c_smbus_read_byte_data(&lm->i2c, 1); + + DBG("wf_lm75: Initializing %s, cfg was: %02x\n", + sr->name, cfg); + + /* clear shutdown bit, keep other settings as left by + * the firmware for now + */ + cfg_new = cfg & ~0x01; + i2c_smbus_write_byte_data(&lm->i2c, 1, cfg_new); + lm->inited = 1; + + /* If we just powered it up, let's wait 200 ms */ + msleep(200); + } + + /* Read temperature register */ + data = (s32)le16_to_cpu(i2c_smbus_read_word_data(&lm->i2c, 0)); + data <<= 8; + *value = data; + + return 0; +} + +static void wf_lm75_release(struct wf_sensor *sr) +{ + struct wf_lm75_sensor *lm = wf_to_lm75(sr); + + /* check if client is registered and detach from i2c */ + if (lm->i2c.adapter) { + i2c_detach_client(&lm->i2c); + lm->i2c.adapter = NULL; + } + + kfree(lm); +} + +static struct wf_sensor_ops wf_lm75_ops = { + .get_value = wf_lm75_get, + .release = wf_lm75_release, + .owner = THIS_MODULE, +}; + +static struct wf_lm75_sensor *wf_lm75_create(struct i2c_adapter *adapter, + u8 addr, int ds1775, + const char *loc) +{ + struct wf_lm75_sensor *lm; + + DBG("wf_lm75: creating %s device at address 0x%02x\n", + ds1775 ? "ds1775" : "lm75", addr); + + lm = kmalloc(sizeof(struct wf_lm75_sensor), GFP_KERNEL); + if (lm == NULL) + return NULL; + memset(lm, 0, sizeof(struct wf_lm75_sensor)); + + /* Usual rant about sensor names not beeing very consistent in + * the device-tree, oh well ... + * Add more entries below as you deal with more setups + */ + if (!strcmp(loc, "Hard drive") || !strcmp(loc, "DRIVE BAY")) + lm->sens.name = "hd-temp"; + else + goto fail; + + lm->inited = 0; + lm->sens.ops = &wf_lm75_ops; + lm->ds1775 = ds1775; + lm->i2c.addr = (addr >> 1) & 0x7f; + lm->i2c.adapter = adapter; + lm->i2c.driver = &wf_lm75_driver; + strncpy(lm->i2c.name, lm->sens.name, I2C_NAME_SIZE-1); + + if (i2c_attach_client(&lm->i2c)) { + printk(KERN_ERR "windfarm: failed to attach %s %s to i2c\n", + ds1775 ? "ds1775" : "lm75", lm->i2c.name); + goto fail; + } + + if (wf_register_sensor(&lm->sens)) { + i2c_detach_client(&lm->i2c); + goto fail; + } + + return lm; + fail: + kfree(lm); + return NULL; +} + +static int wf_lm75_attach(struct i2c_adapter *adapter) +{ + u8 bus_id; + struct device_node *smu, *bus, *dev; + + /* We currently only deal with LM75's hanging off the SMU + * i2c busses. If we extend that driver to other/older + * machines, we should split this function into SMU-i2c, + * keywest-i2c, PMU-i2c, ... + */ + + DBG("wf_lm75: adapter %s detected\n", adapter->name); + + if (strncmp(adapter->name, "smu-i2c-", 8) != 0) + return 0; + smu = of_find_node_by_type(NULL, "smu"); + if (smu == NULL) + return 0; + + /* Look for the bus in the device-tree */ + bus_id = (u8)simple_strtoul(adapter->name + 8, NULL, 16); + + DBG("wf_lm75: bus ID is %x\n", bus_id); + + /* Look for sensors subdir */ + for (bus = NULL; + (bus = of_get_next_child(smu, bus)) != NULL;) { + u32 *reg; + + if (strcmp(bus->name, "i2c")) + continue; + reg = (u32 *)get_property(bus, "reg", NULL); + if (reg == NULL) + continue; + if (bus_id == *reg) + break; + } + of_node_put(smu); + if (bus == NULL) { + printk(KERN_WARNING "windfarm: SMU i2c bus 0x%x not found" + " in device-tree !\n", bus_id); + return 0; + } + + DBG("wf_lm75: bus found, looking for device...\n"); + + /* Now look for lm75(s) in there */ + for (dev = NULL; + (dev = of_get_next_child(bus, dev)) != NULL;) { + const char *loc = + get_property(dev, "hwsensor-location", NULL); + u32 *reg = (u32 *)get_property(dev, "reg", NULL); + DBG(" dev: %s... (loc: %p, reg: %p)\n", dev->name, loc, reg); + if (loc == NULL || reg == NULL) + continue; + /* real lm75 */ + if (device_is_compatible(dev, "lm75")) + wf_lm75_create(adapter, *reg, 0, loc); + /* ds1775 (compatible, better resolution */ + else if (device_is_compatible(dev, "ds1775")) + wf_lm75_create(adapter, *reg, 1, loc); + } + + of_node_put(bus); + + return 0; +} + +static int wf_lm75_detach(struct i2c_client *client) +{ + struct wf_lm75_sensor *lm = i2c_to_lm75(client); + + DBG("wf_lm75: i2c detatch called for %s\n", lm->sens.name); + + /* Mark client detached */ + lm->i2c.adapter = NULL; + + /* release sensor */ + wf_unregister_sensor(&lm->sens); + + return 0; +} + +static int __init wf_lm75_sensor_init(void) +{ + int rc; + + rc = i2c_add_driver(&wf_lm75_driver); + if (rc < 0) + return rc; + return 0; +} + +static void __exit wf_lm75_sensor_exit(void) +{ + i2c_del_driver(&wf_lm75_driver); +} + + +module_init(wf_lm75_sensor_init); +module_exit(wf_lm75_sensor_exit); + +MODULE_AUTHOR("Benjamin Herrenschmidt "); +MODULE_DESCRIPTION("LM75 sensor objects for PowerMacs thermal control"); +MODULE_LICENSE("GPL"); + Index: linux-work/drivers/macintosh/windfarm_pid.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/drivers/macintosh/windfarm_pid.c 2005-11-07 13:30:46.000000000 +1100 @@ -0,0 +1,146 @@ +/* + * Windfarm PowerMac thermal control. Generic PID helpers + * + * (c) Copyright 2005 Benjamin Herrenschmidt, IBM Corp. + * + * + * Released under the term of the GNU GPL v2. + */ + +#include +#include +#include +#include +#include +#include + +#include "windfarm_pid.h" + +#undef DEBUG + +#ifdef DEBUG +#define DBG(args...) printk(args) +#else +#define DBG(args...) do { } while(0) +#endif + +void wf_pid_init(struct wf_pid_state *st, struct wf_pid_param *param) +{ + memset(st, 0, sizeof(struct wf_pid_state)); + st->param = *param; + st->first = 1; +} +EXPORT_SYMBOL_GPL(wf_pid_init); + +s32 wf_pid_run(struct wf_pid_state *st, s32 new_sample) +{ + s64 error, integ, deriv; + s32 target; + int i, hlen = st->param.history_len; + + /* Calculate error term */ + error = new_sample - st->param.itarget; + + /* Get samples into our history buffer */ + if (st->first) { + for (i = 0; i < hlen; i++) { + st->samples[i] = new_sample; + st->errors[i] = error; + } + st->first = 0; + st->index = 0; + } else { + st->index = (st->index + 1) % hlen; + st->samples[st->index] = new_sample; + st->errors[st->index] = error; + } + + /* Calculate integral term */ + for (i = 0, integ = 0; i < hlen; i++) + integ += st->errors[(st->index + hlen - i) % hlen]; + integ *= st->param.interval; + + /* Calculate derivative term */ + deriv = st->errors[st->index] - + st->errors[(st->index + hlen - 1) % hlen]; + deriv /= st->param.interval; + + /* Calculate target */ + target = (s32)((integ * (s64)st->param.gr + deriv * (s64)st->param.gd + + error * (s64)st->param.gp) >> 36); + if (st->param.additive) + target += st->target; + target = max(target, st->param.min); + target = min(target, st->param.max); + st->target = target; + + return st->target; +} +EXPORT_SYMBOL_GPL(wf_pid_run); + +void wf_cpu_pid_init(struct wf_cpu_pid_state *st, + struct wf_cpu_pid_param *param) +{ + memset(st, 0, sizeof(struct wf_cpu_pid_state)); + st->param = *param; + st->first = 1; +} +EXPORT_SYMBOL_GPL(wf_cpu_pid_init); + +s32 wf_cpu_pid_run(struct wf_cpu_pid_state *st, s32 new_power, s32 new_temp) +{ + s64 error, integ, deriv, prop; + s32 target, sval, adj; + int i, hlen = st->param.history_len; + + /* Calculate error term */ + error = st->param.pmaxadj - new_power; + + /* Get samples into our history buffer */ + if (st->first) { + for (i = 0; i < hlen; i++) { + st->powers[i] = new_power; + st->errors[i] = error; + } + st->temps[0] = st->temps[1] = new_temp; + st->first = 0; + st->index = st->tindex = 0; + } else { + st->index = (st->index + 1) % hlen; + st->powers[st->index] = new_power; + st->errors[st->index] = error; + st->tindex = (st->tindex + 1) % 2; + st->temps[st->tindex] = new_temp; + } + + /* Calculate integral term */ + for (i = 0, integ = 0; i < hlen; i++) + integ += st->errors[(st->index + hlen - i) % hlen]; + integ *= st->param.interval; + integ *= st->param.gr; + sval = st->param.tmax - ((integ >> 20) & 0xffffffff); + adj = min(st->param.ttarget, sval); + + DBG("integ: %lx, sval: %lx, adj: %lx\n", integ, sval, adj); + + /* Calculate derivative term */ + deriv = st->temps[st->tindex] - + st->temps[(st->tindex + 2 - 1) % 2]; + deriv /= st->param.interval; + deriv *= st->param.gd; + + /* Calculate proportional term */ + prop = (new_temp - adj); + prop *= st->param.gp; + + DBG("deriv: %lx, prop: %lx\n", deriv, prop); + + /* Calculate target */ + target = st->target + (s32)((deriv + prop) >> 36); + target = max(target, st->param.min); + target = min(target, st->param.max); + st->target = target; + + return st->target; +} +EXPORT_SYMBOL_GPL(wf_cpu_pid_run); Index: linux-work/drivers/macintosh/windfarm_pid.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/drivers/macintosh/windfarm_pid.h 2005-11-07 13:30:46.000000000 +1100 @@ -0,0 +1,84 @@ +/* + * Windfarm PowerMac thermal control. Generic PID helpers + * + * (c) Copyright 2005 Benjamin Herrenschmidt, IBM Corp. + * + * + * Released under the term of the GNU GPL v2. + * + * This is a pair of generic PID helpers that can be used by + * control loops. One is the basic PID implementation, the + * other one is more specifically tailored to the loops used + * for CPU control with 2 input sample types (temp and power) + */ + +/* + * *** Simple PID *** + */ + +#define WF_PID_MAX_HISTORY 32 + +/* This parameter array is passed to the PID algorithm. Currently, + * we don't support changing parameters on the fly as it's not needed + * but could be implemented (with necessary adjustment of the history + * buffer + */ +struct wf_pid_param { + int interval; /* Interval between samples in seconds */ + int history_len; /* Size of history buffer */ + int additive; /* 1: target relative to previous value */ + s32 gd, gp, gr; /* PID gains */ + s32 itarget; /* PID input target */ + s32 min,max; /* min and max target values */ +}; + +struct wf_pid_state { + int first; /* first run of the loop */ + int index; /* index of current sample */ + s32 target; /* current target value */ + s32 samples[WF_PID_MAX_HISTORY]; /* samples history buffer */ + s32 errors[WF_PID_MAX_HISTORY]; /* error history buffer */ + + struct wf_pid_param param; +}; + +extern void wf_pid_init(struct wf_pid_state *st, struct wf_pid_param *param); +extern s32 wf_pid_run(struct wf_pid_state *st, s32 sample); + + +/* + * *** CPU PID *** + */ + +#define WF_CPU_PID_MAX_HISTORY 32 + +/* This parameter array is passed to the CPU PID algorithm. Currently, + * we don't support changing parameters on the fly as it's not needed + * but could be implemented (with necessary adjustment of the history + * buffer + */ +struct wf_cpu_pid_param { + int interval; /* Interval between samples in seconds */ + int history_len; /* Size of history buffer */ + s32 gd, gp, gr; /* PID gains */ + s32 pmaxadj; /* PID max power adjust */ + s32 ttarget; /* PID input target */ + s32 tmax; /* PID input max */ + s32 min,max; /* min and max target values */ +}; + +struct wf_cpu_pid_state { + int first; /* first run of the loop */ + int index; /* index of current power */ + int tindex; /* index of current temp */ + s32 target; /* current target value */ + s32 powers[WF_PID_MAX_HISTORY]; /* power history buffer */ + s32 errors[WF_PID_MAX_HISTORY]; /* error history buffer */ + s32 temps[2]; /* temp. history buffer */ + + struct wf_cpu_pid_param param; +}; + +extern void wf_cpu_pid_init(struct wf_cpu_pid_state *st, + struct wf_cpu_pid_param *param); +extern s32 wf_cpu_pid_run(struct wf_cpu_pid_state *st, s32 power, s32 temp); Index: linux-work/drivers/macintosh/windfarm_cpufreq_clamp.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/drivers/macintosh/windfarm_cpufreq_clamp.c 2005-11-07 13:30:46.000000000 +1100 @@ -0,0 +1,105 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "windfarm.h" + +#define VERSION "0.3" + +static int clamped; +static struct wf_control *clamp_control; + +static int clamp_notifier_call(struct notifier_block *self, + unsigned long event, void *data) +{ + struct cpufreq_policy *p = data; + unsigned long max_freq; + + if (event != CPUFREQ_ADJUST) + return 0; + + max_freq = clamped ? (p->cpuinfo.min_freq) : (p->cpuinfo.max_freq); + cpufreq_verify_within_limits(p, 0, max_freq); + + return 0; +} + +static struct notifier_block clamp_notifier = { + .notifier_call = clamp_notifier_call, +}; + +static int clamp_set(struct wf_control *ct, s32 value) +{ + if (value) + printk(KERN_INFO "windfarm: Clamping CPU frequency to " + "minimum !\n"); + else + printk(KERN_INFO "windfarm: CPU frequency unclamped !\n"); + clamped = value; + cpufreq_update_policy(0); + return 0; +} + +static int clamp_get(struct wf_control *ct, s32 *value) +{ + *value = clamped; + return 0; +} + +static s32 clamp_min(struct wf_control *ct) +{ + return 0; +} + +static s32 clamp_max(struct wf_control *ct) +{ + return 1; +} + +static struct wf_control_ops clamp_ops = { + .set_value = clamp_set, + .get_value = clamp_get, + .get_min = clamp_min, + .get_max = clamp_max, + .owner = THIS_MODULE, +}; + +static int __init wf_cpufreq_clamp_init(void) +{ + struct wf_control *clamp; + + clamp = kmalloc(sizeof(struct wf_control), GFP_KERNEL); + if (clamp == NULL) + return -ENOMEM; + cpufreq_register_notifier(&clamp_notifier, CPUFREQ_POLICY_NOTIFIER); + clamp->ops = &clamp_ops; + clamp->name = "cpufreq-clamp"; + if (wf_register_control(clamp)) + goto fail; + clamp_control = clamp; + return 0; + fail: + kfree(clamp); + return -ENODEV; +} + +static void __exit wf_cpufreq_clamp_exit(void) +{ + if (clamp_control) + wf_unregister_control(clamp_control); +} + + +module_init(wf_cpufreq_clamp_init); +module_exit(wf_cpufreq_clamp_exit); + +MODULE_AUTHOR("Benjamin Herrenschmidt "); +MODULE_DESCRIPTION("CPU frequency clamp for PowerMacs thermal control"); +MODULE_LICENSE("GPL"); + Index: linux-work/drivers/macintosh/windfarm_pm81.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/drivers/macintosh/windfarm_pm81.c 2005-11-07 13:35:04.000000000 +1100 @@ -0,0 +1,881 @@ +/* + * Windfarm PowerMac thermal control. iMac G5 + * + * (c) Copyright 2005 Benjamin Herrenschmidt, IBM Corp. + * + * + * Released under the term of the GNU GPL v2. + * + * The algorithm used is the PID control algorithm, used the same + * way the published Darwin code does, using the same values that + * are present in the Darwin 8.2 snapshot property lists (note however + * that none of the code has been re-used, it's a complete re-implementation + * + * The various control loops found in Darwin config file are: + * + * PowerMac8,1 and PowerMac8,2 + * =========================== + * + * System Fans control loop. Different based on models. In addition to the + * usual PID algorithm, the control loop gets 2 additional pairs of linear + * scaling factors (scale/offsets) expressed as 4.12 fixed point values + * signed offset, unsigned scale) + * + * The targets are modified such as: + * - the linked control (second control) gets the target value as-is + * (typically the drive fan) + * - the main control (first control) gets the target value scaled with + * the first pair of factors, and is then modified as below + * - the value of the target of the CPU Fan control loop is retreived, + * scaled with the second pair of factors, and the max of that and + * the scaled target is applied to the main control. + * + * # model_id: 2 + * controls : system-fan, drive-bay-fan + * sensors : hd-temp + * PID params : G_d = 0x15400000 + * G_p = 0x00200000 + * G_r = 0x000002fd + * History = 2 entries + * Input target = 0x3a0000 + * Interval = 5s + * linear-factors : offset = 0xff38 scale = 0x0ccd + * offset = 0x0208 scale = 0x07ae + * + * # model_id: 3 + * controls : system-fan, drive-bay-fan + * sensors : hd-temp + * PID params : G_d = 0x08e00000 + * G_p = 0x00566666 + * G_r = 0x0000072b + * History = 2 entries + * Input target = 0x350000 + * Interval = 5s + * linear-factors : offset = 0xff38 scale = 0x0ccd + * offset = 0x0000 scale = 0x0000 + * + * # model_id: 5 + * controls : system-fan + * sensors : hd-temp + * PID params : G_d = 0x15400000 + * G_p = 0x00233333 + * G_r = 0x000002fd + * History = 2 entries + * Input target = 0x3a0000 + * Interval = 5s + * linear-factors : offset = 0x0000 scale = 0x1000 + * offset = 0x0091 scale = 0x0bae + * + * CPU Fan control loop. The loop is identical for all models. it + * has an additional pair of scaling factor. This is used to scale the + * systems fan control loop target result (the one before it gets scaled + * by the System Fans control loop itself). Then, the max value of the + * calculated target value and system fan value is sent to the fans + * + * controls : cpu-fan + * sensors : cpu-temp cpu-power + * PID params : From SMU sdb partition + * linear-factors : offset = 0xfb50 scale = 0x1000 + * + * CPU Slew control loop. Not implemented. The cpufreq driver in linux is + * completely separate for now, though we could find a way to link it, either + * as a client reacting to overtemp notifications, or directling monitoring + * the CPU temperature + * + * WARNING ! The CPU control loop requires the CPU tmax for the current + * operating point. However, we currently are completely separated from + * the cpufreq driver and thus do not know what the current operating + * point is. Fortunately, we also do not have any hardware supporting anything + * but operating point 0 at the moment, thus we just peek that value directly + * from the SDB partition. If we ever end up with actually slewing the system + * clock and thus changing operating points, we'll have to find a way to + * communicate with the CPU freq driver; + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "windfarm.h" +#include "windfarm_pid.h" + +#define VERSION "0.4" + +#undef DEBUG + +#ifdef DEBUG +#define DBG(args...) printk(args) +#else +#define DBG(args...) do { } while(0) +#endif + +/* define this to force CPU overtemp to 74 degree, useful for testing + * the overtemp code + */ +#undef HACKED_OVERTEMP + +static int wf_smu_mach_model; /* machine model id */ + +static struct device *wf_smu_dev; + +/* Controls & sensors */ +static struct wf_sensor *sensor_cpu_power; +static struct wf_sensor *sensor_cpu_temp; +static struct wf_sensor *sensor_hd_temp; +static struct wf_control *fan_cpu_main; +static struct wf_control *fan_hd; +static struct wf_control *fan_system; +static struct wf_control *cpufreq_clamp; + +/* Set to kick the control loop into life */ +static int wf_smu_all_controls_ok, wf_smu_all_sensors_ok, wf_smu_started; + +/* Failure handling.. could be nicer */ +#define FAILURE_FAN 0x01 +#define FAILURE_SENSOR 0x02 +#define FAILURE_OVERTEMP 0x04 + +static unsigned int wf_smu_failure_state; +static int wf_smu_readjust, wf_smu_skipping; + +/* + * ****** System Fans Control Loop ****** + * + */ + +/* Parameters for the System Fans control loop. Parameters + * not in this table such as interval, history size, ... + * are common to all versions and thus hard coded for now. + */ +struct wf_smu_sys_fans_param { + int model_id; + s32 itarget; + s32 gd, gp, gr; + + s16 offset0; + u16 scale0; + s16 offset1; + u16 scale1; +}; + +#define WF_SMU_SYS_FANS_INTERVAL 5 +#define WF_SMU_SYS_FANS_HISTORY_SIZE 2 + +/* State data used by the system fans control loop + */ +struct wf_smu_sys_fans_state { + int ticks; + s32 sys_setpoint; + s32 hd_setpoint; + s16 offset0; + u16 scale0; + s16 offset1; + u16 scale1; + struct wf_pid_state pid; +}; + +/* + * Configs for SMU Sytem Fan control loop + */ +static struct wf_smu_sys_fans_param wf_smu_sys_all_params[] = { + /* Model ID 2 */ + { + .model_id = 2, + .itarget = 0x3a0000, + .gd = 0x15400000, + .gp = 0x00200000, + .gr = 0x000002fd, + .offset0 = 0xff38, + .scale0 = 0x0ccd, + .offset1 = 0x0208, + .scale1 = 0x07ae, + }, + /* Model ID 3 */ + { + .model_id = 2, + .itarget = 0x350000, + .gd = 0x08e00000, + .gp = 0x00566666, + .gr = 0x0000072b, + .offset0 = 0xff38, + .scale0 = 0x0ccd, + .offset1 = 0x0000, + .scale1 = 0x0000, + }, + /* Model ID 5 */ + { + .model_id = 2, + .itarget = 0x3a0000, + .gd = 0x15400000, + .gp = 0x00233333, + .gr = 0x000002fd, + .offset0 = 0x0000, + .scale0 = 0x1000, + .offset1 = 0x0091, + .scale1 = 0x0bae, + }, +}; +#define WF_SMU_SYS_FANS_NUM_CONFIGS ARRAY_SIZE(wf_smu_sys_all_params) + +static struct wf_smu_sys_fans_state *wf_smu_sys_fans; + +/* + * ****** CPU Fans Control Loop ****** + * + */ + + +#define WF_SMU_CPU_FANS_INTERVAL 1 +#define WF_SMU_CPU_FANS_MAX_HISTORY 16 +#define WF_SMU_CPU_FANS_SIBLING_SCALE 0x00001000 +#define WF_SMU_CPU_FANS_SIBLING_OFFSET 0xfffffb50 + +/* State data used by the cpu fans control loop + */ +struct wf_smu_cpu_fans_state { + int ticks; + s32 cpu_setpoint; + s32 scale; + s32 offset; + struct wf_cpu_pid_state pid; +}; + +static struct wf_smu_cpu_fans_state *wf_smu_cpu_fans; + + + +/* + * ***** Implementation ***** + * + */ + +static void wf_smu_create_sys_fans(void) +{ + struct wf_smu_sys_fans_param *param = NULL; + struct wf_pid_param pid_param; + int i; + + /* First, locate the params for this model */ + for (i = 0; i < WF_SMU_SYS_FANS_NUM_CONFIGS; i++) + if (wf_smu_sys_all_params[i].model_id == wf_smu_mach_model) { + param = &wf_smu_sys_all_params[i]; + break; + } + + /* No params found, put fans to max */ + if (param == NULL) { + printk(KERN_WARNING "windfarm: System fan config not found " + "for this machine model, max fan speed\n"); + goto fail; + } + + /* Alloc & initialize state */ + wf_smu_sys_fans = kmalloc(sizeof(struct wf_smu_sys_fans_state), + GFP_KERNEL); + if (wf_smu_sys_fans == NULL) { + printk(KERN_WARNING "windfarm: Memory allocation error" + " max fan speed\n"); + goto fail; + } + wf_smu_sys_fans->ticks = 1; + wf_smu_sys_fans->scale0 = param->scale0; + wf_smu_sys_fans->offset0 = param->offset0; + wf_smu_sys_fans->scale1 = param->scale1; + wf_smu_sys_fans->offset1 = param->offset1; + + /* Fill PID params */ + pid_param.gd = param->gd; + pid_param.gp = param->gp; + pid_param.gr = param->gr; + pid_param.interval = WF_SMU_SYS_FANS_INTERVAL; + pid_param.history_len = WF_SMU_SYS_FANS_HISTORY_SIZE; + pid_param.itarget = param->itarget; + pid_param.min = fan_system->ops->get_min(fan_system); + pid_param.max = fan_system->ops->get_max(fan_system); + if (fan_hd) { + pid_param.min = + max(pid_param.min,fan_hd->ops->get_min(fan_hd)); + pid_param.max = + min(pid_param.max,fan_hd->ops->get_max(fan_hd)); + } + wf_pid_init(&wf_smu_sys_fans->pid, &pid_param); + + DBG("wf: System Fan control initialized.\n"); + DBG(" itarged=%d.%03d, min=%d RPM, max=%d RPM\n", + FIX32TOPRINT(pid_param.itarget), pid_param.min, pid_param.max); + return; + + fail: + + if (fan_system) + wf_control_set_max(fan_system); + if (fan_hd) + wf_control_set_max(fan_hd); +} + +static void wf_smu_sys_fans_tick(struct wf_smu_sys_fans_state *st) +{ + s32 new_setpoint, temp, scaled, cputarget; + int rc; + + if (--st->ticks != 0) { + if (wf_smu_readjust) + goto readjust; + return; + } + st->ticks = WF_SMU_SYS_FANS_INTERVAL; + + rc = sensor_hd_temp->ops->get_value(sensor_hd_temp, &temp); + if (rc) { + printk(KERN_WARNING "windfarm: HD temp sensor error %d\n", + rc); + wf_smu_failure_state |= FAILURE_SENSOR; + return; + } + + DBG("wf_smu: System Fans tick ! HD temp: %d.%03d\n", + FIX32TOPRINT(temp)); + + if (temp > (st->pid.param.itarget + 0x50000)) + wf_smu_failure_state |= FAILURE_OVERTEMP; + + new_setpoint = wf_pid_run(&st->pid, temp); + + DBG("wf_smu: new_setpoint: %d RPM\n", (int)new_setpoint); + + scaled = ((((s64)new_setpoint) * (s64)st->scale0) >> 12) + st->offset0; + + DBG("wf_smu: scaled setpoint: %d RPM\n", (int)scaled); + + cputarget = wf_smu_cpu_fans ? wf_smu_cpu_fans->pid.target : 0; + cputarget = ((((s64)cputarget) * (s64)st->scale1) >> 12) + st->offset1; + scaled = max(scaled, cputarget); + scaled = max(scaled, st->pid.param.min); + scaled = min(scaled, st->pid.param.max); + + DBG("wf_smu: adjusted setpoint: %d RPM\n", (int)scaled); + + if (st->sys_setpoint == scaled && new_setpoint == st->hd_setpoint) + return; + st->sys_setpoint = scaled; + st->hd_setpoint = new_setpoint; + readjust: + if (fan_system && wf_smu_failure_state == 0) { + rc = fan_system->ops->set_value(fan_system, st->sys_setpoint); + if (rc) { + printk(KERN_WARNING "windfarm: Sys fan error %d\n", + rc); + wf_smu_failure_state |= FAILURE_FAN; + } + } + if (fan_hd && wf_smu_failure_state == 0) { + rc = fan_hd->ops->set_value(fan_hd, st->hd_setpoint); + if (rc) { + printk(KERN_WARNING "windfarm: HD fan error %d\n", + rc); + wf_smu_failure_state |= FAILURE_FAN; + } + } +} + +static void wf_smu_create_cpu_fans(void) +{ + struct wf_cpu_pid_param pid_param; + struct smu_sdbp_header *hdr; + struct smu_sdbp_cpupiddata *piddata; + struct smu_sdbp_fvt *fvt; + s32 tmax, tdelta, maxpow, powadj; + + /* First, locate the PID params in SMU SBD */ + hdr = smu_get_sdb_partition(SMU_SDB_CPUPIDDATA_ID, NULL); + if (hdr == 0) { + printk(KERN_WARNING "windfarm: CPU PID fan config not found " + "max fan speed\n"); + goto fail; + } + piddata = (struct smu_sdbp_cpupiddata *)&hdr[1]; + + /* Get the FVT params for operating point 0 (the only supported one + * for now) in order to get tmax + */ + hdr = smu_get_sdb_partition(SMU_SDB_FVT_ID, NULL); + if (hdr) { + fvt = (struct smu_sdbp_fvt *)&hdr[1]; + tmax = ((s32)fvt->maxtemp) << 16; + } else + tmax = 0x5e0000; /* 94 degree default */ + + /* Alloc & initialize state */ + wf_smu_cpu_fans = kmalloc(sizeof(struct wf_smu_cpu_fans_state), + GFP_KERNEL); + if (wf_smu_cpu_fans == NULL) + goto fail; + wf_smu_cpu_fans->ticks = 1; + + wf_smu_cpu_fans->scale = WF_SMU_CPU_FANS_SIBLING_SCALE; + wf_smu_cpu_fans->offset = WF_SMU_CPU_FANS_SIBLING_OFFSET; + + /* Fill PID params */ + pid_param.interval = WF_SMU_CPU_FANS_INTERVAL; + pid_param.history_len = piddata->history_len; + if (pid_param.history_len > WF_CPU_PID_MAX_HISTORY) { + printk(KERN_WARNING "windfarm: History size overflow on " + "CPU control loop (%d)\n", piddata->history_len); + pid_param.history_len = WF_CPU_PID_MAX_HISTORY; + } + pid_param.gd = piddata->gd; + pid_param.gp = piddata->gp; + pid_param.gr = piddata->gr / pid_param.history_len; + + tdelta = ((s32)piddata->target_temp_delta) << 16; + maxpow = ((s32)piddata->max_power) << 16; + powadj = ((s32)piddata->power_adj) << 16; + + pid_param.tmax = tmax; + pid_param.ttarget = tmax - tdelta; + pid_param.pmaxadj = maxpow - powadj; + + pid_param.min = fan_cpu_main->ops->get_min(fan_cpu_main); + pid_param.max = fan_cpu_main->ops->get_max(fan_cpu_main); + + wf_cpu_pid_init(&wf_smu_cpu_fans->pid, &pid_param); + + DBG("wf: CPU Fan control initialized.\n"); + DBG(" ttarged=%d.%03d, tmax=%d.%03d, min=%d RPM, max=%d RPM\n", + FIX32TOPRINT(pid_param.ttarget), FIX32TOPRINT(pid_param.tmax), + pid_param.min, pid_param.max); + + return; + + fail: + printk(KERN_WARNING "windfarm: CPU fan config not found\n" + "for this machine model, max fan speed\n"); + + if (cpufreq_clamp) + wf_control_set_max(cpufreq_clamp); + if (fan_cpu_main) + wf_control_set_max(fan_cpu_main); +} + +static void wf_smu_cpu_fans_tick(struct wf_smu_cpu_fans_state *st) +{ + s32 new_setpoint, temp, power, systarget; + int rc; + + if (--st->ticks != 0) { + if (wf_smu_readjust) + goto readjust; + return; + } + st->ticks = WF_SMU_CPU_FANS_INTERVAL; + + rc = sensor_cpu_temp->ops->get_value(sensor_cpu_temp, &temp); + if (rc) { + printk(KERN_WARNING "windfarm: CPU temp sensor error %d\n", + rc); + wf_smu_failure_state |= FAILURE_SENSOR; + return; + } + + rc = sensor_cpu_power->ops->get_value(sensor_cpu_power, &power); + if (rc) { + printk(KERN_WARNING "windfarm: CPU power sensor error %d\n", + rc); + wf_smu_failure_state |= FAILURE_SENSOR; + return; + } + + DBG("wf_smu: CPU Fans tick ! CPU temp: %d.%03d, power: %d.%03d\n", + FIX32TOPRINT(temp), FIX32TOPRINT(power)); + +#ifdef HACKED_OVERTEMP + if (temp > 0x4a0000) + wf_smu_failure_state |= FAILURE_OVERTEMP; +#else + if (temp > st->pid.param.tmax) + wf_smu_failure_state |= FAILURE_OVERTEMP; +#endif + new_setpoint = wf_cpu_pid_run(&st->pid, power, temp); + + DBG("wf_smu: new_setpoint: %d RPM\n", (int)new_setpoint); + + systarget = wf_smu_sys_fans ? wf_smu_sys_fans->pid.target : 0; + systarget = ((((s64)systarget) * (s64)st->scale) >> 12) + + st->offset; + new_setpoint = max(new_setpoint, systarget); + new_setpoint = max(new_setpoint, st->pid.param.min); + new_setpoint = min(new_setpoint, st->pid.param.max); + + DBG("wf_smu: adjusted setpoint: %d RPM\n", (int)new_setpoint); + + if (st->cpu_setpoint == new_setpoint) + return; + st->cpu_setpoint = new_setpoint; + readjust: + if (fan_cpu_main && wf_smu_failure_state == 0) { + rc = fan_cpu_main->ops->set_value(fan_cpu_main, + st->cpu_setpoint); + if (rc) { + printk(KERN_WARNING "windfarm: CPU main fan" + " error %d\n", rc); + wf_smu_failure_state |= FAILURE_FAN; + } + } +} + + +/* + * ****** Attributes ****** + * + */ + +#define BUILD_SHOW_FUNC_FIX(name, data) \ +static ssize_t show_##name(struct device *dev, \ + struct device_attribute *attr, \ + char *buf) \ +{ \ + ssize_t r; \ + s32 val = 0; \ + data->ops->get_value(data, &val); \ + r = sprintf(buf, "%d.%03d", FIX32TOPRINT(val)); \ + return r; \ +} \ +static DEVICE_ATTR(name,S_IRUGO,show_##name, NULL); + + +#define BUILD_SHOW_FUNC_INT(name, data) \ +static ssize_t show_##name(struct device *dev, \ + struct device_attribute *attr, \ + char *buf) \ +{ \ + s32 val = 0; \ + data->ops->get_value(data, &val); \ + return sprintf(buf, "%d", val); \ +} \ +static DEVICE_ATTR(name,S_IRUGO,show_##name, NULL); + +BUILD_SHOW_FUNC_INT(cpu_fan, fan_cpu_main); +BUILD_SHOW_FUNC_INT(sys_fan, fan_system); +BUILD_SHOW_FUNC_INT(hd_fan, fan_hd); + +BUILD_SHOW_FUNC_FIX(cpu_temp, sensor_cpu_temp); +BUILD_SHOW_FUNC_FIX(cpu_power, sensor_cpu_power); +BUILD_SHOW_FUNC_FIX(hd_temp, sensor_hd_temp); + +/* + * ****** Setup / Init / Misc ... ****** + * + */ + +static void wf_smu_tick(void) +{ + unsigned int last_failure = wf_smu_failure_state; + unsigned int new_failure; + + if (!wf_smu_started) { + DBG("wf: creating control loops !\n"); + wf_smu_create_sys_fans(); + wf_smu_create_cpu_fans(); + wf_smu_started = 1; + } + + /* Skipping ticks */ + if (wf_smu_skipping && --wf_smu_skipping) + return; + + wf_smu_failure_state = 0; + if (wf_smu_sys_fans) + wf_smu_sys_fans_tick(wf_smu_sys_fans); + if (wf_smu_cpu_fans) + wf_smu_cpu_fans_tick(wf_smu_cpu_fans); + + wf_smu_readjust = 0; + new_failure = wf_smu_failure_state & ~last_failure; + + /* If entering failure mode, clamp cpufreq and ramp all + * fans to full speed. + */ + if (wf_smu_failure_state && !last_failure) { + if (cpufreq_clamp) + wf_control_set_max(cpufreq_clamp); + if (fan_system) + wf_control_set_max(fan_system); + if (fan_cpu_main) + wf_control_set_max(fan_cpu_main); + if (fan_hd) + wf_control_set_max(fan_hd); + } + + /* If leaving failure mode, unclamp cpufreq and readjust + * all fans on next iteration + */ + if (!wf_smu_failure_state && last_failure) { + if (cpufreq_clamp) + wf_control_set_min(cpufreq_clamp); + wf_smu_readjust = 1; + } + + /* Overtemp condition detected, notify and start skipping a couple + * ticks to let the temperature go down + */ + if (new_failure & FAILURE_OVERTEMP) { + wf_set_overtemp(); + wf_smu_skipping = 2; + } + + /* We only clear the overtemp condition if overtemp is cleared + * _and_ no other failure is present. Since a sensor error will + * clear the overtemp condition (can't measure temperature) at + * the control loop levels, but we don't want to keep it clear + * here in this case + */ + if (new_failure == 0 && last_failure & FAILURE_OVERTEMP) + wf_clear_overtemp(); +} + +static void wf_smu_new_control(struct wf_control *ct) +{ + if (wf_smu_all_controls_ok) + return; + + if (fan_cpu_main == NULL && !strcmp(ct->name, "cpu-fan")) { + if (wf_get_control(ct) == 0) { + fan_cpu_main = ct; + device_create_file(wf_smu_dev, &dev_attr_cpu_fan); + } + } + + if (fan_system == NULL && !strcmp(ct->name, "system-fan")) { + if (wf_get_control(ct) == 0) { + fan_system = ct; + device_create_file(wf_smu_dev, &dev_attr_sys_fan); + } + } + + if (cpufreq_clamp == NULL && !strcmp(ct->name, "cpufreq-clamp")) { + if (wf_get_control(ct) == 0) + cpufreq_clamp = ct; + } + + /* Darwin property list says the HD fan is only for model ID + * 0, 1, 2 and 3 + */ + + if (wf_smu_mach_model > 3) { + if (fan_system && fan_cpu_main && cpufreq_clamp) + wf_smu_all_controls_ok = 1; + return; + } + + if (fan_hd == NULL && !strcmp(ct->name, "drive-bay-fan")) { + if (wf_get_control(ct) == 0) { + fan_hd = ct; + device_create_file(wf_smu_dev, &dev_attr_hd_fan); + } + } + + if (fan_system && fan_hd && fan_cpu_main && cpufreq_clamp) + wf_smu_all_controls_ok = 1; +} + +static void wf_smu_new_sensor(struct wf_sensor *sr) +{ + if (wf_smu_all_sensors_ok) + return; + + if (sensor_cpu_power == NULL && !strcmp(sr->name, "cpu-power")) { + if (wf_get_sensor(sr) == 0) { + sensor_cpu_power = sr; + device_create_file(wf_smu_dev, &dev_attr_cpu_power); + } + } + + if (sensor_cpu_temp == NULL && !strcmp(sr->name, "cpu-temp")) { + if (wf_get_sensor(sr) == 0) { + sensor_cpu_temp = sr; + device_create_file(wf_smu_dev, &dev_attr_cpu_temp); + } + } + + if (sensor_hd_temp == NULL && !strcmp(sr->name, "hd-temp")) { + if (wf_get_sensor(sr) == 0) { + sensor_hd_temp = sr; + device_create_file(wf_smu_dev, &dev_attr_hd_temp); + } + } + + if (sensor_cpu_power && sensor_cpu_temp && sensor_hd_temp) + wf_smu_all_sensors_ok = 1; +} + + +static int wf_smu_notify(struct notifier_block *self, + unsigned long event, void *data) +{ + switch(event) { + case WF_EVENT_NEW_CONTROL: + DBG("wf: new control %s detected\n", + ((struct wf_control *)data)->name); + wf_smu_new_control(data); + wf_smu_readjust = 1; + break; + case WF_EVENT_NEW_SENSOR: + DBG("wf: new sensor %s detected\n", + ((struct wf_sensor *)data)->name); + wf_smu_new_sensor(data); + break; + case WF_EVENT_TICK: + if (wf_smu_all_controls_ok && wf_smu_all_sensors_ok) + wf_smu_tick(); + } + + return 0; +} + +static struct notifier_block wf_smu_events = { + .notifier_call = wf_smu_notify, +}; + +static int wf_init_pm(void) +{ + struct smu_sdbp_header *hdr; + + hdr = smu_get_sdb_partition(SMU_SDB_SENSORTREE_ID, NULL); + if (hdr != 0) { + struct smu_sdbp_sensortree *st = + (struct smu_sdbp_sensortree *)&hdr[1]; + wf_smu_mach_model = st->model_id; + } + + printk(KERN_INFO "windfarm: Initializing for iMacG5 model ID %d\n", + wf_smu_mach_model); + + return 0; +} + +static int wf_smu_probe(struct device *ddev) +{ + wf_smu_dev = ddev; + + wf_register_client(&wf_smu_events); + + return 0; +} + +static int wf_smu_remove(struct device *ddev) +{ + wf_unregister_client(&wf_smu_events); + + /* XXX We don't have yet a guarantee that our callback isn't + * in progress when returning from wf_unregister_client, so + * we add an arbitrary delay. I'll have to fix that in the core + */ + msleep(1000); + + /* Release all sensors */ + /* One more crappy race: I don't think we have any guarantee here + * that the attribute callback won't race with the sensor beeing + * disposed of, and I'm not 100% certain what best way to deal + * with that except by adding locks all over... I'll do that + * eventually but heh, who ever rmmod this module anyway ? + */ + if (sensor_cpu_power) { + device_remove_file(wf_smu_dev, &dev_attr_cpu_power); + wf_put_sensor(sensor_cpu_power); + } + if (sensor_cpu_temp) { + device_remove_file(wf_smu_dev, &dev_attr_cpu_temp); + wf_put_sensor(sensor_cpu_temp); + } + if (sensor_hd_temp) { + device_remove_file(wf_smu_dev, &dev_attr_hd_temp); + wf_put_sensor(sensor_hd_temp); + } + + /* Release all controls */ + if (fan_cpu_main) { + device_remove_file(wf_smu_dev, &dev_attr_cpu_fan); + wf_put_control(fan_cpu_main); + } + if (fan_hd) { + device_remove_file(wf_smu_dev, &dev_attr_hd_fan); + wf_put_control(fan_hd); + } + if (fan_system) { + device_remove_file(wf_smu_dev, &dev_attr_sys_fan); + wf_put_control(fan_system); + } + if (cpufreq_clamp) + wf_put_control(cpufreq_clamp); + + /* Destroy control loops state structures */ + if (wf_smu_sys_fans) + kfree(wf_smu_sys_fans); + if (wf_smu_cpu_fans) + kfree(wf_smu_cpu_fans); + + wf_smu_dev = NULL; + + return 0; +} + +static struct device_driver wf_smu_driver = { + .name = "windfarm", + .bus = &platform_bus_type, + .probe = wf_smu_probe, + .remove = wf_smu_remove, +}; + + +static int __init wf_smu_init(void) +{ + int rc = -ENODEV; + + if (machine_is_compatible("PowerMac8,1") || + machine_is_compatible("PowerMac8,2")) + rc = wf_init_pm(); + + if (rc == 0) { +#ifdef MODULE + request_module("windfarm_smu_controls"); + request_module("windfarm_smu_sensors"); + request_module("windfarm_lm75_sensor"); + +#endif /* MODULE */ + driver_register(&wf_smu_driver); + } + + return rc; +} + +static void __exit wf_smu_exit(void) +{ + + driver_unregister(&wf_smu_driver); +} + + +module_init(wf_smu_init); +module_exit(wf_smu_exit); + +MODULE_AUTHOR("Benjamin Herrenschmidt "); +MODULE_DESCRIPTION("Thermal control logic for iMac G5"); +MODULE_LICENSE("GPL"); + Index: linux-work/drivers/macintosh/windfarm_pm91.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/drivers/macintosh/windfarm_pm91.c 2005-11-07 13:35:12.000000000 +1100 @@ -0,0 +1,816 @@ +/* + * Windfarm PowerMac thermal control. SMU based 1 CPU desktop control loops + * + * (c) Copyright 2005 Benjamin Herrenschmidt, IBM Corp. + * + * + * Released under the term of the GNU GPL v2. + * + * The algorithm used is the PID control algorithm, used the same + * way the published Darwin code does, using the same values that + * are present in the Darwin 8.2 snapshot property lists (note however + * that none of the code has been re-used, it's a complete re-implementation + * + * The various control loops found in Darwin config file are: + * + * PowerMac9,1 + * =========== + * + * Has 3 control loops: CPU fans is similar to PowerMac8,1 (though it doesn't + * try to play with other control loops fans). Drive bay is rather basic PID + * with one sensor and one fan. Slots area is a bit different as the Darwin + * driver is supposed to be capable of working in a special "AGP" mode which + * involves the presence of an AGP sensor and an AGP fan (possibly on the + * AGP card itself). I can't deal with that special mode as I don't have + * access to those additional sensor/fans for now (though ultimately, it would + * be possible to add sensor objects for them) so I'm only implementing the + * basic PCI slot control loop + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "windfarm.h" +#include "windfarm_pid.h" + +#define VERSION "0.4" + +#undef DEBUG + +#ifdef DEBUG +#define DBG(args...) printk(args) +#else +#define DBG(args...) do { } while(0) +#endif + +/* define this to force CPU overtemp to 74 degree, useful for testing + * the overtemp code + */ +#undef HACKED_OVERTEMP + +static struct device *wf_smu_dev; + +/* Controls & sensors */ +static struct wf_sensor *sensor_cpu_power; +static struct wf_sensor *sensor_cpu_temp; +static struct wf_sensor *sensor_hd_temp; +static struct wf_sensor *sensor_slots_power; +static struct wf_control *fan_cpu_main; +static struct wf_control *fan_cpu_second; +static struct wf_control *fan_cpu_third; +static struct wf_control *fan_hd; +static struct wf_control *fan_slots; +static struct wf_control *cpufreq_clamp; + +/* Set to kick the control loop into life */ +static int wf_smu_all_controls_ok, wf_smu_all_sensors_ok, wf_smu_started; + +/* Failure handling.. could be nicer */ +#define FAILURE_FAN 0x01 +#define FAILURE_SENSOR 0x02 +#define FAILURE_OVERTEMP 0x04 + +static unsigned int wf_smu_failure_state; +static int wf_smu_readjust, wf_smu_skipping; + +/* + * ****** CPU Fans Control Loop ****** + * + */ + + +#define WF_SMU_CPU_FANS_INTERVAL 1 +#define WF_SMU_CPU_FANS_MAX_HISTORY 16 + +/* State data used by the cpu fans control loop + */ +struct wf_smu_cpu_fans_state { + int ticks; + s32 cpu_setpoint; + struct wf_cpu_pid_state pid; +}; + +static struct wf_smu_cpu_fans_state *wf_smu_cpu_fans; + + + +/* + * ****** Drive Fan Control Loop ****** + * + */ + +struct wf_smu_drive_fans_state { + int ticks; + s32 setpoint; + struct wf_pid_state pid; +}; + +static struct wf_smu_drive_fans_state *wf_smu_drive_fans; + +/* + * ****** Slots Fan Control Loop ****** + * + */ + +struct wf_smu_slots_fans_state { + int ticks; + s32 setpoint; + struct wf_pid_state pid; +}; + +static struct wf_smu_slots_fans_state *wf_smu_slots_fans; + +/* + * ***** Implementation ***** + * + */ + + +static void wf_smu_create_cpu_fans(void) +{ + struct wf_cpu_pid_param pid_param; + struct smu_sdbp_header *hdr; + struct smu_sdbp_cpupiddata *piddata; + struct smu_sdbp_fvt *fvt; + s32 tmax, tdelta, maxpow, powadj; + + /* First, locate the PID params in SMU SBD */ + hdr = smu_get_sdb_partition(SMU_SDB_CPUPIDDATA_ID, NULL); + if (hdr == 0) { + printk(KERN_WARNING "windfarm: CPU PID fan config not found " + "max fan speed\n"); + goto fail; + } + piddata = (struct smu_sdbp_cpupiddata *)&hdr[1]; + + /* Get the FVT params for operating point 0 (the only supported one + * for now) in order to get tmax + */ + hdr = smu_get_sdb_partition(SMU_SDB_FVT_ID, NULL); + if (hdr) { + fvt = (struct smu_sdbp_fvt *)&hdr[1]; + tmax = ((s32)fvt->maxtemp) << 16; + } else + tmax = 0x5e0000; /* 94 degree default */ + + /* Alloc & initialize state */ + wf_smu_cpu_fans = kmalloc(sizeof(struct wf_smu_cpu_fans_state), + GFP_KERNEL); + if (wf_smu_cpu_fans == NULL) + goto fail; + wf_smu_cpu_fans->ticks = 1; + + /* Fill PID params */ + pid_param.interval = WF_SMU_CPU_FANS_INTERVAL; + pid_param.history_len = piddata->history_len; + if (pid_param.history_len > WF_CPU_PID_MAX_HISTORY) { + printk(KERN_WARNING "windfarm: History size overflow on " + "CPU control loop (%d)\n", piddata->history_len); + pid_param.history_len = WF_CPU_PID_MAX_HISTORY; + } + pid_param.gd = piddata->gd; + pid_param.gp = piddata->gp; + pid_param.gr = piddata->gr / pid_param.history_len; + + tdelta = ((s32)piddata->target_temp_delta) << 16; + maxpow = ((s32)piddata->max_power) << 16; + powadj = ((s32)piddata->power_adj) << 16; + + pid_param.tmax = tmax; + pid_param.ttarget = tmax - tdelta; + pid_param.pmaxadj = maxpow - powadj; + + pid_param.min = fan_cpu_main->ops->get_min(fan_cpu_main); + pid_param.max = fan_cpu_main->ops->get_max(fan_cpu_main); + + wf_cpu_pid_init(&wf_smu_cpu_fans->pid, &pid_param); + + DBG("wf: CPU Fan control initialized.\n"); + DBG(" ttarged=%d.%03d, tmax=%d.%03d, min=%d RPM, max=%d RPM\n", + FIX32TOPRINT(pid_param.ttarget), FIX32TOPRINT(pid_param.tmax), + pid_param.min, pid_param.max); + + return; + + fail: + printk(KERN_WARNING "windfarm: CPU fan config not found\n" + "for this machine model, max fan speed\n"); + + if (cpufreq_clamp) + wf_control_set_max(cpufreq_clamp); + if (fan_cpu_main) + wf_control_set_max(fan_cpu_main); +} + +static void wf_smu_cpu_fans_tick(struct wf_smu_cpu_fans_state *st) +{ + s32 new_setpoint, temp, power; + int rc; + + if (--st->ticks != 0) { + if (wf_smu_readjust) + goto readjust; + return; + } + st->ticks = WF_SMU_CPU_FANS_INTERVAL; + + rc = sensor_cpu_temp->ops->get_value(sensor_cpu_temp, &temp); + if (rc) { + printk(KERN_WARNING "windfarm: CPU temp sensor error %d\n", + rc); + wf_smu_failure_state |= FAILURE_SENSOR; + return; + } + + rc = sensor_cpu_power->ops->get_value(sensor_cpu_power, &power); + if (rc) { + printk(KERN_WARNING "windfarm: CPU power sensor error %d\n", + rc); + wf_smu_failure_state |= FAILURE_SENSOR; + return; + } + + DBG("wf_smu: CPU Fans tick ! CPU temp: %d.%03d, power: %d.%03d\n", + FIX32TOPRINT(temp), FIX32TOPRINT(power)); + +#ifdef HACKED_OVERTEMP + if (temp > 0x4a0000) + wf_smu_failure_state |= FAILURE_OVERTEMP; +#else + if (temp > st->pid.param.tmax) + wf_smu_failure_state |= FAILURE_OVERTEMP; +#endif + new_setpoint = wf_cpu_pid_run(&st->pid, power, temp); + + DBG("wf_smu: new_setpoint: %d RPM\n", (int)new_setpoint); + + if (st->cpu_setpoint == new_setpoint) + return; + st->cpu_setpoint = new_setpoint; + readjust: + if (fan_cpu_main && wf_smu_failure_state == 0) { + rc = fan_cpu_main->ops->set_value(fan_cpu_main, + st->cpu_setpoint); + if (rc) { + printk(KERN_WARNING "windfarm: CPU main fan" + " error %d\n", rc); + wf_smu_failure_state |= FAILURE_FAN; + } + } + if (fan_cpu_second && wf_smu_failure_state == 0) { + rc = fan_cpu_second->ops->set_value(fan_cpu_second, + st->cpu_setpoint); + if (rc) { + printk(KERN_WARNING "windfarm: CPU second fan" + " error %d\n", rc); + wf_smu_failure_state |= FAILURE_FAN; + } + } + if (fan_cpu_third && wf_smu_failure_state == 0) { + rc = fan_cpu_main->ops->set_value(fan_cpu_third, + st->cpu_setpoint); + if (rc) { + printk(KERN_WARNING "windfarm: CPU third fan" + " error %d\n", rc); + wf_smu_failure_state |= FAILURE_FAN; + } + } +} + +static void wf_smu_create_drive_fans(void) +{ + struct wf_pid_param param = { + .interval = 5, + .history_len = 2, + .gd = 0x01e00000, + .gp = 0x00500000, + .gr = 0x00000000, + .itarget = 0x00200000, + }; + + /* Alloc & initialize state */ + wf_smu_drive_fans = kmalloc(sizeof(struct wf_smu_drive_fans_state), + GFP_KERNEL); + if (wf_smu_drive_fans == NULL) { + printk(KERN_WARNING "windfarm: Memory allocation error" + " max fan speed\n"); + goto fail; + } + wf_smu_drive_fans->ticks = 1; + + /* Fill PID params */ + param.additive = (fan_hd->type == WF_CONTROL_RPM_FAN); + param.min = fan_hd->ops->get_min(fan_hd); + param.max = fan_hd->ops->get_max(fan_hd); + wf_pid_init(&wf_smu_drive_fans->pid, ¶m); + + DBG("wf: Drive Fan control initialized.\n"); + DBG(" itarged=%d.%03d, min=%d RPM, max=%d RPM\n", + FIX32TOPRINT(param.itarget), param.min, param.max); + return; + + fail: + if (fan_hd) + wf_control_set_max(fan_hd); +} + +static void wf_smu_drive_fans_tick(struct wf_smu_drive_fans_state *st) +{ + s32 new_setpoint, temp; + int rc; + + if (--st->ticks != 0) { + if (wf_smu_readjust) + goto readjust; + return; + } + st->ticks = st->pid.param.interval; + + rc = sensor_hd_temp->ops->get_value(sensor_hd_temp, &temp); + if (rc) { + printk(KERN_WARNING "windfarm: HD temp sensor error %d\n", + rc); + wf_smu_failure_state |= FAILURE_SENSOR; + return; + } + + DBG("wf_smu: Drive Fans tick ! HD temp: %d.%03d\n", + FIX32TOPRINT(temp)); + + if (temp > (st->pid.param.itarget + 0x50000)) + wf_smu_failure_state |= FAILURE_OVERTEMP; + + new_setpoint = wf_pid_run(&st->pid, temp); + + DBG("wf_smu: new_setpoint: %d\n", (int)new_setpoint); + + if (st->setpoint == new_setpoint) + return; + st->setpoint = new_setpoint; + readjust: + if (fan_hd && wf_smu_failure_state == 0) { + rc = fan_hd->ops->set_value(fan_hd, st->setpoint); + if (rc) { + printk(KERN_WARNING "windfarm: HD fan error %d\n", + rc); + wf_smu_failure_state |= FAILURE_FAN; + } + } +} + +static void wf_smu_create_slots_fans(void) +{ + struct wf_pid_param param = { + .interval = 1, + .history_len = 8, + .gd = 0x00000000, + .gp = 0x00000000, + .gr = 0x00020000, + .itarget = 0x00000000 + }; + + /* Alloc & initialize state */ + wf_smu_slots_fans = kmalloc(sizeof(struct wf_smu_slots_fans_state), + GFP_KERNEL); + if (wf_smu_slots_fans == NULL) { + printk(KERN_WARNING "windfarm: Memory allocation error" + " max fan speed\n"); + goto fail; + } + wf_smu_slots_fans->ticks = 1; + + /* Fill PID params */ + param.additive = (fan_slots->type == WF_CONTROL_RPM_FAN); + param.min = fan_slots->ops->get_min(fan_slots); + param.max = fan_slots->ops->get_max(fan_slots); + wf_pid_init(&wf_smu_slots_fans->pid, ¶m); + + DBG("wf: Slots Fan control initialized.\n"); + DBG(" itarged=%d.%03d, min=%d RPM, max=%d RPM\n", + FIX32TOPRINT(param.itarget), param.min, param.max); + return; + + fail: + if (fan_slots) + wf_control_set_max(fan_slots); +} + +static void wf_smu_slots_fans_tick(struct wf_smu_slots_fans_state *st) +{ + s32 new_setpoint, power; + int rc; + + if (--st->ticks != 0) { + if (wf_smu_readjust) + goto readjust; + return; + } + st->ticks = st->pid.param.interval; + + rc = sensor_slots_power->ops->get_value(sensor_slots_power, &power); + if (rc) { + printk(KERN_WARNING "windfarm: Slots power sensor error %d\n", + rc); + wf_smu_failure_state |= FAILURE_SENSOR; + return; + } + + DBG("wf_smu: Slots Fans tick ! Slots power: %d.%03d\n", + FIX32TOPRINT(power)); + +#if 0 /* Check what makes a good overtemp condition */ + if (power > (st->pid.param.itarget + 0x50000)) + wf_smu_failure_state |= FAILURE_OVERTEMP; +#endif + + new_setpoint = wf_pid_run(&st->pid, power); + + DBG("wf_smu: new_setpoint: %d\n", (int)new_setpoint); + + if (st->setpoint == new_setpoint) + return; + st->setpoint = new_setpoint; + readjust: + if (fan_slots && wf_smu_failure_state == 0) { + rc = fan_slots->ops->set_value(fan_slots, st->setpoint); + if (rc) { + printk(KERN_WARNING "windfarm: Slots fan error %d\n", + rc); + wf_smu_failure_state |= FAILURE_FAN; + } + } +} + + +/* + * ****** Attributes ****** + * + */ + +#define BUILD_SHOW_FUNC_FIX(name, data) \ +static ssize_t show_##name(struct device *dev, \ + struct device_attribute *attr, \ + char *buf) \ +{ \ + ssize_t r; \ + s32 val = 0; \ + data->ops->get_value(data, &val); \ + r = sprintf(buf, "%d.%03d", FIX32TOPRINT(val)); \ + return r; \ +} \ +static DEVICE_ATTR(name,S_IRUGO,show_##name, NULL); + + +#define BUILD_SHOW_FUNC_INT(name, data) \ +static ssize_t show_##name(struct device *dev, \ + struct device_attribute *attr, \ + char *buf) \ +{ \ + s32 val = 0; \ + data->ops->get_value(data, &val); \ + return sprintf(buf, "%d", val); \ +} \ +static DEVICE_ATTR(name,S_IRUGO,show_##name, NULL); + +BUILD_SHOW_FUNC_INT(cpu_fan, fan_cpu_main); +BUILD_SHOW_FUNC_INT(hd_fan, fan_hd); +BUILD_SHOW_FUNC_INT(slots_fan, fan_slots); + +BUILD_SHOW_FUNC_FIX(cpu_temp, sensor_cpu_temp); +BUILD_SHOW_FUNC_FIX(cpu_power, sensor_cpu_power); +BUILD_SHOW_FUNC_FIX(hd_temp, sensor_hd_temp); +BUILD_SHOW_FUNC_FIX(slots_power, sensor_slots_power); + +/* + * ****** Setup / Init / Misc ... ****** + * + */ + +static void wf_smu_tick(void) +{ + unsigned int last_failure = wf_smu_failure_state; + unsigned int new_failure; + + if (!wf_smu_started) { + DBG("wf: creating control loops !\n"); + wf_smu_create_drive_fans(); + wf_smu_create_slots_fans(); + wf_smu_create_cpu_fans(); + wf_smu_started = 1; + } + + /* Skipping ticks */ + if (wf_smu_skipping && --wf_smu_skipping) + return; + + wf_smu_failure_state = 0; + if (wf_smu_drive_fans) + wf_smu_drive_fans_tick(wf_smu_drive_fans); + if (wf_smu_slots_fans) + wf_smu_slots_fans_tick(wf_smu_slots_fans); + if (wf_smu_cpu_fans) + wf_smu_cpu_fans_tick(wf_smu_cpu_fans); + + wf_smu_readjust = 0; + new_failure = wf_smu_failure_state & ~last_failure; + + /* If entering failure mode, clamp cpufreq and ramp all + * fans to full speed. + */ + if (wf_smu_failure_state && !last_failure) { + if (cpufreq_clamp) + wf_control_set_max(cpufreq_clamp); + if (fan_cpu_main) + wf_control_set_max(fan_cpu_main); + if (fan_cpu_second) + wf_control_set_max(fan_cpu_second); + if (fan_cpu_third) + wf_control_set_max(fan_cpu_third); + if (fan_hd) + wf_control_set_max(fan_hd); + if (fan_slots) + wf_control_set_max(fan_slots); + } + + /* If leaving failure mode, unclamp cpufreq and readjust + * all fans on next iteration + */ + if (!wf_smu_failure_state && last_failure) { + if (cpufreq_clamp) + wf_control_set_min(cpufreq_clamp); + wf_smu_readjust = 1; + } + + /* Overtemp condition detected, notify and start skipping a couple + * ticks to let the temperature go down + */ + if (new_failure & FAILURE_OVERTEMP) { + wf_set_overtemp(); + wf_smu_skipping = 2; + } + + /* We only clear the overtemp condition if overtemp is cleared + * _and_ no other failure is present. Since a sensor error will + * clear the overtemp condition (can't measure temperature) at + * the control loop levels, but we don't want to keep it clear + * here in this case + */ + if (new_failure == 0 && last_failure & FAILURE_OVERTEMP) + wf_clear_overtemp(); +} + + +static void wf_smu_new_control(struct wf_control *ct) +{ + if (wf_smu_all_controls_ok) + return; + + if (fan_cpu_main == NULL && !strcmp(ct->name, "cpu-rear-fan-0")) { + if (wf_get_control(ct) == 0) { + fan_cpu_main = ct; + device_create_file(wf_smu_dev, &dev_attr_cpu_fan); + } + } + + if (fan_cpu_second == NULL && !strcmp(ct->name, "cpu-rear-fan-1")) { + if (wf_get_control(ct) == 0) + fan_cpu_second = ct; + } + + if (fan_cpu_third == NULL && !strcmp(ct->name, "cpu-front-fan-0")) { + if (wf_get_control(ct) == 0) + fan_cpu_third = ct; + } + + if (cpufreq_clamp == NULL && !strcmp(ct->name, "cpufreq-clamp")) { + if (wf_get_control(ct) == 0) + cpufreq_clamp = ct; + } + + if (fan_hd == NULL && !strcmp(ct->name, "drive-bay-fan")) { + if (wf_get_control(ct) == 0) { + fan_hd = ct; + device_create_file(wf_smu_dev, &dev_attr_hd_fan); + } + } + + if (fan_slots == NULL && !strcmp(ct->name, "slots-fan")) { + if (wf_get_control(ct) == 0) { + fan_slots = ct; + device_create_file(wf_smu_dev, &dev_attr_slots_fan); + } + } + + if (fan_cpu_main && (fan_cpu_second || fan_cpu_third) && fan_hd && + fan_slots && cpufreq_clamp) + wf_smu_all_controls_ok = 1; +} + +static void wf_smu_new_sensor(struct wf_sensor *sr) +{ + if (wf_smu_all_sensors_ok) + return; + + if (sensor_cpu_power == NULL && !strcmp(sr->name, "cpu-power")) { + if (wf_get_sensor(sr) == 0) { + sensor_cpu_power = sr; + device_create_file(wf_smu_dev, &dev_attr_cpu_power); + } + } + + if (sensor_cpu_temp == NULL && !strcmp(sr->name, "cpu-temp")) { + if (wf_get_sensor(sr) == 0) { + sensor_cpu_temp = sr; + device_create_file(wf_smu_dev, &dev_attr_cpu_temp); + } + } + + if (sensor_hd_temp == NULL && !strcmp(sr->name, "hd-temp")) { + if (wf_get_sensor(sr) == 0) { + sensor_hd_temp = sr; + device_create_file(wf_smu_dev, &dev_attr_hd_temp); + } + } + + if (sensor_slots_power == NULL && !strcmp(sr->name, "slots-power")) { + if (wf_get_sensor(sr) == 0) { + sensor_slots_power = sr; + device_create_file(wf_smu_dev, &dev_attr_slots_power); + } + } + + if (sensor_cpu_power && sensor_cpu_temp && + sensor_hd_temp && sensor_slots_power) + wf_smu_all_sensors_ok = 1; +} + + +static int wf_smu_notify(struct notifier_block *self, + unsigned long event, void *data) +{ + switch(event) { + case WF_EVENT_NEW_CONTROL: + DBG("wf: new control %s detected\n", + ((struct wf_control *)data)->name); + wf_smu_new_control(data); + wf_smu_readjust = 1; + break; + case WF_EVENT_NEW_SENSOR: + DBG("wf: new sensor %s detected\n", + ((struct wf_sensor *)data)->name); + wf_smu_new_sensor(data); + break; + case WF_EVENT_TICK: + if (wf_smu_all_controls_ok && wf_smu_all_sensors_ok) + wf_smu_tick(); + } + + return 0; +} + +static struct notifier_block wf_smu_events = { + .notifier_call = wf_smu_notify, +}; + +static int wf_init_pm(void) +{ + printk(KERN_INFO "windfarm: Initializing for Desktop G5 model\n"); + + return 0; +} + +static int wf_smu_probe(struct device *ddev) +{ + wf_smu_dev = ddev; + + wf_register_client(&wf_smu_events); + + return 0; +} + +static int wf_smu_remove(struct device *ddev) +{ + wf_unregister_client(&wf_smu_events); + + /* XXX We don't have yet a guarantee that our callback isn't + * in progress when returning from wf_unregister_client, so + * we add an arbitrary delay. I'll have to fix that in the core + */ + msleep(1000); + + /* Release all sensors */ + /* One more crappy race: I don't think we have any guarantee here + * that the attribute callback won't race with the sensor beeing + * disposed of, and I'm not 100% certain what best way to deal + * with that except by adding locks all over... I'll do that + * eventually but heh, who ever rmmod this module anyway ? + */ + if (sensor_cpu_power) { + device_remove_file(wf_smu_dev, &dev_attr_cpu_power); + wf_put_sensor(sensor_cpu_power); + } + if (sensor_cpu_temp) { + device_remove_file(wf_smu_dev, &dev_attr_cpu_temp); + wf_put_sensor(sensor_cpu_temp); + } + if (sensor_hd_temp) { + device_remove_file(wf_smu_dev, &dev_attr_hd_temp); + wf_put_sensor(sensor_hd_temp); + } + if (sensor_slots_power) { + device_remove_file(wf_smu_dev, &dev_attr_slots_power); + wf_put_sensor(sensor_slots_power); + } + + /* Release all controls */ + if (fan_cpu_main) { + device_remove_file(wf_smu_dev, &dev_attr_cpu_fan); + wf_put_control(fan_cpu_main); + } + if (fan_cpu_second) + wf_put_control(fan_cpu_second); + if (fan_cpu_third) + wf_put_control(fan_cpu_third); + if (fan_hd) { + device_remove_file(wf_smu_dev, &dev_attr_hd_fan); + wf_put_control(fan_hd); + } + if (fan_slots) { + device_remove_file(wf_smu_dev, &dev_attr_slots_fan); + wf_put_control(fan_slots); + } + if (cpufreq_clamp) + wf_put_control(cpufreq_clamp); + + /* Destroy control loops state structures */ + if (wf_smu_slots_fans) + kfree(wf_smu_cpu_fans); + if (wf_smu_drive_fans) + kfree(wf_smu_cpu_fans); + if (wf_smu_cpu_fans) + kfree(wf_smu_cpu_fans); + + wf_smu_dev = NULL; + + return 0; +} + +static struct device_driver wf_smu_driver = { + .name = "windfarm", + .bus = &platform_bus_type, + .probe = wf_smu_probe, + .remove = wf_smu_remove, +}; + + +static int __init wf_smu_init(void) +{ + int rc = -ENODEV; + + if (machine_is_compatible("PowerMac9,1")) + rc = wf_init_pm(); + + if (rc == 0) { +#ifdef MODULE + request_module("windfarm_smu_controls"); + request_module("windfarm_smu_sensors"); + request_module("windfarm_lm75_sensor"); + +#endif /* MODULE */ + driver_register(&wf_smu_driver); + } + + return rc; +} + +static void __exit wf_smu_exit(void) +{ + + driver_unregister(&wf_smu_driver); +} + + +module_init(wf_smu_init); +module_exit(wf_smu_exit); + +MODULE_AUTHOR("Benjamin Herrenschmidt "); +MODULE_DESCRIPTION("Thermal control logic for PowerMac9,1"); +MODULE_LICENSE("GPL"); + From benh at kernel.crashing.org Mon Nov 7 14:32:28 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 07 Nov 2005 14:32:28 +1100 Subject: [PATCH] ppc64: Update g5_defconfig for ARCH=powerpc Message-ID: <1131334348.5229.160.camel@gaston> This patch updates g5_defconfig for ARCH=powerpc in order to add the SMU support & thermal drivers to it, the pmac sound driver (works on some G5s) and replaces rivafb with nvidiafb which works better for the cards found in G5 based machines. Signed-off-by: Benjamin Herrenschmidt Index: linux-work/arch/powerpc/configs/g5_defconfig =================================================================== --- linux-work.orig/arch/powerpc/configs/g5_defconfig 2005-11-07 13:34:24.000000000 +1100 +++ linux-work/arch/powerpc/configs/g5_defconfig 2005-11-07 13:38:08.000000000 +1100 @@ -1,18 +1,32 @@ # # Automatically generated make config: don't edit -# Linux kernel version: 2.6.14-rc4 -# Thu Oct 20 08:30:23 2005 +# Linux kernel version: 2.6.14 +# Mon Nov 7 13:37:59 2005 # +CONFIG_PPC64=y CONFIG_64BIT=y +CONFIG_PPC_MERGE=y CONFIG_MMU=y +CONFIG_GENERIC_HARDIRQS=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y -CONFIG_GENERIC_ISA_DMA=y +CONFIG_PPC=y CONFIG_EARLY_PRINTK=y CONFIG_COMPAT=y +CONFIG_SYSVIPC_COMPAT=y CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y -CONFIG_FORCE_MAX_ZONEORDER=13 + +# +# Processor support +# +CONFIG_POWER4_ONLY=y +CONFIG_POWER4=y +CONFIG_PPC_FPU=y +CONFIG_ALTIVEC=y +CONFIG_PPC_STD_MMU=y +CONFIG_SMP=y +CONFIG_NR_CPUS=2 # # Code maturity level options @@ -67,30 +81,60 @@ CONFIG_MODULE_SRCVERSION_ALL=y CONFIG_KMOD=y CONFIG_STOP_MACHINE=y -CONFIG_SYSVIPC_COMPAT=y # # Platform support # -# CONFIG_PPC_ISERIES is not set CONFIG_PPC_MULTIPLATFORM=y +# CONFIG_PPC_ISERIES is not set +# CONFIG_EMBEDDED6xx is not set +# CONFIG_APUS is not set # CONFIG_PPC_PSERIES is not set -# CONFIG_PPC_BPA is not set CONFIG_PPC_PMAC=y +CONFIG_PPC_PMAC64=y # CONFIG_PPC_MAPLE is not set -CONFIG_PPC=y -CONFIG_PPC64=y +# CONFIG_PPC_CELL is not set CONFIG_PPC_OF=y -CONFIG_MPIC=y -CONFIG_ALTIVEC=y -CONFIG_KEXEC=y CONFIG_U3_DART=y -CONFIG_PPC_PMAC64=y -CONFIG_BOOTX_TEXT=y -CONFIG_POWER4_ONLY=y +CONFIG_MPIC=y +# CONFIG_PPC_RTAS is not set +# CONFIG_MMIO_NVRAM is not set +# CONFIG_PPC_MPC106 is not set +CONFIG_GENERIC_TBSYNC=y +CONFIG_CPU_FREQ=y +CONFIG_CPU_FREQ_TABLE=y +# CONFIG_CPU_FREQ_DEBUG is not set +CONFIG_CPU_FREQ_STAT=y +# CONFIG_CPU_FREQ_STAT_DETAILS is not set +CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y +# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set +CONFIG_CPU_FREQ_GOV_PERFORMANCE=y +CONFIG_CPU_FREQ_GOV_POWERSAVE=y +CONFIG_CPU_FREQ_GOV_USERSPACE=y +# CONFIG_CPU_FREQ_GOV_ONDEMAND is not set +# CONFIG_CPU_FREQ_GOV_CONSERVATIVE is not set +CONFIG_CPU_FREQ_PMAC64=y +# CONFIG_WANT_EARLY_SERIAL is not set + +# +# Kernel options +# +# CONFIG_HZ_100 is not set +CONFIG_HZ_250=y +# CONFIG_HZ_1000 is not set +CONFIG_HZ=250 +CONFIG_PREEMPT_NONE=y +# CONFIG_PREEMPT_VOLUNTARY is not set +# CONFIG_PREEMPT is not set +# CONFIG_PREEMPT_BKL is not set +CONFIG_BINFMT_ELF=y +# CONFIG_BINFMT_MISC is not set +CONFIG_FORCE_MAX_ZONEORDER=13 CONFIG_IOMMU_VMERGE=y -CONFIG_SMP=y -CONFIG_NR_CPUS=2 +# CONFIG_HOTPLUG_CPU is not set +CONFIG_KEXEC=y +CONFIG_IRQ_ALL_CPUS=y +# CONFIG_NUMA is not set CONFIG_ARCH_SELECT_MEMORY_MODEL=y CONFIG_ARCH_FLATMEM_ENABLE=y CONFIG_SELECT_MEMORY_MODEL=y @@ -100,28 +144,21 @@ CONFIG_FLATMEM=y CONFIG_FLAT_NODE_MEM_MAP=y # CONFIG_SPARSEMEM_STATIC is not set -# CONFIG_NUMA is not set +CONFIG_SPLIT_PTLOCK_CPUS=4 +# CONFIG_PPC_64K_PAGES is not set # CONFIG_SCHED_SMT is not set -CONFIG_PREEMPT_NONE=y -# CONFIG_PREEMPT_VOLUNTARY is not set -# CONFIG_PREEMPT is not set -# CONFIG_PREEMPT_BKL is not set -# CONFIG_HZ_100 is not set -CONFIG_HZ_250=y -# CONFIG_HZ_1000 is not set -CONFIG_HZ=250 -CONFIG_GENERIC_HARDIRQS=y -CONFIG_SECCOMP=y -CONFIG_BINFMT_ELF=y -# CONFIG_BINFMT_MISC is not set -# CONFIG_HOTPLUG_CPU is not set CONFIG_PROC_DEVICETREE=y # CONFIG_CMDLINE_BOOL is not set +# CONFIG_PM is not set +CONFIG_SECCOMP=y CONFIG_ISA_DMA_API=y # -# Bus Options +# Bus options # +CONFIG_GENERIC_ISA_DMA=y +# CONFIG_PPC_I8259 is not set +# CONFIG_PPC_INDIRECT_PCI is not set CONFIG_PCI=y CONFIG_PCI_DOMAINS=y CONFIG_PCI_LEGACY_PROC=y @@ -136,6 +173,7 @@ # PCI Hotplug Support # # CONFIG_HOTPLUG_PCI is not set +CONFIG_KERNEL_START=0xc000000000000000 # # Networking @@ -276,6 +314,10 @@ # CONFIG_NET_DIVERT is not set # CONFIG_ECONET is not set # CONFIG_WAN_ROUTER is not set + +# +# QoS and/or fair queueing +# # CONFIG_NET_SCHED is not set CONFIG_NET_CLS_ROUTE=y @@ -348,6 +390,11 @@ CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y +CONFIG_DEFAULT_AS=y +# CONFIG_DEFAULT_DEADLINE is not set +# CONFIG_DEFAULT_CFQ is not set +# CONFIG_DEFAULT_NOOP is not set +CONFIG_DEFAULT_IOSCHED="anticipatory" # CONFIG_ATA_OVER_ETH is not set # @@ -449,6 +496,7 @@ # # SCSI low-level drivers # +# CONFIG_ISCSI_TCP is not set # CONFIG_BLK_DEV_3W_XXXX_RAID is not set # CONFIG_SCSI_3W_9XXX is not set # CONFIG_SCSI_ACARD is not set @@ -465,10 +513,12 @@ # CONFIG_SCSI_ATA_PIIX is not set # CONFIG_SCSI_SATA_MV is not set # CONFIG_SCSI_SATA_NV is not set -# CONFIG_SCSI_SATA_PROMISE is not set +# CONFIG_SCSI_PDC_ADMA is not set # CONFIG_SCSI_SATA_QSTOR is not set +# CONFIG_SCSI_SATA_PROMISE is not set # CONFIG_SCSI_SATA_SX4 is not set # CONFIG_SCSI_SATA_SIL is not set +# CONFIG_SCSI_SATA_SIL24 is not set # CONFIG_SCSI_SATA_SIS is not set # CONFIG_SCSI_SATA_ULI is not set # CONFIG_SCSI_SATA_VIA is not set @@ -567,6 +617,9 @@ CONFIG_ADB_PMU=y CONFIG_PMAC_SMU=y CONFIG_THERM_PM72=y +CONFIG_WINDFARM=y +CONFIG_WINDFARM_PM81=y +CONFIG_WINDFARM_PM91=y # # Network device support @@ -603,6 +656,7 @@ # CONFIG_NET_TULIP is not set # CONFIG_HP100 is not set # CONFIG_NET_PCI is not set +# CONFIG_FEC_8XX is not set # # Ethernet (1000 Mbit) @@ -768,6 +822,7 @@ # TPM devices # # CONFIG_TCG_TPM is not set +# CONFIG_TELCLOCK is not set # # I2C support @@ -820,6 +875,7 @@ # CONFIG_SENSORS_PCF8591 is not set # CONFIG_SENSORS_RTC8564 is not set # CONFIG_SENSORS_MAX6875 is not set +# CONFIG_RTC_X1205_I2C is not set # CONFIG_I2C_DEBUG_CORE is not set # CONFIG_I2C_DEBUG_ALGO is not set # CONFIG_I2C_DEBUG_BUS is not set @@ -876,10 +932,9 @@ # CONFIG_FB_ASILIANT is not set # CONFIG_FB_IMSTT is not set # CONFIG_FB_VGA16 is not set -# CONFIG_FB_NVIDIA is not set -CONFIG_FB_RIVA=y -# CONFIG_FB_RIVA_I2C is not set -# CONFIG_FB_RIVA_DEBUG is not set +CONFIG_FB_NVIDIA=y +CONFIG_FB_NVIDIA_I2C=y +# CONFIG_FB_RIVA is not set # CONFIG_FB_MATROX is not set # CONFIG_FB_RADEON_OLD is not set CONFIG_FB_RADEON=y @@ -924,7 +979,96 @@ # # Sound # -# CONFIG_SOUND is not set +CONFIG_SOUND=m + +# +# Advanced Linux Sound Architecture +# +CONFIG_SND=m +CONFIG_SND_TIMER=m +CONFIG_SND_PCM=m +CONFIG_SND_HWDEP=m +CONFIG_SND_RAWMIDI=m +CONFIG_SND_SEQUENCER=m +# CONFIG_SND_SEQ_DUMMY is not set +CONFIG_SND_OSSEMUL=y +CONFIG_SND_MIXER_OSS=m +CONFIG_SND_PCM_OSS=m +CONFIG_SND_SEQUENCER_OSS=y +# CONFIG_SND_VERBOSE_PRINTK is not set +# CONFIG_SND_DEBUG is not set +CONFIG_SND_GENERIC_DRIVER=y + +# +# Generic devices +# +# CONFIG_SND_DUMMY is not set +# CONFIG_SND_VIRMIDI is not set +# CONFIG_SND_MTPAV is not set +# CONFIG_SND_SERIAL_U16550 is not set +# CONFIG_SND_MPU401 is not set + +# +# PCI devices +# +# CONFIG_SND_ALI5451 is not set +# CONFIG_SND_ATIIXP is not set +# CONFIG_SND_ATIIXP_MODEM is not set +# CONFIG_SND_AU8810 is not set +# CONFIG_SND_AU8820 is not set +# CONFIG_SND_AU8830 is not set +# CONFIG_SND_AZT3328 is not set +# CONFIG_SND_BT87X is not set +# CONFIG_SND_CS46XX is not set +# CONFIG_SND_CS4281 is not set +# CONFIG_SND_EMU10K1 is not set +# CONFIG_SND_EMU10K1X is not set +# CONFIG_SND_CA0106 is not set +# CONFIG_SND_KORG1212 is not set +# CONFIG_SND_MIXART is not set +# CONFIG_SND_NM256 is not set +# CONFIG_SND_RME32 is not set +# CONFIG_SND_RME96 is not set +# CONFIG_SND_RME9652 is not set +# CONFIG_SND_HDSP is not set +# CONFIG_SND_HDSPM is not set +# CONFIG_SND_TRIDENT is not set +# CONFIG_SND_YMFPCI is not set +# CONFIG_SND_AD1889 is not set +# CONFIG_SND_ALS4000 is not set +# CONFIG_SND_CMIPCI is not set +# CONFIG_SND_ENS1370 is not set +# CONFIG_SND_ENS1371 is not set +# CONFIG_SND_ES1938 is not set +# CONFIG_SND_ES1968 is not set +# CONFIG_SND_MAESTRO3 is not set +# CONFIG_SND_FM801 is not set +# CONFIG_SND_ICE1712 is not set +# CONFIG_SND_ICE1724 is not set +# CONFIG_SND_INTEL8X0 is not set +# CONFIG_SND_INTEL8X0M is not set +# CONFIG_SND_SONICVIBES is not set +# CONFIG_SND_VIA82XX is not set +# CONFIG_SND_VIA82XX_MODEM is not set +# CONFIG_SND_VX222 is not set +# CONFIG_SND_HDA_INTEL is not set + +# +# ALSA PowerMac devices +# +CONFIG_SND_POWERMAC=m +CONFIG_SND_POWERMAC_AUTO_DRC=y + +# +# USB devices +# +CONFIG_SND_USB_AUDIO=m +# CONFIG_SND_USB_USX2Y is not set + +# +# Open Sound System +# +# CONFIG_SOUND_PRIME is not set # # USB support @@ -958,12 +1102,16 @@ # # USB Device Class drivers # -# CONFIG_USB_BLUETOOTH_TTY is not set +# CONFIG_OBSOLETE_OSS_USB_DRIVER is not set CONFIG_USB_ACM=m CONFIG_USB_PRINTER=y # -# NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support' may also be needed; see USB_STORAGE Help for more information +# NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support' +# + +# +# may also be needed; see USB_STORAGE Help for more information # CONFIG_USB_STORAGE=y # CONFIG_USB_STORAGE_DEBUG is not set @@ -1074,6 +1222,7 @@ CONFIG_USB_SERIAL_KLSI=m CONFIG_USB_SERIAL_KOBIL_SCT=m CONFIG_USB_SERIAL_MCT_U232=m +# CONFIG_USB_SERIAL_NOKIA_DKU2 is not set CONFIG_USB_SERIAL_PL2303=m # CONFIG_USB_SERIAL_HP4X is not set CONFIG_USB_SERIAL_SAFE=m @@ -1311,6 +1460,20 @@ CONFIG_NLS_UTF8=y # +# Library routines +# +CONFIG_CRC_CCITT=m +# CONFIG_CRC16 is not set +CONFIG_CRC32=y +CONFIG_LIBCRC32C=m +CONFIG_ZLIB_INFLATE=y +CONFIG_ZLIB_DEFLATE=m +CONFIG_TEXTSEARCH=y +CONFIG_TEXTSEARCH_KMP=m +CONFIG_TEXTSEARCH_BM=m +CONFIG_TEXTSEARCH_FSM=m + +# # Profiling support # CONFIG_PROFILING=y @@ -1331,12 +1494,14 @@ # CONFIG_DEBUG_KOBJECT is not set # CONFIG_DEBUG_INFO is not set CONFIG_DEBUG_FS=y +# CONFIG_DEBUG_VM is not set +# CONFIG_RCU_TORTURE_TEST is not set # CONFIG_DEBUG_STACKOVERFLOW is not set # CONFIG_KPROBES is not set # CONFIG_DEBUG_STACK_USAGE is not set # CONFIG_DEBUGGER is not set -# CONFIG_PPCDBG is not set CONFIG_IRQSTACKS=y +CONFIG_BOOTX_TEXT=y # # Security options @@ -1376,17 +1541,3 @@ # # Hardware crypto devices # - -# -# Library routines -# -CONFIG_CRC_CCITT=m -# CONFIG_CRC16 is not set -CONFIG_CRC32=y -CONFIG_LIBCRC32C=m -CONFIG_ZLIB_INFLATE=y -CONFIG_ZLIB_DEFLATE=m -CONFIG_TEXTSEARCH=y -CONFIG_TEXTSEARCH_KMP=m -CONFIG_TEXTSEARCH_BM=m -CONFIG_TEXTSEARCH_FSM=m From benh at kernel.crashing.org Mon Nov 7 14:36:21 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 07 Nov 2005 14:36:21 +1100 Subject: [PATCH] ppc64: More U3 device-tree fixes Message-ID: <1131334582.5229.163.camel@gaston> Some more U3 revisions have the missing "interrupts" property in U3, this adds them to the fixup code in prom_init.c Signed-off-by: Benjamin Herrenschmidt Index: linux-work/arch/powerpc/kernel/prom_init.c =================================================================== --- linux-work.orig/arch/powerpc/kernel/prom_init.c 2005-11-07 14:21:10.000000000 +1100 +++ linux-work/arch/powerpc/kernel/prom_init.c 2005-11-07 14:36:13.000000000 +1100 @@ -1872,7 +1872,7 @@ if (prom_getprop(u3, "device-rev", &u3_rev, sizeof(u3_rev)) == PROM_ERROR) return; - if (u3_rev != 0x35 && u3_rev != 0x37) + if (u3_rev < 0x35 || u3_rev > 0x39) return; /* does it need fixup ? */ if (prom_getproplen(i2c, "interrupts") > 0) Index: linux-work/arch/ppc64/kernel/prom_init.c =================================================================== --- linux-work.orig/arch/ppc64/kernel/prom_init.c 2005-11-07 10:31:39.000000000 +1100 +++ linux-work/arch/ppc64/kernel/prom_init.c 2005-11-07 14:36:17.000000000 +1100 @@ -1825,7 +1825,7 @@ if (prom_getprop(u3, "device-rev", &u3_rev, sizeof(u3_rev)) == PROM_ERROR) return; - if (u3_rev != 0x35 && u3_rev != 0x37) + if (u3_rev < 0x35 || u3_rev > 0x39) return; /* does it need fixup ? */ if (prom_getproplen(i2c, "interrupts") > 0) From olof at lixom.net Mon Nov 7 14:40:08 2005 From: olof at lixom.net (Olof Johansson) Date: Sun, 6 Nov 2005 19:40:08 -0800 Subject: [PATCH] ppc64: SMU based macs cpufreq support In-Reply-To: <1131334053.5229.151.camel@gaston> References: <1131334053.5229.151.camel@gaston> Message-ID: <20051107034008.GC7166@pb15.lixom.net> On Mon, Nov 07, 2005 at 02:27:33PM +1100, Benjamin Herrenschmidt wrote: > CPU freq support using 970FX powertune facility for iMac G5 and SMU > based single CPU desktop. > > Signed-off-by: Benjamin Herrenschmidt > > Index: linux-work/arch/ppc64/kernel/misc.S [...] > +_GLOBAL(scom970_read) [...] > Index: linux-work/arch/powerpc/kernel/misc_64.S [...] > +_GLOBAL(scom970_read) Are they needed in both places? With the move to arch/powerpc, can't the first go? -Olof From benh at kernel.crashing.org Mon Nov 7 14:45:04 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 07 Nov 2005 14:45:04 +1100 Subject: [PATCH] ppc64: SMU based macs cpufreq support In-Reply-To: <20051107034008.GC7166@pb15.lixom.net> References: <1131334053.5229.151.camel@gaston> <20051107034008.GC7166@pb15.lixom.net> Message-ID: <1131335105.5229.169.camel@gaston> On Sun, 2005-11-06 at 19:40 -0800, Olof Johansson wrote: > On Mon, Nov 07, 2005 at 02:27:33PM +1100, Benjamin Herrenschmidt wrote: > > CPU freq support using 970FX powertune facility for iMac G5 and SMU > > based single CPU desktop. > > > > Signed-off-by: Benjamin Herrenschmidt > > > > Index: linux-work/arch/ppc64/kernel/misc.S > [...] > > +_GLOBAL(scom970_read) > > [...] > > > Index: linux-work/arch/powerpc/kernel/misc_64.S > [...] > > +_GLOBAL(scom970_read) > > > Are they needed in both places? With the move to arch/powerpc, can't the > first go? It will go when arch/ppc64 is gone. In the meantime, it's needed if you build an ARCH=ppc64 kernel Ben. From olof at lixom.net Mon Nov 7 15:29:31 2005 From: olof at lixom.net (Olof Johansson) Date: Sun, 6 Nov 2005 20:29:31 -0800 Subject: [PATCH] ppc64: SMU based macs cpufreq support In-Reply-To: <1131335105.5229.169.camel@gaston> References: <1131334053.5229.151.camel@gaston> <20051107034008.GC7166@pb15.lixom.net> <1131335105.5229.169.camel@gaston> Message-ID: <20051107042931.GD7166@pb15.lixom.net> On Mon, Nov 07, 2005 at 02:45:04PM +1100, Benjamin Herrenschmidt wrote: > On Sun, 2005-11-06 at 19:40 -0800, Olof Johansson wrote: > > Are they needed in both places? With the move to arch/powerpc, can't the > > first go? > > It will go when arch/ppc64 is gone. In the meantime, it's needed if you > build an ARCH=ppc64 kernel Ok. It just seemed odd to add functionality to ppc64 when the focus is so much on moving everything over. -Olof From benh at kernel.crashing.org Mon Nov 7 15:31:10 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 07 Nov 2005 15:31:10 +1100 Subject: [PATCH] ppc64: SMU based macs cpufreq support In-Reply-To: <20051107042931.GD7166@pb15.lixom.net> References: <1131334053.5229.151.camel@gaston> <20051107034008.GC7166@pb15.lixom.net> <1131335105.5229.169.camel@gaston> <20051107042931.GD7166@pb15.lixom.net> Message-ID: <1131337870.11406.0.camel@gaston> On Sun, 2005-11-06 at 20:29 -0800, Olof Johansson wrote: > On Mon, Nov 07, 2005 at 02:45:04PM +1100, Benjamin Herrenschmidt wrote: > > On Sun, 2005-11-06 at 19:40 -0800, Olof Johansson wrote: > > > Are they needed in both places? With the move to arch/powerpc, can't the > > > first go? > > > > It will go when arch/ppc64 is gone. In the meantime, it's needed if you > > build an ARCH=ppc64 kernel > > Ok. It just seemed odd to add functionality to ppc64 when the focus is > so much on moving everything over. Well, I'm adding a functionality to the pmac platform :) It needs that support asm, wether it's in ppc64 or powerpc doesn't matter but I'm not (yet) ready to make pmac unbuildable (or building different features) on ppc64. Ben. From hch at lst.de Mon Nov 7 15:49:23 2005 From: hch at lst.de (Christoph Hellwig) Date: Mon, 7 Nov 2005 05:49:23 +0100 Subject: [PATCH] ppc64: Thermal control for SMU based machiens In-Reply-To: <1131334230.5229.157.camel@gaston> References: <1131334230.5229.157.camel@gaston> Message-ID: <20051107044923.GA16703@lst.de> On Mon, Nov 07, 2005 at 02:30:28PM +1100, Benjamin Herrenschmidt wrote: > This adds a new thermal control framework for PowerMac, along with the > implementation for PowerMac8,1, PowerMac8,2 (iMac G5 rev 1 and 2), and > PowerMac9,1 (latest single CPU desktop). In the future, I expect to move > the older G5 thermal control to the new framework as well. the core code doesn't seem to be mac-specific, could you please move it to common code? Especially the userland callout should be shared with existing drivers on sparc64 and ia64. I don't request you to move those over yet, I'll work with the maintainer to get there later. What about creating drivers/windfarm/ and move all the files there? > +struct wf_sensor_ops { > + int (*get_value)(struct wf_sensor *sr, s32 *val); > + void (*release)(struct wf_sensor *sr); > + struct module *owner; tabs instead of spaces here :) > +#include not needed anymore, latest mainline includes the CONFIG_ symbols automatically. > +#include not needed. > +int wf_register_client(struct notifier_block *nb) > +{ > + int rc; > + struct wf_control *ct; > + struct wf_sensor *sr; > + > + down(&wf_lock); > + rc = notifier_chain_register(&wf_client_list, nb); > + if (rc != 0) > + goto bail; > + wf_client_count++; > + list_for_each_entry(ct, &wf_controls, link) > + wf_notify(WF_EVENT_NEW_CONTROL, ct); > + list_for_each_entry(sr, &wf_sensors, link) > + wf_notify(WF_EVENT_NEW_SENSOR, sr); > + if (wf_client_count == 1) > + wf_start_thread(); > + bail: > + up(&wf_lock); > + return rc; > +} shouldn't clients get proper methods instead of the notifier_block mess? > Index: linux-work/drivers/macintosh/windfarm_smu_controls.c > =================================================================== > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > +++ linux-work/drivers/macintosh/windfarm_smu_controls.c 2005-11-07 13:30:46.000000000 +1100 > @@ -0,0 +1,274 @@ no copyright notices? From mikey at neuling.org Mon Nov 7 15:51:45 2005 From: mikey at neuling.org (Michael Neuling) Date: Mon, 7 Nov 2005 15:51:45 +1100 Subject: [PATCH 1/2] kexec tools: device tree blob reserve memory map entry cleanup In-Reply-To: <20051026125112.ceddab99.mikey@neuling.org> References: <20051026115202.f2acc73f.mikey@neuling.org> <20051026125112.ceddab99.mikey@neuling.org> Message-ID: <20051107155145.f033ecfd.mikey@neuling.org> Retransmitting this patch as the last one version was causing seg faults (due to an excess of brown paper bags in my cubical). ---- Patch cleans up how the reserve memory maps entry for the device tree are modified. Shouldn't change any functionality. Signed-off-by: Michael Neuling 1 files changed, 18 insertions(+), 21 deletions(-) Index: kexec-tools-1.101/kexec/arch/ppc64/kexec-elf-ppc64.c =================================================================== --- kexec-tools-1.101.orig/kexec/arch/ppc64/kexec-elf-ppc64.c +++ kexec-tools-1.101/kexec/arch/ppc64/kexec-elf-ppc64.c @@ -171,10 +171,7 @@ /* Add v2wrap to the current image */ unsigned char *v2wrap_buf = NULL; off_t v2wrap_size = 0; - unsigned int off_len; - unsigned char *seg_buf; - unsigned int rsvmap_len; - unsigned long long *ptr; + unsigned long long *rsvmap_ptr; struct bootblock *bb_ptr; unsigned int devtree_size; @@ -189,23 +186,23 @@ add_buffer(info, v2wrap_buf, v2wrap_size, v2wrap_size, 0, 0, 0xFFFFFFFFFFFFFFFFUL, -1); - /* patch reserve map address for flattened device-tree */ - base_addr = info->segment[(info->nr_segments)-1].mem; - seg_buf = (unsigned char *)info->segment[(info->nr_segments)-1].buf; - seg_buf = seg_buf + 0x100; /* offset to end of v2wrap */ - bb_ptr = (struct bootblock *)seg_buf; - rsvmap_len = bb_ptr->off_dt_struct - bb_ptr->off_mem_rsvmap; - devtree_size = bb_ptr->totalsize; - off_len = sizeof(struct bootblock); - off_len += 7; off_len &= ~7; - seg_buf = seg_buf + off_len; - off_len = rsvmap_len / (2 * sizeof(unsigned long long)); - - ptr = (unsigned long long *)seg_buf; - ptr = ptr + 2*(off_len-2); - *ptr = base_addr + 0x100; - ptr++; - *ptr = (unsigned long long)devtree_size; + /* patch reserve map address for flattened device-tree + find last entry (both 0) in the reserve mem list. Assume DT + entry is before this one */ + bb_ptr = (struct bootblock *)( + (unsigned char *)info->segment[(info->nr_segments)-1].buf + + 0x100); + rsvmap_ptr = (long long *)( + (unsigned char *)info->segment[(info->nr_segments)-1].buf + + bb_ptr->off_mem_rsvmap + 0x100); + while (*rsvmap_ptr || *(rsvmap_ptr+1)){ + rsvmap_ptr += 2; + } + rsvmap_ptr -= 2; + *rsvmap_ptr = (unsigned long long)( + info->segment[(info->nr_segments)-1].mem + 0x100); + rsvmap_ptr++; + *rsvmap_ptr = (unsigned long long)bb_ptr->totalsize; unsigned int nr_segments; nr_segments = info->nr_segments; From benh at kernel.crashing.org Mon Nov 7 16:05:29 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 07 Nov 2005 16:05:29 +1100 Subject: [PATCH] ppc64: Thermal control for SMU based machiens In-Reply-To: <20051107044923.GA16703@lst.de> References: <1131334230.5229.157.camel@gaston> <20051107044923.GA16703@lst.de> Message-ID: <1131339930.11406.10.camel@gaston> On Mon, 2005-11-07 at 05:49 +0100, Christoph Hellwig wrote: > On Mon, Nov 07, 2005 at 02:30:28PM +1100, Benjamin Herrenschmidt wrote: > > This adds a new thermal control framework for PowerMac, along with the > > implementation for PowerMac8,1, PowerMac8,2 (iMac G5 rev 1 and 2), and > > PowerMac9,1 (latest single CPU desktop). In the future, I expect to move > > the older G5 thermal control to the new framework as well. > > the core code doesn't seem to be mac-specific, could you please move > it to common code? Especially the userland callout should be shared > with existing drivers on sparc64 and ia64. I don't request you to move > those over yet, I'll work with the maintainer to get there later. > > What about creating drivers/windfarm/ and move all the files there? I'd like to massage it a bit more before going to common code. I want to port the old thermal control, and maybe a bit smarter overtemp code as well. For now, I'd rather keep it there... I need it in now (users are complaining enough about their machines wanting to be vacuum cleaners) and I have no time this week before the patch gates close to do the other changes I have in mind, so I put that as a target for after 2.6.15. > > +struct wf_sensor_ops { > > + int (*get_value)(struct wf_sensor *sr, s32 *val); > > + void (*release)(struct wf_sensor *sr); > > + struct module *owner; > > tabs instead of spaces here :) oops... oh well...fixing :) > > +#include > > not needed anymore, latest mainline includes the CONFIG_ symbols > automatically. Good to know > > +#include > > not needed. Yah, remaining of some older bits. > > > +int wf_register_client(struct notifier_block *nb) > > +{ > > + int rc; > > + struct wf_control *ct; > > + struct wf_sensor *sr; > > + > > + down(&wf_lock); > > + rc = notifier_chain_register(&wf_client_list, nb); > > + if (rc != 0) > > + goto bail; > > + wf_client_count++; > > + list_for_each_entry(ct, &wf_controls, link) > > + wf_notify(WF_EVENT_NEW_CONTROL, ct); > > + list_for_each_entry(sr, &wf_sensors, link) > > + wf_notify(WF_EVENT_NEW_SENSOR, sr); > > + if (wf_client_count == 1) > > + wf_start_thread(); > > + bail: > > + up(&wf_lock); > > + return rc; > > +} > > shouldn't clients get proper methods instead of the notifier_block mess? It was easier that way for a first implementation. That's what I'm thinking about "massaging it a bit more" :) There are some lifetime issues with clients I want to address as well so yes, I think the notifier will go :) The initial intent of the notifier however was to be able to have smarter mecanisms like "meta" sensors that catch sensor creations and create their own sensor (like a power sensor that catch the voltage and current sensors and exposes a combo power sensor) though I ended up not doing it that way. I also wanted a way to broadcast things like overtemp conditions system wide. But then, there is no limit on what a client can be. That is, the main "controller" for a given machine is a client, but anything else could be as well... > > Index: linux-work/drivers/macintosh/windfarm_smu_controls.c > > =================================================================== > > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > > +++ linux-work/drivers/macintosh/windfarm_smu_controls.c 2005-11-07 13:30:46.000000000 +1100 > > @@ -0,0 +1,274 @@ > > no copyright notices? Will fix. New patch on the way... Thanks, Ben. From benh at kernel.crashing.org Mon Nov 7 16:08:17 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 07 Nov 2005 16:08:17 +1100 Subject: [PATCH] ppc64: Thermal control for SMU based machines Message-ID: <1131340098.11406.14.camel@gaston> This adds a new thermal control framework for PowerMac, along with the implementation for PowerMac8,1, PowerMac8,2 (iMac G5 rev 1 and 2), and PowerMac9,1 (latest single CPU desktop). In the future, I expect to move the older G5 thermal control to the new framework as well. Signed-off-by: Benjamin Herrenschmidt --- New version addressing some of Christoph comments Index: linux-work/drivers/macintosh/smu.c =================================================================== --- linux-work.orig/drivers/macintosh/smu.c 2005-11-07 13:30:45.000000000 +1100 +++ linux-work/drivers/macintosh/smu.c 2005-11-07 13:30:46.000000000 +1100 @@ -590,6 +590,8 @@ sprintf(name, "smu-i2c-%02x", *reg); of_platform_device_create(np, name, &smu->of_dev->dev); } + if (device_is_compatible(np, "smu-sensors")) + of_platform_device_create(np, "smu-sensors", &smu->of_dev->dev); } } Index: linux-work/drivers/macintosh/Kconfig =================================================================== --- linux-work.orig/drivers/macintosh/Kconfig 2005-11-07 13:29:50.000000000 +1100 +++ linux-work/drivers/macintosh/Kconfig 2005-11-07 13:30:46.000000000 +1100 @@ -169,6 +169,25 @@ This driver provides thermostat and fan control for the desktop G5 machines. +config WINDFARM + tristate "New PowerMac thermal control infrastructure" + +config WINDFARM_PM81 + tristate "Support for thermal management on iMac G5" + depends on WINDFARM && I2C && CPU_FREQ_PMAC64 && PMAC_SMU + select I2C_PMAC_SMU + help + This driver provides thermal control for the iMacG5 + +config WINDFARM_PM91 + tristate "Support for thermal management on PowerMac9,1" + depends on WINDFARM && I2C && CPU_FREQ_PMAC64 && PMAC_SMU + select I2C_PMAC_SMU + help + This driver provides thermal control for the PowerMac9,1 + which is the recent (SMU based) single CPU desktop G5 + + config ANSLCD tristate "Support for ANS LCD display" depends on ADB_CUDA && PPC_PMAC Index: linux-work/drivers/macintosh/Makefile =================================================================== --- linux-work.orig/drivers/macintosh/Makefile 2005-11-07 13:29:50.000000000 +1100 +++ linux-work/drivers/macintosh/Makefile 2005-11-07 13:30:46.000000000 +1100 @@ -26,3 +26,12 @@ obj-$(CONFIG_THERM_PM72) += therm_pm72.o obj-$(CONFIG_THERM_WINDTUNNEL) += therm_windtunnel.o obj-$(CONFIG_THERM_ADT746X) += therm_adt746x.o +obj-$(CONFIG_WINDFARM) += windfarm_core.o +obj-$(CONFIG_WINDFARM_PM81) += windfarm_smu_controls.o \ + windfarm_smu_sensors.o \ + windfarm_lm75_sensor.o windfarm_pid.o \ + windfarm_cpufreq_clamp.o windfarm_pm81.o +obj-$(CONFIG_WINDFARM_PM91) += windfarm_smu_controls.o \ + windfarm_smu_sensors.o \ + windfarm_lm75_sensor.o windfarm_pid.o \ + windfarm_cpufreq_clamp.o windfarm_pm91.o Index: linux-work/drivers/macintosh/windfarm.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/drivers/macintosh/windfarm.h 2005-11-07 16:05:27.000000000 +1100 @@ -0,0 +1,131 @@ +/* + * Windfarm PowerMac thermal control. + * + * (c) Copyright 2005 Benjamin Herrenschmidt, IBM Corp. + * + * + * Released under the term of the GNU GPL v2. + */ + +#ifndef __WINDFARM_H__ +#define __WINDFARM_H__ + +#include +#include +#include +#include + +/* Display a 16.16 fixed point value */ +#define FIX32TOPRINT(f) ((f) >> 16),((((f) & 0xffff) * 1000) >> 16) + +/* + * Control objects + */ + +struct wf_control; + +struct wf_control_ops { + int (*set_value)(struct wf_control *ct, s32 val); + int (*get_value)(struct wf_control *ct, s32 *val); + s32 (*get_min)(struct wf_control *ct); + s32 (*get_max)(struct wf_control *ct); + void (*release)(struct wf_control *ct); + struct module *owner; +}; + +struct wf_control { + struct list_head link; + struct wf_control_ops *ops; + char *name; + int type; + struct kref ref; +}; + +#define WF_CONTROL_TYPE_GENERIC 0 +#define WF_CONTROL_RPM_FAN 1 +#define WF_CONTROL_PWM_FAN 2 + + +/* Note about lifetime rules: wf_register_control() will initialize + * the kref and wf_unregister_control will decrement it, thus the + * object creating/disposing a given control shouldn't assume it + * still exists after wf_unregister_control has been called. + * wf_find_control will inc the refcount for you + */ +extern int wf_register_control(struct wf_control *ct); +extern void wf_unregister_control(struct wf_control *ct); +extern struct wf_control * wf_find_control(const char *name); +extern int wf_get_control(struct wf_control *ct); +extern void wf_put_control(struct wf_control *ct); + +static inline int wf_control_set_max(struct wf_control *ct) +{ + s32 vmax = ct->ops->get_max(ct); + return ct->ops->set_value(ct, vmax); +} + +static inline int wf_control_set_min(struct wf_control *ct) +{ + s32 vmin = ct->ops->get_min(ct); + return ct->ops->set_value(ct, vmin); +} + +/* + * Sensor objects + */ + +struct wf_sensor; + +struct wf_sensor_ops { + int (*get_value)(struct wf_sensor *sr, s32 *val); + void (*release)(struct wf_sensor *sr); + struct module *owner; +}; + +struct wf_sensor { + struct list_head link; + struct wf_sensor_ops *ops; + char *name; + struct kref ref; +}; + +/* Same lifetime rules as controls */ +extern int wf_register_sensor(struct wf_sensor *sr); +extern void wf_unregister_sensor(struct wf_sensor *sr); +extern struct wf_sensor * wf_find_sensor(const char *name); +extern int wf_get_sensor(struct wf_sensor *sr); +extern void wf_put_sensor(struct wf_sensor *sr); + +/* For use by clients. Note that we are a bit racy here since + * notifier_block doesn't have a module owner field. I may fix + * it one day ... + * + * LOCKING NOTE ! + * + * All "events" except WF_EVENT_TICK are called with an internal mutex + * held which will deadlock if you call basically any core routine. + * So don't ! Just take note of the event and do your actual operations + * from the ticker. + * + */ +extern int wf_register_client(struct notifier_block *nb); +extern int wf_unregister_client(struct notifier_block *nb); + +/* Overtemp conditions. Those are refcounted */ +extern void wf_set_overtemp(void); +extern void wf_clear_overtemp(void); +extern int wf_is_overtemp(void); + +#define WF_EVENT_NEW_CONTROL 0 /* param is wf_control * */ +#define WF_EVENT_NEW_SENSOR 1 /* param is wf_sensor * */ +#define WF_EVENT_OVERTEMP 2 /* no param */ +#define WF_EVENT_NORMALTEMP 3 /* overtemp condition cleared */ +#define WF_EVENT_TICK 4 /* 1 second tick */ + +/* Note: If that driver gets more broad use, we could replace the + * simplistic overtemp bits with "environmental conditions". That + * could then be used to also notify of things like fan failure, + * case open, battery conditions, ... + */ + +#endif /* __WINDFARM_H__ */ Index: linux-work/drivers/macintosh/windfarm_core.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/drivers/macintosh/windfarm_core.c 2005-11-07 16:03:12.000000000 +1100 @@ -0,0 +1,426 @@ +/* + * Windfarm PowerMac thermal control. Core + * + * (c) Copyright 2005 Benjamin Herrenschmidt, IBM Corp. + * + * + * Released under the term of the GNU GPL v2. + * + * This core code tracks the list of sensors & controls, register + * clients, and holds the kernel thread used for control. + * + * TODO: + * + * Add some information about sensor/control type and data format to + * sensors/controls, and have the sysfs attribute stuff be moved + * generically here instead of hard coded in the platform specific + * driver as it us currently + * + * This however requires solving some annoying lifetime issues with + * sysfs which doesn't seem to have lifetime rules for struct attribute, + * I may have to create full features kobjects for every sensor/control + * instead which is a bit of an overkill imho + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "windfarm.h" + +#define VERSION "0.2" + +#undef DEBUG + +#ifdef DEBUG +#define DBG(args...) printk(args) +#else +#define DBG(args...) do { } while(0) +#endif + +static LIST_HEAD(wf_controls); +static LIST_HEAD(wf_sensors); +static DECLARE_MUTEX(wf_lock); +static struct notifier_block *wf_client_list; +static int wf_client_count; +static unsigned int wf_overtemp; +static unsigned int wf_overtemp_counter; +struct task_struct *wf_thread; + +/* + * Utilities & tick thread + */ + +static inline void wf_notify(int event, void *param) +{ + notifier_call_chain(&wf_client_list, event, param); +} + +int wf_critical_overtemp(void) +{ + static char * critical_overtemp_path = "/sbin/critical_overtemp"; + char *argv[] = { critical_overtemp_path, NULL }; + static char *envp[] = { "HOME=/", + "TERM=linux", + "PATH=/sbin:/usr/sbin:/bin:/usr/bin", + NULL }; + + return call_usermodehelper(critical_overtemp_path, argv, envp, 0); +} +EXPORT_SYMBOL_GPL(wf_critical_overtemp); + +static int wf_thread_func(void *data) +{ + unsigned long next, delay; + + next = jiffies; + + DBG("wf: thread started\n"); + + while(!kthread_should_stop()) { + try_to_freeze(); + + if (time_after_eq(jiffies, next)) { + wf_notify(WF_EVENT_TICK, NULL); + if (wf_overtemp) { + wf_overtemp_counter++; + /* 10 seconds overtemp, notify userland */ + if (wf_overtemp_counter > 10) + wf_critical_overtemp(); + /* 30 seconds, shutdown */ + if (wf_overtemp_counter > 30) { + printk(KERN_ERR "windfarm: Overtemp " + "for more than 30" + " seconds, shutting down\n"); + machine_power_off(); + } + } + next += HZ; + } + + delay = next - jiffies; + if (delay <= HZ) + schedule_timeout_interruptible(delay); + + /* there should be no signal, but oh well */ + if (signal_pending(current)) { + printk(KERN_WARNING "windfarm: thread got sigl !\n"); + break; + } + } + + DBG("wf: thread stopped\n"); + + return 0; +} + +static void wf_start_thread(void) +{ + wf_thread = kthread_run(wf_thread_func, NULL, "kwindfarm"); + if (IS_ERR(wf_thread)) { + printk(KERN_ERR "windfarm: failed to create thread,err %ld\n", + PTR_ERR(wf_thread)); + wf_thread = NULL; + } +} + + +static void wf_stop_thread(void) +{ + if (wf_thread) + kthread_stop(wf_thread); + wf_thread = NULL; +} + +/* + * Controls + */ + +static void wf_control_release(struct kref *kref) +{ + struct wf_control *ct = container_of(kref, struct wf_control, ref); + + DBG("wf: Deleting control %s\n", ct->name); + + if (ct->ops && ct->ops->release) + ct->ops->release(ct); + else + kfree(ct); +} + +int wf_register_control(struct wf_control *new_ct) +{ + struct wf_control *ct; + + down(&wf_lock); + list_for_each_entry(ct, &wf_controls, link) { + if (!strcmp(ct->name, new_ct->name)) { + printk(KERN_WARNING "windfarm: trying to register" + " duplicate control %s\n", ct->name); + up(&wf_lock); + return -EEXIST; + } + } + kref_init(&new_ct->ref); + list_add(&new_ct->link, &wf_controls); + + DBG("wf: Registered control %s\n", new_ct->name); + + wf_notify(WF_EVENT_NEW_CONTROL, new_ct); + up(&wf_lock); + + return 0; +} +EXPORT_SYMBOL_GPL(wf_register_control); + +void wf_unregister_control(struct wf_control *ct) +{ + down(&wf_lock); + list_del(&ct->link); + up(&wf_lock); + + DBG("wf: Unregistered control %s\n", ct->name); + + kref_put(&ct->ref, wf_control_release); +} +EXPORT_SYMBOL_GPL(wf_unregister_control); + +struct wf_control * wf_find_control(const char *name) +{ + struct wf_control *ct; + + down(&wf_lock); + list_for_each_entry(ct, &wf_controls, link) { + if (!strcmp(ct->name, name)) { + if (wf_get_control(ct)) + ct = NULL; + up(&wf_lock); + return ct; + } + } + up(&wf_lock); + return NULL; +} +EXPORT_SYMBOL_GPL(wf_find_control); + +int wf_get_control(struct wf_control *ct) +{ + if (!try_module_get(ct->ops->owner)) + return -ENODEV; + kref_get(&ct->ref); + return 0; +} +EXPORT_SYMBOL_GPL(wf_get_control); + +void wf_put_control(struct wf_control *ct) +{ + struct module *mod = ct->ops->owner; + kref_put(&ct->ref, wf_control_release); + module_put(mod); +} +EXPORT_SYMBOL_GPL(wf_put_control); + + +/* + * Sensors + */ + + +static void wf_sensor_release(struct kref *kref) +{ + struct wf_sensor *sr = container_of(kref, struct wf_sensor, ref); + + DBG("wf: Deleting sensor %s\n", sr->name); + + if (sr->ops && sr->ops->release) + sr->ops->release(sr); + else + kfree(sr); +} + +int wf_register_sensor(struct wf_sensor *new_sr) +{ + struct wf_sensor *sr; + + down(&wf_lock); + list_for_each_entry(sr, &wf_sensors, link) { + if (!strcmp(sr->name, new_sr->name)) { + printk(KERN_WARNING "windfarm: trying to register" + " duplicate sensor %s\n", sr->name); + up(&wf_lock); + return -EEXIST; + } + } + kref_init(&new_sr->ref); + list_add(&new_sr->link, &wf_sensors); + + DBG("wf: Registered sensor %s\n", new_sr->name); + + wf_notify(WF_EVENT_NEW_SENSOR, new_sr); + up(&wf_lock); + + return 0; +} +EXPORT_SYMBOL_GPL(wf_register_sensor); + +void wf_unregister_sensor(struct wf_sensor *sr) +{ + down(&wf_lock); + list_del(&sr->link); + up(&wf_lock); + + DBG("wf: Unregistered sensor %s\n", sr->name); + + wf_put_sensor(sr); +} +EXPORT_SYMBOL_GPL(wf_unregister_sensor); + +struct wf_sensor * wf_find_sensor(const char *name) +{ + struct wf_sensor *sr; + + down(&wf_lock); + list_for_each_entry(sr, &wf_sensors, link) { + if (!strcmp(sr->name, name)) { + if (wf_get_sensor(sr)) + sr = NULL; + up(&wf_lock); + return sr; + } + } + up(&wf_lock); + return NULL; +} +EXPORT_SYMBOL_GPL(wf_find_sensor); + +int wf_get_sensor(struct wf_sensor *sr) +{ + if (!try_module_get(sr->ops->owner)) + return -ENODEV; + kref_get(&sr->ref); + return 0; +} +EXPORT_SYMBOL_GPL(wf_get_sensor); + +void wf_put_sensor(struct wf_sensor *sr) +{ + struct module *mod = sr->ops->owner; + kref_put(&sr->ref, wf_sensor_release); + module_put(mod); +} +EXPORT_SYMBOL_GPL(wf_put_sensor); + + +/* + * Client & notification + */ + +int wf_register_client(struct notifier_block *nb) +{ + int rc; + struct wf_control *ct; + struct wf_sensor *sr; + + down(&wf_lock); + rc = notifier_chain_register(&wf_client_list, nb); + if (rc != 0) + goto bail; + wf_client_count++; + list_for_each_entry(ct, &wf_controls, link) + wf_notify(WF_EVENT_NEW_CONTROL, ct); + list_for_each_entry(sr, &wf_sensors, link) + wf_notify(WF_EVENT_NEW_SENSOR, sr); + if (wf_client_count == 1) + wf_start_thread(); + bail: + up(&wf_lock); + return rc; +} +EXPORT_SYMBOL_GPL(wf_register_client); + +int wf_unregister_client(struct notifier_block *nb) +{ + down(&wf_lock); + notifier_chain_unregister(&wf_client_list, nb); + wf_client_count++; + if (wf_client_count == 0) + wf_stop_thread(); + up(&wf_lock); + + return 0; +} +EXPORT_SYMBOL_GPL(wf_unregister_client); + +void wf_set_overtemp(void) +{ + down(&wf_lock); + wf_overtemp++; + if (wf_overtemp == 1) { + printk(KERN_WARNING "windfarm: Overtemp condition detected !\n"); + wf_overtemp_counter = 0; + wf_notify(WF_EVENT_OVERTEMP, NULL); + } + up(&wf_lock); +} +EXPORT_SYMBOL_GPL(wf_set_overtemp); + +void wf_clear_overtemp(void) +{ + down(&wf_lock); + WARN_ON(wf_overtemp == 0); + if (wf_overtemp == 0) { + up(&wf_lock); + return; + } + wf_overtemp--; + if (wf_overtemp == 0) { + printk(KERN_WARNING "windfarm: Overtemp condition cleared !\n"); + wf_notify(WF_EVENT_NORMALTEMP, NULL); + } + up(&wf_lock); +} +EXPORT_SYMBOL_GPL(wf_clear_overtemp); + +int wf_is_overtemp(void) +{ + return (wf_overtemp != 0); +} +EXPORT_SYMBOL_GPL(wf_is_overtemp); + +static struct platform_device wf_platform_device = { + .name = "windfarm", +}; + +static int __init windfarm_core_init(void) +{ + DBG("wf: core loaded\n"); + + platform_device_register(&wf_platform_device); + return 0; +} + +static void __exit windfarm_core_exit(void) +{ + BUG_ON(wf_client_count != 0); + + DBG("wf: core unloaded\n"); + + platform_device_unregister(&wf_platform_device); +} + + +module_init(windfarm_core_init); +module_exit(windfarm_core_exit); + +MODULE_AUTHOR("Benjamin Herrenschmidt "); +MODULE_DESCRIPTION("Core component of PowerMac thermal control"); +MODULE_LICENSE("GPL"); + Index: linux-work/drivers/macintosh/windfarm_smu_controls.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/drivers/macintosh/windfarm_smu_controls.c 2005-11-07 16:05:16.000000000 +1100 @@ -0,0 +1,282 @@ +/* + * Windfarm PowerMac thermal control. SMU based controls + * + * (c) Copyright 2005 Benjamin Herrenschmidt, IBM Corp. + * + * + * Released under the term of the GNU GPL v2. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "windfarm.h" + +#define VERSION "0.3" + +#undef DEBUG + +#ifdef DEBUG +#define DBG(args...) printk(args) +#else +#define DBG(args...) do { } while(0) +#endif + +/* + * SMU fans control object + */ + +static LIST_HEAD(smu_fans); + +struct smu_fan_control { + struct list_head link; + int fan_type; /* 0 = rpm, 1 = pwm */ + u32 reg; /* index in SMU */ + s32 value; /* current value */ + s32 min, max; /* min/max values */ + struct wf_control ctrl; +}; +#define to_smu_fan(c) container_of(c, struct smu_fan_control, ctrl) + +static int smu_set_fan(int pwm, u8 id, u16 value) +{ + struct smu_cmd cmd; + u8 buffer[16]; + DECLARE_COMPLETION(comp); + int rc; + + /* Fill SMU command structure */ + cmd.cmd = SMU_CMD_FAN_COMMAND; + cmd.data_len = 14; + cmd.reply_len = 16; + cmd.data_buf = cmd.reply_buf = buffer; + cmd.status = 0; + cmd.done = smu_done_complete; + cmd.misc = ∁ + + /* Fill argument buffer */ + memset(buffer, 0, 16); + buffer[0] = pwm ? 0x10 : 0x00; + buffer[1] = 0x01 << id; + *((u16 *)&buffer[2 + id * 2]) = value; + + rc = smu_queue_cmd(&cmd); + if (rc) + return rc; + wait_for_completion(&comp); + return cmd.status; +} + +static void smu_fan_release(struct wf_control *ct) +{ + struct smu_fan_control *fct = to_smu_fan(ct); + + kfree(fct); +} + +static int smu_fan_set(struct wf_control *ct, s32 value) +{ + struct smu_fan_control *fct = to_smu_fan(ct); + + if (value < fct->min) + value = fct->min; + if (value > fct->max) + value = fct->max; + fct->value = value; + + return smu_set_fan(fct->fan_type, fct->reg, value); +} + +static int smu_fan_get(struct wf_control *ct, s32 *value) +{ + struct smu_fan_control *fct = to_smu_fan(ct); + *value = fct->value; /* todo: read from SMU */ + return 0; +} + +static s32 smu_fan_min(struct wf_control *ct) +{ + struct smu_fan_control *fct = to_smu_fan(ct); + return fct->min; +} + +static s32 smu_fan_max(struct wf_control *ct) +{ + struct smu_fan_control *fct = to_smu_fan(ct); + return fct->max; +} + +static struct wf_control_ops smu_fan_ops = { + .set_value = smu_fan_set, + .get_value = smu_fan_get, + .get_min = smu_fan_min, + .get_max = smu_fan_max, + .release = smu_fan_release, + .owner = THIS_MODULE, +}; + +static struct smu_fan_control *smu_fan_create(struct device_node *node, + int pwm_fan) +{ + struct smu_fan_control *fct; + s32 *v; u32 *reg; + char *l; + + fct = kmalloc(sizeof(struct smu_fan_control), GFP_KERNEL); + if (fct == NULL) + return NULL; + fct->ctrl.ops = &smu_fan_ops; + l = (char *)get_property(node, "location", NULL); + if (l == NULL) + goto fail; + + fct->fan_type = pwm_fan; + fct->ctrl.type = pwm_fan ? WF_CONTROL_PWM_FAN : WF_CONTROL_RPM_FAN; + + /* We use the name & location here the same way we do for SMU sensors, + * see the comment in windfarm_smu_sensors.c. The locations are a bit + * less consistent here between the iMac and the desktop models, but + * that is good enough for our needs for now at least. + * + * One problem though is that Apple seem to be inconsistent with case + * and the kernel doesn't have strcasecmp =P + */ + + fct->ctrl.name = NULL; + + /* Names used on desktop models */ + if (!strcmp(l, "Rear Fan 0") || !strcmp(l, "Rear Fan") || + !strcmp(l, "Rear fan 0") || !strcmp(l, "Rear fan")) + fct->ctrl.name = "cpu-rear-fan-0"; + else if (!strcmp(l, "Rear Fan 1") || !strcmp(l, "Rear fan 1")) + fct->ctrl.name = "cpu-rear-fan-1"; + else if (!strcmp(l, "Front Fan 0") || !strcmp(l, "Front Fan") || + !strcmp(l, "Front fan 0") || !strcmp(l, "Front fan")) + fct->ctrl.name = "cpu-front-fan-0"; + else if (!strcmp(l, "Front Fan 1") || !strcmp(l, "Front fan 1")) + fct->ctrl.name = "cpu-front-fan-1"; + else if (!strcmp(l, "Slots Fan") || !strcmp(l, "Slots fan")) + fct->ctrl.name = "slots-fan"; + else if (!strcmp(l, "Drive Bay") || !strcmp(l, "Drive bay")) + fct->ctrl.name = "drive-bay-fan"; + + /* Names used on iMac models */ + if (!strcmp(l, "System Fan") || !strcmp(l, "System fan")) + fct->ctrl.name = "system-fan"; + else if (!strcmp(l, "CPU Fan") || !strcmp(l, "CPU fan")) + fct->ctrl.name = "cpu-fan"; + else if (!strcmp(l, "Hard Drive") || !strcmp(l, "Hard drive")) + fct->ctrl.name = "drive-bay-fan"; + + /* Unrecognized fan, bail out */ + if (fct->ctrl.name == NULL) + goto fail; + + /* Get min & max values*/ + v = (s32 *)get_property(node, "min-value", NULL); + if (v == NULL) + goto fail; + fct->min = *v; + v = (s32 *)get_property(node, "max-value", NULL); + if (v == NULL) + goto fail; + fct->max = *v; + + /* Get "reg" value */ + reg = (u32 *)get_property(node, "reg", NULL); + if (reg == NULL) + goto fail; + fct->reg = *reg; + + if (wf_register_control(&fct->ctrl)) + goto fail; + + return fct; + fail: + kfree(fct); + return NULL; +} + + +static int __init smu_controls_init(void) +{ + struct device_node *smu, *fans, *fan; + + if (!smu_present()) + return -ENODEV; + + smu = of_find_node_by_type(NULL, "smu"); + if (smu == NULL) + return -ENODEV; + + /* Look for RPM fans */ + for (fans = NULL; (fans = of_get_next_child(smu, fans)) != NULL;) + if (!strcmp(fans->name, "rpm-fans")) + break; + for (fan = NULL; + fans && (fan = of_get_next_child(fans, fan)) != NULL;) { + struct smu_fan_control *fct; + + fct = smu_fan_create(fan, 0); + if (fct == NULL) { + printk(KERN_WARNING "windfarm: Failed to create SMU " + "RPM fan %s\n", fan->name); + continue; + } + list_add(&fct->link, &smu_fans); + } + of_node_put(fans); + + + /* Look for PWM fans */ + for (fans = NULL; (fans = of_get_next_child(smu, fans)) != NULL;) + if (!strcmp(fans->name, "pwm-fans")) + break; + for (fan = NULL; + fans && (fan = of_get_next_child(fans, fan)) != NULL;) { + struct smu_fan_control *fct; + + fct = smu_fan_create(fan, 1); + if (fct == NULL) { + printk(KERN_WARNING "windfarm: Failed to create SMU " + "PWM fan %s\n", fan->name); + continue; + } + list_add(&fct->link, &smu_fans); + } + of_node_put(fans); + of_node_put(smu); + + return 0; +} + +static void __exit smu_controls_exit(void) +{ + struct smu_fan_control *fct; + + while (!list_empty(&smu_fans)) { + fct = list_entry(smu_fans.next, struct smu_fan_control, link); + list_del(&fct->link); + wf_unregister_control(&fct->ctrl); + } +} + + +module_init(smu_controls_init); +module_exit(smu_controls_exit); + +MODULE_AUTHOR("Benjamin Herrenschmidt "); +MODULE_DESCRIPTION("SMU control objects for PowerMacs thermal control"); +MODULE_LICENSE("GPL"); + Index: linux-work/drivers/macintosh/windfarm_smu_sensors.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/drivers/macintosh/windfarm_smu_sensors.c 2005-11-07 16:05:04.000000000 +1100 @@ -0,0 +1,479 @@ +/* + * Windfarm PowerMac thermal control. SMU based sensors + * + * (c) Copyright 2005 Benjamin Herrenschmidt, IBM Corp. + * + * + * Released under the term of the GNU GPL v2. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "windfarm.h" + +#define VERSION "0.2" + +#undef DEBUG + +#ifdef DEBUG +#define DBG(args...) printk(args) +#else +#define DBG(args...) do { } while(0) +#endif + +/* + * Various SMU "partitions" calibration objects for which we + * keep pointers here for use by bits & pieces of the driver + */ +static struct smu_sdbp_cpuvcp *cpuvcp; +static int cpuvcp_version; +static struct smu_sdbp_cpudiode *cpudiode; +static struct smu_sdbp_slotspow *slotspow; +static u8 *debugswitches; + +/* + * SMU basic sensors objects + */ + +static LIST_HEAD(smu_ads); + +struct smu_ad_sensor { + struct list_head link; + u32 reg; /* index in SMU */ + struct wf_sensor sens; +}; +#define to_smu_ads(c) container_of(c, struct smu_ad_sensor, sens) + +static void smu_ads_release(struct wf_sensor *sr) +{ + struct smu_ad_sensor *ads = to_smu_ads(sr); + + kfree(ads); +} + +static int smu_read_adc(u8 id, s32 *value) +{ + struct smu_simple_cmd cmd; + DECLARE_COMPLETION(comp); + int rc; + + rc = smu_queue_simple(&cmd, SMU_CMD_READ_ADC, 1, + smu_done_complete, &comp, id); + if (rc) + return rc; + wait_for_completion(&comp); + if (cmd.cmd.status != 0) + return cmd.cmd.status; + if (cmd.cmd.reply_len != 2) { + printk(KERN_ERR "winfarm: read ADC 0x%x returned %d bytes !\n", + id, cmd.cmd.reply_len); + return -EIO; + } + *value = *((u16 *)cmd.buffer); + return 0; +} + +static int smu_cputemp_get(struct wf_sensor *sr, s32 *value) +{ + struct smu_ad_sensor *ads = to_smu_ads(sr); + int rc; + s32 val; + s64 scaled; + + rc = smu_read_adc(ads->reg, &val); + if (rc) { + printk(KERN_ERR "windfarm: read CPU temp failed, err %d\n", + rc); + return rc; + } + + /* Ok, we have to scale & adjust, taking units into account */ + scaled = (s64)(((u64)val) * (u64)cpudiode->m_value); + scaled >>= 3; + scaled += ((s64)cpudiode->b_value) << 9; + *value = (s32)(scaled << 1); + + return 0; +} + +static int smu_cpuamp_get(struct wf_sensor *sr, s32 *value) +{ + struct smu_ad_sensor *ads = to_smu_ads(sr); + s32 val, scaled; + int rc; + + rc = smu_read_adc(ads->reg, &val); + if (rc) { + printk(KERN_ERR "windfarm: read CPU current failed, err %d\n", + rc); + return rc; + } + + /* Ok, we have to scale & adjust, taking units into account */ + scaled = (s32)(val * (u32)cpuvcp->curr_scale); + scaled += (s32)cpuvcp->curr_offset; + *value = scaled << 4; + + return 0; +} + +static int smu_cpuvolt_get(struct wf_sensor *sr, s32 *value) +{ + struct smu_ad_sensor *ads = to_smu_ads(sr); + s32 val, scaled; + int rc; + + rc = smu_read_adc(ads->reg, &val); + if (rc) { + printk(KERN_ERR "windfarm: read CPU voltage failed, err %d\n", + rc); + return rc; + } + + /* Ok, we have to scale & adjust, taking units into account */ + scaled = (s32)(val * (u32)cpuvcp->volt_scale); + scaled += (s32)cpuvcp->volt_offset; + *value = scaled << 4; + + return 0; +} + +static int smu_slotspow_get(struct wf_sensor *sr, s32 *value) +{ + struct smu_ad_sensor *ads = to_smu_ads(sr); + s32 val, scaled; + int rc; + + rc = smu_read_adc(ads->reg, &val); + if (rc) { + printk(KERN_ERR "windfarm: read slots power failed, err %d\n", + rc); + return rc; + } + + /* Ok, we have to scale & adjust, taking units into account */ + scaled = (s32)(val * (u32)slotspow->pow_scale); + scaled += (s32)slotspow->pow_offset; + *value = scaled << 4; + + return 0; +} + + +static struct wf_sensor_ops smu_cputemp_ops = { + .get_value = smu_cputemp_get, + .release = smu_ads_release, + .owner = THIS_MODULE, +}; +static struct wf_sensor_ops smu_cpuamp_ops = { + .get_value = smu_cpuamp_get, + .release = smu_ads_release, + .owner = THIS_MODULE, +}; +static struct wf_sensor_ops smu_cpuvolt_ops = { + .get_value = smu_cpuvolt_get, + .release = smu_ads_release, + .owner = THIS_MODULE, +}; +static struct wf_sensor_ops smu_slotspow_ops = { + .get_value = smu_slotspow_get, + .release = smu_ads_release, + .owner = THIS_MODULE, +}; + + +static struct smu_ad_sensor *smu_ads_create(struct device_node *node) +{ + struct smu_ad_sensor *ads; + char *c, *l; + u32 *v; + + ads = kmalloc(sizeof(struct smu_ad_sensor), GFP_KERNEL); + if (ads == NULL) + return NULL; + c = (char *)get_property(node, "device_type", NULL); + l = (char *)get_property(node, "location", NULL); + if (c == NULL || l == NULL) + goto fail; + + /* We currently pick the sensors based on the OF name and location + * properties, while Darwin uses the sensor-id's. + * The problem with the IDs is that they are model specific while it + * looks like apple has been doing a reasonably good job at keeping + * the names and locations consistents so I'll stick with the names + * and locations for now. + */ + if (!strcmp(c, "temp-sensor") && + !strcmp(l, "CPU T-Diode")) { + ads->sens.ops = &smu_cputemp_ops; + ads->sens.name = "cpu-temp"; + } else if (!strcmp(c, "current-sensor") && + !strcmp(l, "CPU Current")) { + ads->sens.ops = &smu_cpuamp_ops; + ads->sens.name = "cpu-current"; + } else if (!strcmp(c, "voltage-sensor") && + !strcmp(l, "CPU Voltage")) { + ads->sens.ops = &smu_cpuvolt_ops; + ads->sens.name = "cpu-voltage"; + } else if (!strcmp(c, "power-sensor") && + !strcmp(l, "Slots Power")) { + ads->sens.ops = &smu_slotspow_ops; + ads->sens.name = "slots-power"; + if (slotspow == NULL) { + DBG("wf: slotspow partition (%02x) not found\n", + SMU_SDB_SLOTSPOW_ID); + goto fail; + } + } else + goto fail; + + v = (u32 *)get_property(node, "reg", NULL); + if (v == NULL) + goto fail; + ads->reg = *v; + + if (wf_register_sensor(&ads->sens)) + goto fail; + return ads; + fail: + kfree(ads); + return NULL; +} + +/* + * SMU Power combo sensor object + */ + +struct smu_cpu_power_sensor { + struct list_head link; + struct wf_sensor *volts; + struct wf_sensor *amps; + int fake_volts : 1; + int quadratic : 1; + struct wf_sensor sens; +}; +#define to_smu_cpu_power(c) container_of(c, struct smu_cpu_power_sensor, sens) + +static struct smu_cpu_power_sensor *smu_cpu_power; + +static void smu_cpu_power_release(struct wf_sensor *sr) +{ + struct smu_cpu_power_sensor *pow = to_smu_cpu_power(sr); + + if (pow->volts) + wf_put_sensor(pow->volts); + if (pow->amps) + wf_put_sensor(pow->amps); + kfree(pow); +} + +static int smu_cpu_power_get(struct wf_sensor *sr, s32 *value) +{ + struct smu_cpu_power_sensor *pow = to_smu_cpu_power(sr); + s32 volts, amps, power; + u64 tmps, tmpa, tmpb; + int rc; + + rc = pow->amps->ops->get_value(pow->amps, &s); + if (rc) + return rc; + + if (pow->fake_volts) { + *value = amps * 12 - 0x30000; + return 0; + } + + rc = pow->volts->ops->get_value(pow->volts, &volts); + if (rc) + return rc; + + power = (s32)((((u64)volts) * ((u64)amps)) >> 16); + if (!pow->quadratic) { + *value = power; + return 0; + } + tmps = (((u64)power) * ((u64)power)) >> 16; + tmpa = ((u64)cpuvcp->power_quads[0]) * tmps; + tmpb = ((u64)cpuvcp->power_quads[1]) * ((u64)power); + *value = (tmpa >> 28) + (tmpb >> 28) + (cpuvcp->power_quads[2] >> 12); + + return 0; +} + +static struct wf_sensor_ops smu_cpu_power_ops = { + .get_value = smu_cpu_power_get, + .release = smu_cpu_power_release, + .owner = THIS_MODULE, +}; + + +static struct smu_cpu_power_sensor * +smu_cpu_power_create(struct wf_sensor *volts, struct wf_sensor *amps) +{ + struct smu_cpu_power_sensor *pow; + + pow = kmalloc(sizeof(struct smu_cpu_power_sensor), GFP_KERNEL); + if (pow == NULL) + return NULL; + pow->sens.ops = &smu_cpu_power_ops; + pow->sens.name = "cpu-power"; + + wf_get_sensor(volts); + pow->volts = volts; + wf_get_sensor(amps); + pow->amps = amps; + + /* Some early machines need a faked voltage */ + if (debugswitches && ((*debugswitches) & 0x80)) { + printk(KERN_INFO "windfarm: CPU Power sensor using faked" + " voltage !\n"); + pow->fake_volts = 1; + } else + pow->fake_volts = 0; + + /* Try to use quadratic transforms on PowerMac8,1 and 9,1 for now, + * I yet have to figure out what's up with 8,2 and will have to + * adjust for later, unless we can 100% trust the SDB partition... + */ + if ((machine_is_compatible("PowerMac8,1") || + machine_is_compatible("PowerMac8,2") || + machine_is_compatible("PowerMac9,1")) && + cpuvcp_version >= 2) { + pow->quadratic = 1; + DBG("windfarm: CPU Power using quadratic transform\n"); + } else + pow->quadratic = 0; + + if (wf_register_sensor(&pow->sens)) + goto fail; + return pow; + fail: + kfree(pow); + return NULL; +} + +static int smu_fetch_param_partitions(void) +{ + struct smu_sdbp_header *hdr; + + /* Get CPU voltage/current/power calibration data */ + hdr = smu_get_sdb_partition(SMU_SDB_CPUVCP_ID, NULL); + if (hdr == NULL) { + DBG("wf: cpuvcp partition (%02x) not found\n", + SMU_SDB_CPUVCP_ID); + return -ENODEV; + } + cpuvcp = (struct smu_sdbp_cpuvcp *)&hdr[1]; + /* Keep version around */ + cpuvcp_version = hdr->version; + + /* Get CPU diode calibration data */ + hdr = smu_get_sdb_partition(SMU_SDB_CPUDIODE_ID, NULL); + if (hdr == NULL) { + DBG("wf: cpudiode partition (%02x) not found\n", + SMU_SDB_CPUDIODE_ID); + return -ENODEV; + } + cpudiode = (struct smu_sdbp_cpudiode *)&hdr[1]; + + /* Get slots power calibration data if any */ + hdr = smu_get_sdb_partition(SMU_SDB_SLOTSPOW_ID, NULL); + if (hdr != NULL) + slotspow = (struct smu_sdbp_slotspow *)&hdr[1]; + + /* Get debug switches if any */ + hdr = smu_get_sdb_partition(SMU_SDB_DEBUG_SWITCHES_ID, NULL); + if (hdr != NULL) + debugswitches = (u8 *)&hdr[1]; + + return 0; +} + +static int __init smu_sensors_init(void) +{ + struct device_node *smu, *sensors, *s; + struct smu_ad_sensor *volt_sensor = NULL, *curr_sensor = NULL; + int rc; + + if (!smu_present()) + return -ENODEV; + + /* Get parameters partitions */ + rc = smu_fetch_param_partitions(); + if (rc) + return rc; + + smu = of_find_node_by_type(NULL, "smu"); + if (smu == NULL) + return -ENODEV; + + /* Look for sensors subdir */ + for (sensors = NULL; + (sensors = of_get_next_child(smu, sensors)) != NULL;) + if (!strcmp(sensors->name, "sensors")) + break; + + of_node_put(smu); + + /* Create basic sensors */ + for (s = NULL; + sensors && (s = of_get_next_child(sensors, s)) != NULL;) { + struct smu_ad_sensor *ads; + + ads = smu_ads_create(s); + if (ads == NULL) + continue; + list_add(&ads->link, &smu_ads); + /* keep track of cpu voltage & current */ + if (!strcmp(ads->sens.name, "cpu-voltage")) + volt_sensor = ads; + else if (!strcmp(ads->sens.name, "cpu-current")) + curr_sensor = ads; + } + + of_node_put(sensors); + + /* Create CPU power sensor if possible */ + if (volt_sensor && curr_sensor) + smu_cpu_power = smu_cpu_power_create(&volt_sensor->sens, + &curr_sensor->sens); + + return 0; +} + +static void __exit smu_sensors_exit(void) +{ + struct smu_ad_sensor *ads; + + /* dispose of power sensor */ + if (smu_cpu_power) + wf_unregister_sensor(&smu_cpu_power->sens); + + /* dispose of basic sensors */ + while (!list_empty(&smu_ads)) { + ads = list_entry(smu_ads.next, struct smu_ad_sensor, link); + list_del(&ads->link); + wf_unregister_sensor(&ads->sens); + } +} + + +module_init(smu_sensors_init); +module_exit(smu_sensors_exit); + +MODULE_AUTHOR("Benjamin Herrenschmidt "); +MODULE_DESCRIPTION("SMU sensor objects for PowerMacs thermal control"); +MODULE_LICENSE("GPL"); + Index: linux-work/drivers/macintosh/windfarm_lm75_sensor.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/drivers/macintosh/windfarm_lm75_sensor.c 2005-11-07 16:04:50.000000000 +1100 @@ -0,0 +1,263 @@ +/* + * Windfarm PowerMac thermal control. LM75 sensor + * + * (c) Copyright 2005 Benjamin Herrenschmidt, IBM Corp. + * + * + * Released under the term of the GNU GPL v2. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "windfarm.h" + +#define VERSION "0.1" + +#undef DEBUG + +#ifdef DEBUG +#define DBG(args...) printk(args) +#else +#define DBG(args...) do { } while(0) +#endif + +struct wf_lm75_sensor { + int ds1775 : 1; + int inited : 1; + struct i2c_client i2c; + struct wf_sensor sens; +}; +#define wf_to_lm75(c) container_of(c, struct wf_lm75_sensor, sens) +#define i2c_to_lm75(c) container_of(c, struct wf_lm75_sensor, i2c) + +static int wf_lm75_attach(struct i2c_adapter *adapter); +static int wf_lm75_detach(struct i2c_client *client); + +static struct i2c_driver wf_lm75_driver = { + .owner = THIS_MODULE, + .name = "wf_lm75", + .flags = I2C_DF_NOTIFY, + .attach_adapter = wf_lm75_attach, + .detach_client = wf_lm75_detach, +}; + +static int wf_lm75_get(struct wf_sensor *sr, s32 *value) +{ + struct wf_lm75_sensor *lm = wf_to_lm75(sr); + s32 data; + + if (lm->i2c.adapter == NULL) + return -ENODEV; + + /* Init chip if necessary */ + if (!lm->inited) { + u8 cfg_new, cfg = (u8)i2c_smbus_read_byte_data(&lm->i2c, 1); + + DBG("wf_lm75: Initializing %s, cfg was: %02x\n", + sr->name, cfg); + + /* clear shutdown bit, keep other settings as left by + * the firmware for now + */ + cfg_new = cfg & ~0x01; + i2c_smbus_write_byte_data(&lm->i2c, 1, cfg_new); + lm->inited = 1; + + /* If we just powered it up, let's wait 200 ms */ + msleep(200); + } + + /* Read temperature register */ + data = (s32)le16_to_cpu(i2c_smbus_read_word_data(&lm->i2c, 0)); + data <<= 8; + *value = data; + + return 0; +} + +static void wf_lm75_release(struct wf_sensor *sr) +{ + struct wf_lm75_sensor *lm = wf_to_lm75(sr); + + /* check if client is registered and detach from i2c */ + if (lm->i2c.adapter) { + i2c_detach_client(&lm->i2c); + lm->i2c.adapter = NULL; + } + + kfree(lm); +} + +static struct wf_sensor_ops wf_lm75_ops = { + .get_value = wf_lm75_get, + .release = wf_lm75_release, + .owner = THIS_MODULE, +}; + +static struct wf_lm75_sensor *wf_lm75_create(struct i2c_adapter *adapter, + u8 addr, int ds1775, + const char *loc) +{ + struct wf_lm75_sensor *lm; + + DBG("wf_lm75: creating %s device at address 0x%02x\n", + ds1775 ? "ds1775" : "lm75", addr); + + lm = kmalloc(sizeof(struct wf_lm75_sensor), GFP_KERNEL); + if (lm == NULL) + return NULL; + memset(lm, 0, sizeof(struct wf_lm75_sensor)); + + /* Usual rant about sensor names not beeing very consistent in + * the device-tree, oh well ... + * Add more entries below as you deal with more setups + */ + if (!strcmp(loc, "Hard drive") || !strcmp(loc, "DRIVE BAY")) + lm->sens.name = "hd-temp"; + else + goto fail; + + lm->inited = 0; + lm->sens.ops = &wf_lm75_ops; + lm->ds1775 = ds1775; + lm->i2c.addr = (addr >> 1) & 0x7f; + lm->i2c.adapter = adapter; + lm->i2c.driver = &wf_lm75_driver; + strncpy(lm->i2c.name, lm->sens.name, I2C_NAME_SIZE-1); + + if (i2c_attach_client(&lm->i2c)) { + printk(KERN_ERR "windfarm: failed to attach %s %s to i2c\n", + ds1775 ? "ds1775" : "lm75", lm->i2c.name); + goto fail; + } + + if (wf_register_sensor(&lm->sens)) { + i2c_detach_client(&lm->i2c); + goto fail; + } + + return lm; + fail: + kfree(lm); + return NULL; +} + +static int wf_lm75_attach(struct i2c_adapter *adapter) +{ + u8 bus_id; + struct device_node *smu, *bus, *dev; + + /* We currently only deal with LM75's hanging off the SMU + * i2c busses. If we extend that driver to other/older + * machines, we should split this function into SMU-i2c, + * keywest-i2c, PMU-i2c, ... + */ + + DBG("wf_lm75: adapter %s detected\n", adapter->name); + + if (strncmp(adapter->name, "smu-i2c-", 8) != 0) + return 0; + smu = of_find_node_by_type(NULL, "smu"); + if (smu == NULL) + return 0; + + /* Look for the bus in the device-tree */ + bus_id = (u8)simple_strtoul(adapter->name + 8, NULL, 16); + + DBG("wf_lm75: bus ID is %x\n", bus_id); + + /* Look for sensors subdir */ + for (bus = NULL; + (bus = of_get_next_child(smu, bus)) != NULL;) { + u32 *reg; + + if (strcmp(bus->name, "i2c")) + continue; + reg = (u32 *)get_property(bus, "reg", NULL); + if (reg == NULL) + continue; + if (bus_id == *reg) + break; + } + of_node_put(smu); + if (bus == NULL) { + printk(KERN_WARNING "windfarm: SMU i2c bus 0x%x not found" + " in device-tree !\n", bus_id); + return 0; + } + + DBG("wf_lm75: bus found, looking for device...\n"); + + /* Now look for lm75(s) in there */ + for (dev = NULL; + (dev = of_get_next_child(bus, dev)) != NULL;) { + const char *loc = + get_property(dev, "hwsensor-location", NULL); + u32 *reg = (u32 *)get_property(dev, "reg", NULL); + DBG(" dev: %s... (loc: %p, reg: %p)\n", dev->name, loc, reg); + if (loc == NULL || reg == NULL) + continue; + /* real lm75 */ + if (device_is_compatible(dev, "lm75")) + wf_lm75_create(adapter, *reg, 0, loc); + /* ds1775 (compatible, better resolution */ + else if (device_is_compatible(dev, "ds1775")) + wf_lm75_create(adapter, *reg, 1, loc); + } + + of_node_put(bus); + + return 0; +} + +static int wf_lm75_detach(struct i2c_client *client) +{ + struct wf_lm75_sensor *lm = i2c_to_lm75(client); + + DBG("wf_lm75: i2c detatch called for %s\n", lm->sens.name); + + /* Mark client detached */ + lm->i2c.adapter = NULL; + + /* release sensor */ + wf_unregister_sensor(&lm->sens); + + return 0; +} + +static int __init wf_lm75_sensor_init(void) +{ + int rc; + + rc = i2c_add_driver(&wf_lm75_driver); + if (rc < 0) + return rc; + return 0; +} + +static void __exit wf_lm75_sensor_exit(void) +{ + i2c_del_driver(&wf_lm75_driver); +} + + +module_init(wf_lm75_sensor_init); +module_exit(wf_lm75_sensor_exit); + +MODULE_AUTHOR("Benjamin Herrenschmidt "); +MODULE_DESCRIPTION("LM75 sensor objects for PowerMacs thermal control"); +MODULE_LICENSE("GPL"); + Index: linux-work/drivers/macintosh/windfarm_pid.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/drivers/macintosh/windfarm_pid.c 2005-11-07 16:04:41.000000000 +1100 @@ -0,0 +1,145 @@ +/* + * Windfarm PowerMac thermal control. Generic PID helpers + * + * (c) Copyright 2005 Benjamin Herrenschmidt, IBM Corp. + * + * + * Released under the term of the GNU GPL v2. + */ + +#include +#include +#include +#include +#include + +#include "windfarm_pid.h" + +#undef DEBUG + +#ifdef DEBUG +#define DBG(args...) printk(args) +#else +#define DBG(args...) do { } while(0) +#endif + +void wf_pid_init(struct wf_pid_state *st, struct wf_pid_param *param) +{ + memset(st, 0, sizeof(struct wf_pid_state)); + st->param = *param; + st->first = 1; +} +EXPORT_SYMBOL_GPL(wf_pid_init); + +s32 wf_pid_run(struct wf_pid_state *st, s32 new_sample) +{ + s64 error, integ, deriv; + s32 target; + int i, hlen = st->param.history_len; + + /* Calculate error term */ + error = new_sample - st->param.itarget; + + /* Get samples into our history buffer */ + if (st->first) { + for (i = 0; i < hlen; i++) { + st->samples[i] = new_sample; + st->errors[i] = error; + } + st->first = 0; + st->index = 0; + } else { + st->index = (st->index + 1) % hlen; + st->samples[st->index] = new_sample; + st->errors[st->index] = error; + } + + /* Calculate integral term */ + for (i = 0, integ = 0; i < hlen; i++) + integ += st->errors[(st->index + hlen - i) % hlen]; + integ *= st->param.interval; + + /* Calculate derivative term */ + deriv = st->errors[st->index] - + st->errors[(st->index + hlen - 1) % hlen]; + deriv /= st->param.interval; + + /* Calculate target */ + target = (s32)((integ * (s64)st->param.gr + deriv * (s64)st->param.gd + + error * (s64)st->param.gp) >> 36); + if (st->param.additive) + target += st->target; + target = max(target, st->param.min); + target = min(target, st->param.max); + st->target = target; + + return st->target; +} +EXPORT_SYMBOL_GPL(wf_pid_run); + +void wf_cpu_pid_init(struct wf_cpu_pid_state *st, + struct wf_cpu_pid_param *param) +{ + memset(st, 0, sizeof(struct wf_cpu_pid_state)); + st->param = *param; + st->first = 1; +} +EXPORT_SYMBOL_GPL(wf_cpu_pid_init); + +s32 wf_cpu_pid_run(struct wf_cpu_pid_state *st, s32 new_power, s32 new_temp) +{ + s64 error, integ, deriv, prop; + s32 target, sval, adj; + int i, hlen = st->param.history_len; + + /* Calculate error term */ + error = st->param.pmaxadj - new_power; + + /* Get samples into our history buffer */ + if (st->first) { + for (i = 0; i < hlen; i++) { + st->powers[i] = new_power; + st->errors[i] = error; + } + st->temps[0] = st->temps[1] = new_temp; + st->first = 0; + st->index = st->tindex = 0; + } else { + st->index = (st->index + 1) % hlen; + st->powers[st->index] = new_power; + st->errors[st->index] = error; + st->tindex = (st->tindex + 1) % 2; + st->temps[st->tindex] = new_temp; + } + + /* Calculate integral term */ + for (i = 0, integ = 0; i < hlen; i++) + integ += st->errors[(st->index + hlen - i) % hlen]; + integ *= st->param.interval; + integ *= st->param.gr; + sval = st->param.tmax - ((integ >> 20) & 0xffffffff); + adj = min(st->param.ttarget, sval); + + DBG("integ: %lx, sval: %lx, adj: %lx\n", integ, sval, adj); + + /* Calculate derivative term */ + deriv = st->temps[st->tindex] - + st->temps[(st->tindex + 2 - 1) % 2]; + deriv /= st->param.interval; + deriv *= st->param.gd; + + /* Calculate proportional term */ + prop = (new_temp - adj); + prop *= st->param.gp; + + DBG("deriv: %lx, prop: %lx\n", deriv, prop); + + /* Calculate target */ + target = st->target + (s32)((deriv + prop) >> 36); + target = max(target, st->param.min); + target = min(target, st->param.max); + st->target = target; + + return st->target; +} +EXPORT_SYMBOL_GPL(wf_cpu_pid_run); Index: linux-work/drivers/macintosh/windfarm_pid.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/drivers/macintosh/windfarm_pid.h 2005-11-07 13:30:46.000000000 +1100 @@ -0,0 +1,84 @@ +/* + * Windfarm PowerMac thermal control. Generic PID helpers + * + * (c) Copyright 2005 Benjamin Herrenschmidt, IBM Corp. + * + * + * Released under the term of the GNU GPL v2. + * + * This is a pair of generic PID helpers that can be used by + * control loops. One is the basic PID implementation, the + * other one is more specifically tailored to the loops used + * for CPU control with 2 input sample types (temp and power) + */ + +/* + * *** Simple PID *** + */ + +#define WF_PID_MAX_HISTORY 32 + +/* This parameter array is passed to the PID algorithm. Currently, + * we don't support changing parameters on the fly as it's not needed + * but could be implemented (with necessary adjustment of the history + * buffer + */ +struct wf_pid_param { + int interval; /* Interval between samples in seconds */ + int history_len; /* Size of history buffer */ + int additive; /* 1: target relative to previous value */ + s32 gd, gp, gr; /* PID gains */ + s32 itarget; /* PID input target */ + s32 min,max; /* min and max target values */ +}; + +struct wf_pid_state { + int first; /* first run of the loop */ + int index; /* index of current sample */ + s32 target; /* current target value */ + s32 samples[WF_PID_MAX_HISTORY]; /* samples history buffer */ + s32 errors[WF_PID_MAX_HISTORY]; /* error history buffer */ + + struct wf_pid_param param; +}; + +extern void wf_pid_init(struct wf_pid_state *st, struct wf_pid_param *param); +extern s32 wf_pid_run(struct wf_pid_state *st, s32 sample); + + +/* + * *** CPU PID *** + */ + +#define WF_CPU_PID_MAX_HISTORY 32 + +/* This parameter array is passed to the CPU PID algorithm. Currently, + * we don't support changing parameters on the fly as it's not needed + * but could be implemented (with necessary adjustment of the history + * buffer + */ +struct wf_cpu_pid_param { + int interval; /* Interval between samples in seconds */ + int history_len; /* Size of history buffer */ + s32 gd, gp, gr; /* PID gains */ + s32 pmaxadj; /* PID max power adjust */ + s32 ttarget; /* PID input target */ + s32 tmax; /* PID input max */ + s32 min,max; /* min and max target values */ +}; + +struct wf_cpu_pid_state { + int first; /* first run of the loop */ + int index; /* index of current power */ + int tindex; /* index of current temp */ + s32 target; /* current target value */ + s32 powers[WF_PID_MAX_HISTORY]; /* power history buffer */ + s32 errors[WF_PID_MAX_HISTORY]; /* error history buffer */ + s32 temps[2]; /* temp. history buffer */ + + struct wf_cpu_pid_param param; +}; + +extern void wf_cpu_pid_init(struct wf_cpu_pid_state *st, + struct wf_cpu_pid_param *param); +extern s32 wf_cpu_pid_run(struct wf_cpu_pid_state *st, s32 power, s32 temp); Index: linux-work/drivers/macintosh/windfarm_cpufreq_clamp.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/drivers/macintosh/windfarm_cpufreq_clamp.c 2005-11-07 13:30:46.000000000 +1100 @@ -0,0 +1,105 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "windfarm.h" + +#define VERSION "0.3" + +static int clamped; +static struct wf_control *clamp_control; + +static int clamp_notifier_call(struct notifier_block *self, + unsigned long event, void *data) +{ + struct cpufreq_policy *p = data; + unsigned long max_freq; + + if (event != CPUFREQ_ADJUST) + return 0; + + max_freq = clamped ? (p->cpuinfo.min_freq) : (p->cpuinfo.max_freq); + cpufreq_verify_within_limits(p, 0, max_freq); + + return 0; +} + +static struct notifier_block clamp_notifier = { + .notifier_call = clamp_notifier_call, +}; + +static int clamp_set(struct wf_control *ct, s32 value) +{ + if (value) + printk(KERN_INFO "windfarm: Clamping CPU frequency to " + "minimum !\n"); + else + printk(KERN_INFO "windfarm: CPU frequency unclamped !\n"); + clamped = value; + cpufreq_update_policy(0); + return 0; +} + +static int clamp_get(struct wf_control *ct, s32 *value) +{ + *value = clamped; + return 0; +} + +static s32 clamp_min(struct wf_control *ct) +{ + return 0; +} + +static s32 clamp_max(struct wf_control *ct) +{ + return 1; +} + +static struct wf_control_ops clamp_ops = { + .set_value = clamp_set, + .get_value = clamp_get, + .get_min = clamp_min, + .get_max = clamp_max, + .owner = THIS_MODULE, +}; + +static int __init wf_cpufreq_clamp_init(void) +{ + struct wf_control *clamp; + + clamp = kmalloc(sizeof(struct wf_control), GFP_KERNEL); + if (clamp == NULL) + return -ENOMEM; + cpufreq_register_notifier(&clamp_notifier, CPUFREQ_POLICY_NOTIFIER); + clamp->ops = &clamp_ops; + clamp->name = "cpufreq-clamp"; + if (wf_register_control(clamp)) + goto fail; + clamp_control = clamp; + return 0; + fail: + kfree(clamp); + return -ENODEV; +} + +static void __exit wf_cpufreq_clamp_exit(void) +{ + if (clamp_control) + wf_unregister_control(clamp_control); +} + + +module_init(wf_cpufreq_clamp_init); +module_exit(wf_cpufreq_clamp_exit); + +MODULE_AUTHOR("Benjamin Herrenschmidt "); +MODULE_DESCRIPTION("CPU frequency clamp for PowerMacs thermal control"); +MODULE_LICENSE("GPL"); + Index: linux-work/drivers/macintosh/windfarm_pm81.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/drivers/macintosh/windfarm_pm81.c 2005-11-07 16:04:23.000000000 +1100 @@ -0,0 +1,879 @@ +/* + * Windfarm PowerMac thermal control. iMac G5 + * + * (c) Copyright 2005 Benjamin Herrenschmidt, IBM Corp. + * + * + * Released under the term of the GNU GPL v2. + * + * The algorithm used is the PID control algorithm, used the same + * way the published Darwin code does, using the same values that + * are present in the Darwin 8.2 snapshot property lists (note however + * that none of the code has been re-used, it's a complete re-implementation + * + * The various control loops found in Darwin config file are: + * + * PowerMac8,1 and PowerMac8,2 + * =========================== + * + * System Fans control loop. Different based on models. In addition to the + * usual PID algorithm, the control loop gets 2 additional pairs of linear + * scaling factors (scale/offsets) expressed as 4.12 fixed point values + * signed offset, unsigned scale) + * + * The targets are modified such as: + * - the linked control (second control) gets the target value as-is + * (typically the drive fan) + * - the main control (first control) gets the target value scaled with + * the first pair of factors, and is then modified as below + * - the value of the target of the CPU Fan control loop is retreived, + * scaled with the second pair of factors, and the max of that and + * the scaled target is applied to the main control. + * + * # model_id: 2 + * controls : system-fan, drive-bay-fan + * sensors : hd-temp + * PID params : G_d = 0x15400000 + * G_p = 0x00200000 + * G_r = 0x000002fd + * History = 2 entries + * Input target = 0x3a0000 + * Interval = 5s + * linear-factors : offset = 0xff38 scale = 0x0ccd + * offset = 0x0208 scale = 0x07ae + * + * # model_id: 3 + * controls : system-fan, drive-bay-fan + * sensors : hd-temp + * PID params : G_d = 0x08e00000 + * G_p = 0x00566666 + * G_r = 0x0000072b + * History = 2 entries + * Input target = 0x350000 + * Interval = 5s + * linear-factors : offset = 0xff38 scale = 0x0ccd + * offset = 0x0000 scale = 0x0000 + * + * # model_id: 5 + * controls : system-fan + * sensors : hd-temp + * PID params : G_d = 0x15400000 + * G_p = 0x00233333 + * G_r = 0x000002fd + * History = 2 entries + * Input target = 0x3a0000 + * Interval = 5s + * linear-factors : offset = 0x0000 scale = 0x1000 + * offset = 0x0091 scale = 0x0bae + * + * CPU Fan control loop. The loop is identical for all models. it + * has an additional pair of scaling factor. This is used to scale the + * systems fan control loop target result (the one before it gets scaled + * by the System Fans control loop itself). Then, the max value of the + * calculated target value and system fan value is sent to the fans + * + * controls : cpu-fan + * sensors : cpu-temp cpu-power + * PID params : From SMU sdb partition + * linear-factors : offset = 0xfb50 scale = 0x1000 + * + * CPU Slew control loop. Not implemented. The cpufreq driver in linux is + * completely separate for now, though we could find a way to link it, either + * as a client reacting to overtemp notifications, or directling monitoring + * the CPU temperature + * + * WARNING ! The CPU control loop requires the CPU tmax for the current + * operating point. However, we currently are completely separated from + * the cpufreq driver and thus do not know what the current operating + * point is. Fortunately, we also do not have any hardware supporting anything + * but operating point 0 at the moment, thus we just peek that value directly + * from the SDB partition. If we ever end up with actually slewing the system + * clock and thus changing operating points, we'll have to find a way to + * communicate with the CPU freq driver; + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "windfarm.h" +#include "windfarm_pid.h" + +#define VERSION "0.4" + +#undef DEBUG + +#ifdef DEBUG +#define DBG(args...) printk(args) +#else +#define DBG(args...) do { } while(0) +#endif + +/* define this to force CPU overtemp to 74 degree, useful for testing + * the overtemp code + */ +#undef HACKED_OVERTEMP + +static int wf_smu_mach_model; /* machine model id */ + +static struct device *wf_smu_dev; + +/* Controls & sensors */ +static struct wf_sensor *sensor_cpu_power; +static struct wf_sensor *sensor_cpu_temp; +static struct wf_sensor *sensor_hd_temp; +static struct wf_control *fan_cpu_main; +static struct wf_control *fan_hd; +static struct wf_control *fan_system; +static struct wf_control *cpufreq_clamp; + +/* Set to kick the control loop into life */ +static int wf_smu_all_controls_ok, wf_smu_all_sensors_ok, wf_smu_started; + +/* Failure handling.. could be nicer */ +#define FAILURE_FAN 0x01 +#define FAILURE_SENSOR 0x02 +#define FAILURE_OVERTEMP 0x04 + +static unsigned int wf_smu_failure_state; +static int wf_smu_readjust, wf_smu_skipping; + +/* + * ****** System Fans Control Loop ****** + * + */ + +/* Parameters for the System Fans control loop. Parameters + * not in this table such as interval, history size, ... + * are common to all versions and thus hard coded for now. + */ +struct wf_smu_sys_fans_param { + int model_id; + s32 itarget; + s32 gd, gp, gr; + + s16 offset0; + u16 scale0; + s16 offset1; + u16 scale1; +}; + +#define WF_SMU_SYS_FANS_INTERVAL 5 +#define WF_SMU_SYS_FANS_HISTORY_SIZE 2 + +/* State data used by the system fans control loop + */ +struct wf_smu_sys_fans_state { + int ticks; + s32 sys_setpoint; + s32 hd_setpoint; + s16 offset0; + u16 scale0; + s16 offset1; + u16 scale1; + struct wf_pid_state pid; +}; + +/* + * Configs for SMU Sytem Fan control loop + */ +static struct wf_smu_sys_fans_param wf_smu_sys_all_params[] = { + /* Model ID 2 */ + { + .model_id = 2, + .itarget = 0x3a0000, + .gd = 0x15400000, + .gp = 0x00200000, + .gr = 0x000002fd, + .offset0 = 0xff38, + .scale0 = 0x0ccd, + .offset1 = 0x0208, + .scale1 = 0x07ae, + }, + /* Model ID 3 */ + { + .model_id = 2, + .itarget = 0x350000, + .gd = 0x08e00000, + .gp = 0x00566666, + .gr = 0x0000072b, + .offset0 = 0xff38, + .scale0 = 0x0ccd, + .offset1 = 0x0000, + .scale1 = 0x0000, + }, + /* Model ID 5 */ + { + .model_id = 2, + .itarget = 0x3a0000, + .gd = 0x15400000, + .gp = 0x00233333, + .gr = 0x000002fd, + .offset0 = 0x0000, + .scale0 = 0x1000, + .offset1 = 0x0091, + .scale1 = 0x0bae, + }, +}; +#define WF_SMU_SYS_FANS_NUM_CONFIGS ARRAY_SIZE(wf_smu_sys_all_params) + +static struct wf_smu_sys_fans_state *wf_smu_sys_fans; + +/* + * ****** CPU Fans Control Loop ****** + * + */ + + +#define WF_SMU_CPU_FANS_INTERVAL 1 +#define WF_SMU_CPU_FANS_MAX_HISTORY 16 +#define WF_SMU_CPU_FANS_SIBLING_SCALE 0x00001000 +#define WF_SMU_CPU_FANS_SIBLING_OFFSET 0xfffffb50 + +/* State data used by the cpu fans control loop + */ +struct wf_smu_cpu_fans_state { + int ticks; + s32 cpu_setpoint; + s32 scale; + s32 offset; + struct wf_cpu_pid_state pid; +}; + +static struct wf_smu_cpu_fans_state *wf_smu_cpu_fans; + + + +/* + * ***** Implementation ***** + * + */ + +static void wf_smu_create_sys_fans(void) +{ + struct wf_smu_sys_fans_param *param = NULL; + struct wf_pid_param pid_param; + int i; + + /* First, locate the params for this model */ + for (i = 0; i < WF_SMU_SYS_FANS_NUM_CONFIGS; i++) + if (wf_smu_sys_all_params[i].model_id == wf_smu_mach_model) { + param = &wf_smu_sys_all_params[i]; + break; + } + + /* No params found, put fans to max */ + if (param == NULL) { + printk(KERN_WARNING "windfarm: System fan config not found " + "for this machine model, max fan speed\n"); + goto fail; + } + + /* Alloc & initialize state */ + wf_smu_sys_fans = kmalloc(sizeof(struct wf_smu_sys_fans_state), + GFP_KERNEL); + if (wf_smu_sys_fans == NULL) { + printk(KERN_WARNING "windfarm: Memory allocation error" + " max fan speed\n"); + goto fail; + } + wf_smu_sys_fans->ticks = 1; + wf_smu_sys_fans->scale0 = param->scale0; + wf_smu_sys_fans->offset0 = param->offset0; + wf_smu_sys_fans->scale1 = param->scale1; + wf_smu_sys_fans->offset1 = param->offset1; + + /* Fill PID params */ + pid_param.gd = param->gd; + pid_param.gp = param->gp; + pid_param.gr = param->gr; + pid_param.interval = WF_SMU_SYS_FANS_INTERVAL; + pid_param.history_len = WF_SMU_SYS_FANS_HISTORY_SIZE; + pid_param.itarget = param->itarget; + pid_param.min = fan_system->ops->get_min(fan_system); + pid_param.max = fan_system->ops->get_max(fan_system); + if (fan_hd) { + pid_param.min = + max(pid_param.min,fan_hd->ops->get_min(fan_hd)); + pid_param.max = + min(pid_param.max,fan_hd->ops->get_max(fan_hd)); + } + wf_pid_init(&wf_smu_sys_fans->pid, &pid_param); + + DBG("wf: System Fan control initialized.\n"); + DBG(" itarged=%d.%03d, min=%d RPM, max=%d RPM\n", + FIX32TOPRINT(pid_param.itarget), pid_param.min, pid_param.max); + return; + + fail: + + if (fan_system) + wf_control_set_max(fan_system); + if (fan_hd) + wf_control_set_max(fan_hd); +} + +static void wf_smu_sys_fans_tick(struct wf_smu_sys_fans_state *st) +{ + s32 new_setpoint, temp, scaled, cputarget; + int rc; + + if (--st->ticks != 0) { + if (wf_smu_readjust) + goto readjust; + return; + } + st->ticks = WF_SMU_SYS_FANS_INTERVAL; + + rc = sensor_hd_temp->ops->get_value(sensor_hd_temp, &temp); + if (rc) { + printk(KERN_WARNING "windfarm: HD temp sensor error %d\n", + rc); + wf_smu_failure_state |= FAILURE_SENSOR; + return; + } + + DBG("wf_smu: System Fans tick ! HD temp: %d.%03d\n", + FIX32TOPRINT(temp)); + + if (temp > (st->pid.param.itarget + 0x50000)) + wf_smu_failure_state |= FAILURE_OVERTEMP; + + new_setpoint = wf_pid_run(&st->pid, temp); + + DBG("wf_smu: new_setpoint: %d RPM\n", (int)new_setpoint); + + scaled = ((((s64)new_setpoint) * (s64)st->scale0) >> 12) + st->offset0; + + DBG("wf_smu: scaled setpoint: %d RPM\n", (int)scaled); + + cputarget = wf_smu_cpu_fans ? wf_smu_cpu_fans->pid.target : 0; + cputarget = ((((s64)cputarget) * (s64)st->scale1) >> 12) + st->offset1; + scaled = max(scaled, cputarget); + scaled = max(scaled, st->pid.param.min); + scaled = min(scaled, st->pid.param.max); + + DBG("wf_smu: adjusted setpoint: %d RPM\n", (int)scaled); + + if (st->sys_setpoint == scaled && new_setpoint == st->hd_setpoint) + return; + st->sys_setpoint = scaled; + st->hd_setpoint = new_setpoint; + readjust: + if (fan_system && wf_smu_failure_state == 0) { + rc = fan_system->ops->set_value(fan_system, st->sys_setpoint); + if (rc) { + printk(KERN_WARNING "windfarm: Sys fan error %d\n", + rc); + wf_smu_failure_state |= FAILURE_FAN; + } + } + if (fan_hd && wf_smu_failure_state == 0) { + rc = fan_hd->ops->set_value(fan_hd, st->hd_setpoint); + if (rc) { + printk(KERN_WARNING "windfarm: HD fan error %d\n", + rc); + wf_smu_failure_state |= FAILURE_FAN; + } + } +} + +static void wf_smu_create_cpu_fans(void) +{ + struct wf_cpu_pid_param pid_param; + struct smu_sdbp_header *hdr; + struct smu_sdbp_cpupiddata *piddata; + struct smu_sdbp_fvt *fvt; + s32 tmax, tdelta, maxpow, powadj; + + /* First, locate the PID params in SMU SBD */ + hdr = smu_get_sdb_partition(SMU_SDB_CPUPIDDATA_ID, NULL); + if (hdr == 0) { + printk(KERN_WARNING "windfarm: CPU PID fan config not found " + "max fan speed\n"); + goto fail; + } + piddata = (struct smu_sdbp_cpupiddata *)&hdr[1]; + + /* Get the FVT params for operating point 0 (the only supported one + * for now) in order to get tmax + */ + hdr = smu_get_sdb_partition(SMU_SDB_FVT_ID, NULL); + if (hdr) { + fvt = (struct smu_sdbp_fvt *)&hdr[1]; + tmax = ((s32)fvt->maxtemp) << 16; + } else + tmax = 0x5e0000; /* 94 degree default */ + + /* Alloc & initialize state */ + wf_smu_cpu_fans = kmalloc(sizeof(struct wf_smu_cpu_fans_state), + GFP_KERNEL); + if (wf_smu_cpu_fans == NULL) + goto fail; + wf_smu_cpu_fans->ticks = 1; + + wf_smu_cpu_fans->scale = WF_SMU_CPU_FANS_SIBLING_SCALE; + wf_smu_cpu_fans->offset = WF_SMU_CPU_FANS_SIBLING_OFFSET; + + /* Fill PID params */ + pid_param.interval = WF_SMU_CPU_FANS_INTERVAL; + pid_param.history_len = piddata->history_len; + if (pid_param.history_len > WF_CPU_PID_MAX_HISTORY) { + printk(KERN_WARNING "windfarm: History size overflow on " + "CPU control loop (%d)\n", piddata->history_len); + pid_param.history_len = WF_CPU_PID_MAX_HISTORY; + } + pid_param.gd = piddata->gd; + pid_param.gp = piddata->gp; + pid_param.gr = piddata->gr / pid_param.history_len; + + tdelta = ((s32)piddata->target_temp_delta) << 16; + maxpow = ((s32)piddata->max_power) << 16; + powadj = ((s32)piddata->power_adj) << 16; + + pid_param.tmax = tmax; + pid_param.ttarget = tmax - tdelta; + pid_param.pmaxadj = maxpow - powadj; + + pid_param.min = fan_cpu_main->ops->get_min(fan_cpu_main); + pid_param.max = fan_cpu_main->ops->get_max(fan_cpu_main); + + wf_cpu_pid_init(&wf_smu_cpu_fans->pid, &pid_param); + + DBG("wf: CPU Fan control initialized.\n"); + DBG(" ttarged=%d.%03d, tmax=%d.%03d, min=%d RPM, max=%d RPM\n", + FIX32TOPRINT(pid_param.ttarget), FIX32TOPRINT(pid_param.tmax), + pid_param.min, pid_param.max); + + return; + + fail: + printk(KERN_WARNING "windfarm: CPU fan config not found\n" + "for this machine model, max fan speed\n"); + + if (cpufreq_clamp) + wf_control_set_max(cpufreq_clamp); + if (fan_cpu_main) + wf_control_set_max(fan_cpu_main); +} + +static void wf_smu_cpu_fans_tick(struct wf_smu_cpu_fans_state *st) +{ + s32 new_setpoint, temp, power, systarget; + int rc; + + if (--st->ticks != 0) { + if (wf_smu_readjust) + goto readjust; + return; + } + st->ticks = WF_SMU_CPU_FANS_INTERVAL; + + rc = sensor_cpu_temp->ops->get_value(sensor_cpu_temp, &temp); + if (rc) { + printk(KERN_WARNING "windfarm: CPU temp sensor error %d\n", + rc); + wf_smu_failure_state |= FAILURE_SENSOR; + return; + } + + rc = sensor_cpu_power->ops->get_value(sensor_cpu_power, &power); + if (rc) { + printk(KERN_WARNING "windfarm: CPU power sensor error %d\n", + rc); + wf_smu_failure_state |= FAILURE_SENSOR; + return; + } + + DBG("wf_smu: CPU Fans tick ! CPU temp: %d.%03d, power: %d.%03d\n", + FIX32TOPRINT(temp), FIX32TOPRINT(power)); + +#ifdef HACKED_OVERTEMP + if (temp > 0x4a0000) + wf_smu_failure_state |= FAILURE_OVERTEMP; +#else + if (temp > st->pid.param.tmax) + wf_smu_failure_state |= FAILURE_OVERTEMP; +#endif + new_setpoint = wf_cpu_pid_run(&st->pid, power, temp); + + DBG("wf_smu: new_setpoint: %d RPM\n", (int)new_setpoint); + + systarget = wf_smu_sys_fans ? wf_smu_sys_fans->pid.target : 0; + systarget = ((((s64)systarget) * (s64)st->scale) >> 12) + + st->offset; + new_setpoint = max(new_setpoint, systarget); + new_setpoint = max(new_setpoint, st->pid.param.min); + new_setpoint = min(new_setpoint, st->pid.param.max); + + DBG("wf_smu: adjusted setpoint: %d RPM\n", (int)new_setpoint); + + if (st->cpu_setpoint == new_setpoint) + return; + st->cpu_setpoint = new_setpoint; + readjust: + if (fan_cpu_main && wf_smu_failure_state == 0) { + rc = fan_cpu_main->ops->set_value(fan_cpu_main, + st->cpu_setpoint); + if (rc) { + printk(KERN_WARNING "windfarm: CPU main fan" + " error %d\n", rc); + wf_smu_failure_state |= FAILURE_FAN; + } + } +} + + +/* + * ****** Attributes ****** + * + */ + +#define BUILD_SHOW_FUNC_FIX(name, data) \ +static ssize_t show_##name(struct device *dev, \ + struct device_attribute *attr, \ + char *buf) \ +{ \ + ssize_t r; \ + s32 val = 0; \ + data->ops->get_value(data, &val); \ + r = sprintf(buf, "%d.%03d", FIX32TOPRINT(val)); \ + return r; \ +} \ +static DEVICE_ATTR(name,S_IRUGO,show_##name, NULL); + + +#define BUILD_SHOW_FUNC_INT(name, data) \ +static ssize_t show_##name(struct device *dev, \ + struct device_attribute *attr, \ + char *buf) \ +{ \ + s32 val = 0; \ + data->ops->get_value(data, &val); \ + return sprintf(buf, "%d", val); \ +} \ +static DEVICE_ATTR(name,S_IRUGO,show_##name, NULL); + +BUILD_SHOW_FUNC_INT(cpu_fan, fan_cpu_main); +BUILD_SHOW_FUNC_INT(sys_fan, fan_system); +BUILD_SHOW_FUNC_INT(hd_fan, fan_hd); + +BUILD_SHOW_FUNC_FIX(cpu_temp, sensor_cpu_temp); +BUILD_SHOW_FUNC_FIX(cpu_power, sensor_cpu_power); +BUILD_SHOW_FUNC_FIX(hd_temp, sensor_hd_temp); + +/* + * ****** Setup / Init / Misc ... ****** + * + */ + +static void wf_smu_tick(void) +{ + unsigned int last_failure = wf_smu_failure_state; + unsigned int new_failure; + + if (!wf_smu_started) { + DBG("wf: creating control loops !\n"); + wf_smu_create_sys_fans(); + wf_smu_create_cpu_fans(); + wf_smu_started = 1; + } + + /* Skipping ticks */ + if (wf_smu_skipping && --wf_smu_skipping) + return; + + wf_smu_failure_state = 0; + if (wf_smu_sys_fans) + wf_smu_sys_fans_tick(wf_smu_sys_fans); + if (wf_smu_cpu_fans) + wf_smu_cpu_fans_tick(wf_smu_cpu_fans); + + wf_smu_readjust = 0; + new_failure = wf_smu_failure_state & ~last_failure; + + /* If entering failure mode, clamp cpufreq and ramp all + * fans to full speed. + */ + if (wf_smu_failure_state && !last_failure) { + if (cpufreq_clamp) + wf_control_set_max(cpufreq_clamp); + if (fan_system) + wf_control_set_max(fan_system); + if (fan_cpu_main) + wf_control_set_max(fan_cpu_main); + if (fan_hd) + wf_control_set_max(fan_hd); + } + + /* If leaving failure mode, unclamp cpufreq and readjust + * all fans on next iteration + */ + if (!wf_smu_failure_state && last_failure) { + if (cpufreq_clamp) + wf_control_set_min(cpufreq_clamp); + wf_smu_readjust = 1; + } + + /* Overtemp condition detected, notify and start skipping a couple + * ticks to let the temperature go down + */ + if (new_failure & FAILURE_OVERTEMP) { + wf_set_overtemp(); + wf_smu_skipping = 2; + } + + /* We only clear the overtemp condition if overtemp is cleared + * _and_ no other failure is present. Since a sensor error will + * clear the overtemp condition (can't measure temperature) at + * the control loop levels, but we don't want to keep it clear + * here in this case + */ + if (new_failure == 0 && last_failure & FAILURE_OVERTEMP) + wf_clear_overtemp(); +} + +static void wf_smu_new_control(struct wf_control *ct) +{ + if (wf_smu_all_controls_ok) + return; + + if (fan_cpu_main == NULL && !strcmp(ct->name, "cpu-fan")) { + if (wf_get_control(ct) == 0) { + fan_cpu_main = ct; + device_create_file(wf_smu_dev, &dev_attr_cpu_fan); + } + } + + if (fan_system == NULL && !strcmp(ct->name, "system-fan")) { + if (wf_get_control(ct) == 0) { + fan_system = ct; + device_create_file(wf_smu_dev, &dev_attr_sys_fan); + } + } + + if (cpufreq_clamp == NULL && !strcmp(ct->name, "cpufreq-clamp")) { + if (wf_get_control(ct) == 0) + cpufreq_clamp = ct; + } + + /* Darwin property list says the HD fan is only for model ID + * 0, 1, 2 and 3 + */ + + if (wf_smu_mach_model > 3) { + if (fan_system && fan_cpu_main && cpufreq_clamp) + wf_smu_all_controls_ok = 1; + return; + } + + if (fan_hd == NULL && !strcmp(ct->name, "drive-bay-fan")) { + if (wf_get_control(ct) == 0) { + fan_hd = ct; + device_create_file(wf_smu_dev, &dev_attr_hd_fan); + } + } + + if (fan_system && fan_hd && fan_cpu_main && cpufreq_clamp) + wf_smu_all_controls_ok = 1; +} + +static void wf_smu_new_sensor(struct wf_sensor *sr) +{ + if (wf_smu_all_sensors_ok) + return; + + if (sensor_cpu_power == NULL && !strcmp(sr->name, "cpu-power")) { + if (wf_get_sensor(sr) == 0) { + sensor_cpu_power = sr; + device_create_file(wf_smu_dev, &dev_attr_cpu_power); + } + } + + if (sensor_cpu_temp == NULL && !strcmp(sr->name, "cpu-temp")) { + if (wf_get_sensor(sr) == 0) { + sensor_cpu_temp = sr; + device_create_file(wf_smu_dev, &dev_attr_cpu_temp); + } + } + + if (sensor_hd_temp == NULL && !strcmp(sr->name, "hd-temp")) { + if (wf_get_sensor(sr) == 0) { + sensor_hd_temp = sr; + device_create_file(wf_smu_dev, &dev_attr_hd_temp); + } + } + + if (sensor_cpu_power && sensor_cpu_temp && sensor_hd_temp) + wf_smu_all_sensors_ok = 1; +} + + +static int wf_smu_notify(struct notifier_block *self, + unsigned long event, void *data) +{ + switch(event) { + case WF_EVENT_NEW_CONTROL: + DBG("wf: new control %s detected\n", + ((struct wf_control *)data)->name); + wf_smu_new_control(data); + wf_smu_readjust = 1; + break; + case WF_EVENT_NEW_SENSOR: + DBG("wf: new sensor %s detected\n", + ((struct wf_sensor *)data)->name); + wf_smu_new_sensor(data); + break; + case WF_EVENT_TICK: + if (wf_smu_all_controls_ok && wf_smu_all_sensors_ok) + wf_smu_tick(); + } + + return 0; +} + +static struct notifier_block wf_smu_events = { + .notifier_call = wf_smu_notify, +}; + +static int wf_init_pm(void) +{ + struct smu_sdbp_header *hdr; + + hdr = smu_get_sdb_partition(SMU_SDB_SENSORTREE_ID, NULL); + if (hdr != 0) { + struct smu_sdbp_sensortree *st = + (struct smu_sdbp_sensortree *)&hdr[1]; + wf_smu_mach_model = st->model_id; + } + + printk(KERN_INFO "windfarm: Initializing for iMacG5 model ID %d\n", + wf_smu_mach_model); + + return 0; +} + +static int wf_smu_probe(struct device *ddev) +{ + wf_smu_dev = ddev; + + wf_register_client(&wf_smu_events); + + return 0; +} + +static int wf_smu_remove(struct device *ddev) +{ + wf_unregister_client(&wf_smu_events); + + /* XXX We don't have yet a guarantee that our callback isn't + * in progress when returning from wf_unregister_client, so + * we add an arbitrary delay. I'll have to fix that in the core + */ + msleep(1000); + + /* Release all sensors */ + /* One more crappy race: I don't think we have any guarantee here + * that the attribute callback won't race with the sensor beeing + * disposed of, and I'm not 100% certain what best way to deal + * with that except by adding locks all over... I'll do that + * eventually but heh, who ever rmmod this module anyway ? + */ + if (sensor_cpu_power) { + device_remove_file(wf_smu_dev, &dev_attr_cpu_power); + wf_put_sensor(sensor_cpu_power); + } + if (sensor_cpu_temp) { + device_remove_file(wf_smu_dev, &dev_attr_cpu_temp); + wf_put_sensor(sensor_cpu_temp); + } + if (sensor_hd_temp) { + device_remove_file(wf_smu_dev, &dev_attr_hd_temp); + wf_put_sensor(sensor_hd_temp); + } + + /* Release all controls */ + if (fan_cpu_main) { + device_remove_file(wf_smu_dev, &dev_attr_cpu_fan); + wf_put_control(fan_cpu_main); + } + if (fan_hd) { + device_remove_file(wf_smu_dev, &dev_attr_hd_fan); + wf_put_control(fan_hd); + } + if (fan_system) { + device_remove_file(wf_smu_dev, &dev_attr_sys_fan); + wf_put_control(fan_system); + } + if (cpufreq_clamp) + wf_put_control(cpufreq_clamp); + + /* Destroy control loops state structures */ + if (wf_smu_sys_fans) + kfree(wf_smu_sys_fans); + if (wf_smu_cpu_fans) + kfree(wf_smu_cpu_fans); + + wf_smu_dev = NULL; + + return 0; +} + +static struct device_driver wf_smu_driver = { + .name = "windfarm", + .bus = &platform_bus_type, + .probe = wf_smu_probe, + .remove = wf_smu_remove, +}; + + +static int __init wf_smu_init(void) +{ + int rc = -ENODEV; + + if (machine_is_compatible("PowerMac8,1") || + machine_is_compatible("PowerMac8,2")) + rc = wf_init_pm(); + + if (rc == 0) { +#ifdef MODULE + request_module("windfarm_smu_controls"); + request_module("windfarm_smu_sensors"); + request_module("windfarm_lm75_sensor"); + +#endif /* MODULE */ + driver_register(&wf_smu_driver); + } + + return rc; +} + +static void __exit wf_smu_exit(void) +{ + + driver_unregister(&wf_smu_driver); +} + + +module_init(wf_smu_init); +module_exit(wf_smu_exit); + +MODULE_AUTHOR("Benjamin Herrenschmidt "); +MODULE_DESCRIPTION("Thermal control logic for iMac G5"); +MODULE_LICENSE("GPL"); + Index: linux-work/drivers/macintosh/windfarm_pm91.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/drivers/macintosh/windfarm_pm91.c 2005-11-07 16:04:13.000000000 +1100 @@ -0,0 +1,814 @@ +/* + * Windfarm PowerMac thermal control. SMU based 1 CPU desktop control loops + * + * (c) Copyright 2005 Benjamin Herrenschmidt, IBM Corp. + * + * + * Released under the term of the GNU GPL v2. + * + * The algorithm used is the PID control algorithm, used the same + * way the published Darwin code does, using the same values that + * are present in the Darwin 8.2 snapshot property lists (note however + * that none of the code has been re-used, it's a complete re-implementation + * + * The various control loops found in Darwin config file are: + * + * PowerMac9,1 + * =========== + * + * Has 3 control loops: CPU fans is similar to PowerMac8,1 (though it doesn't + * try to play with other control loops fans). Drive bay is rather basic PID + * with one sensor and one fan. Slots area is a bit different as the Darwin + * driver is supposed to be capable of working in a special "AGP" mode which + * involves the presence of an AGP sensor and an AGP fan (possibly on the + * AGP card itself). I can't deal with that special mode as I don't have + * access to those additional sensor/fans for now (though ultimately, it would + * be possible to add sensor objects for them) so I'm only implementing the + * basic PCI slot control loop + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "windfarm.h" +#include "windfarm_pid.h" + +#define VERSION "0.4" + +#undef DEBUG + +#ifdef DEBUG +#define DBG(args...) printk(args) +#else +#define DBG(args...) do { } while(0) +#endif + +/* define this to force CPU overtemp to 74 degree, useful for testing + * the overtemp code + */ +#undef HACKED_OVERTEMP + +static struct device *wf_smu_dev; + +/* Controls & sensors */ +static struct wf_sensor *sensor_cpu_power; +static struct wf_sensor *sensor_cpu_temp; +static struct wf_sensor *sensor_hd_temp; +static struct wf_sensor *sensor_slots_power; +static struct wf_control *fan_cpu_main; +static struct wf_control *fan_cpu_second; +static struct wf_control *fan_cpu_third; +static struct wf_control *fan_hd; +static struct wf_control *fan_slots; +static struct wf_control *cpufreq_clamp; + +/* Set to kick the control loop into life */ +static int wf_smu_all_controls_ok, wf_smu_all_sensors_ok, wf_smu_started; + +/* Failure handling.. could be nicer */ +#define FAILURE_FAN 0x01 +#define FAILURE_SENSOR 0x02 +#define FAILURE_OVERTEMP 0x04 + +static unsigned int wf_smu_failure_state; +static int wf_smu_readjust, wf_smu_skipping; + +/* + * ****** CPU Fans Control Loop ****** + * + */ + + +#define WF_SMU_CPU_FANS_INTERVAL 1 +#define WF_SMU_CPU_FANS_MAX_HISTORY 16 + +/* State data used by the cpu fans control loop + */ +struct wf_smu_cpu_fans_state { + int ticks; + s32 cpu_setpoint; + struct wf_cpu_pid_state pid; +}; + +static struct wf_smu_cpu_fans_state *wf_smu_cpu_fans; + + + +/* + * ****** Drive Fan Control Loop ****** + * + */ + +struct wf_smu_drive_fans_state { + int ticks; + s32 setpoint; + struct wf_pid_state pid; +}; + +static struct wf_smu_drive_fans_state *wf_smu_drive_fans; + +/* + * ****** Slots Fan Control Loop ****** + * + */ + +struct wf_smu_slots_fans_state { + int ticks; + s32 setpoint; + struct wf_pid_state pid; +}; + +static struct wf_smu_slots_fans_state *wf_smu_slots_fans; + +/* + * ***** Implementation ***** + * + */ + + +static void wf_smu_create_cpu_fans(void) +{ + struct wf_cpu_pid_param pid_param; + struct smu_sdbp_header *hdr; + struct smu_sdbp_cpupiddata *piddata; + struct smu_sdbp_fvt *fvt; + s32 tmax, tdelta, maxpow, powadj; + + /* First, locate the PID params in SMU SBD */ + hdr = smu_get_sdb_partition(SMU_SDB_CPUPIDDATA_ID, NULL); + if (hdr == 0) { + printk(KERN_WARNING "windfarm: CPU PID fan config not found " + "max fan speed\n"); + goto fail; + } + piddata = (struct smu_sdbp_cpupiddata *)&hdr[1]; + + /* Get the FVT params for operating point 0 (the only supported one + * for now) in order to get tmax + */ + hdr = smu_get_sdb_partition(SMU_SDB_FVT_ID, NULL); + if (hdr) { + fvt = (struct smu_sdbp_fvt *)&hdr[1]; + tmax = ((s32)fvt->maxtemp) << 16; + } else + tmax = 0x5e0000; /* 94 degree default */ + + /* Alloc & initialize state */ + wf_smu_cpu_fans = kmalloc(sizeof(struct wf_smu_cpu_fans_state), + GFP_KERNEL); + if (wf_smu_cpu_fans == NULL) + goto fail; + wf_smu_cpu_fans->ticks = 1; + + /* Fill PID params */ + pid_param.interval = WF_SMU_CPU_FANS_INTERVAL; + pid_param.history_len = piddata->history_len; + if (pid_param.history_len > WF_CPU_PID_MAX_HISTORY) { + printk(KERN_WARNING "windfarm: History size overflow on " + "CPU control loop (%d)\n", piddata->history_len); + pid_param.history_len = WF_CPU_PID_MAX_HISTORY; + } + pid_param.gd = piddata->gd; + pid_param.gp = piddata->gp; + pid_param.gr = piddata->gr / pid_param.history_len; + + tdelta = ((s32)piddata->target_temp_delta) << 16; + maxpow = ((s32)piddata->max_power) << 16; + powadj = ((s32)piddata->power_adj) << 16; + + pid_param.tmax = tmax; + pid_param.ttarget = tmax - tdelta; + pid_param.pmaxadj = maxpow - powadj; + + pid_param.min = fan_cpu_main->ops->get_min(fan_cpu_main); + pid_param.max = fan_cpu_main->ops->get_max(fan_cpu_main); + + wf_cpu_pid_init(&wf_smu_cpu_fans->pid, &pid_param); + + DBG("wf: CPU Fan control initialized.\n"); + DBG(" ttarged=%d.%03d, tmax=%d.%03d, min=%d RPM, max=%d RPM\n", + FIX32TOPRINT(pid_param.ttarget), FIX32TOPRINT(pid_param.tmax), + pid_param.min, pid_param.max); + + return; + + fail: + printk(KERN_WARNING "windfarm: CPU fan config not found\n" + "for this machine model, max fan speed\n"); + + if (cpufreq_clamp) + wf_control_set_max(cpufreq_clamp); + if (fan_cpu_main) + wf_control_set_max(fan_cpu_main); +} + +static void wf_smu_cpu_fans_tick(struct wf_smu_cpu_fans_state *st) +{ + s32 new_setpoint, temp, power; + int rc; + + if (--st->ticks != 0) { + if (wf_smu_readjust) + goto readjust; + return; + } + st->ticks = WF_SMU_CPU_FANS_INTERVAL; + + rc = sensor_cpu_temp->ops->get_value(sensor_cpu_temp, &temp); + if (rc) { + printk(KERN_WARNING "windfarm: CPU temp sensor error %d\n", + rc); + wf_smu_failure_state |= FAILURE_SENSOR; + return; + } + + rc = sensor_cpu_power->ops->get_value(sensor_cpu_power, &power); + if (rc) { + printk(KERN_WARNING "windfarm: CPU power sensor error %d\n", + rc); + wf_smu_failure_state |= FAILURE_SENSOR; + return; + } + + DBG("wf_smu: CPU Fans tick ! CPU temp: %d.%03d, power: %d.%03d\n", + FIX32TOPRINT(temp), FIX32TOPRINT(power)); + +#ifdef HACKED_OVERTEMP + if (temp > 0x4a0000) + wf_smu_failure_state |= FAILURE_OVERTEMP; +#else + if (temp > st->pid.param.tmax) + wf_smu_failure_state |= FAILURE_OVERTEMP; +#endif + new_setpoint = wf_cpu_pid_run(&st->pid, power, temp); + + DBG("wf_smu: new_setpoint: %d RPM\n", (int)new_setpoint); + + if (st->cpu_setpoint == new_setpoint) + return; + st->cpu_setpoint = new_setpoint; + readjust: + if (fan_cpu_main && wf_smu_failure_state == 0) { + rc = fan_cpu_main->ops->set_value(fan_cpu_main, + st->cpu_setpoint); + if (rc) { + printk(KERN_WARNING "windfarm: CPU main fan" + " error %d\n", rc); + wf_smu_failure_state |= FAILURE_FAN; + } + } + if (fan_cpu_second && wf_smu_failure_state == 0) { + rc = fan_cpu_second->ops->set_value(fan_cpu_second, + st->cpu_setpoint); + if (rc) { + printk(KERN_WARNING "windfarm: CPU second fan" + " error %d\n", rc); + wf_smu_failure_state |= FAILURE_FAN; + } + } + if (fan_cpu_third && wf_smu_failure_state == 0) { + rc = fan_cpu_main->ops->set_value(fan_cpu_third, + st->cpu_setpoint); + if (rc) { + printk(KERN_WARNING "windfarm: CPU third fan" + " error %d\n", rc); + wf_smu_failure_state |= FAILURE_FAN; + } + } +} + +static void wf_smu_create_drive_fans(void) +{ + struct wf_pid_param param = { + .interval = 5, + .history_len = 2, + .gd = 0x01e00000, + .gp = 0x00500000, + .gr = 0x00000000, + .itarget = 0x00200000, + }; + + /* Alloc & initialize state */ + wf_smu_drive_fans = kmalloc(sizeof(struct wf_smu_drive_fans_state), + GFP_KERNEL); + if (wf_smu_drive_fans == NULL) { + printk(KERN_WARNING "windfarm: Memory allocation error" + " max fan speed\n"); + goto fail; + } + wf_smu_drive_fans->ticks = 1; + + /* Fill PID params */ + param.additive = (fan_hd->type == WF_CONTROL_RPM_FAN); + param.min = fan_hd->ops->get_min(fan_hd); + param.max = fan_hd->ops->get_max(fan_hd); + wf_pid_init(&wf_smu_drive_fans->pid, ¶m); + + DBG("wf: Drive Fan control initialized.\n"); + DBG(" itarged=%d.%03d, min=%d RPM, max=%d RPM\n", + FIX32TOPRINT(param.itarget), param.min, param.max); + return; + + fail: + if (fan_hd) + wf_control_set_max(fan_hd); +} + +static void wf_smu_drive_fans_tick(struct wf_smu_drive_fans_state *st) +{ + s32 new_setpoint, temp; + int rc; + + if (--st->ticks != 0) { + if (wf_smu_readjust) + goto readjust; + return; + } + st->ticks = st->pid.param.interval; + + rc = sensor_hd_temp->ops->get_value(sensor_hd_temp, &temp); + if (rc) { + printk(KERN_WARNING "windfarm: HD temp sensor error %d\n", + rc); + wf_smu_failure_state |= FAILURE_SENSOR; + return; + } + + DBG("wf_smu: Drive Fans tick ! HD temp: %d.%03d\n", + FIX32TOPRINT(temp)); + + if (temp > (st->pid.param.itarget + 0x50000)) + wf_smu_failure_state |= FAILURE_OVERTEMP; + + new_setpoint = wf_pid_run(&st->pid, temp); + + DBG("wf_smu: new_setpoint: %d\n", (int)new_setpoint); + + if (st->setpoint == new_setpoint) + return; + st->setpoint = new_setpoint; + readjust: + if (fan_hd && wf_smu_failure_state == 0) { + rc = fan_hd->ops->set_value(fan_hd, st->setpoint); + if (rc) { + printk(KERN_WARNING "windfarm: HD fan error %d\n", + rc); + wf_smu_failure_state |= FAILURE_FAN; + } + } +} + +static void wf_smu_create_slots_fans(void) +{ + struct wf_pid_param param = { + .interval = 1, + .history_len = 8, + .gd = 0x00000000, + .gp = 0x00000000, + .gr = 0x00020000, + .itarget = 0x00000000 + }; + + /* Alloc & initialize state */ + wf_smu_slots_fans = kmalloc(sizeof(struct wf_smu_slots_fans_state), + GFP_KERNEL); + if (wf_smu_slots_fans == NULL) { + printk(KERN_WARNING "windfarm: Memory allocation error" + " max fan speed\n"); + goto fail; + } + wf_smu_slots_fans->ticks = 1; + + /* Fill PID params */ + param.additive = (fan_slots->type == WF_CONTROL_RPM_FAN); + param.min = fan_slots->ops->get_min(fan_slots); + param.max = fan_slots->ops->get_max(fan_slots); + wf_pid_init(&wf_smu_slots_fans->pid, ¶m); + + DBG("wf: Slots Fan control initialized.\n"); + DBG(" itarged=%d.%03d, min=%d RPM, max=%d RPM\n", + FIX32TOPRINT(param.itarget), param.min, param.max); + return; + + fail: + if (fan_slots) + wf_control_set_max(fan_slots); +} + +static void wf_smu_slots_fans_tick(struct wf_smu_slots_fans_state *st) +{ + s32 new_setpoint, power; + int rc; + + if (--st->ticks != 0) { + if (wf_smu_readjust) + goto readjust; + return; + } + st->ticks = st->pid.param.interval; + + rc = sensor_slots_power->ops->get_value(sensor_slots_power, &power); + if (rc) { + printk(KERN_WARNING "windfarm: Slots power sensor error %d\n", + rc); + wf_smu_failure_state |= FAILURE_SENSOR; + return; + } + + DBG("wf_smu: Slots Fans tick ! Slots power: %d.%03d\n", + FIX32TOPRINT(power)); + +#if 0 /* Check what makes a good overtemp condition */ + if (power > (st->pid.param.itarget + 0x50000)) + wf_smu_failure_state |= FAILURE_OVERTEMP; +#endif + + new_setpoint = wf_pid_run(&st->pid, power); + + DBG("wf_smu: new_setpoint: %d\n", (int)new_setpoint); + + if (st->setpoint == new_setpoint) + return; + st->setpoint = new_setpoint; + readjust: + if (fan_slots && wf_smu_failure_state == 0) { + rc = fan_slots->ops->set_value(fan_slots, st->setpoint); + if (rc) { + printk(KERN_WARNING "windfarm: Slots fan error %d\n", + rc); + wf_smu_failure_state |= FAILURE_FAN; + } + } +} + + +/* + * ****** Attributes ****** + * + */ + +#define BUILD_SHOW_FUNC_FIX(name, data) \ +static ssize_t show_##name(struct device *dev, \ + struct device_attribute *attr, \ + char *buf) \ +{ \ + ssize_t r; \ + s32 val = 0; \ + data->ops->get_value(data, &val); \ + r = sprintf(buf, "%d.%03d", FIX32TOPRINT(val)); \ + return r; \ +} \ +static DEVICE_ATTR(name,S_IRUGO,show_##name, NULL); + + +#define BUILD_SHOW_FUNC_INT(name, data) \ +static ssize_t show_##name(struct device *dev, \ + struct device_attribute *attr, \ + char *buf) \ +{ \ + s32 val = 0; \ + data->ops->get_value(data, &val); \ + return sprintf(buf, "%d", val); \ +} \ +static DEVICE_ATTR(name,S_IRUGO,show_##name, NULL); + +BUILD_SHOW_FUNC_INT(cpu_fan, fan_cpu_main); +BUILD_SHOW_FUNC_INT(hd_fan, fan_hd); +BUILD_SHOW_FUNC_INT(slots_fan, fan_slots); + +BUILD_SHOW_FUNC_FIX(cpu_temp, sensor_cpu_temp); +BUILD_SHOW_FUNC_FIX(cpu_power, sensor_cpu_power); +BUILD_SHOW_FUNC_FIX(hd_temp, sensor_hd_temp); +BUILD_SHOW_FUNC_FIX(slots_power, sensor_slots_power); + +/* + * ****** Setup / Init / Misc ... ****** + * + */ + +static void wf_smu_tick(void) +{ + unsigned int last_failure = wf_smu_failure_state; + unsigned int new_failure; + + if (!wf_smu_started) { + DBG("wf: creating control loops !\n"); + wf_smu_create_drive_fans(); + wf_smu_create_slots_fans(); + wf_smu_create_cpu_fans(); + wf_smu_started = 1; + } + + /* Skipping ticks */ + if (wf_smu_skipping && --wf_smu_skipping) + return; + + wf_smu_failure_state = 0; + if (wf_smu_drive_fans) + wf_smu_drive_fans_tick(wf_smu_drive_fans); + if (wf_smu_slots_fans) + wf_smu_slots_fans_tick(wf_smu_slots_fans); + if (wf_smu_cpu_fans) + wf_smu_cpu_fans_tick(wf_smu_cpu_fans); + + wf_smu_readjust = 0; + new_failure = wf_smu_failure_state & ~last_failure; + + /* If entering failure mode, clamp cpufreq and ramp all + * fans to full speed. + */ + if (wf_smu_failure_state && !last_failure) { + if (cpufreq_clamp) + wf_control_set_max(cpufreq_clamp); + if (fan_cpu_main) + wf_control_set_max(fan_cpu_main); + if (fan_cpu_second) + wf_control_set_max(fan_cpu_second); + if (fan_cpu_third) + wf_control_set_max(fan_cpu_third); + if (fan_hd) + wf_control_set_max(fan_hd); + if (fan_slots) + wf_control_set_max(fan_slots); + } + + /* If leaving failure mode, unclamp cpufreq and readjust + * all fans on next iteration + */ + if (!wf_smu_failure_state && last_failure) { + if (cpufreq_clamp) + wf_control_set_min(cpufreq_clamp); + wf_smu_readjust = 1; + } + + /* Overtemp condition detected, notify and start skipping a couple + * ticks to let the temperature go down + */ + if (new_failure & FAILURE_OVERTEMP) { + wf_set_overtemp(); + wf_smu_skipping = 2; + } + + /* We only clear the overtemp condition if overtemp is cleared + * _and_ no other failure is present. Since a sensor error will + * clear the overtemp condition (can't measure temperature) at + * the control loop levels, but we don't want to keep it clear + * here in this case + */ + if (new_failure == 0 && last_failure & FAILURE_OVERTEMP) + wf_clear_overtemp(); +} + + +static void wf_smu_new_control(struct wf_control *ct) +{ + if (wf_smu_all_controls_ok) + return; + + if (fan_cpu_main == NULL && !strcmp(ct->name, "cpu-rear-fan-0")) { + if (wf_get_control(ct) == 0) { + fan_cpu_main = ct; + device_create_file(wf_smu_dev, &dev_attr_cpu_fan); + } + } + + if (fan_cpu_second == NULL && !strcmp(ct->name, "cpu-rear-fan-1")) { + if (wf_get_control(ct) == 0) + fan_cpu_second = ct; + } + + if (fan_cpu_third == NULL && !strcmp(ct->name, "cpu-front-fan-0")) { + if (wf_get_control(ct) == 0) + fan_cpu_third = ct; + } + + if (cpufreq_clamp == NULL && !strcmp(ct->name, "cpufreq-clamp")) { + if (wf_get_control(ct) == 0) + cpufreq_clamp = ct; + } + + if (fan_hd == NULL && !strcmp(ct->name, "drive-bay-fan")) { + if (wf_get_control(ct) == 0) { + fan_hd = ct; + device_create_file(wf_smu_dev, &dev_attr_hd_fan); + } + } + + if (fan_slots == NULL && !strcmp(ct->name, "slots-fan")) { + if (wf_get_control(ct) == 0) { + fan_slots = ct; + device_create_file(wf_smu_dev, &dev_attr_slots_fan); + } + } + + if (fan_cpu_main && (fan_cpu_second || fan_cpu_third) && fan_hd && + fan_slots && cpufreq_clamp) + wf_smu_all_controls_ok = 1; +} + +static void wf_smu_new_sensor(struct wf_sensor *sr) +{ + if (wf_smu_all_sensors_ok) + return; + + if (sensor_cpu_power == NULL && !strcmp(sr->name, "cpu-power")) { + if (wf_get_sensor(sr) == 0) { + sensor_cpu_power = sr; + device_create_file(wf_smu_dev, &dev_attr_cpu_power); + } + } + + if (sensor_cpu_temp == NULL && !strcmp(sr->name, "cpu-temp")) { + if (wf_get_sensor(sr) == 0) { + sensor_cpu_temp = sr; + device_create_file(wf_smu_dev, &dev_attr_cpu_temp); + } + } + + if (sensor_hd_temp == NULL && !strcmp(sr->name, "hd-temp")) { + if (wf_get_sensor(sr) == 0) { + sensor_hd_temp = sr; + device_create_file(wf_smu_dev, &dev_attr_hd_temp); + } + } + + if (sensor_slots_power == NULL && !strcmp(sr->name, "slots-power")) { + if (wf_get_sensor(sr) == 0) { + sensor_slots_power = sr; + device_create_file(wf_smu_dev, &dev_attr_slots_power); + } + } + + if (sensor_cpu_power && sensor_cpu_temp && + sensor_hd_temp && sensor_slots_power) + wf_smu_all_sensors_ok = 1; +} + + +static int wf_smu_notify(struct notifier_block *self, + unsigned long event, void *data) +{ + switch(event) { + case WF_EVENT_NEW_CONTROL: + DBG("wf: new control %s detected\n", + ((struct wf_control *)data)->name); + wf_smu_new_control(data); + wf_smu_readjust = 1; + break; + case WF_EVENT_NEW_SENSOR: + DBG("wf: new sensor %s detected\n", + ((struct wf_sensor *)data)->name); + wf_smu_new_sensor(data); + break; + case WF_EVENT_TICK: + if (wf_smu_all_controls_ok && wf_smu_all_sensors_ok) + wf_smu_tick(); + } + + return 0; +} + +static struct notifier_block wf_smu_events = { + .notifier_call = wf_smu_notify, +}; + +static int wf_init_pm(void) +{ + printk(KERN_INFO "windfarm: Initializing for Desktop G5 model\n"); + + return 0; +} + +static int wf_smu_probe(struct device *ddev) +{ + wf_smu_dev = ddev; + + wf_register_client(&wf_smu_events); + + return 0; +} + +static int wf_smu_remove(struct device *ddev) +{ + wf_unregister_client(&wf_smu_events); + + /* XXX We don't have yet a guarantee that our callback isn't + * in progress when returning from wf_unregister_client, so + * we add an arbitrary delay. I'll have to fix that in the core + */ + msleep(1000); + + /* Release all sensors */ + /* One more crappy race: I don't think we have any guarantee here + * that the attribute callback won't race with the sensor beeing + * disposed of, and I'm not 100% certain what best way to deal + * with that except by adding locks all over... I'll do that + * eventually but heh, who ever rmmod this module anyway ? + */ + if (sensor_cpu_power) { + device_remove_file(wf_smu_dev, &dev_attr_cpu_power); + wf_put_sensor(sensor_cpu_power); + } + if (sensor_cpu_temp) { + device_remove_file(wf_smu_dev, &dev_attr_cpu_temp); + wf_put_sensor(sensor_cpu_temp); + } + if (sensor_hd_temp) { + device_remove_file(wf_smu_dev, &dev_attr_hd_temp); + wf_put_sensor(sensor_hd_temp); + } + if (sensor_slots_power) { + device_remove_file(wf_smu_dev, &dev_attr_slots_power); + wf_put_sensor(sensor_slots_power); + } + + /* Release all controls */ + if (fan_cpu_main) { + device_remove_file(wf_smu_dev, &dev_attr_cpu_fan); + wf_put_control(fan_cpu_main); + } + if (fan_cpu_second) + wf_put_control(fan_cpu_second); + if (fan_cpu_third) + wf_put_control(fan_cpu_third); + if (fan_hd) { + device_remove_file(wf_smu_dev, &dev_attr_hd_fan); + wf_put_control(fan_hd); + } + if (fan_slots) { + device_remove_file(wf_smu_dev, &dev_attr_slots_fan); + wf_put_control(fan_slots); + } + if (cpufreq_clamp) + wf_put_control(cpufreq_clamp); + + /* Destroy control loops state structures */ + if (wf_smu_slots_fans) + kfree(wf_smu_cpu_fans); + if (wf_smu_drive_fans) + kfree(wf_smu_cpu_fans); + if (wf_smu_cpu_fans) + kfree(wf_smu_cpu_fans); + + wf_smu_dev = NULL; + + return 0; +} + +static struct device_driver wf_smu_driver = { + .name = "windfarm", + .bus = &platform_bus_type, + .probe = wf_smu_probe, + .remove = wf_smu_remove, +}; + + +static int __init wf_smu_init(void) +{ + int rc = -ENODEV; + + if (machine_is_compatible("PowerMac9,1")) + rc = wf_init_pm(); + + if (rc == 0) { +#ifdef MODULE + request_module("windfarm_smu_controls"); + request_module("windfarm_smu_sensors"); + request_module("windfarm_lm75_sensor"); + +#endif /* MODULE */ + driver_register(&wf_smu_driver); + } + + return rc; +} + +static void __exit wf_smu_exit(void) +{ + + driver_unregister(&wf_smu_driver); +} + + +module_init(wf_smu_init); +module_exit(wf_smu_exit); + +MODULE_AUTHOR("Benjamin Herrenschmidt "); +MODULE_DESCRIPTION("Thermal control logic for PowerMac9,1"); +MODULE_LICENSE("GPL"); + From anton at samba.org Mon Nov 7 17:43:07 2005 From: anton at samba.org (Anton Blanchard) Date: Mon, 7 Nov 2005 17:43:07 +1100 Subject: [PATCH] ppc64: fix Memory: summary line Message-ID: <20051107064306.GK12353@krispykreme> On ppc64 we end up with a negative value for the data size in the memory boot message: Memory: 2035560k/2097152k available (5792k kernel code, 89564k reserved, 18014398509481632k data, 870k bss, 352k init) It turns out the section ordering of the linker script is different on ppc32 and ppc64, so just count data as _edata - _sdata which should work on both. Signed-off-by: Anton Blanchard --- Index: build/arch/powerpc/mm/mem.c =================================================================== --- build.orig/arch/powerpc/mm/mem.c 2005-11-07 16:56:10.000000000 +1100 +++ build/arch/powerpc/mm/mem.c 2005-11-07 17:24:41.000000000 +1100 @@ -358,7 +358,7 @@ } codesize = (unsigned long)&_sdata - (unsigned long)&_stext; - datasize = (unsigned long)&__init_begin - (unsigned long)&_sdata; + datasize = (unsigned long)&_edata - (unsigned long)&_sdata; initsize = (unsigned long)&__init_end - (unsigned long)&__init_begin; bsssize = (unsigned long)&__bss_stop - (unsigned long)&__bss_start; From anton at samba.org Mon Nov 7 18:43:56 2005 From: anton at samba.org (Anton Blanchard) Date: Mon, 7 Nov 2005 18:43:56 +1100 Subject: [PATCH] ppc64: fix oprofile sample bit handling Message-ID: <20051107074356.GL12353@krispykreme> Oprofile was hardwiring the MMCRA sample bit to 1 but on newer cpus (eg POWER5) we want to vary it based on the group being sampled. Add a temporary workaround until people update their oprofile userspace. Signed-off-by: Anton Blanchard --- Index: build/arch/powerpc/oprofile/op_model_power4.c =================================================================== --- build.orig/arch/powerpc/oprofile/op_model_power4.c 2005-11-05 20:51:08.000000000 +1100 +++ build/arch/powerpc/oprofile/op_model_power4.c 2005-11-07 18:17:29.000000000 +1100 @@ -17,6 +17,7 @@ #include #include #include +#include #define dbg(args...) @@ -81,6 +82,26 @@ extern void ppc64_enable_pmcs(void); +/* + * Older CPUs require the MMCRA sample bit to be always set, but newer + * CPUs only want it set for some groups. Eventually we will remove all + * knowledge of this bit in the kernel, oprofile userspace should be + * setting it when required. + * + * In order to keep current installations working we force the bit for + * those older CPUs. Once everyone has updated their oprofile userspace we + * can remove this hack. + */ +static inline int mmcra_must_set_sample(void) +{ + if (__is_processor(PV_POWER4) || __is_processor(PV_POWER4p) || + __is_processor(PV_970) || __is_processor(PV_970FX) || + __is_processor(PV_970MP)) + return 1; + + return 0; +} + static void power4_cpu_setup(void *unused) { unsigned int mmcr0 = mmcr0_val; @@ -98,7 +119,8 @@ mtspr(SPRN_MMCR1, mmcr1_val); - mmcra |= MMCRA_SAMPLE_ENABLE; + if (mmcra_must_set_sample()) + mmcra |= MMCRA_SAMPLE_ENABLE; mtspr(SPRN_MMCRA, mmcra); dbg("setup on cpu %d, mmcr0 %lx\n", smp_processor_id(), From anton at samba.org Mon Nov 7 19:05:31 2005 From: anton at samba.org (Anton Blanchard) Date: Mon, 7 Nov 2005 19:05:31 +1100 Subject: [PATCH] ppc64: remove some direct xmon calls Message-ID: <20051107080531.GN12353@krispykreme> Even though we can enable and disable xmon at runtime now, there are a few places in the merge tree that call xmon and xmon_printf directly. In the case below we call die() which will call xmon if it is enabled. Also remove an unnecessary include of xmon.h in smp.c. Signed-off-by: Anton Blanchard --- Index: linux-2.6/arch/powerpc/kernel/smp.c =================================================================== --- linux-2.6.orig/arch/powerpc/kernel/smp.c 2005-11-05 17:44:42.000000000 +1100 +++ linux-2.6/arch/powerpc/kernel/smp.c 2005-11-06 14:40:07.000000000 +1100 @@ -40,7 +40,6 @@ #include #include #include -#include #include #include #include Index: linux-2.6/arch/powerpc/kernel/traps.c =================================================================== --- linux-2.6.orig/arch/powerpc/kernel/traps.c 2005-11-05 17:44:42.000000000 +1100 +++ linux-2.6/arch/powerpc/kernel/traps.c 2005-11-06 14:41:15.000000000 +1100 @@ -39,7 +39,6 @@ #include #include #include -#include #include #ifdef CONFIG_PPC32 #include @@ -748,22 +747,12 @@ return 0; if (bug->line & BUG_WARNING_TRAP) { /* this is a WARN_ON rather than BUG/BUG_ON */ -#ifdef CONFIG_XMON - xmon_printf(KERN_ERR "Badness in %s at %s:%ld\n", - bug->function, bug->file, - bug->line & ~BUG_WARNING_TRAP); -#endif /* CONFIG_XMON */ printk(KERN_ERR "Badness in %s at %s:%ld\n", bug->function, bug->file, bug->line & ~BUG_WARNING_TRAP); dump_stack(); return 1; } -#ifdef CONFIG_XMON - xmon_printf(KERN_CRIT "kernel BUG in %s at %s:%ld!\n", - bug->function, bug->file, bug->line); - xmon(regs); -#endif /* CONFIG_XMON */ printk(KERN_CRIT "kernel BUG in %s at %s:%ld!\n", bug->function, bug->file, bug->line); From michael at ellerman.id.au Tue Nov 8 00:06:45 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Tue, 8 Nov 2005 00:06:45 +1100 (EST) Subject: [PATCH 0/12] powerpc: PPC64 Kdump Support Message-ID: <1131368803.622486.677825384736.qpush@concordia> Hi y'all, This is the current stack of patches we have for implementing Kdump on PPC64. These aren't ready for merging yet, but they're close. I wanted to get them out to the list sooner rather than later, so you can all see them. A lot of these have already been via the list, but some haven't. Comments requested! These sit on top of a merged version of page.h which no longer works since 64k pages went it, so don't try to run them. I'm hoping to get page.h merged soon, as well as machine_kexec.c and a few other files we touch. cheers From michael at ellerman.id.au Tue Nov 8 00:06:48 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Tue, 8 Nov 2005 00:06:48 +1100 (EST) Subject: [PATCH 1/12] powerpc: Seperate usage of KERNELBASE and PAGE_OFFSET In-Reply-To: <1131368803.622486.677825384736.qpush@concordia> Message-ID: <20051107130648.430786869D@ozlabs.org> This patch tries to seperate usage of KERNELBASE and PAGE_OFFSET. PAGE_OFFSET == 0xC00..00 and always will. It's the quantity you subtract from a virtual kernel address to get a physical one. KERNELBASE == 0xC00..00 + SOMETHING, where SOMETHING tends to be 0, but might not be. It points to the start of the kernel text + data in virtual memory. arch/powerpc/kernel/entry_64.S | 4 ++-- arch/powerpc/kernel/lparmap.c | 6 +++--- arch/powerpc/mm/hash_utils_64.c | 6 +++--- arch/powerpc/mm/slb.c | 4 ++-- arch/powerpc/mm/slb_low.S | 6 +++--- arch/powerpc/mm/stab.c | 10 +++++----- arch/powerpc/mm/tlb_64.c | 2 +- arch/ppc64/kernel/machine_kexec.c | 5 ++--- 8 files changed, 21 insertions(+), 22 deletions(-) Index: kexec/arch/powerpc/mm/stab.c =================================================================== --- kexec.orig/arch/powerpc/mm/stab.c +++ kexec/arch/powerpc/mm/stab.c @@ -40,7 +40,7 @@ static int make_ste(unsigned long stab, unsigned long entry, group, old_esid, castout_entry, i; unsigned int global_entry; struct stab_entry *ste, *castout_ste; - unsigned long kernel_segment = (esid << SID_SHIFT) >= KERNELBASE; + unsigned long kernel_segment = (esid << SID_SHIFT) >= PAGE_OFFSET; vsid_data = vsid << STE_VSID_SHIFT; esid_data = esid << SID_SHIFT | STE_ESID_KP | STE_ESID_V; @@ -83,7 +83,7 @@ static int make_ste(unsigned long stab, } /* Dont cast out the first kernel segment */ - if ((castout_ste->esid_data & ESID_MASK) != KERNELBASE) + if ((castout_ste->esid_data & ESID_MASK) != PAGE_OFFSET) break; castout_entry = (castout_entry + 1) & 0xf; @@ -248,7 +248,7 @@ void stabs_alloc(void) panic("Unable to allocate segment table for CPU %d.\n", cpu); - newstab += KERNELBASE; + newstab = (unsigned long)__va(newstab); memset((void *)newstab, 0, PAGE_SIZE); @@ -265,13 +265,13 @@ void stabs_alloc(void) */ void stab_initialize(unsigned long stab) { - unsigned long vsid = get_kernel_vsid(KERNELBASE); + unsigned long vsid = get_kernel_vsid(PAGE_OFFSET); if (cpu_has_feature(CPU_FTR_SLB)) { slb_initialize(); } else { asm volatile("isync; slbia; isync":::"memory"); - make_ste(stab, GET_ESID(KERNELBASE), vsid); + make_ste(stab, GET_ESID(PAGE_OFFSET), vsid); /* Order update */ asm volatile("sync":::"memory"); Index: kexec/arch/ppc64/kernel/machine_kexec.c =================================================================== --- kexec.orig/arch/ppc64/kernel/machine_kexec.c +++ kexec/arch/ppc64/kernel/machine_kexec.c @@ -171,9 +171,8 @@ void kexec_copy_flush(struct kimage *ima * including ones that were in place on the original copy */ for (i = 0; i < nr_segments; i++) - flush_icache_range(ranges[i].mem + KERNELBASE, - ranges[i].mem + KERNELBASE + - ranges[i].memsz); + flush_icache_range((unsigned long)__va(ranges[i].mem), + (unsigned long)__va(ranges[i].mem + ranges[i].memsz)); } #ifdef CONFIG_SMP Index: kexec/arch/powerpc/mm/hash_utils_64.c =================================================================== --- kexec.orig/arch/powerpc/mm/hash_utils_64.c +++ kexec/arch/powerpc/mm/hash_utils_64.c @@ -239,7 +239,7 @@ void __init htab_initialize(void) /* create bolted the linear mapping in the hash table */ for (i=0; i < lmb.memory.cnt; i++) { - base = lmb.memory.region[i].base + KERNELBASE; + base = (unsigned long)__va(lmb.memory.region[i].base); size = lmb.memory.region[i].size; DBG("creating mapping for region: %lx : %lx\n", base, size); @@ -276,8 +276,8 @@ void __init htab_initialize(void) * for either 4K or 16MB pages. */ if (tce_alloc_start) { - tce_alloc_start += KERNELBASE; - tce_alloc_end += KERNELBASE; + tce_alloc_start = (unsigned long)__va(tce_alloc_start); + tce_alloc_end = (unsigned long)__va(tce_alloc_end); if (base + size >= tce_alloc_start) tce_alloc_start = base + size + 1; Index: kexec/arch/powerpc/mm/slb.c =================================================================== --- kexec.orig/arch/powerpc/mm/slb.c +++ kexec/arch/powerpc/mm/slb.c @@ -55,7 +55,7 @@ static void slb_flush_and_rebolt(void) ksp_flags |= SLB_VSID_L; ksp_esid_data = mk_esid_data(get_paca()->kstack, 2); - if ((ksp_esid_data & ESID_MASK) == KERNELBASE) + if ((ksp_esid_data & ESID_MASK) == PAGE_OFFSET) ksp_esid_data &= ~SLB_ESID_V; /* We need to do this all in asm, so we're sure we don't touch @@ -145,7 +145,7 @@ void slb_initialize(void) asm volatile("isync":::"memory"); asm volatile("slbmte %0,%0"::"r" (0) : "memory"); asm volatile("isync; slbia; isync":::"memory"); - create_slbe(KERNELBASE, flags, 0); + create_slbe(PAGE_OFFSET, flags, 0); create_slbe(VMALLOCBASE, SLB_VSID_KERNEL, 1); /* We don't bolt the stack for the time being - we're in boot, * so the stack is in the bolted segment. By the time it goes Index: kexec/arch/powerpc/kernel/entry_64.S =================================================================== --- kexec.orig/arch/powerpc/kernel/entry_64.S +++ kexec/arch/powerpc/kernel/entry_64.S @@ -674,7 +674,7 @@ _GLOBAL(enter_rtas) /* Setup our real return addr */ SET_REG_TO_LABEL(r4,.rtas_return_loc) - SET_REG_TO_CONST(r9,KERNELBASE) + SET_REG_TO_CONST(r9,PAGE_OFFSET) sub r4,r4,r9 mtlr r4 @@ -702,7 +702,7 @@ _GLOBAL(enter_rtas) _STATIC(rtas_return_loc) /* relocation is off at this point */ mfspr r4,SPRN_SPRG3 /* Get PACA */ - SET_REG_TO_CONST(r5, KERNELBASE) + SET_REG_TO_CONST(r5, PAGE_OFFSET) sub r4,r4,r5 /* RELOC the PACA base pointer */ mfmsr r6 Index: kexec/arch/powerpc/mm/slb_low.S =================================================================== --- kexec.orig/arch/powerpc/mm/slb_low.S +++ kexec/arch/powerpc/mm/slb_low.S @@ -66,12 +66,12 @@ _GLOBAL(slb_allocate) srdi r9,r3,60 /* get region */ srdi r3,r3,28 /* get esid */ - cmpldi cr7,r9,0xc /* cmp KERNELBASE for later use */ + cmpldi cr7,r9,0xc /* cmp PAGE_OFFSET for later use */ rldimi r10,r3,28,0 /* r10= ESID<<28 | entry */ oris r10,r10,SLB_ESID_V at h /* r10 |= SLB_ESID_V */ - /* r3 = esid, r10 = esid_data, cr7 = <>KERNELBASE */ + /* r3 = esid, r10 = esid_data, cr7 = <> PAGE_OFFSET */ blt cr7,0f /* user or kernel? */ @@ -114,7 +114,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_16M_PAGE) ld r9,PACACONTEXTID(r13) rldimi r3,r9,USER_ESID_BITS,0 -9: /* r3 = protovsid, r11 = flags, r10 = esid_data, cr7 = <>KERNELBASE */ +9: /* r3 = protovsid, r11 = flags, r10 = esid_data, cr7 = <> PAGE_OFFSET */ ASM_VSID_SCRAMBLE(r3,r9) rldimi r11,r3,SLB_VSID_SHIFT,16 /* combine VSID and flags */ Index: kexec/arch/powerpc/kernel/lparmap.c =================================================================== --- kexec.orig/arch/powerpc/kernel/lparmap.c +++ kexec/arch/powerpc/kernel/lparmap.c @@ -16,8 +16,8 @@ const struct LparMap __attribute__((__se .xSegmentTableOffs = STAB0_PAGE, .xEsids = { - { .xKernelEsid = GET_ESID(KERNELBASE), - .xKernelVsid = KERNEL_VSID(KERNELBASE), }, + { .xKernelEsid = GET_ESID(PAGE_OFFSET), + .xKernelVsid = KERNEL_VSID(PAGE_OFFSET), }, { .xKernelEsid = GET_ESID(VMALLOCBASE), .xKernelVsid = KERNEL_VSID(VMALLOCBASE), }, }, @@ -25,7 +25,7 @@ const struct LparMap __attribute__((__se .xRanges = { { .xPages = HvPagesToMap, .xOffset = 0, - .xVPN = KERNEL_VSID(KERNELBASE) << (SID_SHIFT - PAGE_SHIFT), + .xVPN = KERNEL_VSID(PAGE_OFFSET) << (SID_SHIFT - PAGE_SHIFT), }, }, }; Index: kexec/arch/powerpc/mm/tlb_64.c =================================================================== --- kexec.orig/arch/powerpc/mm/tlb_64.c +++ kexec/arch/powerpc/mm/tlb_64.c @@ -149,7 +149,7 @@ void hpte_update(struct mm_struct *mm, u batch->mm = mm; batch->large = pte_huge(pte); } - if (addr < KERNELBASE) { + if (!is_kernel_addr(addr)) { vsid = get_vsid(mm->context.id, addr); WARN_ON(vsid == 0); } else From michael at ellerman.id.au Tue Nov 8 00:06:50 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Tue, 8 Nov 2005 00:06:50 +1100 (EST) Subject: [PATCH 2/12] powerpc: Add a is_kernel_addr() macro In-Reply-To: <1131368803.622486.677825384736.qpush@concordia> Message-ID: <20051107130650.D8FF6686BA@ozlabs.org> There's a bunch of code that compares an address with KERNELBASE to see if it's a "kernel address", ie. >= KERNELBASE. Replace all of them with an is_kernel_addr() macro that does the same thing. This will save us some pain when we change KERNELBASE, and also makes the code more readable IMHO. arch/powerpc/kernel/prom_init.c | 2 +- arch/powerpc/kernel/setup_64.c | 2 +- arch/powerpc/mm/slb.c | 6 +++--- arch/powerpc/mm/stab.c | 6 +++--- arch/powerpc/oprofile/op_model_power4.c | 4 ++-- arch/powerpc/oprofile/op_model_rs64.c | 3 +-- arch/powerpc/xmon/xmon.c | 4 ++-- include/asm-powerpc/page.h | 6 ++++++ include/asm-ppc64/pgtable.h | 2 +- 9 files changed, 20 insertions(+), 15 deletions(-) Index: kexec/arch/powerpc/mm/stab.c =================================================================== --- kexec.orig/arch/powerpc/mm/stab.c +++ kexec/arch/powerpc/mm/stab.c @@ -122,7 +122,7 @@ static int __ste_allocate(unsigned long unsigned long offset; /* Kernel or user address? */ - if (ea >= KERNELBASE) { + if (is_kernel_addr(ea)) { vsid = get_kernel_vsid(ea); } else { if ((ea >= TASK_SIZE_USER64) || (! mm)) @@ -133,7 +133,7 @@ static int __ste_allocate(unsigned long stab_entry = make_ste(get_paca()->stab_addr, GET_ESID(ea), vsid); - if (ea < KERNELBASE) { + if (!is_kernel_addr(ea)) { offset = __get_cpu_var(stab_cache_ptr); if (offset < NR_STAB_CACHE_ENTRIES) __get_cpu_var(stab_cache[offset++]) = stab_entry; @@ -190,7 +190,7 @@ void switch_stab(struct task_struct *tsk entry++, ste++) { unsigned long ea; ea = ste->esid_data & ESID_MASK; - if (ea < KERNELBASE) { + if (!is_kernel_addr(ea)) { ste->esid_data = 0; } } Index: kexec/arch/powerpc/kernel/prom_init.c =================================================================== --- kexec.orig/arch/powerpc/kernel/prom_init.c +++ kexec/arch/powerpc/kernel/prom_init.c @@ -1917,7 +1917,7 @@ static void __init prom_check_initrd(uns if (r3 && r4 && r4 != 0xdeadbeef) { unsigned long val; - RELOC(prom_initrd_start) = (r3 >= KERNELBASE) ? __pa(r3) : r3; + RELOC(prom_initrd_start) = is_kernel_addr(r3) ? __pa(r3) : r3; RELOC(prom_initrd_end) = RELOC(prom_initrd_start) + r4; val = RELOC(prom_initrd_start); Index: kexec/arch/powerpc/kernel/setup_64.c =================================================================== --- kexec.orig/arch/powerpc/kernel/setup_64.c +++ kexec/arch/powerpc/kernel/setup_64.c @@ -421,7 +421,7 @@ static void __init check_for_initrd(void /* If we were passed an initrd, set the ROOT_DEV properly if the values * look sensible. If not, clear initrd reference. */ - if (initrd_start >= KERNELBASE && initrd_end >= KERNELBASE && + if (is_kernel_addr(initrd_start) && is_kernel_addr(initrd_end) && initrd_end > initrd_start) ROOT_DEV = Root_RAM0; else Index: kexec/arch/powerpc/mm/slb.c =================================================================== --- kexec.orig/arch/powerpc/mm/slb.c +++ kexec/arch/powerpc/mm/slb.c @@ -111,14 +111,14 @@ void switch_slb(struct task_struct *tsk, else unmapped_base = TASK_UNMAPPED_BASE_USER64; - if (pc >= KERNELBASE) + if (is_kernel_addr(pc)) return; slb_allocate(pc); if (GET_ESID(pc) == GET_ESID(stack)) return; - if (stack >= KERNELBASE) + if (is_kernel_addr(stack)) return; slb_allocate(stack); @@ -126,7 +126,7 @@ void switch_slb(struct task_struct *tsk, || (GET_ESID(stack) == GET_ESID(unmapped_base))) return; - if (unmapped_base >= KERNELBASE) + if (is_kernel_addr(unmapped_base)) return; slb_allocate(unmapped_base); } Index: kexec/arch/powerpc/oprofile/op_model_power4.c =================================================================== --- kexec.orig/arch/powerpc/oprofile/op_model_power4.c +++ kexec/arch/powerpc/oprofile/op_model_power4.c @@ -232,7 +232,7 @@ static unsigned long get_pc(struct pt_re return (unsigned long)__va(pc); /* Not sure where we were */ - if (pc < KERNELBASE) + if (!is_kernel_addr(pc)) /* function descriptor madness */ return *((unsigned long *)kernel_unknown_bucket); @@ -244,7 +244,7 @@ static int get_kernel(unsigned long pc) int is_kernel; if (!mmcra_has_sihv) { - is_kernel = (pc >= KERNELBASE); + is_kernel = is_kernel_addr(pc); } else { unsigned long mmcra = mfspr(SPRN_MMCRA); is_kernel = ((mmcra & MMCRA_SIPR) == 0); Index: kexec/arch/powerpc/xmon/xmon.c =================================================================== --- kexec.orig/arch/powerpc/xmon/xmon.c +++ kexec/arch/powerpc/xmon/xmon.c @@ -1015,7 +1015,7 @@ static long check_bp_loc(unsigned long a unsigned int instr; addr &= ~3; - if (addr < KERNELBASE) { + if (!is_kernel_addr(addr)) { printf("Breakpoints may only be placed at kernel addresses\n"); return 0; } @@ -1066,7 +1066,7 @@ bpt_cmds(void) dabr.address = 0; dabr.enabled = 0; if (scanhex(&dabr.address)) { - if (dabr.address < KERNELBASE) { + if (!is_kernel_addr(dabr.address)) { printf(badaddr); break; } Index: kexec/include/asm-powerpc/page.h =================================================================== --- kexec.orig/include/asm-powerpc/page.h +++ kexec/include/asm-powerpc/page.h @@ -106,6 +106,12 @@ #define KERNELBASE PAGE_OFFSET #define VMALLOCBASE ASM_CONST(0xD000000000000000) +/* + * Don't compare things with KERNELBASE or PAGE_OFFSET to test for + * "kernelness", use is_kernel_addr() - it should do what you want. + */ +#define is_kernel_addr(x) ((x) >= PAGE_OFFSET) + #ifndef __ASSEMBLY__ #ifdef __powerpc64__ Index: kexec/include/asm-ppc64/pgtable.h =================================================================== --- kexec.orig/include/asm-ppc64/pgtable.h +++ kexec/include/asm-ppc64/pgtable.h @@ -212,7 +212,7 @@ static inline pte_t pfn_pte(unsigned lon #define pte_pfn(x) ((unsigned long)((pte_val(x) >> PTE_SHIFT))) #define pte_page(x) pfn_to_page(pte_pfn(x)) -#define pmd_set(pmdp, ptep) ({BUG_ON((u64)ptep < KERNELBASE); pmd_val(*(pmdp)) = (unsigned long)(ptep);}) +#define pmd_set(pmdp, ptep) ({BUG_ON(!is_kernel_addr((u64)ptep)); pmd_val(*(pmdp)) = (unsigned long)(ptep);}) #define pmd_none(pmd) (!pmd_val(pmd)) #define pmd_bad(pmd) (pmd_val(pmd) == 0) #define pmd_present(pmd) (pmd_val(pmd) != 0) Index: kexec/arch/powerpc/oprofile/op_model_rs64.c =================================================================== --- kexec.orig/arch/powerpc/oprofile/op_model_rs64.c +++ kexec/arch/powerpc/oprofile/op_model_rs64.c @@ -178,7 +178,6 @@ static void rs64_handle_interrupt(struct int val; int i; unsigned long pc = mfspr(SPRN_SIAR); - int is_kernel = (pc >= KERNELBASE); /* set the PMM bit (see comment below) */ mtmsrd(mfmsr() | MSR_PMM); @@ -187,7 +186,7 @@ static void rs64_handle_interrupt(struct val = ctr_read(i); if (val < 0) { if (ctr[i].enabled) { - oprofile_add_pc(pc, is_kernel, i); + oprofile_add_pc(pc, is_kernel_addr(pc), i); ctr_write(i, reset_value[i]); } else { ctr_write(i, 0); From michael at ellerman.id.au Tue Nov 8 00:06:52 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Tue, 8 Nov 2005 00:06:52 +1100 (EST) Subject: [PATCH 3/12] powerpc: Add CONFIG_CRASH_DUMP In-Reply-To: <1131368803.622486.677825384736.qpush@concordia> Message-ID: <20051107130652.B2BE6686BC@ozlabs.org> This patch adds a Kconfig variable, CONFIG_CRASH_DUMP, which configures the built kernel for use as a Kdump kernel. Currently "all" this involves is changing the value of KERNELBASE to 32 MB. arch/powerpc/Kconfig | 11 +++++++++++ arch/powerpc/kernel/setup_64.c | 3 +++ include/asm-powerpc/page.h | 10 +++++++++- 3 files changed, 23 insertions(+), 1 deletion(-) Index: kexec/arch/powerpc/Kconfig =================================================================== --- kexec.orig/arch/powerpc/Kconfig +++ kexec/arch/powerpc/Kconfig @@ -379,6 +379,17 @@ config CELL_IIC bool default y +config CRASH_DUMP + bool "kernel crash dumps (EXPERIMENTAL)" + depends on PPC_MULTIPLATFORM + depends on EXPERIMENTAL + help + Build a kernel suitable for use as a kdump capture kernel. + The kernel will be linked at a different address than normal, and + so can only be used for Kdump. + + Don't change this unless you know what you are doing. + config IBMVIO depends on PPC_PSERIES || PPC_ISERIES bool Index: kexec/arch/powerpc/kernel/setup_64.c =================================================================== --- kexec.orig/arch/powerpc/kernel/setup_64.c +++ kexec/arch/powerpc/kernel/setup_64.c @@ -528,6 +528,9 @@ void __init setup_system(void) ppc64_caches.iline_size); printk("htab_address = 0x%p\n", htab_address); printk("htab_hash_mask = 0x%lx\n", htab_hash_mask); +#if PHYSICAL_START > 0 + printk("physical_start = 0x%lx\n", PHYSICAL_START); +#endif printk("-----------------------------------------------------\n"); mm_init_ppc64(); Index: kexec/include/asm-powerpc/page.h =================================================================== --- kexec.orig/include/asm-powerpc/page.h +++ kexec/include/asm-powerpc/page.h @@ -103,7 +103,15 @@ #define PAGE_OFFSET CONFIG_KERNEL_START #endif /* __powerpc64__ */ -#define KERNELBASE PAGE_OFFSET +#ifdef CONFIG_CRASH_DUMP +/* Kdump kernel runs at 32 MB, change at your peril. */ +#define PHYSICAL_START ASM_CONST(0x2000000) +#else +#define PHYSICAL_START ASM_CONST(0x0) +#endif + +#define KERNELBASE (PAGE_OFFSET + PHYSICAL_START) + #define VMALLOCBASE ASM_CONST(0xD000000000000000) /* From michael at ellerman.id.au Tue Nov 8 00:06:55 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Tue, 8 Nov 2005 00:06:55 +1100 (EST) Subject: [PATCH 4/12] powerpc: Create a trampoline for the fwnmi vectors In-Reply-To: <1131368803.622486.677825384736.qpush@concordia> Message-ID: <20051107130655.2649A686CD@ozlabs.org> The fwnmi vectors can be anywhere < 32 MB, so we need to use a trampoline for them. The kdump kernel will register the trampoline addresses, which will then jump up to the real code above 32 MB. arch/powerpc/kernel/head_64.S | 2 ++ arch/powerpc/platforms/pseries/ras.c | 6 ++---- arch/powerpc/platforms/pseries/setup.c | 17 +++++++++-------- include/asm-powerpc/firmware.h | 6 ++++++ 4 files changed, 19 insertions(+), 12 deletions(-) Index: kexec/arch/powerpc/kernel/head_64.S =================================================================== --- kexec.orig/arch/powerpc/kernel/head_64.S +++ kexec/arch/powerpc/kernel/head_64.S @@ -512,6 +512,7 @@ _GLOBAL(do_stab_bolted_pSeries) * Vectors for the FWNMI option. Share common code. */ .globl system_reset_fwnmi + .align 7 system_reset_fwnmi: HMT_MEDIUM mtspr SPRN_SPRG1,r13 /* save r13 */ @@ -519,6 +520,7 @@ system_reset_fwnmi: EXCEPTION_PROLOG_PSERIES(PACA_EXGEN, system_reset_common) .globl machine_check_fwnmi + .align 7 machine_check_fwnmi: HMT_MEDIUM mtspr SPRN_SPRG1,r13 /* save r13 */ Index: kexec/arch/powerpc/platforms/pseries/setup.c =================================================================== --- kexec.orig/arch/powerpc/platforms/pseries/setup.c +++ kexec/arch/powerpc/platforms/pseries/setup.c @@ -75,8 +75,6 @@ #endif extern void find_udbg_vterm(void); -extern void system_reset_fwnmi(void); /* from head.S */ -extern void machine_check_fwnmi(void); /* from head.S */ extern void generic_find_legacy_serial_ports(u64 *physport, unsigned int *default_speed); @@ -104,18 +102,21 @@ void pSeries_show_cpuinfo(struct seq_fil /* Initialize firmware assisted non-maskable interrupts if * the firmware supports this feature. - * */ static void __init fwnmi_init(void) { - int ret; + unsigned long a1, a2; + int ibm_nmi_register = rtas_token("ibm,nmi-register"); if (ibm_nmi_register == RTAS_UNKNOWN_SERVICE) return; - ret = rtas_call(ibm_nmi_register, 2, 1, NULL, - __pa((unsigned long)system_reset_fwnmi), - __pa((unsigned long)machine_check_fwnmi)); - if (ret == 0) + + /* If the kernel's not linked at zero we point the firmware at low + * addresses anyway, and use a trampoline to get to the real code. */ + a1 = __pa(system_reset_fwnmi) - PHYSICAL_START; + a2 = __pa(machine_check_fwnmi) - PHYSICAL_START; + + if (0 == rtas_call(ibm_nmi_register, 2, 1, NULL, a1, a2)) fwnmi_active = 1; } Index: kexec/include/asm-powerpc/firmware.h =================================================================== --- kexec.orig/include/asm-powerpc/firmware.h +++ kexec/include/asm-powerpc/firmware.h @@ -92,6 +92,12 @@ typedef struct { extern firmware_feature_t firmware_features_table[]; #endif +extern void system_reset_fwnmi(void); +extern void machine_check_fwnmi(void); + +/* This is true if we are using the firmware NMI handler (typically LPAR) */ +extern int fwnmi_active; + #endif /* __ASSEMBLY__ */ #endif /* __KERNEL__ */ #endif /* __ASM_POWERPC_FIRMWARE_H */ Index: kexec/arch/powerpc/platforms/pseries/ras.c =================================================================== --- kexec.orig/arch/powerpc/platforms/pseries/ras.c +++ kexec/arch/powerpc/platforms/pseries/ras.c @@ -49,14 +49,12 @@ #include #include #include +#include static unsigned char ras_log_buf[RTAS_ERROR_LOG_MAX]; static DEFINE_SPINLOCK(ras_log_buf_lock); -char mce_data_buf[RTAS_ERROR_LOG_MAX] -; -/* This is true if we are using the firmware NMI handler (typically LPAR) */ -extern int fwnmi_active; +char mce_data_buf[RTAS_ERROR_LOG_MAX]; static int ras_get_sensor_state_token; static int ras_check_exception_token; From michael at ellerman.id.au Tue Nov 8 00:06:57 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Tue, 8 Nov 2005 00:06:57 +1100 (EST) Subject: [PATCH 5/12] powerpc: Reroute interrupts from 0 + offset to PHYSICAL_START + offset In-Reply-To: <1131368803.622486.677825384736.qpush@concordia> Message-ID: <20051107130657.980E5686CC@ozlabs.org> Regardless of where the kernel's linked we always get interrupts at low addresses. This patch creates a trampoline in the first 3 pages of memory, where interrupts land, and patches those addresses to jump into the real kernel code at PHYSICAL_START. We also need to reserve the trampoline code and a bit more in prom.c arch/powerpc/kernel/Makefile | 3 +- arch/powerpc/kernel/crash.c | 52 +++++++++++++++++++++++++++++++++++++++++ arch/powerpc/kernel/prom.c | 6 +++- arch/powerpc/kernel/setup_64.c | 5 +++ include/asm-powerpc/kdump.h | 13 ++++++++++ 5 files changed, 77 insertions(+), 2 deletions(-) Index: kexec/arch/powerpc/kernel/setup_64.c =================================================================== --- kexec.orig/arch/powerpc/kernel/setup_64.c +++ kexec/arch/powerpc/kernel/setup_64.c @@ -34,6 +34,7 @@ #include #include #include +#include #include #include #include @@ -274,6 +275,10 @@ void __init early_setup(unsigned long dt } ppc_md = **mach; +#ifdef CONFIG_CRASH_DUMP + kdump_setup(); +#endif + DBG("Found, Initializing memory management...\n"); /* Index: kexec/arch/powerpc/kernel/prom.c =================================================================== --- kexec.orig/arch/powerpc/kernel/prom.c +++ kexec/arch/powerpc/kernel/prom.c @@ -37,6 +37,7 @@ #include #include #include +#include #include #include #include @@ -1354,11 +1355,14 @@ void __init early_init_devtree(void *par #ifdef CONFIG_PPC64 systemcfg->physicalMemorySize = lmb_phys_mem_size(); #endif - lmb_reserve(0, __pa(klimit)); DBG("Phys. mem: %lx\n", lmb_phys_mem_size()); /* Reserve LMB regions used by kernel, initrd, dt, etc... */ + lmb_reserve(__pa(KERNELBASE), __pa(klimit) - __pa(KERNELBASE)); +#ifdef CONFIG_CRASH_DUMP + lmb_reserve(0, KDUMP_BACKUP_LIMIT); +#endif early_reserve_mem(); DBG("Scanning CPUs ...\n"); Index: kexec/include/asm-powerpc/kdump.h =================================================================== --- /dev/null +++ kexec/include/asm-powerpc/kdump.h @@ -0,0 +1,13 @@ +#ifndef _PPC64_KDUMP_H +#define _PPC64_KDUMP_H + +/* How many bytes to backup from zero for kdump. The backup limit should + * be greater or equal to the trampoline's end address. */ +#define KDUMP_BACKUP_LIMIT 0x8000 + +#define KDUMP_TRAMPOLINE_START 0x0100 +#define KDUMP_TRAMPOLINE_END 0x3000 + +extern void kdump_setup(void); + +#endif /* __PPC64_KDUMP_H */ Index: kexec/arch/powerpc/kernel/Makefile =================================================================== --- kexec.orig/arch/powerpc/kernel/Makefile +++ kexec/arch/powerpc/kernel/Makefile @@ -11,7 +11,7 @@ CFLAGS_btext.o += -fPIC endif obj-y := semaphore.o cputable.o ptrace.o syscalls.o \ - signal_32.o pmc.o + signal_32.o pmc.o crash.o obj-$(CONFIG_PPC64) += setup_64.o binfmt_elf32.o sys_ppc32.o \ signal_64.o ptrace32.o systbl.o obj-$(CONFIG_ALTIVEC) += vecemu.o vector.o @@ -22,6 +22,7 @@ obj-$(CONFIG_RTAS_FLASH) += rtas_flash.o obj-$(CONFIG_RTAS_PROC) += rtas-proc.o obj-$(CONFIG_IBMVIO) += vio.o obj-$(CONFIG_GENERIC_TBSYNC) += smp-tbsync.o +obj-$(CONFIG_CRASH_DUMP) += crash.o ifeq ($(CONFIG_PPC_MERGE),y) Index: kexec/arch/powerpc/kernel/crash.c =================================================================== --- /dev/null +++ kexec/arch/powerpc/kernel/crash.c @@ -0,0 +1,52 @@ +/* + * Routines for doing kexec-based kdump. + * + * Copyright (C) 2005, IBM Corp. + * + * Created by: Michael Ellerman + * + * This source code is licensed under the GNU General Public License, + * Version 2. See the file COPYING for more details. + */ + +#undef DEBUG + +#include +#include +#include + +#ifdef DEBUG +#define DBG(fmt...) udbg_printf(fmt) +#else +#define DBG(fmt...) +#endif + +static void __init create_trampoline(unsigned long addr) +{ + /* The maximum range of a single instruction branch, is the current + * instruction's address + (32 MB - 4) bytes. For the trampoline we + * need to branch to current address + 32 MB. So we insert a nop at + * the trampoline address, then the next instruction (+ 4 bytes) + * does a branch to (32 MB - 4). The net effect is that when we + * branch to "addr" we jump to ("addr" + 32 MB). Although it requires + * two